kernel_xiaomi_raphael

Evolution-X-Devices/kernel_xiaomi_raphael

Author	SHA1	Message	Date
EmanuelCN	9ff663517c	schedutil: Inline clo code with 5.4	2024-09-30 16:34:15 +03:00
Joel Fernandes (Google)	3e312c8e7b	schedutil: Allow cpufreq requests to be made even when kthread kicked Currently there is a chance of a schedutil cpufreq update request to be dropped if there is a pending update request. This pending request can be delayed if there is a scheduling delay of the irq_work and the wake up of the schedutil governor kthread. A very bad scenario is when a schedutil request was already just made, such as to reduce the CPU frequency, then a newer request to increase CPU frequency (even sched deadline urgent frequency increase requests) can be dropped, even though the rate limits suggest that its Ok to process a request. This is because of the way the work_in_progress flag is used. This patch improves the situation by allowing new requests to happen even though the old one is still being processed. Note that in this approach, if an irq_work was already issued, we just update next_freq and don't bother to queue another request so there's no extra work being done to make this happen. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-09-30 16:31:50 +03:00
Tim Zimmermann	92494b9920	syscall: Increase fake uname to 6.6.40 Change-Id: I77a8c5cd0e74eff97ae9b6b7e37812e2972cb3c8	2024-09-07 12:26:37 +03:00
Tim Zimmermann	31f7a8793a	syscall: Only fake uname on very first call of netbpfload * The bpf programs actually still support older kernels, we just need to bypass the very first check for kernel version Change-Id: I4264782ee63efb26b95abd94774938d5456200a3	2024-09-07 12:26:24 +03:00
Daniel Borkmann	248cbcedae	BACKPORT: bpf: fix unconnected udp hooks Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Change-Id: If2bab00efe5f37a591083fe2676e76f35f8cecc3 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-09-07 12:24:22 +03:00
Alexei Starovoitov	1980188064	BACKPORT: bpf: enforce return code for cgroup-bpf programs with addition of tnum logic the verifier got smart enough and we can enforce return codes at program load time. For now do so for cgroup-bpf program types. Change-Id: Iae3a46c3d38810e47cbf4ec23356abae03ded736 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-09-07 12:24:22 +03:00
Andrey Ignatov	735c155332	bpf: Hooks for sys_sendmsg In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr ` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr ` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Change-Id: Icf5938b0b69ddfb1e99dc2abc90204f7c97f0473 Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2024-09-07 12:24:00 +03:00
kondors1995	478461a998	Merge branch 'linux-4.14.y' of https://github.com/openela/kernel-lts into 15.0	2024-09-07 12:22:53 +03:00
Sultan Alsawaf	d9324db241	cpufreq: schedutil: Allow single-CPU frequency to drop without idling Given that a CPU's clock is gated at even the shallowest idle state, waiting until a CPU idles at least once before reducing its frequency is putting the cart before the horse. For long-running workloads with low compute needs, requiring an idle call since the last frequency update to lower the CPU's frequency results in significantly increased energy usage. Given that there is already a mechanism in place to ratelimit frequency changes, this heuristic is wholly unnecessary. Allow single-CPU performance domains to drop their frequency without requiring an idle call in between to improve energy. Right off the bat, this reduces CPU power consumption by 7.5% playing a cat gif in Firefox on a Pixel 8 (270 mW -> 250 mW). And there is no visible loss of performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-09-01 17:45:32 +03:00
EmanuelCN	de9a4993ce	Revert "cpufreq: schedutil: Ignore CPU load older than WALT window size" This reverts commit b9d1ecec68e9750f819dca02af47caee55d06a58.	2024-09-01 17:44:20 +03:00
Helium-Studio	ea10cc32c6	cpufreq: schedutil: Checkout to msm-4.19 * HEAD: `2c85e5f15e` * Remove WALT and SchedTune pieces as we won't use them anymore * Remove SCHED_CPUFREQ_RT_DL in include/linux/sched/cpufreq.h as it's unused now * Add some new scheduler definitions from msm-4.19 for new schedutil * Optimize freq switching for sm8150 (sm8150 use software freq switching) * Minor format and code adjustments (compared to HEAD) * Note: The new schedutil requires UClamp, which will be backported later Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>	2024-08-28 18:31:29 +03:00
kondors1995	be5603e38e	sched:remove unused value	2024-08-25 22:13:17 +03:00
Peter Zijlstra	73bad98694	sched: Improve wake_up_all_idle_cpus() take #2 As reported by syzbot and experienced by Pavel, using cpus_read_lock() in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem and possibly others). Instead, shrink the preempt disable region by iterating all CPUs and checking the online status for each individual CPU while having preemption disabled. Fixes: 8850cb663b5c ("sched: Simplify wake_up_idle()") Reported-by: syzbot+d5b23b18d2f4feae8a67@syzkaller.appspotmail.com Reported-by: Pavel Machek <pavel@ucw.cz> Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Change-Id: I652eb678e8a2e30d71beeebae35a36d4d4c49a8d (cherry picked from 96611c26dc351c33f73b48756a9feacc109e5bab) Signed-off-by: Alexander Winkowski <dereference23@outlook.com> # Conflicts: # kernel/smp.c	2024-08-25 22:11:37 +03:00
Peter Zijlstra	5ed0fda462	sched: Simplify wake_up_idle() Simplify and make wake_up_if_idle() more robust, also don't iterate the whole machine with preempt_disable() in it's caller: wake_up_all_idle_cpus(). This prepares for another wake_up_if_idle() user that needs a full do_idle() cycle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.769328779@infradead.org Change-Id: If9f0dd5d99dd675828656161bfdba299f447f998 (cherry picked from 8850cb663b5cda04d33f9cfbc38889d73d3c8e24) Signed-off-by: Alexander Winkowski <dereference23@outlook.com>	2024-08-25 22:06:41 +03:00
kondors1995	6be9588eed	Revert "sched/core: wake up from idle by sending IPI_WAKEUP" This reverts commit `f49348dd5a`.	2024-08-25 22:06:20 +03:00
Paolo Abeni	4915f9a08a	Revert "softirq: Let ksoftirqd do its job" This reverts the following commits: `4cd13c21b2` ("softirq: Let ksoftirqd do its job") 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous") 1342d8080f61 ("softirq: Don't skip softirq execution when softirq thread is parking") in a single change to avoid known bad intermediate states introduced by a patch series reverting them individually. Due to the mentioned commit, when the ksoftirqd threads take charge of softirq processing, the system can experience high latencies. In the past a few workarounds have been implemented for specific side-effects of the initial ksoftirqd enforcement commit: commit 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") commit `8d5755b3f7` ("watchdog: softdog: fire watchdog even if softirqs do not get to run") commit `217f697436` ("net: busy-poll: allow preemption in sk_busy_loop()") commit 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous") But the latency problem still exists in real-life workloads, see the link below. The reverted commit intended to solve a live-lock scenario that can now be addressed with the NAPI threaded mode, introduced with commit 29863d41bb6e ("net: implement threaded-able napi poll loop support"), which is nowadays in a pretty stable status. While a complete solution to put softirq processing under nice resource control would be preferable, that has proven to be a very hard task. In the short term, remove the main pain point, and also simplify a bit the current softirq implementation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: netdev@vger.kernel.org Link: https://lore.kernel.org/netdev/305d7742212cbe98621b16be782b0562f1012cb6.camel@redhat.com Link: https://lore.kernel.org/r/57e66b364f1b6f09c9bc0316742c3b14f4ce83bd.1683526542.git.pabeni@redhat.com (cherry picked from commit d15121be7485655129101f3960ae6add40204463) Change-Id: If014afbfa3bb56f7c490a22b8334857c8308f901 Signed-off-by: Alexander Winkowski <dereference23@outlook.com>	2024-08-25 22:04:37 +03:00
kondors1995	fe902b1eb4	Squashed commit of the following: commit 7004ae7dfb08c25dc5165abc447a70036b00b60e Author: kondors1995 <normandija1945@gmail.com> Date: Fri Aug 16 17:06:30 2024 +0300 arm64: dts: sm8150-v2: Update gpu votlages commit 7d2152a68429cc4e44ec5e0d6b3340c7b3bd7130 Author: kondors1995 <normandija1945@gmail.com> Date: Thu Aug 15 17:46:10 2024 +0300 sched:fair: fix unused warning not sure way its there but lets keep it commit 37a33c25346a355a07b4900a864f6326740c3a61 Author: kondors1995 <normandija1945@gmail.com> Date: Thu Aug 15 17:37:29 2024 +0300 arm64: dts: sm8150-v2: Fixup @2cc338b commit ab938b00393e1a8e6a11b0f54d5c24d56f68a4de Author: EmanuelCN <emanuelghub@gmail.com> Date: Sun Aug 11 14:38:41 2024 +0300 drivers: gpu: drm: Do not affine pm qos requests to prime core This was accidentally readded when i rebased techpack/display and drm. commit 42966bed31891213f5bdb64be9f23aae0acb2c2d Author: kondors1995 <normandija1945@gmail.com> Date: Wed Aug 14 10:56:36 2024 +0300 raphael:defconfig: Disable CONFIG_FAIR_GROUP_SCHED commit 4e1c9539727fc898016524700f5cced0cda9f55d Author: EmanuelCN <emanuelghub@gmail.com> Date: Wed Aug 7 13:57:13 2024 +0300 cpufreq: schedutil: Store the cached ratelimits values In a merge resolution google accidentally removed this, since we have checkouted the scheduler to redbull this part was missing. commit 2dc779651497edbe71c74f4b41eab79b4b0e2368 Author: EmanuelCN <emanuelghub@gmail.com> Date: Tue Aug 6 20:24:40 2024 +0300 cpufreq: schedutil: Set rate-limits globally Since we are not gonna modify them per cluster anymore its redundant to keep them this way, plus its less expensive. commit 4ac25bd369c6ba572b582ea7099e42dfd7f7137f Author: Xuewen Yan <xuewen.yan@unisoc.com> Date: Wed Jul 19 21:05:27 2023 +0800 cpufreq: schedutil: Update next_freq when cpufreq_limits change When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq `2301000` unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq `2301000` unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at `2301000`, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com commit e88732a5887bfaa8054408514f2af952e119a9c0 Author: kondors1995 <normandija1945@gmail.com> Date: Mon Aug 12 20:37:01 2024 +0300 cpufreq: schedutil:remove walt bits commit a19bfda67be888221f9348eff56a53ac3b34756a Author: Dawei Li <daweilics@gmail.com> Date: Sun Aug 4 21:38:34 2024 +0300 sched/fair: Fix initial util_avg calculation Change se->load.weight to se_weight(se) in the calculation for the initial util_avg to avoid unnecessarily inflating the util_avg by 1024 times. The reason is that se->load.weight has the unit/scale as the scaled-up load, while cfs_rg->avg.load_avg has the unit/scale as the true task weight (as mapped directly from the task's nice/priority value). With CONFIG_32BIT, the scaled-up load is equal to the true task weight. With CONFIG_64BIT, the scaled-up load is 1024 times the true task weight. Thus, the current code may inflate the util_avg by 1024 times. The follow-up capping will not allow the util_avg value to go wild. But the calculation should have the correct logic. Signed-off-by: Dawei Li <daweilics@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com commit b147bc04acfe5b5647ef50ea9af580aaed04894a Author: EmanuelCN <emanuelghub@gmail.com> Date: Wed Aug 7 14:13:40 2024 +0300 cpufreq: schedutil: Give 25% headroom to prime core If we don't give any headroom to prime core it can result in big performance regression commit 010effdcd39496611bca9f284c32a5e554a4fd27 Author: EmanuelCN <emanuelghub@gmail.com> Date: Wed Jul 3 00:16:04 2024 +0300 sched: Introduce Per-Cluster DVFS Headroom Typically, UI-demanding tasks are handled by little and big cores, while the prime core is more power-hungry and leads to unnecessary boosts, causing power wastage. From my observations, boosting the little cores provides the most significant performance improvement for UI tasks, followed by a modest boost to the big cores. This results in fewer stutters across the entire system. To avoid excessive boosting, the implementation ensures that when the utilization is equal to or higher than the maximum capacity value, util is returned as-is. commit b72fefb30cfc63426415bdbf06d1bb4c87c4d32f Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Sun Jan 14 19:36:00 2024 +0100 sched/fair: Fix frequency selection for non-invariant case Linus reported a ~50% performance regression on single-threaded workloads on his AMD Ryzen system, and bisected it to: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation") When frequency invariance is not enabled, get_capacity_ref_freq(policy) is supposed to return the current frequency and the performance margin applied by map_util_perf(), enabling the utilization to go above the maximum compute capacity and to select a higher frequency than the current one. After the changes in 9c0b4bb7f630, the performance margin was applied earlier in the path to take into account utilization clampings and we couldn't get a utilization higher than the maximum compute capacity, and the CPU remained 'stuck' at lower frequencies. To fix this, we must use a frequency above the current frequency to get a chance to select a higher OPP when the current one becomes fully used. Apply the same margin and return a frequency 25% higher than the current one in order to switch to the next OPP before we fully use the CPU at the current one. [ mingo: Clarified the changelog. ] Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Bisected-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Wyes Karny <wkarny@gmail.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Wyes Karny <wkarny@gmail.com> Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org commit dcad748d1a2dcdb781fbcc514dc977d8ddc91be6 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Nov 22 14:39:04 2023 +0100 sched/cpufreq: Rework iowait boost Use the max value that has already been computed inside sugov_get_util() to cap the iowait boost and remove dependency with uclamp_rq_util_with() which is not used anymore. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20231122133904.446032-3-vincent.guittot@linaro.org commit 31a906e3fcb3ba783d45157b722bcb3a321c1f8f Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Nov 22 14:39:03 2023 +0100 sched/cpufreq: Rework schedutil governor performance estimation The current method to take into account uclamp hints when estimating the target frequency can end in a situation where the selected target frequency is finally higher than uclamp hints, whereas there are no real needs. Such cases mainly happen because we are currently mixing the traditional scheduler utilization signal with the uclamp performance hints. By adding these 2 metrics, we loose an important information when it comes to select the target frequency, and we have to make some assumptions which can't fit all cases. Rework the interface between the scheduler and schedutil governor in order to propagate all information down to the cpufreq governor. effective_cpu_util() interface changes and now returns the actual utilization of the CPU with 2 optional inputs: - The minimum performance for this CPU; typically the capacity to handle the deadline task and the interrupt pressure. But also uclamp_min request when available. - The maximum targeting performance for this CPU which reflects the maximum level that we would like to not exceed. By default it will be the CPU capacity but can be reduced because of some performance hints set with uclamp. The value can be lower than actual utilization and/or min performance level. A new sugov_effective_cpu_perf() interface is also available to compute the final performance level that is targeted for the CPU, after applying some cpufreq headroom and taking into account all inputs. With these 2 functions, schedutil is now able to decide when it must go above uclamp hints. It now also has a generic way to get the min performance level. The dependency between energy model and cpufreq governor and its headroom policy doesn't exist anymore. eenv_pd_max_util() asks schedutil for the targeted performance after applying the impact of the waking task. [ mingo: Refined the changelog & C comments. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org commit 307c28badda88662bfd2cf4afb4168506d1e8e96 Author: Lukasz Luba <lukasz.luba@arm.com> Date: Thu Dec 8 16:02:56 2022 +0000 cpufreq, sched/util: Optimize operations with single CPU capacity lookup The max CPU capacity is the same for all CPUs sharing frequency domain. There is a way to avoid heavy operations in a loop for each CPU by leveraging this knowledge. Thus, simplify the looping code in the sugov_next_freq_shared() and drop heavy multiplications. Instead, use simple max() to get the highest utilization from these CPUs. This is useful for platforms with many (4 or 6) little CPUs. We avoid heavy 2PD_CPU_NUM multiplications in that loop, which is called billions of times, since it's not limited by the schedutil time delta filter in sugov_should_update_freq(). When there was no need to change frequency the code bailed out, not updating the sg_policy::last_freq_update_time. Then every visit after delta_ns time longer than the sg_policy::freq_update_delay_ns goes through and triggers the next frequency calculation code. Although, if the next frequency, as outcome of that, would be the same as current frequency, we won't update the sg_policy::last_freq_update_time and the story will be repeated (in a very short period, sometimes a few microseconds). The max CPU capacity must be fetched every time we are called, due to difficulties during the policy setup, where we are not able to get the normalized CPU capacity at the right time. The fetched CPU capacity value is than used in sugov_iowait_apply() to calculate the right boost. This required a few changes in the local functions and arguments. The capacity value should hopefully be fetched once when needed and then passed over CPU registers to those functions. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221208160256.859-2-lukasz.luba@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> commit db9c6be8f578a2decb289f5e2cce1af14819690e Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Date: Thu Mar 28 11:33:21 2019 +0100 cpufreq: schedutil: Simplify iowait boosting There is not reason for the minimum iowait boost value in the schedutil cpufreq governor to depend on the available range of CPU frequencies. In fact, that dependency is generally confusing, because it causes the iowait boost to behave somewhat differently on CPUs with the same maximum frequency and different minimum frequencies, for example. For this reason, replace the min field in struct sugov_cpu with a constant and choose its values to be 1/8 of SCHED_CAPACITY_SCALE (for consistency with the intel_pstate driver's internal governor). [Note that policy->cpuinfo.max_freq will not be a constant any more after a subsequent change, so this change is depended on by it.] Link: https://lore.kernel.org/lkml/20190305083202.GU32494@hirez.programming.kicks-ass.net/T/#ee20bdc98b7d89f6110c0d00e5c3ee8c2ced93c3d Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> commit 7e1f2edaa9a3738cdf7bc23eaec0feb866633e3a Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Tue Jun 21 10:04:10 2022 +0100 sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com commit caa24c3962bc2a002034548e3c93dce95c9330c4 Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Date: Fri Jun 21 22:06:38 2024 +0300 cpufreq: schedutil: Add util to struct sg_cpu Instead of passing util and max between functions while computing the utilization and capacity, store the former in struct sg_cpu (along with the latter and bw_dl). This will allow the current utilization value to be compared with the one obtained previously (which is requisite for some code changes to follow this one), but also it causes the code to look slightly more consistent and cleaner. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> commit 176187b62d8637f44e8fcdee2b539dee8c30491b Author: John Galt <johngaltfirstrun@gmail.com> Date: Thu Jun 27 21:52:34 2024 -0400 sched/eevdf: 1ms base slice commit c64c439cffc03bff0b5b69455dbd9018ae920a64 Author: John Galt <johngaltfirstrun@gmail.com> Date: Thu Jun 27 21:49:47 2024 -0400 sched/features: enable SCHED_FEAT_PLACE_DEADLINE_INITIAL commit 190d38b6481aea95d454dcfab5801a578a86f3dc Author: Qais Yousef <qais.yousef@arm.com> Date: Tue Apr 18 15:09:35 2023 +0100 sched/uclamp: Fix fits_capacity() check in feec() commit 244226035a1f9b2b6c326e55ae5188fab4f428cb upstream. As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024, it'll always pick the previous CPU because fits_capacity() will always return false in this case. The new util_fits_cpu() logic should handle this correctly for us beside more corner cases where similar failures could occur, like when using UCLAMP_MAX. We open code uclamp_rq_util_with() except for the clamp() part, util_fits_cpu() needs the 'raw' values to be passed to it. Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp value for the rq. Makes the code more readable and ensures the right rules (use READ_ONCE/WRITE_ONCE) are respected transparently. [1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Reported-by: Yun Hsiang <hsiang023167@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com (cherry picked from commit 244226035a1f9b2b6c326e55ae5188fab4f428cb) [Fix trivial conflict in kernel/sched/fair.c due to new automatic variables in master vs 5.10] Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit a661b80985ae13d3ad1a1be93f482192e4a361b5 Author: Youssef Esmat <youssefesmat@chromium.org> Date: Thu Jun 20 19:26:56 2024 +0300 sched/eevdf: Change base_slice to 3ms The default base slice of 0.75 msec is causing excessive context switches. Raise it to 3 msecs, which lowers the amount of context switches and the added overhead of the scheduler. BUG=b:308209790 TEST=Manual tested on DUT UPSTREAM-TASK=b:324602237 Change-Id: I7c7a7ecb377933a4032eb11161f5b5873e3e1e4c Signed-off-by: Youssef Esmat <youssefesmat@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5291506 Reviewed-by: Suleiman Souhlal <suleiman@google.com> Reviewed-by: Vineeth Pillai <vineethrp@google.com> Reviewed-by: Steven Rostedt <rostedt@google.com> commit 7d18cd0125c2a4347268d2f72b53a9045937d98a Author: Youssef Esmat <youssefesmat@chromium.org> Date: Thu Jun 20 19:26:07 2024 +0300 sched/eevdf: Add ENFORCE_ELIGIBILITY and default settings Eligibility reduces scheduling latency of high nice task at the cost of runtime of low nice tasks. This hurts the chrome model where UI tasks should not be interrupted by background tasks. This adds a feature to be able to disable eligibility checks. Note that fairness is presvered in the way the deadline is calculated since the deadline is adjusted by the weight of the task and a task. Disabling lag placement is similar to eligibility since lag will adjust the CPU time based on the amount of runtime. BUG=b:308209790 TEST=Manual tested on DUT UPSTREAM-TASK=b:324602237 Signed-off-by: Youssef Esmat <youssefesmat@chromium.org> commit 6ef762984070d49371eb803bf50aeacfebe51613 Author: Aaron Lu <aaron.lu@intel.com> Date: Tue Sep 12 14:58:08 2023 +0800 sched/fair: Ratelimit update to tg->load_avg When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions and with each migration, the task's load will transfer from src cfs_rq to target cfs_rq and each change involves an update to tg->load_avg. Since the workload can trigger as many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. As a result, the two mentioned functions showed noticeable overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: during a 5s window, wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. Reduce the overhead by limiting updates to tg->load_avg to at most once per ms. The update frequency is a tradeoff between tracking accuracy and overhead. 1ms is chosen because PELT window is roughly 1ms and it delivered good results for the tests that I've done. After this change, the cost of accessing tg->load_avg is greatly reduced and performance improved. Detailed test results below. ============================== postgres_sysbench on SPR: 25% base: 42382±19.8% patch: 50174±9.5% (noise) 50% base: 67626±1.3% patch: 67365±3.1% (noise) 75% base: 100216±1.2% patch: 112470±0.1% +12.2% 100% base: 93671±0.4% patch: 113563±0.2% +21.2% ============================== hackbench on ICL: group=1 base: 114912±5.2% patch: 117857±2.5% (noise) group=4 base: 359902±1.6% patch: 361685±2.7% (noise) group=8 base: 461070±0.8% patch: 491713±0.3% +6.6% group=16 base: 309032±5.0% patch: 378337±1.3% +22.4% ============================= hackbench on SPR: group=1 base: 100768±2.9% patch: 103134±2.9% (noise) group=4 base: 413830±12.5% patch: 378660±16.6% (noise) group=8 base: 436124±0.6% patch: 490787±3.2% +12.5% group=16 base: 457730±3.2% patch: 680452±1.3% +48.8% ============================ netperf/udp_rr on ICL 25% base: 114413±0.1% patch: 115111±0.0% +0.6% 50% base: 86803±0.5% patch: 86611±0.0% (noise) 75% base: 35959±5.3% patch: 49801±0.6% +38.5% 100% base: 61951±6.4% patch: 70224±0.8% +13.4% =========================== netperf/udp_rr on SPR 25% base: 104954±1.3% patch: 107312±2.8% (noise) 50% base: 55394±4.6% patch: 54940±7.4% (noise) 75% base: 13779±3.1% patch: 36105±1.1% +162% 100% base: 9703±3.7% patch: 28011±0.2% +189% ============================================== netperf/tcp_stream on ICL (all in noise range) 25% base: 43092±0.1% patch: 42891±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% =============================================== netperf/tcp_stream on SPR (all in noise range) 25% base: 34491±0.3% patch: 34886±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: David Vernet <void@manifault.com> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com commit c0c01cb55fa02d91f3a9f1fd456ecc6842990f99 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Thu Aug 18 20:48:02 2022 +0800 sched/fair: Fix another detach on unattached task corner case commit `7dc603c902` ("sched/fair: Fix PELT integrity for new tasks") fixed two load tracking problems for new task, including detach on unattached new task problem. There still left another detach on unattached task problem for the task which has been woken up by try_to_wake_up() and waiting for actually being woken up by sched_ttwu_pending(). try_to_wake_up(p) cpu = select_task_rq(p) if (task_cpu(p) != cpu) set_task_cpu(p, cpu) migrate_task_rq_fair() remove_entity_load_avg() --> unattached se->avg.last_update_time = 0; __set_task_cpu() ttwu_queue(p, cpu) ttwu_queue_wakelist() __ttwu_queue_wakelist() task_change_group_fair() detach_task_cfs_rq() detach_entity_cfs_rq() detach_entity_load_avg() --> detach on unattached task set_task_rq() attach_task_cfs_rq() attach_entity_cfs_rq() attach_entity_load_avg() The reason of this problem is similar, we should check in detach_entity_cfs_rq() that se->avg.last_update_time != 0, before do detach_entity_load_avg(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-7-zhouchengming@bytedance.com commit 60cfd4b1c31915161d1c942a4a9067314fba8d79 Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Mon May 6 23:43:57 2024 -0700 sched/cass: Honor uclamp even when no CPUs can satisfy the requirement When all CPUs available to a uclamp'd process are thermal throttled, it is possible for them to be throttled below the uclamp minimum requirement. In this case, CASS only considers uclamp when it compares relative utilization and nowhere else; i.e., CASS essentially ignores the most important aspect of uclamp. Fix it so that CASS tries to honor uclamp even when no CPUs available to a uclamp'd process are capable of fully meeting the uclamp minimum. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit 82d9f4952ab48ec0b1dd5cd48da1c3619ceb01c9 Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Mon May 6 20:14:08 2024 -0700 sched/cass: Fix disproportionate load spreading when CPUs are throttled When CPUs are thermal throttled, CASS tries to spread load such that their resulting P-state is scaled relatively to their _throttled_ maximum capacity, rather than their original capacity. As a result, throttled CPUs are unfairly under-utilized, causing other CPUs to receive the extra burden and thus run at a disproportionately higher P-state relative to the throttled CPUs. This not only hurts performance, but also greatly diminishes energy efficiency since it breaks CASS's basic load balancing principle. To fix this, some convoluted logic is required in order to make CASS aware of a CPU's throttled and non-throttled capacity. The non-throttled capacity is used for the fundamental relative utilization comparison, while the throttled capacity is used in conjunction to ensure a throttled CPU isn't accidentally overloaded as a result. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit 6341ff1997878ad3b6b90a5bf0e20b99aa9e6d89 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Tue Dec 18 10:31:30 2018 +0000 ANDROID: sched/fair: EAS: Add uclamp support to find_best_target() Utilization clamping can be used to boost the utilization of small tasks or cap that of big tasks. Thus, one of its possible usages is to bias tasks placement to "promote" small tasks on higher capacity (less energy efficient) CPUs or "constraint" big tasks on small capacity (more energy efficient) CPUs. When the Energy Aware Scheduler (EAS) looks for the most energy efficient CPU to run a task on, it currently considers only the effective utiliation estimated for a task. Fix this by adding an additional check to skip CPUs which capacity is smaller then the task clamped utilization. Change-Id: I43fa6fa27e27c1eb5272c6a45d1a6a5b0faae1aa Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c0ccb5593e4c2ac72758c5a0f68e274a5698a839 Author: EmanuelCN <emanuelghub@gmail.com> Date: Mon Apr 8 21:44:58 2024 +0300 sched/eevdf: Use the other placement strategy Place lag is currently broken (4.19 needs more backports to make it work as intended) Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 052afe14ea97a104e3c384f56405c30fb571efe9 Author: Xuewen Yan <xuewen.yan@unisoc.com> Date: Mon Apr 22 16:22:38 2024 +0800 sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf() It was possible to have pick_eevdf() return NULL, which then causes a NULL-deref. This turned out to be due to entity_eligible() returning falsely negative because of a s64 multiplcation overflow. Specifically, reweight_eevdf() computes the vlag without considering the limit placed upon vlag as update_entity_lag() does, and then the scaling multiplication (remember that weight is 20bit fixed point) can overflow. This then leads to the new vruntime being weird which then causes the above entity_eligible() to go side-ways and claim nothing is eligible. Thus limit the range of vlag accordingly. All this was quite rare, but fatal when it does happen. Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/ Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/ Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Reported-by: Sergei Trofimovich <slyich@gmail.com> Reported-by: Igor Raits <igor@gooddata.com> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-and-tested-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240422082238.5784-1-xuewen.yan@unisoc.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit e224fc314c4b2f28785f30f8c85af7c566d540b3 Author: Tianchen Ding <dtcccc@linux.alibaba.com> Date: Wed Mar 6 10:21:33 2024 +0800 sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr reweight_eevdf() only keeps V unchanged inside itself. When se != cfs_rq->curr, it would be dequeued from rb tree first. So that V is changed and the result is wrong. Pass the original V to reweight_eevdf() to fix this issue. Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> [peterz: flip if() condition for clarity] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Link: https://lkml.kernel.org/r/20240306022133.81008-3-dtcccc@linux.alibaba.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 6f605d6768465bec0f16cead98e1b3b08e548afe Author: Tianchen Ding <dtcccc@linux.alibaba.com> Date: Wed Mar 6 10:21:32 2024 +0800 sched/eevdf: Always update V if se->on_rq when reweighting reweight_eevdf() needs the latest V to do accurate calculation for new ve and vd. So update V unconditionally when se is runnable. Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Suggested-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Link: https://lore.kernel.org/r/20240306022133.81008-2-dtcccc@linux.alibaba.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 3416506799b6b2b3eab0688051082ed2fc1d1f5b Author: Wang Jinchao <wangjinchao@xfusion.com> Date: Thu Dec 14 13:20:29 2023 +0800 sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair() This variable became unused in: 5e963f2bd465 ("sched/fair: Commit to EEVDF") Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/202312141319+0800-wangjinchao@xfusion.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 2cfaa805dc73ae4e6395a8aa69ee79b9580cc7bb Author: Yiwei Lin <s921975628@gmail.com> Date: Fri Nov 17 16:01:06 2023 +0800 sched/fair: Update min_vruntime for reweight_entity() correctly Since reweight_entity() may have chance to change the weight of cfs_rq->curr entity, we should also update_min_vruntime() if this is the case Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Signed-off-by: Yiwei Lin <s921975628@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Link: https://lore.kernel.org/r/20231117080106.12890-1-s921975628@gmail.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 13ffac20f571298c2007b90674694aebe367be89 Author: Abel Wu <wuyun.abel@bytedance.com> Date: Wed Nov 15 11:36:46 2023 +0800 sched/eevdf: O(1) fastpath for task selection Since the RB-tree is now sorted by deadline, let's first try the leftmost entity which has the earliest virtual deadline. I've done some benchmarks to see its effectiveness. All the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R) Platinum 8260 with 2 NUMA nodes each of which has 24C/48T. hackbench: process/thread + pipe/socket + 1/2/4/8 groups netperf: TCP/UDP + STREAM/RR + 24/48/72/96/192 threads tbench: loopback 24/48/72/96/192 threads schbench: 1/2/4/8 mthreads direct: cfs_rq has only one entity parity: RUN_TO_PARITY fast: O(1) fastpath slow: heap search (%) direct parity fast slow hackbench 92.95 2.02 4.91 0.12 netperf 68.08 6.60 24.18 1.14 tbench 67.55 11.22 20.61 0.62 schbench 69.91 2.65 25.73 1.71 The above results indicate that this fastpath really makes task selection more efficient. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231115033647.80785-4-wuyun.abel@bytedance.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 010771bc9b9ab17c289743afe1241127ba5263df Author: Abel Wu <wuyun.abel@bytedance.com> Date: Wed Nov 15 11:36:45 2023 +0800 sched/eevdf: Sort the rbtree by virtual deadline Sort the task timeline by virtual deadline and keep the min_vruntime in the augmented tree, so we can avoid doubling the worst case cost and make full use of the cached leftmost node to enable O(1) fastpath picking in next patch. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231115033647.80785-3-wuyun.abel@bytedance.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 1e35ae32e182db3531da5d9055e45c7ccf826c96 Author: Abel Wu <wuyun.abel@bytedance.com> Date: Tue Nov 21 21:44:26 2023 +0200 sched/eevdf: Fix vruntime adjustment on reweight vruntime of the (on_rq && !0-lag) entity needs to be adjusted when it gets re-weighted, and the calculations can be simplified based on the fact that re-weight won't change the w-average of all the entities. Please check the proofs in comments. But adjusting vruntime can also cause position change in RB-tree hence require re-queue to fix up which might be costly. This might be avoided by deferring adjustment to the time the entity actually leaves tree (dequeue/pick), but that will negatively affect task selection and probably not good enough either. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 73fa83e38da16aad4009b37fe7ee7a9fb289adb9 Author: Yiwei Lin <s921975628@gmail.com> Date: Fri Oct 20 13:56:17 2023 +0800 sched/fair: Remove unused 'curr' argument from pick_next_entity() The 'curr' argument of pick_next_entity() has become unused after the EEVDF changes. [ mingo: Updated the changelog. ] Signed-off-by: Yiwei Lin <s921975628@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit d2d81ece57c3c14431dc2525f551e2d2eadb2834 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Oct 17 16:59:47 2023 +0200 sched/eevdf: Fix heap corruption more Because someone is a flaming idiot... and forgot we have current as se->on_rq but not actually in the tree itself, and walking rb_parent() on an entry not in the tree is 'funky' and KASAN complains. Fixes: 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap integrity") Reported-by: 0599jiangyc@gmail.com Reported-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218020 Link: https://lkml.kernel.org/r/CAJwJo6ZGXO07%3DQvW4fgQfbsDzQPs9xj5sAQ1zp%3DmAyPMNbHYww%40mail.gmail.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit b97630c04b5e7869bc0a5900e6e7dda8f7bcf5d3 Author: Benjamin Segall <bsegall@google.com> Date: Fri Sep 29 17:09:30 2023 -0700 sched/eevdf: Fix pick_eevdf() The old pick_eevdf() could fail to find the actual earliest eligible deadline when it descended to the right looking for min_deadline, but it turned out that that min_deadline wasn't actually eligible. In that case we need to go back and search through any left branches we skipped looking for the actual best _eligible_ min_deadline. This is more expensive, but still O(log n), and at worst should only involve descending two branches of the rbtree. I've run this through a userspace stress test (thank you tools/lib/rbtree.c), so hopefully this implementation doesn't miss any corner cases. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit df45b4b6a01c84ccb6ac72b0d5d20da7c90572d8 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Oct 6 21:24:45 2023 +0200 sched/eevdf: Fix min_deadline heap integrity Marek and Biju reported instances of: "EEVDF scheduling fail, picking leftmost" which Mike correlated with cgroup scheduling and the min_deadline heap getting corrupted; some trace output confirms: > And yeah, min_deadline is hosed somehow: > > validate_cfs_rq: --- / > __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr) > __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31) > __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33) > validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256 Turns out that reweight_entity(), which tries really hard to be fast, does not do the normal dequeue+update+enqueue pattern but does scale the deadline. However, it then fails to propagate the updated deadline value up the heap. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Biju Das <biju.das.jz@bp.renesas.com> Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Tested-by: Mike Galbraith <efault@gmx.de> Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 3a51a2af114512733bfe3b6965a7320af592ac6e Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Sep 26 14:29:50 2023 +0200 sched/eevdf: Fix avg_vruntime() The expectation is that placing a task at avg_vruntime() makes it eligible. Turns out there is a corner case where this is not the case. Specifically, avg_vruntime() relies on the fact that integer division is a flooring function (eg. it discards the remainder). By this property the value returned is slightly left of the true average. However! when the average is a negative (relative to min_vruntime) the effect is flipped and it becomes a ceil, with the result that the returned value is just right of the average and thus not eligible. Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 94630ff1c5dace213fc4d886abdddff610331975 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Sep 15 00:48:55 2023 +0200 sched/eevdf: Also update slice on placement Tasks that never consume their full slice would not update their slice value. This means that tasks that are spawned before the sysctl scaling keep their original (UP) slice length. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit ebdbfa17b89f097f7dbb25126033767d392ec08a Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed Sep 20 15:00:24 2023 +0200 sched/debug: Remove the /proc/sys/kernel/sched_child_runs_first sysctl The /proc/sys/kernel/sched_child_runs_first knob is no longer connected since: 5e963f2bd4654 ("sched/fair: Commit to EEVDF") Remove it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230920130025.412071-2-bigeasy@linutronix.de Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit f3c2be38e880f64190b90cfa7a587e224036fa58 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Aug 16 15:40:59 2023 +0200 sched/eevdf: Curb wakeup-preemption Mike and others noticed that EEVDF does like to over-schedule quite a bit -- which does hurt performance of a number of benchmarks / workloads. In particular, what seems to cause over-scheduling is that when lag is of the same order (or larger) than the request / slice then placement will not only cause the task to be placed left of current, but also with a smaller deadline than current, which causes immediate preemption. [ notably, lag bounds are relative to HZ ] Mike suggested we stick to picking 'current' for as long as it's eligible to run, giving it uninterrupted runtime until it reaches parity with the pack. Augment Mike's suggestion by only allowing it to exhaust it's initial request. One random data point: echo NO_RUN_TO_PARITY > /debug/sched/features perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 3,723,554 context-switches ( +- 0.56% ) 9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% ) echo RUN_TO_PARITY > /debug/sched/features perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 2,556,535 context-switches ( +- 0.51% ) 9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% ) Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit bb48e3bf106a493dad3be6f2a84f507d7a5d0202 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:49 2023 +0200 sched/fair: Propagate enqueue flags into place_entity() This allows place_entity() to consider ENQUEUE_WAKEUP and ENQUEUE_MIGRATED. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 498bfeceabaf3803261f7c934e4303d6f2137ccc Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:48 2023 +0200 sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice EEVDF uses this tunable as the base request/slice -- make sure the name reflects this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit b194d2bfc5dc8070c78357d0c2b63a0cde2877d6 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Nov 21 21:21:06 2023 +0200 sched/fair: Commit to EEVDF EEVDF is a better defined scheduling policy, as a result it has less heuristics/tunables. There is no compelling reason to keep CFS around. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org [@Helium-Studio: Also remove sysctl entries that dropped by this commit] Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> # Conflicts: # kernel/sched/fair.c # kernel/sysctl.c commit 29c0c9b70590c86dbb3504704b4043ffe793d48f Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:46 2023 +0200 sched/smp: Use lag to simplify cross-runqueue placement Using lag is both more correct and simpler when moving between runqueues. Notable, min_vruntime() was invented as a cheap approximation of avg_vruntime() for this very purpose (SMP migration). Since we now have the real thing; use it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 241fa4e083655b4f67cae56fa716ca6234506b30 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Thu Aug 18 20:48:01 2022 +0800 sched/fair: Combine detach into dequeue when migrating task When we are migrating task out of the CPU, we can combine detach and propagation into dequeue_entity() to save the detach_entity_cfs_rq() in migrate_task_rq_fair(). This optimization is like combining DO_ATTACH in the enqueue_entity() when migrating task to the CPU. So we don't have to traverse the CFS tree extra time to do the detach_entity_cfs_rq() -> propagate_entity_cfs_rq(), which wouldn't be called anymore with this patch's change. detach_task() deactivate_task() dequeue_task_fair() for_each_sched_entity(se) dequeue_entity() update_load_avg() /* (1) / detach_entity_load_avg() set_task_cpu() migrate_task_rq_fair() detach_entity_cfs_rq() / (2) / update_load_avg(); detach_entity_load_avg(); propagate_entity_cfs_rq(); for_each_sched_entity() update_load_avg() This patch save the detach_entity_cfs_rq() called in (2) by doing the detach_entity_load_avg() for a CPU migrating task inside (1) (the task being the first se in the loop) Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-6-zhouchengming@bytedance.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit d563959f4820312d3d1b462c78ca62a8a5fe3157 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:45 2023 +0200 sched/fair: Commit to lag based placement Removes the FAIR_SLEEPERS code in favour of the new LAG based placement. Specifically, the whole FAIR_SLEEPER thing was a very crude approximation to make up for the lack of lag based placement, specifically the 'service owed' part. This is important for things like 'starve' and 'hackbench'. One side effect of FAIR_SLEEPER is that it caused 'small' unfairness, specifically, by always ignoring up-to 'thresh' sleeptime it would have a 50%/50% time distribution for a 50% sleeper vs a 100% runner, while strictly speaking this should (of course) result in a 33%/67% split (as CFS will also do if the sleep period exceeds 'thresh'). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit ac875e302647f7266ce5678d02f06732e4512f2c Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:44 2023 +0200 sched/fair: Implement an EEVDF-like scheduling policy Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes something like WF2Q or EEVDF based a much better fit. Specifically, EEVDF does EDF like scheduling in the left half of the tree -- those entities that are owed service. Except because this is a virtual time scheduler, the deadlines are in virtual time as well, which is what allows over-subscription. EEVDF has two parameters: - weight, or time-slope: which is mapped to nice just as before - request size, or slice length: which is used to compute the virtual deadline as: vd_i = ve_i + r_i/w_i Basically, by setting a smaller slice, the deadline will be earlier and the task will be more eligible and ran earlier. Tick driven preemption is driven by request/slice completion; while wakeup preemption is driven by the deadline. Because the tree is now effectively an interval tree, and the selection is no longer 'leftmost', over-scheduling is less of a problem. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit fa237d82950d832b045ca398b0bdd7088422cb37 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:42 2023 +0200 sched/fair: Add lag based placement With the introduction of avg_vruntime, it is possible to approximate lag (the entire purpose of introducing it in fact). Use this to do lag based placement over sleep+wake. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 6063bb0c8a1acd02f35d9de3a2f73c17fd99c3f2 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:41 2023 +0200 sched/fair: Remove sched_feat(START_DEBIT) With the introduction of avg_vruntime() there is no need to use worse approximations. Take the 0-lag point as starting point for inserting new tasks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 45d55875264f7f7756c814df55a0c505de1f13c0 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:40 2023 +0200 sched/fair: Add cfs_rq::avg_vruntime In order to move to an eligibility based scheduling policy, we need to have a better approximation of the ideal scheduler. Specifically, for a virtual time weighted fair queueing based scheduler the ideal scheduler will be the weighted average of the individual virtual runtimes (math in the comment). As such, compute the weighted average to approximate the ideal scheduler -- note that the approximation is in the individual task behaviour, which isn't strictly conformant. Specifically consider adding a task with a vruntime left of center, in this case the average will move backwards in time -- something the ideal scheduler would of course never do. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 482a42257efb9da9db63f03a8af5d5f07d3518f6 Author: Jiang Biao <benbjiang@tencent.com> Date: Tue Aug 11 19:32:09 2020 +0800 sched/fair: Simplify the work when reweighting entity The code in reweight_entity() can be simplified. For a sched entity on the rq, the entity accounting can be replaced by cfs_rq instantaneous load updates currently called from within the entity accounting. Even though an entity on the rq can't represent a task in reweight_entity() (a task is always dequeued before calling this function) and so the numa task accounting and the rq->cfs_tasks list management of the entity accounting are never called, the redundant cfs_rq->nr_running decrement/increment will be avoided. Signed-off-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200811113209.34057-1-benbjiang@tencent.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit d21036a9134537a7e56654ea6191797905c78148 Author: Vincent Donnefort <vincent.donnefort@arm.com> Date: Tue Jun 21 10:04:08 2022 +0100 sched/fair: Provide u64 read for 32-bits arch helper Introducing macro helpers u64_u32_{store,load}() to factorize lockless accesses to u64 variables for 32-bits architectures. Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To accommodate the later where the copy lies outside of the structure (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy), use the _copy() version of those helpers. Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and therefore, have a small penalty for 32-bits machines in set_task_rq_fair() and init_cfs_rq(). Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit fded4adb0d75b4c4c5400278f277a5ff8a782020 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Apr 29 17:04:12 2020 +0200 rbtree, sched/fair: Use rb_add_cached() Reduce rbtree boiler plate by using the new helper function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 759f0c10c642f4917e94b9ad3a29a68cfe02f1a5 Author: Michel Lespinasse <walken@google.com> Date: Wed Sep 25 16:46:10 2019 -0700 augmented rbtree: rework the RB_DECLARE_CALLBACKS macro definition Change the definition of the RBCOMPUTE function. The propagate callback repeatedly calls RBCOMPUTE as it moves from leaf to root. it wants to stop recomputing once the augmented subtree information doesn't change. This was previously checked using the == operator, but that only works when the augmented subtree information is a scalar field. This commit modifies the RBCOMPUTE function so that it now sets the augmented subtree information instead of returning it, and returns a boolean value indicating if the propagate callback should stop. The motivation for this change is that I want to introduce augmented rbtree uses where the augmented data for the subtree is a struct instead of a scalar. Link: http://lkml.kernel.org/r/20190703040156.56953-4-walken@google.com Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 9372e034d9366c3cad0772ee1b310652d90bcb70 Author: Michel Lespinasse <walken@google.com> Date: Wed Sep 25 16:46:07 2019 -0700 augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks for the case where the augmented value is a scalar whose definition follows a max(f(node)) pattern. This actually covers all present uses of RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the various RBCOMPUTE function definitions. [walken@google.com: fix mm/vmalloc.c] Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com [walken@google.com: re-add check to check_augmented()] Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit a3175d39d501e3512d4df600939201cbda8e9b33 Author: Michel Lespinasse <walken@google.com> Date: Wed Sep 25 16:46:04 2019 -0700 augmented rbtree: add comments for RB_DECLARE_CALLBACKS macro Patch series "make RB_DECLARE_CALLBACKS more generic", v3. These changes are intended to make the RB_DECLARE_CALLBACKS macro more generic (allowing the aubmented subtree information to be a struct instead of a scalar). I have verified the compiled lib/interval_tree.o and mm/mmap.o files to check that they didn't change. This held as expected for interval_tree.o; mmap.o did have some changes which could be reverted by marking __vma_link_rb as noinline. I did not add such a change to the patchset; I felt it was reasonable enough to leave the inlining decision up to the compiler. This patch (of 3): Add a short comment summarizing the arguments to RB_DECLARE_CALLBACKS. The arguments are also now capitalized. This copies the style of the INTERVAL_TREE_DEFINE macro. No functional changes in this commit, only comments and capitalization. Link: http://lkml.kernel.org/r/20190703040156.56953-2-walken@google.com Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 9600cbd0d286c6cf86b68e88ea0600a219815d37 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 13:58:43 2023 +0200 rbtree: Add rb_add_augmented_cached() helper While slightly sub-optimal, updating the augmented data while going down the tree during lookup would be faster -- alas the augment interface does not currently allow for that, provide a generic helper to add a node to an augmented cached tree. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.862983648@infradead.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 3773003b378a5a3e5e44d0083d05c7437fd8a4b5 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon Oct 9 10:36:53 2017 +0200 sched/core: Ensure load_balance() respects the active_mask While load_balance() masks the source CPUs against active_mask, it had a hole against the destination CPU. Ensure the destination CPU is also part of the 'domain-mask & active-mask' set. Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `77d1dfda0e` ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds") Signed-off-by: Ingo Molnar <mingo@kernel.org> commit 1a4a49b4cd14bdcf91e3c9de97aa98c912e11c83 Author: Wanpeng Li <wanpengli@tencent.com> Date: Mon Jan 13 08:50:27 2020 +0800 [BACKPORT]sched/nohz: Optimize get_nohz_timer_target() On a machine, CPU 0 is used for housekeeping, the other 39 CPUs in the same socket are in nohz_full mode. We can observe huge time burn in the loop for seaching nearest busy housekeeper cpu by ftrace. 2) \| get_nohz_timer_target() { 2) 0.240 us \| housekeeping_test_cpu(); 2) 0.458 us \| housekeeping_test_cpu(); ... 2) 0.292 us \| housekeeping_test_cpu(); 2) 0.240 us \| housekeeping_test_cpu(); 2) 0.227 us \| housekeeping_any_cpu(); 2) + 43.460 us \| } This patch optimizes the searching logic by finding a nearest housekeeper CPU in the housekeeping cpumask, it can minimize the worst searching time from ~44us to < 10us in my testing. In addition, the last iterated busy housekeeper can become a random candidate while current CPU is a better fallback if it is a housekeeper. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/1578876627-11938-1-git-send-email-wanpengli@tencent.com Signed-off-by: DennySPB <dennyspb@gmail.com> commit f773f3b275ad618158280c9793a1d27639f298f7 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Fri Dec 14 17:01:56 2018 +0100 sched/fair: Trigger asym_packing during idle load balance Newly idle load balancing is not always triggered when a CPU becomes idle. This prevents the scheduler from getting a chance to migrate the task for asym packing. Enable active migration during idle load balance too. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: valentin.schneider@arm.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jesse Chan <jc@linux.com> Signed-off-by: billaids <jimmy.nelle@hsw-stud.de> commit 75352e3fb31924ec3af106be73e210d2258fefe8 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon Oct 9 10:36:53 2017 +0200 sched/core: Ensure load_balance() respects the active_mask While load_balance() masks the source CPUs against active_mask, it had a hole against the destination CPU. Ensure the destination CPU is also part of the 'domain-mask & active-mask' set. Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `77d1dfda0e` ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds") Signed-off-by: Ingo Molnar <mingo@kernel.org> commit 74b6dcb03b6b9fa56fbcdf2ffb33e50fa97db6dc Author: Frederic Weisbecker <frederic@kernel.org> Date: Tue Dec 3 17:01:06 2019 +0100 sched: Use fair:prio_changed() instead of ad-hoc implementation set_user_nice() implements its own version of fair::prio_changed() and therefore misses a specific optimization towards nohz_full CPUs that avoid sending an resched IPI to a reniced task running alone. Use the proper callback instead. Change-Id: I51ba67826dfcec0aa423758281943c01ba267c91 Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20191203160106.18806-3-frederic@kernel.org Signed-off-by: mydongistiny <jaysonedson@gmail.com> Signed-off-by: DennySPB <dennyspb@gmail.com> commit 1994e58a6bf1726d32a1c58af16b2be6fdfafd80 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Apr 26 12:19:32 2018 +0200 sched/fair: Fix the update of blocked load when newly idle With commit: 31e77c93e432 ("sched/fair: Update blocked load when newly idle") ... we release the rq->lock when updating blocked load of idle CPUs. This opens a time window during which another CPU can add a task to this CPU's cfs_rq. The check for newly added task of idle_balance() is not in the common path. Move the out label to include this check. Reported-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle") Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> commit ddd49e696ad91a561a8c8bf12ef61bc0e113ac41 Author: Josh Don <joshdon@google.com> Date: Tue Aug 4 12:34:13 2020 -0700 sched/fair: Ignore cache hotness for SMT migration SMT siblings share caches, so cache hotness should be irrelevant for cross-sibling migration. Signed-off-by: Josh Don <joshdon@google.com> Proposed-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200804193413.510651-1-joshdon@google.com commit de8674e8809c07543874ec4a4c91631c481caf4f Author: Peter Oskolkov <posk@google.com> Date: Wed Sep 30 10:35:32 2020 -0700 sched/fair: Tweak pick_next_entity() Currently, pick_next_entity(...) has the following structure (simplified): [...] if (last_buddy_ok()) result = last_buddy; if (next_buddy_ok()) result = next_buddy; [...] The intended behavior is to prefer next buddy over last buddy; the current code somewhat obfuscates this, and also wastes cycles checking the last buddy when eventually the next buddy is picked up. So this patch refactors two 'ifs' above into [...] if (next_buddy_ok()) result = next_buddy; else if (last_buddy_ok()) result = last_buddy; [...] Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guitttot@linaro.org> Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com commit 6a20c355412765268bfc7635b626899873f73337 Author: Clement Courbet <courbet@google.com> Date: Wed Mar 3 14:46:53 2021 -0800 sched: Optimize __calc_delta() A significant portion of __calc_delta() time is spent in the loop shifting a u64 by 32 bits. Use `fls` instead of iterating. This is ~7x faster on benchmarks. The generic `fls` implementation (`generic_fls`) is still ~4x faster than the loop. Architectures that have a better implementation will make use of it. For example, on x86 we get an additional factor 2 in speed without dedicated implementation. On GCC, the asm versions of `fls` are about the same speed as the builtin. On Clang, the versions that use fls are more than twice as slow as the builtin. This is because the way the `fls` function is written, clang puts the value in memory: https://godbolt.org/z/EfMbYe. This bug is filed at https://bugs.llvm.org/show_bug.cgi?idI406. ``` name cpu/op BM_Calc<__calc_delta_loop> 9.57ms Â=B112% BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113% BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113% BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112% BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113% BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115% BM_Calc<__calc_delta_builtin> 1.32ms Â=B111% ``` Signed-off-by: Clement Courbet <courbet@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com commit e4793308d8214d546ae2eead6f9fac0426ada0f9 Author: Amit Kucheria <amit.kucheria@linaro.org> Date: Mon Oct 21 17:45:12 2019 +0530 cpufreq: Initialize the governors in core_initcall Initialize the cpufreq governors earlier to allow for earlier performance control during the boot process. Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/b98eae9b44eb2f034d7f5d12a161f5f831be1eb7.1571656015.git.amit.kucheria@linaro.org # Conflicts: # drivers/cpufreq/cpufreq_performance.c commit 7f1bc529bc787dcbedfd2899cb9c99326dd57215 Author: Mathieu Poirier <mathieu.poirier@linaro.org> Date: Fri Jul 19 15:59:53 2019 +0200 sched/topology: Add partition_sched_domains_locked() Introduce the partition_sched_domains_locked() function by taking the mutex locking code out of the original function. That way the work done by partition_sched_domains_locked() can be reused without dropping the mutex lock. No change of functionality is introduced by this patch. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: rostedt@goodmis.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> commit e17f27e0d39d77e4b1011278796a30e8d7b05d6f Author: Valentin Schneider <valentin.schneider@arm.com> Date: Tue Apr 9 18:35:45 2019 +0100 sched/topology: Skip duplicate group rewrites in build_sched_groups() While staring at build_sched_domains(), I realized that get_group() does several duplicate (thus useless) writes. If you take the Arm Juno r0 (LITTLEs = [0, 3, 4, 5], bigs = [1, 2]), the sched_group build flow would look like this: ('MC[cpu]->sg' means 'per_cpu_ptr(&tl->data->sg, cpu)' with 'tl == MC') build_sched_groups(MC[CPU0]->sd, CPU0) get_group(0) -> MC[CPU0]->sg get_group(3) -> MC[CPU3]->sg get_group(4) -> MC[CPU4]->sg get_group(5) -> MC[CPU5]->sg build_sched_groups(DIE[CPU0]->sd, CPU0) get_group(0) -> DIE[CPU0]->sg get_group(1) -> DIE[CPU1]->sg <=================+ \| build_sched_groups(MC[CPU1]->sd, CPU1) \| get_group(1) -> MC[CPU1]->sg \| get_group(2) -> MC[CPU2]->sg \| \| build_sched_groups(DIE[CPU1]->sd, CPU1) ^ get_group(1) -> DIE[CPU1]->sg } We've set up these two up here! get_group(3) -> DIE[CPU0]->sg } From this point on, we will only use sched_groups that have been previously visited & initialized. The only new operation will be which group pointer we affect to sd->groups. On the Juno r0 we get 32 get_group() calls, every single one of them writing to a sched_group->cpumask. However, all of the data structures we need are set up after 8 visits (see above). Return early from get_group() if we've already visited (and thus initialized) the sched_group we're looking at. Overlapping domains are not affected as they do not use build_sched_groups(). Tested on a Juno and a 2 (Xeon E5-2690) system. ( FWIW I initially checked the refs for both sg && sg->sgc, but figured if they weren't both 0 or > 1 then something must have gone wrong, so I threw in a WARN_ON(). ) No change in functionality intended. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> commit c62d6453c251607e2fe6f6e204a0711b7edc03df Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Jun 17 17:00:17 2019 +0200 sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity() The 'struct sched_domain sd' parameter to arch_scale_cpu_capacity() is unused since commit: 765d0af19f5f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'") Remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: gregkh@linuxfoundation.org Cc: linux@armlinux.org.uk Cc: quentin.perret@arm.com Cc: rafael@kernel.org Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> commit 360635bafb36f39478a5c32b43e3cb64b1df4dad Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Apr 22 16:01:47 2024 -0400 sched-pelt: make half life nonconfigurable 16ms only commit baa288445831e3cef3b7f6cd20bc0c52f331acc2 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Apr 22 15:32:50 2024 -0400 sched/fair: decrease migration cost commit 7e1c7b478d201f3909d4c8f1f9f121064457f4e9 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Tue Aug 1 17:26:48 2023 +0200 sched/rt: Don't try push tasks if there are none. I have a RT task X at a high priority and cyclictest on each CPU with lower priority than X's. If X is active and each CPU wakes their own cylictest thread then it ends in a longer rto_push storm. A random CPU determines via balance_rt() that the CPU on which X is running needs to push tasks. X has the highest priority, cyclictest is next in line so there is nothing that can be done since the task with the higher priority is not touched. tell_cpu_to_push() increments rto_loop_next and schedules rto_push_irq_work_func() on X's CPU. The other CPUs also increment the loop counter and do the same. Once rto_push_irq_work_func() is active it does nothing because it has _no_ pushable tasks on its runqueue. Then checks rto_next_cpu() and decides to queue irq_work on the local CPU because another CPU requested a push by incrementing the counter. I have traces where ~30 CPUs request this ~3 times each before it finally ends. This greatly increases X's runtime while X isn't making much progress. Teach rto_next_cpu() to only return CPUs which also have tasks on their runqueue which can be pushed away. This does not reduce the tell_cpu_to_push() invocations (rto_loop_next counter increments) but reduces the amount of issued rto_push_irq_work_func() if nothing can be done. As the result the overloaded CPU is blocked less often. There are still cases where the "same job" is repeated several times (for instance the current CPU needs to resched but didn't yet because the irq-work is repeated a few times and so the old task remains on the CPU) but the majority of request end in tell_cpu_to_push() before an IPI is issued. Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> commit 93662a6fbca530b6480fc92e9d119c5ccf09da7f Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed May 13 15:55:28 2020 +0200 sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair() are quite close and follow the same sequence for enqueuing an entity in the cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as enqueue_task_fair(). This fixes a problem already faced with the latter and add an optimization in the last for_each_sched_entity loop. Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) Reported-by Tao Zhou <zohooouoto@zoho.com.cn> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org commit e8164ed37c41537ee1db3584095cd533705d5ad5 Author: Josh Don <joshdon@google.com> Date: Fri Apr 10 15:52:08 2020 -0700 sched/fair: Remove distribute_running from CFS bandwidth This is mostly a revert of commit: baa9be4ffb55 ("sched/fair: Fix throttle_list starvation with low CFS quota") The primary use of distribute_running was to determine whether to add throttled entities to the head or the tail of the throttled list. Now that we always add to the tail, we can remove this field. The other use of distribute_running is in the slack_timer, so that we don't start a distribution while one is already running. However, even in the event that this race occurs, it is fine to have two distributions running (especially now that distribute grabs the cfs_b->lock to determine remaining quota before assigning). Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200410225208.109717-3-joshdon@google.com commit 1eb50996e28558aabcd74d3afb37e044fedac26d Author: bsegall@google.com <bsegall@google.com> Date: Thu Jun 6 10:21:01 2019 -0700 sched/fair: Don't push cfs_bandwith slack timers forward When a cfs_rq sleeps and returns its quota, we delay for 5ms before waking any throttled cfs_rqs to coalesce with other cfs_rqs going to sleep, as this has to be done outside of the rq lock we hold. The current code waits for 5ms without any sleeps, instead of waiting for 5ms from the first sleep, which can delay the unthrottle more than we want. Switch this around so that we can't push this forward forever. This requires an extra flag rather than using hrtimer_active, since we need to start a new timer if the current one is in the process of finishing. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Acked-by: Phil Auld <pauld@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: DennySPB <dennyspb@gmail.com> commit 2d4490a11d8952e8016647a5bbfa98f6df781dde Author: Paul Turner <pjt@google.com> Date: Fri Apr 10 15:52:07 2020 -0700 sched/fair: Eliminate bandwidth race between throttling and distribution There is a race window in which an entity begins throttling before quota is added to the pool, but does not finish throttling until after we have finished with distribute_cfs_runtime(). This entity is not observed by distribute_cfs_runtime() because it was not on the throttled list at the time that distribution was running. This race manifests as rare period-length statlls for such entities. Rather than heavy-weight the synchronization with the progress of distribution, we can fix this by aborting throttling if bandwidth has become available. Otherwise, we immediately add the entity to the throttled list so that it can be observed by a subsequent distribution. Additionally, we can remove the case of adding the throttled entity to the head of the throttled list, and simply always add to the tail. Thanks to 26a8b12747c97, distribute_cfs_runtime() no longer holds onto its own pool of runtime. This means that if we do hit the !assign and distribute_running case, we know that distribution is about to end. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200410225208.109717-2-joshdon@google.com commit 4f08342e65cedd1082630c987d9a7bafa7703c70 Author: Huaixin Chang <changhuaixin@linux.alibaba.com> Date: Fri Mar 27 11:26:25 2020 +0800 sched/fair: Fix race between runtime distribution and assignment Currently, there is a potential race between distribute_cfs_runtime() and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read, distributes without holding lock and finds out there is not enough runtime to charge against after distribution. Because assign_cfs_rq_runtime() might be called during distribution, and use cfs_b->runtime at the same time. Fibtest is the tool to test this race. Assume all gcfs_rq is throttled and cfs period timer runs, slow threads might run and sleep, returning unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local pool. If all this happens sufficiently quickly, cfs_b->runtime will drop a lot. If runtime distributed is large too, over-use of runtime happens. A runtime over-using by about 70 percent of quota is seen when we test fibtest on a 96-core machine. We run fibtest with 1 fast thread and 95 slow threads in test group, configure 10ms quota for this group and see the CPU usage of fibtest is 17.0%, which is far more than the expected 10%. On a smaller machine with 32 cores, we also run fibtest with 96 threads. CPU usage is more than 12%, which is also more than expected 10%. This shows that on similar workloads, this race do affect CPU bandwidth control. Solve this by holding lock inside distribute_cfs_runtime(). Fixes: `c06f04c704` ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/ commit 455896c7c0c647b57e0b742d00b3b65aa8f02e87 Author: kondors1995 <normandija1945@gmail.com> Date: Thu Apr 4 21:48:12 2024 +0300 sched/cass:fixup commit 4b88ec1c7957d055e3a23067908fa0ec47cbd2c5 Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Wed Mar 13 21:25:29 2024 -0700 sched/cass: Eliminate redundant calls to smp_processor_id() Calling smp_processor_id() can be expensive depending on how an arch implements it, so avoid calling it more than necessary. Use the raw variant too since this code is always guaranteed to run with preemption disabled. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit f7d512d05d97dea633689bf689fd8a63960248ca Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Tue Mar 12 16:55:56 2024 -0700 sched/cass: Only treat sync waker CPU as idle if there's one task running For synchronized wakes, the waker's CPU should only be treated as idle if there aren't any other running tasks on that CPU. This is because, for synchronized wakes, it is assumed that the waker will immediately go to sleep after waking the wakee; therefore, if there aren't any other tasks running on the waker's CPU, it'll go idle and should be treated as such to improve task placement. This optimization only applies when there aren't any other tasks running on the waker's CPU, however. Fix it by ensuring that there's only the waker running on its CPU. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit 3264ea2961b6a7a2e5589b199c031aa17361401a Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Mon Feb 19 13:13:02 2024 -0800 sched/cass: Fix suboptimal task placement when uclamp is used Uclamp is designed to specify a process' CPU performance requirement scaled as a CPU capacity value. It simply denotes the process' requirement for the CPU's raw performance and thus P-state. CASS currently treats uclamp as a CPU load value however, producing wildly suboptimal CPU placement decisions for tasks which use uclamp. This hurts performance and, even worse, massively hurts energy efficiency, with CASS sometimes yielding power consumption that is a few times higher than EAS. Since uclamp inherently throws a wrench into CASS's goal of keeping relative P-states as low as possible across all CPUs, making it cooperate with CASS requires a multipronged approach. Make the following three changes to fix the uclamp task placement issue: 1. Treat uclamp as a CPU performance value rather than a CPU load value. 2. Clamp a CPU's utilization to the task's uclamp floor in order to keep relative P-states as low as possible across all CPUs. 3. Consider preferring a non-idle CPU for uclamped tasks to avoid pushing up the P-state of more than one CPU when there are multiple concurrent uclamped tasks. This fixes CASS's massive energy efficiency and performance issues when uclamp is used. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit dc098ba6aab229c19099642f561d33562f1ff755 Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Sat Jan 6 00:34:48 2024 -0800 sched/cass: Perform runqueue selection for RT tasks too RT tasks aren't placed on CPUs in a load-balanced manner, much less an energy efficient one. On systems which contain many RT tasks and/or IRQ threads, energy efficiency and throughput are diminished significantly by the default RT runqueue selection scheme which targets minimal latency. In practice, performance is actually improved by spreading RT tasks fairly, despite the small latency impact. Additionally, energy efficiency is significantly improved since the placement of all tasks benefits from energy-efficient runqueue selection, rather than just CFS tasks. Perform runqueue selection for RT tasks in CASS to significantly improve energy efficiency and overall performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit c8bbbe7148731e0024d0681f6cc8738cbe1cee06 Author: kondors1995 <normandija1945@gmail.com> Date: Thu Apr 4 12:48:06 2024 +0300 sched/cass:checkout to kerneltoast/android_kernel_google_zuma@63f0b82d3 commit 1d4260eba475bad8a2aba7a974e854baefeb1040 Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Wed Feb 28 21:07:44 2024 -0800 sched/cass: Clean up local variable scope in cass_best_cpu() Move `curr` and `idle_state` to within the loop's scope for better readability. Also, leave a comment about `curr->cpu` to make it clear that `curr->cpu` must be initialized within the loop in order for `best->cpu` to be valid. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit cc47ef1ef764b7d261402bfa04a42dc4a418ab0b Author: Sultan Alsawaf <sultan@kerneltoast.com> Date: Sun Dec 17 19:04:28 2023 -0800 sched/cass: Fix CPU selection when no candidate CPUs are idle When no candidate CPUs are idle, CASS would keep `cidx` unchanged, and thus `best == curr` would always be true. As a result, since the empty candidate slot never changes, the current candidate `curr` always overwrites the best candidate `best`. This causes the last valid CPU to always be selected by CASS when no CPUs are idle (i.e., under heavy load). Fix it by ensuring that the CPU loop in cass_best_cpu() flips the free candidate index after the first candidate CPU is evaluated. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> commit 2dec239da28c7e1d19ead79686f5114e417e6d65 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Mon Nov 5 14:54:00 2018 +0000 UPSTREAM: sched/fair: Add lsub_positive() and use it consistently The following pattern: var -= min_t(typeof(var), var, val); is used multiple times in fair.c. The existing sub_positive() already captures that pattern, but it also adds an explicit load-store to properly support lockless observations. In other cases the pattern above is used to update local, and/or not concurrently accessed, variables. Let's add a simpler version of sub_positive(), targeted at local variables updates, which gives the same readability benefits at calling sites, without enforcing {READ,WRITE}_ONCE() barriers. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net Change-Id: I6a6a3b2ae9e4baa4ab6e906bf2aaed7306303025 Signed-off-by: Jason Edson <jaysonedson@gmail.com> Signed-off-by: DennySPb <dennyspb@gmail.com> Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit fc9b2f93da21cddfc7d76fa941eb496935e48db4 Author: Rohit Jain <rohit.k.jain@oracle.com> Date: Wed May 2 13:52:10 2018 -0700 sched/core: Don't schedule threads on pre-empted vCPUs In paravirt configurations today, spinlocks figure out whether a vCPU is running to determine whether or not spinlock should bother spinning. We can use the same logic to prioritize CPUs when scheduling threads. If a vCPU has been pre-empted, it will incur the extra cost of VMENTER and the time it actually spends to be running on the host CPU. If we had other vCPUs which were actually running on the host CPU and idle we should schedule threads there. Performance numbers: Note: With patch is referred to as Paravirt in the following and without patch is referred to as Base. 1) When only 1 VM is running: a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better): +-------+-----------------+----------------+ \|Number \|Paravirt \|Base \| \|of +---------+-------+-------+--------+ \|Threads\|Average \|Std Dev\|Average\| Std Dev\| +-------+---------+-------+-------+--------+ \|1 \|1.817 \|0.076 \|1.721 \| 0.067 \| \|2 \|3.467 \|0.120 \|3.468 \| 0.074 \| \|4 \|6.266 \|0.035 \|6.314 \| 0.068 \| \|8 \|11.437 \|0.105 \|11.418 \| 0.132 \| \|16 \|21.862 \|0.167 \|22.161 \| 0.129 \| \|25 \|33.341 \|0.326 \|33.692 \| 0.147 \| +-------+---------+-------+-------+--------+ 2) When two VMs are running with same CPU affinities: a) tbench test on VM 8 cpus Base: VM1: Throughput 220.59 MB/sec 1 clients 1 procs max_latency=12.872 ms Throughput 448.716 MB/sec 2 clients 2 procs max_latency=7.555 ms Throughput 861.009 MB/sec 4 clients 4 procs max_latency=49.501 ms Throughput 1261.81 MB/sec 7 clients 7 procs max_latency=76.990 ms VM2: Throughput 219.937 MB/sec 1 clients 1 procs max_latency=12.517 ms Throughput 470.99 MB/sec 2 clients 2 procs max_latency=12.419 ms Throughput 841.299 MB/sec 4 clients 4 procs max_latency=37.043 ms Throughput 1240.78 MB/sec 7 clients 7 procs max_latency=77.489 ms Paravirt: VM1: Throughput 222.572 MB/sec 1 clients 1 procs max_latency=7.057 ms Throughput 485.993 MB/sec 2 clients 2 procs max_latency=26.049 ms Throughput 947.095 MB/sec 4 clients 4 procs max_latency=45.338 ms Throughput 1364.26 MB/sec 7 clients 7 procs max_latency=145.124 ms VM2: Throughput 224.128 MB/sec 1 clients 1 procs max_latency=4.564 ms Throughput 501.878 MB/sec 2 clients 2 procs max_latency=11.061 ms Throughput 965.455 MB/sec 4 clients 4 procs max_latency=45.370 ms Throughput 1359.08 MB/sec 7 clients 7 procs max_latency=168.053 ms b) Hackbench with 4 fd 1,000,000 loops +-------+--------------------------------------+----------------------------------------+ \|Number \|Paravirt \|Base \| \|of +----------+--------+---------+--------+----------+--------+---------+----------+ \|Threads\|Average1 \|Std Dev1\|Average2 \| Std Dev\|Average1 \|Std Dev1\|Average2 \| Std Dev 2\| +-------+----------+--------+---------+--------+----------+--------+---------+----------+ \| 1 \| 3.748 \| 0.620 \| 3.576 \| 0.432 \| 4.006 \| 0.395 \| 3.446 \| 0.787 \| +-------+----------+--------+---------+--------+----------+--------+---------+----------+ Note that this test was run just to show the interference effect over-subscription can have in baseline c) schbench results with 2 message groups on 8 vCPU VMs +-----------+-------+---------------+--------------+------------+ \| \| \| Paravirt \| Base \| \| +-----------+-------+-------+-------+-------+------+------------+ \| \|Threads\| VM1 \| VM2 \| VM1 \| VM2 \|%Improvement\| +-----------+-------+-------+-------+-------+------+------------+ \|50.0000th \| 1 \| 52 \| 53 \| 58 \| 54 \| +6.25% \| \|75.0000th \| 1 \| 69 \| 61 \| 83 \| 59 \| +8.45% \| \|90.0000th \| 1 \| 80 \| 80 \| 89 \| 83 \| +6.98% \| \|95.0000th \| 1 \| 83 \| 83 \| 93 \| 87 \| +7.78% \| \|99.0000th \| 1 \| 92 \| 94 \| 99 \| 97 \| +5.10% \| \|99.5000th \| 1 \| 95 \| 100 \| 102 \| 103 \| +4.88% \| \|99.9000th \| 1 \| 107 \| 123 \| 105 \| 203 \| +25.32% \| +-----------+-------+-------+-------+-------+------+------------+ \|50.0000th \| 2 \| 56 \| 62 \| 67 \| 59 \| +6.35% \| \|75.0000th \| 2 \| 69 \| 75 \| 80 \| 71 \| +4.64% \| \|90.0000th \| 2 \| 80 \| 82 \| 90 \| 81 \| +5.26% \| \|95.0000th \| 2 \| 85 \| 87 \| 97 \| 91 \| +8.51% \| \|99.0000th \| 2 \| 98 \| 99 \| 107 \| 109 \| +8.79% \| \|99.5000th \| 2 \| 107 \| 105 \| 109 \| 116 \| +5.78% \| \|99.9000th \| 2 \| 9968 \| 609 \| 875 \| 3116 \| -165.02% \| +-----------+-------+-------+-------+-------+------+------------+ \|50.0000th \| 4 \| 78 \| 77 \| 78 \| 79 \| +1.27% \| \|75.0000th \| 4 \| 98 \| 106 \| 100 \| 104 \| 0.00% \| \|90.0000th \| 4 \| 987 \| 1001 \| 995 \| 1015 \| +1.09% \| \|95.0000th \| 4 \| 4136 \| 5368 \| 5752 \| 5192 \| +13.16% \| \|99.0000th \| 4 \| 11632 \| 11344 \| 11024\| 10736\| -5.59% \| \|99.5000th \| 4 \| 12624 \| 13040 \| 12720\| 12144\| -3.22% \| \|99.9000th \| 4 \| 13168 \| 18912 \| 14992\| 17824\| +2.24% \| +-----------+-------+-------+-------+-------+------+------------+ Note: Improvement is measured for (VM1+VM2) Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dhaval.giani@oracle.com Cc: matt@codeblueprint.co.uk Cc: steven.sistare@oracle.com Cc: subhra.mazumdar@oracle.com Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 495ad164dcd1731db6746346355ecf52a193c720 Author: EmanuelCN <emanuelghub@gmail.com> Date: Wed Apr 3 18:31:28 2024 +0000 init: Enable SCHED_THERMAL_PRESSURE by default Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 9170eccde2a4c93f9be5b4bc0bed7bce1277f8da Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:13 2020 -0500 sched/fair: Enable tuning of decay period Thermal pressure follows pelt signals which means the decay period for thermal pressure is the default pelt decay period. Depending on SoC characteristics and thermal activity, it might be beneficial to decay thermal pressure slower, but still in-tune with the pelt signals. One way to achieve this is to provide a command line parameter to set a decay shift parameter to an integer between 0 and 10. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-10-thara.gopinath@linaro.org Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 0c8c7aaf09aea4f6296130605b75697fa21f2615 Author: Lukasz Luba <lukasz.luba@arm.com> Date: Thu Jun 10 16:03:22 2021 +0100 UPSTREAM: thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure The thermal pressure signal gives information to the scheduler about reduced CPU capacity due to thermal. It is based on a value stored in a per-cpu 'thermal_pressure' variable. The online CPUs will get the new value there, while the offline won't. Unfortunately, when the CPU is back online, the value read from per-cpu variable might be wrong (stale data). This might affect the scheduler decisions, since it sees the CPU capacity differently than what is actually available. Fix it by making sure that all online+offline CPUs would get the proper value in their per-cpu variable when thermal framework sets capping. Fixes: f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping") Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/all/20210614191030.22241-1-lukasz.luba@arm.com/ Bug: 199501011 Change-Id: I10cceb48b72ccce1f51cfc0a7ecfa8d8e67d4394 (cherry picked from commit 2ad8ccc17d1e4270cf65a3f2a07a7534aa23e3fb) Signed-off-by: Ram Chandrasekar <quic_rkumbako@quicinc.com> Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit cdbf22d3d8bc1b7a384677f4af7ea20fd169c538 Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:12 2020 -0500 thermal: cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thermal governors can request for a CPU's maximum supported frequency to be capped in case of an overheat event. This in turn means that the maximum capacity available for tasks to run on the particular CPU is reduced. Delta between the original maximum capacity and capped maximum capacity is known as thermal pressure. Enable cpufreq cooling device to update the thermal pressure in event of a capped maximum frequency. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-9-thara.gopinath@linaro.org Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit d81bc16f14b74721dcea680568e862361f531ca8 Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:10 2020 -0500 sched/fair: Enable periodic update of average thermal pressure Introduce support in scheduler periodic tick and other CFS bookkeeping APIs to trigger the process of computing average thermal pressure for a CPU. Also consider avg_thermal.load_avg in others_have_blocked which allows for decay of pelt signals. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-7-thara.gopinath@linaro.org Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 0b07e1ccac5d97bf63df30b1a9e36c99a1d59390 Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:09 2020 -0500 arm/topology: Populate arch_scale_thermal_pressure() for ARM platforms Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus enabling scheduler to retrieve instantaneous thermal pressure of a CPU. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-6-thara.gopinath@linaro.org Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit f4dd9d9f8a6e9eb80615f3cf46b33a823f743e3a Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:08 2020 -0500 arm64/topology: Populate arch_scale_thermal_pressure() for arm64 platforms Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus enabling scheduler to retrieve instantaneous thermal pressure of a CPU. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-5-thara.gopinath@linaro.org Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 8d62e23c1aff88f3ea55d7bb1f8f69295e150189 Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:07 2020 -0500 drivers/base/arch_topology: Add infrastructure to store and update instantaneous thermal pressure Add architecture specific APIs to update and track thermal pressure on a per CPU basis. A per CPU variable thermal_pressure is introduced to keep track of instantaneous per CPU thermal pressure. Thermal pressure is the delta between maximum capacity and capped capacity due to a thermal event. topology_get_thermal_pressure can be hooked into the scheduler specified arch_scale_thermal_pressure to retrieve instantaneous thermal pressure of a CPU. arch_set_thermal_pressure can be used to update the thermal pressure. Considering topology_get_thermal_pressure reads thermal_pressure and arch_set_thermal_pressure writes into thermal_pressure, one can argue for some sort of locking mechanism to avoid a stale value. But considering topology_get_thermal_pressure can be called from a system critical path like scheduler tick function, a locking mechanism is not ideal. This means that it is possible the thermal_pressure value used to calculate average thermal pressure for a CPU can be stale for up to 1 tick period. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-4-thara.gopinath@linaro.org Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 92727ae00ca5d23464f5d7ec7702bf62ab4ba026 Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:06 2020 -0500 sched/topology: Add callback to read per CPU thermal pressure Introduce the arch_scale_thermal_pressure() callback to retrieve per CPU thermal pressure. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-3-thara.gopinath@linaro.org Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 62abba1d3769cd0c8566f1d354035499fb6cb8a0 Author: Thara Gopinath <thara.gopinath@linaro.org> Date: Fri Feb 21 19:52:05 2020 -0500 sched/pelt: Add support to track thermal pressure Extrapolating on the existing framework to track rt/dl utilization using pelt signals, add a similar mechanism to track thermal pressure. The difference here from rt/dl utilization tracking is that, instead of tracking time spent by a CPU running a RT/DL task through util_avg, the average thermal pressure is tracked through load_avg. This is because thermal pressure signal is weighted time "delta" capacity unlike util_avg which is binary. "delta capacity" here means delta between the actual capacity of a CPU and the decreased capacity a CPU due to a thermal event. In order to track average thermal pressure, a new sched_avg variable avg_thermal is introduced. Function update_thermal_load_avg can be called to do the periodic bookkeeping (accumulate, decay and average) of the thermal pressure. Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-2-thara.gopinath@linaro.org Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 82a4f68550bde32a72afb0085c8250c7dd672d21 Author: Lingutla Chandrasekhar <clingutla@codeaurora.org> Date: Thu Feb 4 15:52:03 2021 +0530 sched: fair: consider all running tasks in cpu for load balance Load_balancer considers only cfs running tasks for finding busiest cpu to do load balancing. But cpu may be busy with other type tasks (ex: RT), then that cpu might not selected as busy cpu due to weight vs nr_run checks fails). And possibley cfs tasks running on that cpu would suffer till other type tasks finishes or weight checks passes, while other cpus sitting idle and not able to do load balance. So, consider all running tasks to check cpu busieness. Change-Id: Iddf3f668507e20359f6388fc30ff5897d234c902 Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> Signed-off-by: atndko <z1281552865@gmail.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> commit 21028ad73e84230251c5704576d5c97eb4390e31 Author: Lucas Stach <l.stach@pengutronix.de> Date: Mon Aug 31 13:07:19 2020 +0200 sched/deadline: Fix stale throttling on de-/boosted tasks When a boosted task gets throttled, what normally happens is that it's immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the runtime and clears the dl_throttled flag. There is a special case however: if the throttling happened on sched-out and the task has been deboosted in the meantime, the replenish is skipped as the task will return to its normal scheduling class. This leaves the task with the dl_throttled flag set. Now if the task gets boosted up to the deadline scheduling class again while it is sleeping, it's still in the throttled state. The normal wakeup however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't actually place it on the rq. Thus we end up with a task that is runnable, but not actually on the rq and neither a immediate replenishment happens, nor is the replenishment timer set up, so the task is stuck in forever-throttled limbo. Clear the dl_throttled flag before dropping back to the normal scheduling class to fix this issue. Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de Signed-off-by: Zlatan Radovanovic <zlatan.radovanovic@fet.ba> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> commit 0541494a6aa9d50deb09f922e74c12b7d319e211 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Sep 21 09:24:22 2020 +0200 sched/fair: Reduce minimal imbalance threshold The 25% default imbalance threshold for DIE and NUMA domain is large enough to generate significant unfairness between threads. A typical example is the case of 11 threads running on 2x4 CPUs. The imbalance of 20% between the 2 groups of 4 cores is just low enough to not trigger the load balance between the 2 groups. We will have always the same 6 threads on one group of 4 CPUs and the other 5 threads on the other group of CPUS. With a fair time sharing in each group, we ends up with +20% running time for the group of 5 threads. Consider decreasing the imbalance threshold for overloaded case where we use the load to balance task and to ensure fair time sharing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Hillf Danton <hdanton@sina.com> Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org Signed-off-by: Zlatan Radovanovic <zlatan.radovanovic@fet.ba> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> commit ae5539ae9bd8eb9319a3a9773f1df1aa3daacbcb Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Sep 21 09:24:24 2020 +0200 sched/fair: Reduce busy load balance interval The busy_factor, which increases load balance interval when a cpu is busy, is set to 32 by default. This value generates some huge LB interval on large system like the THX2 made of 2 node x 28 cores x 4 threads. For such system, the interval increases from 112ms to 3584ms at MC level. And from 228ms to 7168ms at NUMA level. Even on smaller system, a lower busy factor has shown improvement on the fair distribution of the running time so let reduce it for all. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org commit e644e2307fbc7b33c1baae6e8e50517b0a5924c0 Author: Valentin Schneider <valentin.schneider@arm.com> Date: Wed Dec 11 11:38:50 2019 +0000 BACKPORT: sched/fair: Make task_fits_capacity() consider uclamp restrictions task_fits_capacity() drives CPU selection at wakeup time, and is also used to detect misfit tasks. Right now it does so by comparing task_util_est() with a CPU's capacity, but doesn't take into account uclamp restrictions. There's a few interesting uses that can come out of doing this. For instance, a low uclamp.max value could prevent certain tasks from being flagged as misfit tasks, so they could merrily remain on low-capacity CPUs. Similarly, a high uclamp.min value would steer tasks towards high capacity CPUs at wakeup (and, should that fail, later steered via misfit balancing), so such "boosted" tasks would favor CPUs of higher capacity. Introduce uclamp_task_util() and make task_fits_capacity() use it. [QP: fixed missing dependency on fits_capacity() by using the open coded alternative] Bug: 120440300 Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-5-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit a7008c07a568278ed2763436404752a98004c7ff) Signed-off-by: Quentin Perret <qperret@google.com> Change-Id: Iabde2eda7252c3bcc273e61260a7a12a7de991b1 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit fed39a8a5b86100174489179f80a7514c7a85c4b Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Sep 27 17:58:58 2018 +0100 BACKPORT: ANDROID: sched/core: Move SchedTune task API into UtilClamp wrappers The main SchedTune API calls realted to task tuning attributes are now wrapped by more generic and mainlinish UtilClamp calls. The new APIs are: - uclamp_task(p) <= boosted_task_util(p) - uclamp_boosted(p) <= schedtune_task_boost(p) > 0 - uclamp_latency_sensitive(p) <= schedtune_prefer_idle(p) Let's provide also an implementation of the same API based on the new uclamp.uclamp_latency_sensitive flag. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> [Modified the patch to use uclamp.latency_sensitive instead mainline attributes] Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ib1a6902e1c07a82a370e36bf1776d895b7528cbc Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 0152bf348f6c517ba445393032bb9049c533aa27 Author: Rick Yiu <rickyiu@google.com> Date: Mon Jun 15 22:03:08 2020 +0800 ANDROID: sched/tune: Consider stune boost margin when computing energy If CONFIG_SCHED_TUNE is enabled, it does not use boosted cpu util to compute energy, so it could not reflect the real freq when a cpu has boosted tasks on it. Addressing it by adding boost margin if type is FREQUENCY_UTIL in schedutil_cpu_util(). Bug: 158637636 Signed-off-by: Rick Yiu <rickyiu@google.com> Change-Id: I570920cb1e67d07de87006fca058d50e9358b7cd Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit a07859716117b26c7794a22ec3a4dec64b618ba0 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Apr 4 10:24:43 2019 +0100 ANDROID: sched/tune: Move SchedTune cpu API into UtilClamp wrappers The SchedTune CPU boosting API is currently used from sugov_get_util() to get the boosted utilization and to pass it into schedutil_cpu_util(). When UtilClamp is in use instead we call schedutil_cpu_util() by passing in just the CFS utilization and the clamping is done internally on the aggregated CFS+RT utilization for FREQUENCY_UTIL calls. This asymmetry is not required moreover, schedutil code is polluted by non-mainline SchedTune code. Wrap SchedTune API call related to cpu utilization boosting with a more generic and mainlinish UtilClamp call: - uclamp_rq_util_with(cpu, util, p) <= boosted_cpu_util(cpu) This new API is already used in schedutil_cpu_util() to clamp the aggregated RT+CFS utilization on FREQUENCY_UTIL calls. Move the cpu boosting into uclamp_rq_util_with() so that we remove any SchedTune specific bit from kernel/sched/cpufreq_schedutil.c. Get rid of the no more required boosted_cpu_util(cpu) method and replace it with a stune_util(cpu, util) which signature is better aligned with its uclamp_rq_util_with(cpu, util, p) counterpart. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I45b0f0f54123fe0a2515fa9f1683842e6b99234f [Removed superfluous __maybe_unused for capacity_orig_of] Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 75d6dd5a51250b259269afa979ee3fdd2f089db2 Author: Quentin Perret <qperret@google.com> Date: Thu Jun 10 15:13:06 2021 +0000 ANDROID: sched/core: Make uclamp changes depend on CAP_SYS_NICE There is currently nothing preventing tasks from changing their per-task clamp values in anyway that they like. The rationale is probably that system administrators are still able to limit those clamps thanks to the cgroup interface. However, this causes pain in a system where both per-task and per-cgroup clamp values are expected to be under the control of core system components (as is the case for Android). To fix this, let's require CAP_SYS_NICE to change per-task clamp values. There are ongoing discussions upstream about more flexible approaches than this using the RLIMIT API -- see [1]. But the upstream discussion has not converged yet, and this is way too late for UAPI changes in android12-5.10 anyway, so let's apply this change which provides the behaviour we want without actually impacting UAPIs. [1] https://lore.kernel.org/lkml/20210623123441.592348-4-qperret@google.com/ Bug: 187186685 Signed-off-by: Quentin Perret <qperret@google.com> Change-Id: I749312a77306460318ac5374cf243d00b78120dd Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit b6e71f141280f787c584a8c597d7898dbff28d10 Author: Quentin Perret <qperret@google.com> Date: Thu Aug 5 11:21:53 2021 +0100 BACKPORT: sched/uclamp: Fix UCLAMP_FLAG_IDLE setting The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last uclamp active task (that is, when buckets.tasks reaches 0 for all buckets) to maintain the last uclamp.max and prevent blocked util from suddenly becoming visible. However, there is an asymmetry in how the flag is set and cleared which can lead to having the flag set whilst there are active tasks on the rq. Specifically, the flag is cleared in the uclamp_rq_inc() path, which is called at enqueue time, but set in uclamp_rq_dec_id() which is called both when dequeueing a task _and_ in the update_uclamp_active() path. As a result, when both uclamp_rq_{dec,ind}_id() are called from update_uclamp_active(), the flag ends up being set but not cleared, hence leaving the runqueue in a broken state. Fix this by clearing the flag in update_uclamp_active() as well. Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com Signed-off-by: RuRuTiaSaMa <1009087450@qq.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c0fda42ce6b4f1af0cd02cdf259aff2aaf8038cc Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Fri Nov 13 12:34:54 2020 +0100 UPSTREAM: sched/uclamp: Allow to reset a task uclamp constraint value In case the user wants to stop controlling a uclamp constraint value for a task, use the magic value -1 in sched_util_{min,max} with the appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate the reset. The advantage over the 'additional flag' approach (i.e. introducing SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be exported via uapi. This avoids the need to document how this new flag has be used in conjunction with the existing uclamp related flags. The following subtle issue is fixed as well. When a uclamp constraint value is set on a !user_defined uclamp_se it is currently first reset and then set. Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which stands for the 'sched class change' case. The related condition 'if (uc_se->user_defined)' moved from __setscheduler_uclamp() into uclamp_reset(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yun Hsiang <hsiang023167@gmail.com> Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggemann@arm.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c2295e608322cc2ed46f7bc785e9e092813a630f Author: YueHaibing <yuehaibing@huawei.com> Date: Tue Sep 22 21:24:10 2020 +0800 UPSTREAM: sched/core: Remove unused inline function uclamp_bucket_base_value() There is no caller in tree, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 384e707b762fe96e3257dc45fdd06f102077e6f0 Author: Qinglang Miao <miaoqinglang@huawei.com> Date: Sat Jul 25 16:56:29 2020 +0800 UPSTREAM: sched/uclamp: Remove unnecessary mutex_init() The uclamp_mutex lock is initialized statically via DEFINE_MUTEX(), it is unnecessary to initialize it runtime via mutex_init(). Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20200725085629.98292-1-miaoqinglang@huawei.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit a15a1c8215338f6381551dd7e42fe8103da1b104 Author: Qais Yousef <qais.yousef@arm.com> Date: Thu Jul 16 12:03:45 2020 +0100 UPSTREAM: sched/uclamp: Add a new sysctl to control RT default boost value RT tasks by default run at the highest capacity/performance level. When uclamp is selected this default behavior is retained by enforcing the requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum value. This is also referred to as 'the default boost value of RT tasks'. See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks"). On battery powered devices, it is desired to control this default (currently hardcoded) behavior at runtime to reduce energy consumed by RT tasks. For example, a mobile device manufacturer where big.LITTLE architecture is dominant, the performance of the little cores varies across SoCs, and on high end ones the big cores could be too power hungry. Given the diversity of SoCs, the new knob allows manufactures to tune the best performance/power for RT tasks for the particular hardware they run on. They could opt to further tune the value when the user selects a different power saving mode or when the device is actively charging. The runtime aspect of it further helps in creating a single kernel image that can be run on multiple devices that require different tuning. Keep in mind that a lot of RT tasks in the system are created by the kernel. On Android for instance I can see over 50 RT tasks, only a handful of which created by the Android framework. To control the default behavior globally by system admins and device integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default to change the default boost value of the RT tasks. I anticipate this to be mostly in the form of modifying the init script of a particular device. To avoid polluting the fast path with unnecessary code, the approach taken is to synchronously do the update by traversing all the existing tasks in the system. This could race with a concurrent fork(), which is dealt with by introducing sched_post_fork() function which will ensure the racy fork will get the right update applied. Tested on Juno-r2 in combination with the RT capacity awareness [1]. By default an RT task will go to the highest capacity CPU and run at the maximum frequency, which is particularly energy inefficient on high end mobile devices because the biggest core[s] are 'huge' and power hungry. With this patch the RT task can be controlled to run anywhere by default, and doesn't cause the frequency to be maximum all the time. Yet any task that really needs to be boosted can easily escape this default behavior by modifying its requested uclamp.min value (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall. [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 7f15a8da33343d817d828b5ca95d207e4a89495f Author: Qais Yousef <qais.yousef@arm.com> Date: Thu Dec 2 11:20:33 2021 +0000 UPSTREAM: sched/uclamp: Fix rq->uclamp_max not set on first enqueue Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq rq, struct task_struct p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit d81ae8aac85c changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Bug: 254441685 Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com (cherry picked from commit 315c4f884800c45cb6bd8c90422fad554a8b9588) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I621fc463a3e51361516c2479aff6c80213aaf918 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 3aa7ab1653f0b9cb224c095c7f9cfa92782749f0 Author: Qais Yousef <qais.yousef@arm.com> Date: Thu Jun 17 17:51:55 2021 +0100 UPSTREAM: sched/uclamp: Fix uclamp_tg_restrict() Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 20% -----------+-----+------+----------- With this fix we get: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. \| p \| tg \| effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- new -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \|70% \| 70% -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Bug: 254441685 Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <xuewen.yan94@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com (cherry picked from commit 0213b7083e81f4acd69db32cb72eb4e5f220329a) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I128d75fea2900ec7bc360b44f18cada76c968578 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 02bc00f504e7174e58fbf90c176d37c91e1e759e Author: Qais Yousef <qais.yousef@arm.com> Date: Thu Jul 16 12:03:47 2020 +0100 UPSTREAM: sched/uclamp: Fix a deadlock when enabling uclamp static key The following splat was caught when setting uclamp value of a task: BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49 cpus_read_lock+0x68/0x130 static_key_enable+0x1c/0x38 __sched_setscheduler+0x900/0xad8 Fix by ensuring we enable the key outside of the critical section in __sched_setscheduler() Bug: 254441685 Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com (cherry picked from commit e65855a52b479f98674998cb23b21ef5a8144b04) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I9b33882f72b2f5a8bb8a1e077e7785f3462d1cee Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 557b3cd9f47d60ea7d5bf8be9f6d06e34786a8ec Author: Xuewen Yan <xuewen.yan@unisoc.com> Date: Wed Jun 30 22:12:04 2021 +0800 UPSTREAM: sched/uclamp: Ignore max aggregation if rq is idle When a task wakes up on an idle rq, uclamp_rq_util_with() would max aggregate with rq value. But since there is no task enqueued yet, the values are stale based on the last task that was running. When the new task actually wakes up and enqueued, then the rq uclamp values should reflect that of the newly woken up task effective uclamp values. This is a problem particularly for uclamp_max because it default to 1024. If a task p with uclamp_max = 512 wakes up, then max aggregation would ignore the capping that should apply when this task is enqueued, which is wrong. Fix that by ignoring max aggregation if the rq is idle since in that case the effective uclamp value of the rq will be the ones of the task that will wake up. Bug: 254441685 Fixes: 9d20ad7dfc9a ("sched/uclamp: Add uclamp_util_with()") Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> [qias: Changelog] Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lore.kernel.org/r/20210630141204.8197-1-xuewen.yan94@gmail.com (cherry picked from commit 3e1493f46390618ea78607cb30c58fc19e2a5035) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I6ea180d854d9d8ffa94abdac4800c9cb130f77cf Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 04eeff8cf3fc25bde9471441080f3900ccb09c3a Author: Qais Yousef <qais.yousef@arm.com> Date: Mon May 10 15:50:32 2021 +0100 UPSTREAM: sched/uclamp: Fix locking around cpu_util_update_eff() cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the uclamp_mutex or rcu_read_lock() like other call sites, which is a mistake. The uclamp_mutex is required to protect against concurrent reads and writes that could update the cgroup hierarchy. The rcu_read_lock() is required to traverse the cgroup data structures in cpu_util_update_eff(). Surround the caller with the required locks and add some asserts to better document the dependency in cpu_util_update_eff(). Bug: 254441685 Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com (cherry picked from commit 93b73858701fd01de26a4a874eb95f9b7156fd4b) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I6b11073f23f58ce4c2415cdfc46140a60e3411a2 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit b60f7bed3b76a71ec848c5b690f6927bf36e2f0e Author: Qais Yousef <qais.yousef@arm.com> Date: Mon May 10 15:50:31 2021 +0100 UPSTREAM: sched/uclamp: Fix wrong implementation of cpu.uclamp.min cpu.uclamp.min is a protection as described in cgroup-v2 Resource Distribution Model Documentation/admin-guide/cgroup-v2.rst which means we try our best to preserve the minimum performance point of tasks in this group. See full description of cpu.uclamp.min in the cgroup-v2.rst. But the current implementation makes it a limit, which is not what was intended. For example: tg->cpu.uclamp.min = 20% p0->uclamp[UCLAMP_MIN] = 0 p1->uclamp[UCLAMP_MIN] = 50% Previous Behavior (limit): p0->effective_uclamp = 0 p1->effective_uclamp = 20% New Behavior (Protection): p0->effective_uclamp = 20% p1->effective_uclamp = 50% Which is inline with how protections should work. With this change the cgroup and per-task behaviors are the same, as expected. Additionally, we remove the confusing relationship between cgroup and !user_defined flag. We don't want for example RT tasks that are boosted by default to max to change their boost value when they attach to a cgroup. If a cgroup wants to limit the max performance point of tasks attached to it, then cpu.uclamp.max must be set accordingly. Or if they want to set different boost value based on cgroup, then sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max and set the right cpu.uclamp.min for each group to let the RT tasks obtain the desired boost value when attached to that group. As it stands the dependency on !user_defined flag adds an extra layer of complexity that is not required now cpu.uclamp.min behaves properly as a protection. The propagation model of effective cpu.uclamp.min in child cgroups as implemented by cpu_util_update_eff() is still correct. The parent protection sets an upper limit of what the child cgroups will effectively get. Bug: 254441685 Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com (cherry picked from commit 0c18f2ecfcc274a4bcc1d122f79ebd4001c3b445) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I9f9f7b9e7ef3d19ccb1685f271639c9ed76b580f Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit b2135621698d06e6a73b0f4ef15529321cc68e31 Author: Qais Yousef <qais.yousef@arm.com> Date: Tue Jun 30 12:21:23 2020 +0100 BACKPORT: sched/uclamp: Protect uclamp fast path code with static key There is a report that when uclamp is enabled, a netperf UDP test regresses compared to a kernel compiled without uclamp. https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/ While investigating the root cause, there were no sign that the uclamp code is doing anything particularly expensive but could suffer from bad cache behavior under certain circumstances that are yet to be understood. https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/ To reduce the pressure on the fast path anyway, add a static key that is by default will skip executing uclamp logic in the enqueue/dequeue_task() fast path until it's needed. As soon as the user start using util clamp by: 1. Changing uclamp value of a task with sched_setattr() 2. Modifying the default sysctl_sched_util_clamp_{min, max} 3. Modifying the default cpu.uclamp.{min, max} value in cgroup We flip the static key now that the user has opted to use util clamp. Effectively re-introducing uclamp logic in the enqueue/dequeue_task() fast path. It stays on from that point forward until the next reboot. This should help minimize the effect of util clamp on workloads that don't need it but still allow distros to ship their kernels with uclamp compiled in by default. SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end up with unbalanced call to uclamp_rq_dec_id() if we flip the key while a task is running in the rq. Since we know it is harmless we just quietly return if we attempt a uclamp_rq_dec_id() when rq->uclamp[].bucket[].tasks is 0. In schedutil, we introduce a new uclamp_is_enabled() helper which takes the static key into account to ensure RT boosting behavior is retained. The following results demonstrates how this helps on 2 Sockets Xeon E5 2x10-Cores system. nouclamp uclamp uclamp-static-key Hmean send-64 162.43 ( 0.00%) 157.84 * -2.82%* 163.39 * 0.59%* Hmean send-128 324.71 ( 0.00%) 314.78 * -3.06%* 326.18 * 0.45%* Hmean send-256 641.55 ( 0.00%) 628.67 * -2.01%* 648.12 * 1.02%* Hmean send-1024 2525.28 ( 0.00%) 2448.26 * -3.05%* 2543.73 * 0.73%* Hmean send-2048 4836.14 ( 0.00%) 4712.08 * -2.57%* 4867.69 * 0.65%* Hmean send-3312 7540.83 ( 0.00%) 7425.45 * -1.53%* 7621.06 * 1.06%* Hmean send-4096 9124.53 ( 0.00%) 8948.82 * -1.93%* 9276.25 * 1.66%* Hmean send-8192 15589.67 ( 0.00%) 15486.35 * -0.66%* 15819.98 * 1.48%* Hmean send-16384 26386.47 ( 0.00%) 25752.25 * -2.40%* 26773.74 * 1.47%* The perf diff between nouclamp and uclamp-static-key when uclamp is disabled in the fast path: 8.73% -1.55% [kernel.kallsyms] [k] try_to_wake_up 0.07% +0.04% [kernel.kallsyms] [k] deactivate_task 0.13% -0.02% [kernel.kallsyms] [k] activate_task The diff between nouclamp and uclamp-static-key when uclamp is enabled in the fast path: 8.73% -0.72% [kernel.kallsyms] [k] try_to_wake_up 0.13% +0.39% [kernel.kallsyms] [k] activate_task 0.07% +0.38% [kernel.kallsyms] [k] deactivate_task Bug: 254441685 Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") Reported-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20200630112123.12076-3-qais.yousef@arm.com (cherry picked from commit 46609ce227039fd192e0ecc7d940bed587fd2c78) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I80555c22b856fbbd46692f83d501f03b6f393c35 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 322f697f698c34fd19271a1c21e756bbe0752aad Author: Qais Yousef <qais.yousef@arm.com> Date: Tue Jun 30 12:21:22 2020 +0100 BACKPORT: sched/uclamp: Fix initialization of struct uclamp_rq struct uclamp_rq was zeroed out entirely in assumption that in the first call to uclamp_rq_inc() they'd be initialized correctly in accordance to default settings. But when next patch introduces a static key to skip uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil will fail to perform any frequency changes because the rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which means all rqs are capped to 0 by default. Fix it by making sure we do proper initialization at init without relying on uclamp_rq_inc() doing it later. Bug: 254441685 Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20200630112123.12076-2-qais.yousef@arm.com (cherry picked from commit d81ae8aac85ca2e307d273f6dc7863a721bf054e) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I014101dbe53c85f412f87f7f6937b18d2f141800 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 416ea5e54d1471df3352093b1aa0c3b7a4341a8f Author: Quentin Perret <qperret@google.com> Date: Fri Apr 30 15:14:12 2021 +0000 UPSTREAM: sched: Fix out-of-bound access in uclamp Util-clamp places tasks in different buckets based on their clamp values for performance reasons. However, the size of buckets is currently computed using a rounding division, which can lead to an off-by-one error in some configurations. For instance, with 20 buckets, the bucket size will be 1024/20=51. A task with a clamp of 1024 will be mapped to bucket id 1024/51=20. Sadly, correct indexes are in range [0,19], hence leading to an out of bound memory access. Clamp the bucket id to fix the issue. Bug: 186415778 Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") Suggested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20210430151412.160913-1-qperret@google.com (cherry picked from commit 6d2f8909a5fabb73fe2a63918117943986c39b6c) Signed-off-by: Quentin Perret <qperret@google.com> Change-Id: I8097f5ed34abcff36c5ed395643d65727ea969eb Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 41abff075e5f46de03fcb89a58b41a73c3f137a6 Author: Quentin Perret <qperret@google.com> Date: Thu Apr 16 09:59:56 2020 +0100 BACKPORT: sched/core: Fix reset-on-fork from RT with uclamp uclamp_fork() resets the uclamp values to their default when the reset-on-fork flag is set. It also checks whether the task has a RT policy, and sets its uclamp.min to 1024 accordingly. However, during reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after, hence leading to an erroneous uclamp.min setting for the new task if it was forked from RT. Fix this by removing the unnecessary check on rt_task() in uclamp_fork() as this doesn't make sense if the reset-on-fork flag is set. [ qperret: BACKPORT because of a conflict with the Android-specific SUGOV_RT_MAX_FREQ sched_feat, which is equally unnecessary in this path ] Bug: 120440300 Fixes: 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks") Reported-by: Chitti Babu Theegala <ctheegal@codeaurora.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Patrick Bellasi <patrick.bellasi@matbug.net> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com (cherry picked from commit eaf5a92ebde5bca3bb2565616115bd6d579486cd) Signed-off-by: Quentin Perret <qperret@google.com> Change-Id: I9a19ac5474d0508b8437e4a1d859573b4106ed08 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c719bd9543490474cb07d09d04cf7a01780c7a28 Author: Qais Yousef <qais.yousef@arm.com> Date: Tue Jan 14 21:09:47 2020 +0000 UPSTREAM: sched/uclamp: Reject negative values in cpu_uclamp_write() The check to ensure that the new written value into cpu.uclamp.{min,max} is within range, [0:100], wasn't working because of the signed comparison 7301 if (req.percent > UCLAMP_PERCENT_SCALE) { 7302 req.ret = -ERANGE; 7303 return req; 7304 } # echo -1 > cpu.uclamp.min # cat cpu.uclamp.min 42949671.96 Cast req.percent into u64 to force the comparison to be unsigned and work as intended in capacity_from_percent(). # echo -1 > cpu.uclamp.min sh: write error: Numerical result out of range Bug: 120440300 Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com (cherry picked from commit b562d140649966d4daedd0483a8fe59ad3bb465a) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I17fc2b119dcbffb212e130ed2c37ae3a8d5bbb61 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit bea7af6318b69aed941c04b64fb7aec6760692fa Author: Quentin Perret <quentin.perret@arm.com> Date: Tue Jul 30 13:54:00 2019 +0100 BACKPORT: ANDROID: sched/core: Add a latency-sensitive flag to uclamp Add a 'latency_sensitive' flag to uclamp in order to express the need for some tasks to find a CPU where they can wake-up quickly. This is not expected to be used without cgroup support, so add solely a cgroup interface for it. As this flag represents a boolean attribute and not an amount of resources to be shared, it is not clear what the delegation logic should be. As such, it is kept simple: every new cgroup starts with latency_sensitive set to false, regardless of the parent. In essence, this is similar to SchedTune's prefer-idle flag which was used in android-4.19 and prior. Bug: 120440300 Change-Id: I722d8ecabb428bb7b95a5b54bc70a87f182dde2a Signed-off-by: Quentin Perret <quentin.perret@arm.com> (cherry picked from commit ad7dd648fc7dbe11f23673a3463af2468a274998) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit dcf228d2be609f20fb826ec420e1af7074b3265a Author: Li Guanglei <guanglei.li@unisoc.com> Date: Wed Dec 25 15:44:04 2019 +0800 FROMGIT: sched/core: Fix size of rq::uclamp initialization rq::uclamp is an array of struct uclamp_rq, make sure we clear the whole thing. Bug: 120440300 Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga") Signed-off-by: Li Guanglei <guanglei.li@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com (cherry picked from commit dcd6dffb0a75741471297724640733fa4e958d72 https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Id36a2b77c45e586535e8fadfb7d66868ca8fe8c7 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 2b98fd7366175ec1903c23885f0357b51bc4c929 Author: Qais Yousef <qais.yousef@arm.com> Date: Tue Dec 24 11:54:04 2019 +0000 FROMGIT: sched/uclamp: Fix a bug in propagating uclamp value in new cgroups When a new cgroup is created, the effective uclamp value wasn't updated with a call to cpu_util_update_eff() that looks at the hierarchy and update to the most restrictive values. Fix it by ensuring to call cpu_util_update_eff() when a new cgroup becomes online. Without this change, the newly created cgroup uses the default root_task_group uclamp values, which is 1024 for both uclamp_{min, max}, which will cause the rq to to be clamped to max, hence cause the system to run at max frequency. The problem was observed on Ubuntu server and was reproduced on Debian and Buildroot rootfs. By default, Ubuntu and Debian create a cpu controller cgroup hierarchy and add all tasks to it - which creates enough noise to keep the rq uclamp value at max most of the time. Imitating this behavior makes the problem visible in Buildroot too which otherwise looks fine since it's a minimal userspace. Bug: 120440300 Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps") Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Doug Smythies <dsmythies@telus.net> Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/ (cherry picked from commit 7226017ad37a888915628e59a84a2d1e57b40707 https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I9636c60e04d58bbfc5041df1305b34a12b5a3f46 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit ca1d456778b535f2b7aa6c473d4d67e4ece5a6d0 Author: Valentin Schneider <valentin.schneider@arm.com> Date: Wed Dec 11 11:38:49 2019 +0000 FROMGIT: sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with() The current helper returns (CPU) rq utilization with uclamp restrictions taken into account. A uclamp task utilization helper would be quite helpful, but this requires some renaming. Prepare the code for the introduction of a uclamp_task_util() by renaming the existing uclamp_util_with() to uclamp_rq_util_with(). Bug: 120440300 Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-4-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit d2b58a286e89824900d501db0be1d4f6aed474fc https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I3e7146b788e079e400167203df5e5dadee2fd232 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 3b187fe5f97ec4e11f313001b70ce174b068efb8 Author: Valentin Schneider <valentin.schneider@arm.com> Date: Wed Dec 11 11:38:48 2019 +0000 FROMGIT: sched/uclamp: Make uclamp util helpers use and return UL values Vincent pointed out recently that the canonical type for utilization values is 'unsigned long'. Internally uclamp uses 'unsigned int' values for cache optimization, but this doesn't have to be exported to its users. Make the uclamp helpers that deal with utilization use and return unsigned long values. Bug: 120440300 Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-3-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 686516b55e98edf18c2a02d36aaaa6f4c0f6c39c https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Id3837f12237e5b77eb3a236bd32457dcd7de743e Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 9f7cbbc57004a73d8550358a95c803f49a4067bb Author: Valentin Schneider <valentin.schneider@arm.com> Date: Wed Dec 11 11:38:47 2019 +0000 FROMGIT: sched/uclamp: Remove uclamp_util() The sole user of uclamp_util(), schedutil_cpu_util(), was made to use uclamp_util_with() instead in commit: af24bde8df20 ("sched/uclamp: Add uclamp support to energy_compute()") From then on, uclamp_util() has remained unused. Being a simple wrapper around uclamp_util_with(), we can get rid of it and win back a few lines. Bug: 120440300 Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 59fe675248ffc37d4167e9ec6920a2f3d5ec67bb https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I11dbff80c6c4be9666438800b2527aca8cd24025 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 1e2856646d5f69ab499a1c5cd9f2e3376b7a1ef9 Author: Qais Yousef <qais.yousef@arm.com> Date: Wed Oct 9 11:46:11 2019 +0100 BACKPORT: sched/rt: Make RT capacity-aware Capacity Awareness refers to the fact that on heterogeneous systems (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence when placing tasks we need to be aware of this difference of CPU capacities. In such scenarios we want to ensure that the selected CPU has enough capacity to meet the requirement of the running task. Enough capacity means here that capacity_orig_of(cpu) >= task.requirement. The definition of task.requirement is dependent on the scheduling class. For CFS, utilization is used to select a CPU that has >= capacity value than the cfs_task.util. capacity_orig_of(cpu) >= cfs_task.util DL isn't capacity aware at the moment but can make use of the bandwidth reservation to implement that in a similar manner CFS uses utilization. The following patchset implements that: https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/ capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime For RT we don't have a per task utilization signal and we lack any information in general about what performance requirement the RT task needs. But with the introduction of uclamp, RT tasks can now control that by setting uclamp_min to guarantee a minimum performance point. ATM the uclamp value are only used for frequency selection; but on heterogeneous systems this is not enough and we need to ensure that the capacity of the CPU is >= uclamp_min. Which is what implemented here. capacity_orig_of(cpu) >= rt_task.uclamp_min Note that by default uclamp.min is 1024, which means that RT tasks will always be biased towards the big CPUs, which make for a better more predictable behavior for the default case. Must stress that the bias acts as a hint rather than a definite placement strategy. For example, if all big cores are busy executing other RT tasks we can't guarantee that a new RT task will be placed there. On non-heterogeneous systems the original behavior of RT should be retained. Similarly if uclamp is not selected in the config. [ mingo: Minor edits to comments. ] Bug: 120440300 Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 804d402fb6f6487b825aae8cf42fda6426c62867 https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git) [Qais: resolved minor conflict in kernel/sched/cpupri.c] Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ifc9da1c47de1aec9b4d87be2614e4c8968366900 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit fbcf2f9dcec79826507ced228a9278ae404c2a8a Author: Valentin Schneider <valentin.schneider@arm.com> Date: Fri Nov 15 10:39:08 2019 +0000 UPSTREAM: sched/uclamp: Fix overzealous type replacement Some uclamp helpers had their return type changed from 'unsigned int' to 'enum uclamp_id' by commit 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values") but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE] range, which should really be unsigned int. The affected helpers are uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up. Note that this doesn't lead to any obj diff using a relatively recent aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value() properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do anything funny. I'm still marking this as fixing the above commit to be on the safe side. Bug: 120440300 Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: patrick.bellasi@matbug.net Cc: qperret@google.com Cc: surenb@google.com Cc: tj@kernel.org Fixes: 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values") Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 7763baace1b738d65efa46d68326c9406311c6bf) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I924a99c125372a8fca81cb4bc0c82e6a7183fc8a Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit b52d3d696bf2b1aecece38539b2fcf4f7f8445f5 Author: Qais Yousef <qais.yousef@arm.com> Date: Thu Nov 14 21:10:52 2019 +0000 UPSTREAM: sched/uclamp: Fix incorrect condition uclamp_update_active() should perform the update when p->uclamp[clamp_id].active is true. But when the logic was inverted in [1], the if condition wasn't inverted correctly too. [1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/ Bug: 120440300 Reported-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Patrick Bellasi <patrick.bellasi@matbug.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes") Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 6e1ff0773f49c7d38e8b4a9df598def6afb9f415) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I51b58a6089290277e08a0aaa72b86f852eec1512 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit bbe475a9c1f66f4da03c5870aff5883fed09eb77 Author: Qais Yousef <qais.yousef@arm.com> Date: Tue Nov 5 11:22:12 2019 +0000 UPSTREAM: sched/core: Fix compilation error when cgroup not selected When cgroup is disabled the following compilation error was hit kernel/sched/core.c: In function ‘uclamp_update_active_tasks’: kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known struct css_task_iter it; ^~ kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration] css_task_iter_start(css, 0, &it); ^~~~~~~~~~~~~~~~~~~ __sg_page_iter_start kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration] while ((p = css_task_iter_next(&it))) { ^~~~~~~~~~~~~~~~~~ __sg_page_iter_next kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration] css_task_iter_end(&it); ^~~~~~~~~~~~~~~~~ get_task_cred kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable] struct css_task_iter it; ^~ cc1: some warnings being treated as errors make[2]: *** [kernel/sched/core.o] Error 1 Fix by protetion uclamp_update_active_tasks() with CONFIG_UCLAMP_TASK_GROUP Bug: 120440300 Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Patrick Bellasi <patrick.bellasi@matbug.net> Cc: Mel Gorman <mgorman@suse.de> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com (cherry picked from commit e3b8b6a0d12cccf772113d6b5c1875192186fbd4) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ia4c0f801d68050526f9f117ec9189e448b01345a Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 35f8e35625c692abfc81b698e5112058d37dacad Author: Ingo Molnar <mingo@kernel.org> Date: Wed Sep 4 09:55:32 2019 +0200 UPSTREAM: sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code Thadeu Lima de Souza Cascardo reported that 'chrt' broke on recent kernels: $ chrt -p $$ chrt: failed to get pid 26306's policy: Argument list too long and he has root-caused the bug to the following commit increasing sched_attr size and breaking sched_read_attr() into returning -EFBIG: a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping") The other, bigger bug is that the whole sched_getattr() and sched_read_attr() logic of checking non-zero bits in new ABI components is arguably broken, and pretty much any extension of the ABI will spuriously break the ABI. That's way too fragile. Instead implement the perf syscall's extensible ABI instead, which we already implement on the sched_setattr() side: - if user-attributes have the same size as kernel attributes then the logic is unchanged. - if user-attributes are larger than the kernel knows about then simply skip the extra bits, but set attr->size to the (smaller) kernel size so that tooling can (in principle) handle older kernel as well. - if user-attributes are smaller than the kernel knows about then just copy whatever user-space can accept. Also clean up the whole logic: - Simplify the code flow - there's no need for 'ret' for example. - Standardize on 'kattr/uattr' and 'ksize/usize' naming to make sure we always know which side we are dealing with. - Why is it called 'read' when what it does is to copy to user? This code is so far away from VFS read() semantics that the naming is actively confusing. Name it sched_attr_copy_to_user() instead, which mirrors other copy_to_user() functionality. - Move the attr->size assignment from the head of sched_getattr() to the sched_attr_copy_to_user() function. Nothing else within the kernel should care about the size of the structure. With these fixes the sched_getattr() syscall now nicely supports an extensible ABI in both a forward and backward compatible fashion, and will also fix the chrt bug. As an added bonus the bogus -EFBIG return is removed as well, which as Thadeu noted should have been -E2BIG to begin with. Bug: 120440300 Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping") Link: https://lkml.kernel.org/r/20190904075532.GA26751@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 1251201c0d34fadf69d56efa675c2b7dd0a90eca) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I67e653c4f69db0140e9651c125b60e2b8cfd62f1 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 2e5fab725ddfc0db94558a0eb68fbb7584ab1a93 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Aug 22 14:28:11 2019 +0100 UPSTREAM: sched/uclamp: Always use 'enum uclamp_id' for clamp_id values The supported clamp indexes are defined in 'enum clamp_id', however, because of the code logic in some of the first utilization clamping series version, sometimes we needed to use 'unsigned int' to represent indices. This is not more required since the final version of the uclamp_* APIs can always use the proper enum uclamp_id type. Fix it with a bulk rename now that we have all the bits merged. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 0413d7f33e60751570fd6c179546bde2f7d82dcb) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I0be680b2489fa07244bac63b5c6fe1a79a53bef7 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 27326e00f48ecb7241256b777090282295061a10 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Aug 22 14:28:10 2019 +0100 UPSTREAM: sched/uclamp: Update CPU's refcount on TG's clamp changes On updates of task group (TG) clamp values, ensure that these new values are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE tasks are immediately boosted and/or capped as requested. Do that each time we update effective clamps from cpu_util_update_eff(). Use the cgroup_subsys_state (css) to walk the list of tasks in each affected TG and update their RUNNABLE tasks. Update each task by using the same mechanism used for cpu affinity masks updates, i.e. by taking the rq lock. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-6-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit babbe170e053c6ec2343751749995b7b9fd5fd2c) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I5e48891bd48c266dd282e1bab8f60533e4e29b48 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit e66ba8289ce3a479f0c6c52d25307c4d2b22f256 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Aug 22 14:28:09 2019 +0100 UPSTREAM: sched/uclamp: Use TG's clamps to restrict TASK's clamps When a task specific clamp value is configured via sched_setattr(2), this value is accounted in the corresponding clamp bucket every time the task is {en,de}qeued. However, when cgroups are also in use, the task specific clamp values could be restricted by the task_group (TG) clamp values. Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a task is enqueued, it's accounted in the clamp bucket tracking the smaller clamp between the task specific value and its TG effective value. This allows to: 1. ensure cgroup clamps are always used to restrict task specific requests, i.e. boosted not more than its TG effective protection and capped at least as its TG effective limit. 2. implement a "nice-like" policy, where tasks are still allowed to request less than what enforced by their TG effective limits and protections Do this by exploiting the concept of "effective" clamp, which is already used by a TG to track parent enforced restrictions. Apply task group clamp restrictions only to tasks belonging to a child group. While, for tasks in the root group or in an autogroup, system defaults are still enforced. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 3eac870a324728e5d17118888840dad70bcd37f3) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I0215e0a68cc0fa7c441e33052757f8571b7c99b9 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit d26df4b554e45013689e091881875e767cc55122 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Aug 22 14:28:08 2019 +0100 UPSTREAM: sched/uclamp: Propagate system defaults to the root group The clamp values are not tunable at the level of the root task group. That's for two main reasons: - the root group represents "system resources" which are always entirely available from the cgroup standpoint. - when tuning/restricting "system resources" makes sense, tuning must be done using a system wide API which should also be available when control groups are not. When a system wide restriction is available, cgroups should be aware of its value in order to know exactly how much "system resources" are available for the subgroups. Utilization clamping supports already the concepts of: - system defaults: which define the maximum possible clamp values usable by tasks. - effective clamps: which allows a parent cgroup to constraint (maybe temporarily) its descendants without losing the information related to the values "requested" from them. Exploit these two concepts and bind them together in such a way that, whenever system default are tuned, the new values are propagated to (possibly) restrict or relax the "effective" value of nested cgroups. When cgroups are in use, force an update of all the RUNNABLE tasks. Otherwise, keep things simple and do just a lazy update next time each task will be enqueued. Do that since we assume a more strict resource control is required when cgroups are in use. This allows also to keep "effective" clamp values updated in case we need to expose them to user-space. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 7274a5c1bbec45f06f1fff4b8c8b5855b6cc189d) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ibf7ce5c46b67c79765b56b792ee22ed9595802c3 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 46b50a614c89320b23ab7615fb3cd92083c38bcf Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Aug 22 14:28:07 2019 +0100 UPSTREAM: sched/uclamp: Propagate parent clamps In order to properly support hierarchical resources control, the cgroup delegation model requires that attribute writes from a child group never fail but still are locally consistent and constrained based on parent's assigned resources. This requires to properly propagate and aggregate parent attributes down to its descendants. Implement this mechanism by adding a new "effective" clamp value for each task group. The effective clamp value is defined as the smaller value between the clamp value of a group and the effective clamp value of its parent. This is the actual clamp value enforced on tasks in a task group. Since it's possible for a cpu.uclamp.min value to be bigger than the cpu.uclamp.max value, ensure local consistency by restricting each "protection" (i.e. min utilization) with the corresponding "limit" (i.e. max utilization). Do that at effective clamps propagation to ensure all user-space write never fails while still always tracking the most restrictive values. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 0b60ba2dd342016e4e717dbaa4ca9af3a43f4434) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: If1cc136e1fb4a8f4c6ea15dc440b28d833a8d7e7 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 067c0b19313dc9e3730abf9ceb5942420a2c3fdd Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Thu Aug 22 14:28:06 2019 +0100 BACKPORT: sched/uclamp: Extend CPU's cgroup controller The cgroup CPU bandwidth controller allows to assign a specified (maximum) bandwidth to the tasks of a group. However this bandwidth is defined and enforced only on a temporal base, without considering the actual frequency a CPU is running on. Thus, the amount of computation completed by a task within an allocated bandwidth can be very different depending on the actual frequency the CPU is running that task. The amount of computation can be affected also by the specific CPU a task is running on, especially when running on asymmetric capacity systems like Arm's big.LITTLE. With the availability of schedutil, the scheduler is now able to drive frequency selections based on actual task utilization. Moreover, the utilization clamping support provides a mechanism to bias the frequency selection operated by schedutil depending on constraints assigned to the tasks currently RUNNABLE on a CPU. Giving the mechanisms described above, it is now possible to extend the cpu controller to specify the minimum (or maximum) utilization which should be considered for tasks RUNNABLE on a cpu. This makes it possible to better defined the actual computational power assigned to task groups, thus improving the cgroup CPU bandwidth controller which is currently based just on time constraints. Extend the CPU controller with a couple of new attributes uclamp.{min,max} which allow to enforce utilization boosting and capping for all the tasks in a group. Specifically: - uclamp.min: defines the minimum utilization which should be considered i.e. the RUNNABLE tasks of this group will run at least at a minimum frequency which corresponds to the uclamp.min utilization - uclamp.max: defines the maximum utilization which should be considered i.e. the RUNNABLE tasks of this group will run up to a maximum frequency which corresponds to the uclamp.max utilization These attributes: a) are available only for non-root nodes, both on default and legacy hierarchies, while system wide clamps are defined by a generic interface which does not depends on cgroups. This system wide interface enforces constraints on tasks in the root node. b) enforce effective constraints at each level of the hierarchy which are a restriction of the group requests considering its parent's effective constraints. Root group effective constraints are defined by the system wide interface. This mechanism allows each (non-root) level of the hierarchy to: - request whatever clamp values it would like to get - effectively get only up to the maximum amount allowed by its parent c) have higher priority than task-specific clamps, defined via sched_setattr(), thus allowing to control and restrict task requests. Add two new attributes to the cpu controller to collect "requested" clamp values. Allow that at each non-root level of the hierarchy. Keep it simple by not caring now about "effective" values computation and propagation along the hierarchy. Update sysctl_sched_uclamp_handler() to use the newly introduced uclamp_mutex so that we serialize system default updates with cgroup relate updates. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 2480c093130f64ac3a410504fa8b3db1fc4b87ce) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I0285c44910bf073b80d7996361e6698bc5aedfae Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 64b042aac4283cf4456f213f3546daee0c731785 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:11 2019 +0100 UPSTREAM: sched/uclamp: Add uclamp_util_with() So far uclamp_util() allows to clamp a specified utilization considering the clamp values requested by RUNNABLE tasks in a CPU. For the Energy Aware Scheduler (EAS) it is interesting to test how clamp values will change when a task is becoming RUNNABLE on a given CPU. For example, EAS is interested in comparing the energy impact of different scheduling decisions and the clamp values can play a role on that. Add uclamp_util_with() which allows to clamp a given utilization by considering the possible impact on CPU clamp values of a specified task. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-11-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 9d20ad7dfc9a5cc64e33d725902d3863d350a66a) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ida153a3526b87f5674a6e037d4725d99eec7b478 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 0fc858cc7a9712c08da29adc187e69279768a6e4 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:10 2019 +0100 BACKPORT: sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks Each time a frequency update is required via schedutil, a frequency is selected to (possibly) satisfy the utilization reported by each scheduling class and irqs. However, when utilization clamping is in use, the frequency selection should consider userspace utilization clamping hints. This will allow, for example, to: - boost tasks which are directly affecting the user experience by running them at least at a minimum "requested" frequency - cap low priority tasks not directly affecting the user experience by running them only up to a maximum "allowed" frequency These constraints are meant to support a per-task based tuning of the frequency selection thus supporting a fine grained definition of performance boosting vs energy saving strategies in kernel space. Add support to clamp the utilization of RUNNABLE FAIR and RT tasks within the boundaries defined by their aggregated utilization clamp constraints. Do that by considering the max(min_util, max_util) to give boosted tasks the performance they need even when they happen to be co-scheduled with other capped tasks. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 982d9cdc22c9f6df5ad790caa229ff74fb1d95e7) Conflicts: kernel/sched/cpufreq_schedutil.c 1. Merged the if condition to include the non-upstream sched_feat(SUGOV_RT_MAX_FREQ) check 2. Change the function signature to pass util_cfs and define util as an automatic variable. Bug: 120440300 Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ie222c9ad84776fc2948e30c116eee876df697a17 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c927abaedd42cf29f8e3dc92613e84c4375e4773 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:09 2019 +0100 UPSTREAM: sched/uclamp: Set default clamps for RT tasks By default FAIR tasks start without clamps, i.e. neither boosted nor capped, and they run at the best frequency matching their utilization demand. This default behavior does not fit RT tasks which instead are expected to run at the maximum available frequency, if not otherwise required by explicitly capping them. Enforce the correct behavior for RT tasks by setting util_min to max whenever: 1. the task is switched to the RT class and it does not already have a user-defined clamp value assigned. 2. an RT task is forked from a parent with RESET_ON_FORK set. NOTE: utilization clamp values are cross scheduling class attributes and thus they are never changed/reset once a value has been explicitly defined from user-space. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-9-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 1a00d999971c78ab024a17b0efc37d78404dd120) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I81fcadaea34f557e531fa5ac6aab84fcb0ee37c7 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit bd6028f582359eac294fe7363206330696c84b9c Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:08 2019 +0100 UPSTREAM: sched/uclamp: Reset uclamp values on RESET_ON_FORK A forked tasks gets the same clamp values of its parent however, when the RESET_ON_FORK flag is set on parent, e.g. via: sys_sched_setattr() sched_setattr() __sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK) the new forked task is expected to start with all attributes reset to default values. Do that for utilization clamp values too by checking the reset request from the existing uclamp_fork() call which already provides the required initialization for other uclamp related bits. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-8-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit a87498ace58e23b62a572dc7267579ede4c8495c) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: If7bda202707aac3a2696a42f8146f607cdd36905 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c713ddf1b1d0c7fac555eacf8579b8ffd5bdd96d Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:07 2019 +0100 BACKPORT: sched/uclamp: Extend sched_setattr() to support utilization clamping The SCHED_DEADLINE scheduling class provides an advanced and formal model to define tasks requirements that can translate into proper decisions for both task placements and frequencies selections. Other classes have a more simplified model based on the POSIX concept of priorities. Such a simple priority based model however does not allow to exploit most advanced features of the Linux scheduler like, for example, driving frequencies selection via the schedutil cpufreq governor. However, also for non SCHED_DEADLINE tasks, it's still interesting to define tasks properties to support scheduler decisions. Utilization clamping exposes to user-space a new set of per-task attributes the scheduler can use as hints about the expected/required utilization for a task. This allows to implement a "proactive" per-task frequency control policy, a more advanced policy than the current one based just on "passive" measured task utilization. For example, it's possible to boost interactive tasks (e.g. to get better performance) or cap background tasks (e.g. to be more energy/thermal efficient). Introduce a new API to set utilization clamping values for a specified task by extending sched_setattr(), a syscall which already allows to define task specific properties for different scheduling classes. A new pair of attributes allows to specify a minimum and maximum utilization the scheduler can consider for a task. Do that by validating the required clamp values before and then applying the required changes using _the_ same pattern already in use for __setscheduler(). This ensures that the task is re-enqueued with the new clamp values. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit a509a7cd79747074a2c018a45bbbc52d1f4aed44) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I420e7ece5628bc639811a79654c35135a65bfd02 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 6306516692f50c3a2d2119142e393f5ef9f48259 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:06 2019 +0100 BACKPORT: sched/core: Allow sched_setattr() to use the current policy The sched_setattr() syscall mandates that a policy is always specified. This requires to always know which policy a task will have when attributes are configured and this makes it impossible to add more generic task attributes valid across different scheduling policies. Reading the policy before setting generic tasks attributes is racy since we cannot be sure it is not changed concurrently. Introduce the required support to change generic task attributes without affecting the current task policy. This is done by adding an attribute flag (SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy. Add support for the SETPARAM_POLICY policy, which is already used by the sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX syscall. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-6-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 1d6362fa0cfc8c7b243fa92924429d826599e691) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I41cbe73d7aa30123adbd757fa30e346938651784 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 62d50c5a8556377176c6e3ef1226aa968e714730 Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:05 2019 +0100 UPSTREAM: sched/uclamp: Add system default clamps Tasks without a user-defined clamp value are considered not clamped and by default their utilization can have any value in the [0..SCHED_CAPACITY_SCALE] range. Tasks with a user-defined clamp value are allowed to request any value in that range, and the required clamp is unconditionally enforced. However, a "System Management Software" could be interested in limiting the range of clamp values allowed for all tasks. Add a privileged interface to define a system default configuration via: /proc/sys/kernel/sched_uclamp_util_{min,max} which works as an unconditional clamp range restriction for all tasks. With the default configuration, the full SCHED_CAPACITY_SCALE range of values is allowed for each clamp index. Otherwise, the task-specific clamp is capped by the corresponding system default value. Do that by tracking, for each task, the "effective" clamp value and bucket the task has been refcounted in at enqueue time. This allows to lazy aggregate "requested" and "system default" values at enqueue time and simplifies refcounting updates at dequeue time. The cached bucket ids are used to avoid (relatively) more expensive integer divisions every time a task is enqueued. An active flag is used to report when the "effective" value is valid and thus the task is actually refcounted in the corresponding rq's bucket. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit e8f14172c6b11e9a86c65532497087f8eb0f91b1) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I4f014c5ec9c312aaad606518f6e205fd0cfbcaa2 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 8bd52ebd588ad2c724fb789ee11b6e3179145eba Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:04 2019 +0100 UPSTREAM: sched/uclamp: Enforce last task's UCLAMP_MAX When a task sleeps it removes its max utilization clamp from its CPU. However, the blocked utilization on that CPU can be higher than the max clamp value enforced while the task was running. This allows undesired CPU frequency increases while a CPU is idle, for example, when another CPU on the same frequency domain triggers a frequency update, since schedutil can now see the full not clamped blocked utilization of the idle CPU. Fix this by using: uclamp_rq_dec_id(p, rq, UCLAMP_MAX) uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value) to detect when a CPU has no more RUNNABLE clamped tasks and to flag this condition. Don't track any minimum utilization clamps since an idle CPU never requires a minimum frequency. The decay of the blocked utilization is good enough to reduce the CPU frequency. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-4-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit e496187da71070687b55ff455e7d8d7d7f0ae0b9) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ie9eab897eb654ec9d4fba5eda20f66a91a712817 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit d3048c160f56d4003be0a9867b975140ea5efc0e Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:03 2019 +0100 UPSTREAM: sched/uclamp: Add bucket local max tracking Because of bucketization, different task-specific clamp values are tracked in the same bucket. For example, with 20% bucket size and assuming to have: Task1: util_min=25% Task2: util_min=35% both tasks will be refcounted in the [20..39]% bucket and always boosted only up to 20% thus implementing a simple floor aggregation normally used in histograms. In systems with only few and well-defined clamp values, it would be useful to track the exact clamp value required by a task whenever possible. For example, if a system requires only 23% and 47% boost values then it's possible to track the exact boost required by each task using only 3 buckets of ~33% size each. Introduce a mechanism to max aggregate the requested clamp values of RUNNABLE tasks in the same bucket. Keep it simple by resetting the bucket value to its base value only when a bucket becomes inactive. Allow a limited and controlled overboosting margin for tasks recounted in the same bucket. In systems where the boost values are not known in advance, it is still possible to control the maximum acceptable overboosting margin by tuning the number of clamp groups. For example, 20 groups ensure a 5% maximum overboost. Remove the rq bucket initialization code since a correct bucket value is now computed when a task is refcounted into a CPU's rq. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 60daf9c19410604f08c99e146bc378c8a64f4ccd) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I8782971f8867033cee5aaf981c96f9de33a5288c Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 016fa3c1ea0a4a5bed08ff564ae45903ce3dfb9b Author: Patrick Bellasi <patrick.bellasi@arm.com> Date: Fri Jun 21 09:42:02 2019 +0100 BACKPORT: sched/uclamp: Add CPU's clamp buckets refcounting Utilization clamping allows to clamp the CPU's utilization within a [util_min, util_max] range, depending on the set of RUNNABLE tasks on that CPU. Each task references two "clamp buckets" defining its minimum and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp bucket is active if there is at least one RUNNABLE tasks enqueued on that CPU and refcounting that bucket. When a task is {en,de}queued {on,from} a rq, the set of active clamp buckets on that CPU can change. If the set of active clamp buckets changes for a CPU a new "aggregated" clamp value is computed for that CPU. This is because each clamp bucket enforces a different utilization clamp value. Clamp values are always MAX aggregated for both util_min and util_max. This ensures that no task can affect the performance of other co-scheduled tasks which are more boosted (i.e. with higher util_min clamp) or less capped (i.e. with higher util_max clamp). A task has: task_struct::uclamp[clamp_id]::bucket_id to track the "bucket index" of the CPU's clamp bucket it refcounts while enqueued, for each clamp index (clamp_id). A runqueue has: rq::uclamp[clamp_id]::bucket[bucket_id].tasks to track how many RUNNABLE tasks on that CPU refcount each clamp bucket (bucket_id) of a clamp index (clamp_id). It also has a: rq::uclamp[clamp_id]::bucket[bucket_id].value to track the clamp value of each clamp bucket (bucket_id) of a clamp index (clamp_id). The rq::uclamp::bucket[clamp_id][] array is scanned every time it's needed to find a new MAX aggregated clamp value for a clamp_id. This operation is required only when it's dequeued the last task of a clamp bucket tracking the current MAX aggregated clamp value. In this case, the CPU is either entering IDLE or going to schedule a less boosted or more clamped task. The expected number of different clamp values configured at build time is small enough to fit the full unordered array into a single cache line, for configurations of up to 7 buckets. Add to struct rq the basic data structures required to refcount the number of RUNNABLE tasks for each clamp bucket. Add also the max aggregation required to update the rq's clamp value at each enqueue/dequeue event. Use a simple linear mapping of clamp values into clamp buckets. Pre-compute and cache bucket_id to avoid integer divisions at enqueue/dequeue time. Bug: 120440300 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 69842cba9ace84849bb9b8edcdf2cefccd97901c) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: I2c2c23572fb82e004f815cc9c783881355df6836 Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit ff72a25515d4673ccdfe6a7a8606326725ee057f Author: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> Date: Wed May 22 15:56:14 2024 +0800 cpufreq: schedutil: Checkout to msm-4.19 From: https://github.com/EmanuelCN/kernel_xiaomi_sm8250/blob/staging/kernel/sched/cpufreq_schedutil.c * Removed sugov_get_util() under CONFIG_SCHED_WALT guard, because we don't use WALT anymore * Preserved 819f63f and c2bdfaf, also introduce new definitions from 4.19 for new schedutil Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit eecd40ea2f0a9b22cd79f288da288e1d4513c086 Author: Viresh Kumar <viresh.kumar@linaro.org> Date: Tue May 22 15:31:30 2018 +0530 cpufreq: Rename cpufreq_can_do_remote_dvfs() This routine checks if the CPU running this code belongs to the policy of the target CPU or if not, can it do remote DVFS for it remotely. But the current name of it implies as if it is only about doing remote updates. Rename it to make it more relevant. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c201be3d3447e286e7e6924dc06fd61fe923954b Author: kondors1995 <normandija1945@gmail.com> Date: Mon Aug 5 10:10:39 2024 +0300 Revert "cpufreq: Avoid leaving stale IRQ work items during CPU offline" This reverts commit `c0079a7b3b`. commit 69078f1d44b561a0eddceb43873b2ad5a772fe9d Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Feb 6 17:14:22 2019 +0100 sched/fair: Fix O(nr_cgroups) in the load balancing path commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617 upstream. This re-applies the commit reverted here: commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply: commit `a9e7f6544b` ("sched/fair: Fix O(nr_cgroups) in load balance path") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 178af989ed8209fc342dfd89bdf23b935f7a2ce1 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Feb 6 17:14:21 2019 +0100 sched/fair: Optimize update_blocked_averages() commit 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c upstream. Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child ordering of the list when it will be added back. In order to remove an empty and fully decayed cfs_rq, we must remove its children too, so they will be added back in the right order next time. With a normal decay of PELT, a parent will be empty and fully decayed if all children are empty and fully decayed too. In such a case, we just have to ensure that the whole branch will be added when a new task is enqueued. This is default behavior since : commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list") In case of throttling, the PELT of throttled cfs_rq will not be updated whereas the parent will. This breaks the assumption made above unless we remove the children of a cfs_rq that is throttled. Then, they will be added back when unthrottled and a sched_entity will be enqueued. As throttled cfs_rq are now removed from the list, we can remove the associated test in update_blocked_averages(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 5f6df438eb114f68750cbaf73ff655cf09767aa9 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Jan 30 06:22:47 2019 +0100 sched/fair: Fix insertion in rq->leaf_cfs_rq_list commit f6783319737f28e4436a69611853a5a098cbe974 upstream. Sargun reported a crash: "I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix infinite loop in update_blocked_averages() by reverting `a9e7f6544b` and put it on top of 4.19.13. In addition to this, I uninlined list_add_leaf_cfs_rq for debugging. This revealed a new bug that we didn't get to because we kept getting crashes from the previous issue. When we are running with cgroups that are rapidly changing, with CFS bandwidth control, and in addition using the cpusets cgroup, we see this crash. Specifically, it seems to occur with cgroups that are throttled and we change the allowed cpuset." The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that it will walk down to root the 1st time a cfs_rq is used and we will finish to add either a cfs_rq without parent or a cfs_rq with a parent that is already on the list. But this is not always true in presence of throttling. Because a cfs_rq can be throttled even if it has never been used but other CPUs of the cgroup have already used all the bandwdith, we are not sure to go down to the root and add all cfs_rq in the list. Ensure that all cfs_rq will be added in the list even if they are throttled. [ mingo: Fix !CGROUPS build. ] Reported-by: Sargun Dhillon <sargun@sargun.me> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: tj@kernel.org Fixes: `9c2791f936` ("Fix hierarchical order in rq->leaf_cfs_rq_list") Link: https://lkml.kernel.org/r/1548825767-10799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Janne Huttunen <janne.huttunen@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 43bf835d9112937ae2b7d07b2b0a64b53598623d Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Jan 30 14:41:04 2019 +0100 sched/fair: Add tmp_alone_branch assertion commit 5d299eabea5a251fbf66e8277704b874bbba92dc upstream. The magic in list_add_leaf_cfs_rq() requires that at the end of enqueue_task_fair(): rq->tmp_alone_branch == &rq->lead_cfs_rq_list If this is violated, list integrity is compromised for list entries and the tmp_alone_branch pointer might dangle. Also, reflow list_add_leaf_cfs_rq() while there. This looses one indentation level and generates a form that's convenient for the next patch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Janne Huttunen <janne.huttunen@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 4a1c8a809ec62e802d9c2b174bcddd90750e6805 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Fri Jun 12 17:47:03 2020 +0200 sched/pelt: Cleanup PELT divider Factorize in a single place the calculation of the divider to be used to to compute _avg from _sum value Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200612154703.23555-1-vincent.guittot@linaro.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit ed092eafa66750326f15caa1a8e7c2f5f62f5951 Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Wed Jun 3 10:03:01 2020 +0200 sched/pelt: Remove redundant cap_scale() definition Besides in PELT cap_scale() is used in the Deadline scheduler class for scale-invariant bandwidth enforcement. Remove the cap_scale() definition in kernel/sched/pelt.c and keep the one in kernel/sched/sched.h. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200603080304.16548-2-dietmar.eggemann@arm.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit a48e6b38b3b57cb00853bd405524f2158d454ba9 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Fri Apr 8 19:53:08 2022 +0800 UPSTREAM: sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq Since commit 23127296889f ("sched/fair: Update scale invariance of PELT") change to use rq_clock_pelt() instead of rq_clock_task(), we should also use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And rename throttled_clock_task(_time) to be clock_pelt rather than clock_task. Bug: 254441685 Fixes: 23127296889f ("sched/fair: Update scale invariance of PELT") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com (cherry picked from commit 64eaf50731ac0a8c76ce2fedd50ef6652aabc5ff) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I61e971d09f14708b8ee170fd5d5109144bba6e34 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 98290e08beea376902e4e3d2eaeb3851b0192f74 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Jan 23 16:26:53 2019 +0100 UPSTREAM: sched/fair: Update scale invariance of PELT The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to ensure that the same amount of idle time will be applied when system becomes idle and no idle time has been "stolen". But reaching the maximum utilization value (SCHED_CAPACITY_SCALE) means that the task is seen as an always-running task whatever the capacity of the CPU (even at max compute capacity). In this case, we can discard this "stolen" idle times which becomes meaningless. In order to achieve this time scaling, a new clock_pelt is created per rq. The increase of this clock scales with current capacity when something is running on rq and synchronizes with clock_task when rq is idle. With this mechanism, we ensure the same running and idle time whatever the current capacity. This also enables to simplify the pelt algorithm by removing all references of uarch and frequency and applying the same contribution to utilization and loads. Furthermore, the scaling is done only once per update of clock (update_rq_clock_task()) instead of during each update of sched_entities and cfs/rt/dl_rq of the rq like the current implementation. This is interesting when cgroup are involved as shown in the results below: On a hikey (octo Arm64 platform). Performance cpufreq governor and only shallowest c-state to remove variance generated by those power features so we only track the impact of pelt algo. each test runs 16 times: ./perf bench sched pipe (higher is better) kernel tip/sched/core + patch ops/seconds ops/seconds diff cgroup root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% hackbench -l 1000 (lower is better) kernel tip/sched/core + patch duration(sec) duration(sec) diff cgroup root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% Then, the responsiveness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. These values has been computed based on the geometric series and the half period value: Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo Arm64 platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. Bug: 120440300 Change-Id: I0bd4ed2317f2a9a965634e53ce1476417af697a6 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 23127296889fe84b0762b191b5d041e8ba6f2599) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 0a7060dfca03fe7a078974f343372ad623e7a25d Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Jan 23 16:26:52 2019 +0100 UPSTREAM: sched/fair: Move the rq_of() helper function Move rq_of() helper function so it can be used in pelt.c [ mingo: Improve readability while at it. ] Bug: 120440300 Change-Id: I2133979476631d68baaffcaa308f4cdab94f22b1 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 62478d9911fab9694c195f0ca8e4701de09be98e) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 4ed482487f9d5ad5d11417da5376c596ffc2956c Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Fri Aug 3 15:05:38 2018 +0100 UPSTREAM: sched/fair: Remove setting task's se->runnable_weight during PELT update A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's se->runnable_weight must always be in sync with its se->load.weight. se->runnable_weight is set to se->load.weight when the task is forked (init_entity_runnable_average()) or reniced (reweight_entity()). There are two cases in set_load_weight() which since they currently only set se->load.weight could lead to a situation in which se->load.weight is different to se->runnable_weight for a CFS task: (1) A task switches to SCHED_IDLE. (2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced (during which only its static priority gets set) switches to SCHED_OTHER or SCHED_BATCH. Set se->runnable_weight to se->load.weight in these two cases to prevent this. This eliminates the need to explicitly set it to se->load.weight during PELT updates in the CFS scheduler fastpath. Bug: 120440300 Change-Id: I52184a9e1fd53cb42ef3ae546b1fae78b744c9ad Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 4a465e3ebbc8004ce4f7f08f6022ee8315a94edf) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit a97016777175a6305b14bbae29f4ce0a3a3930cd Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Fri Dec 14 23:10:06 2018 +0100 sched/pelt: Fix warning and clean up IRQ PELT config Commit 11d4afd4ff667f9b6178ee8c142c36cb78bd84db upstream. Create a config for enabling irq load tracking in the scheduler. irq load tracking is useful only when irq or paravirtual time is accounted but it's only possible with SMP for now. Also use __maybe_unused to remove the compilation warning in update_rq_clock_task() that has been introduced by: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Suggested-by: Ingo Molnar <mingo@redhat.com> Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: dou_liyang@163.com Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 8c3f9eb95d52cb68c7f62e2cfcf99a662008e4f7 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Jul 19 14:00:06 2018 +0200 sched/fair: Remove #ifdefs from scale_rt_capacity() Reuse cpu_util_irq() that has been defined for schedutil and set irq util to 0 when !CONFIG_IRQ_TIME_ACCOUNTING. But the compiler is not able to optimize the sequence (at least with aarch64 GCC 7.2.1): free = (max - irq); free /= max; when irq is fixed to 0 Add a new inline function scale_irq_capacity() that will scale utilization when irq is accounted. Reuse this funciton in schedutil which applies similar formula. Suggested-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit bf59401f1068438e811e2bb93f5b2ac3c88f1b52 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Fri Aug 31 17:22:55 2018 +0200 sched/pelt: Fix update_blocked_averages() for RT and DL classes update_blocked_averages() is called to periodiccally decay the stalled load of idle CPUs and to sync all loads before running load balance. When cfs rq is idle, it trigs a load balance during pick_next_task_fair() in order to potentially pull tasks and to use this newly idle CPU. This load balance happens whereas prev task from another class has not been put and its utilization updated yet. This may lead to wrongly account running time as idle time for RT or DL classes. Test that no RT or DL task is running when updating their utilization in update_blocked_averages(). We still update RT and DL utilization instead of simply skipping them to make sure that all metrics are synced when used during load balance. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking") Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking") Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit cc819e146c895dd29b3fe93fbd558ccd0a190ed6 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Jun 28 17:45:12 2018 +0200 sched/core: Use PELT for scale_rt_capacity() The utilization of the CPU by RT, DL and IRQs are now tracked with PELT so we can use these metrics instead of rt_avg to evaluate the remaining capacity available for CFS class. scale_rt_capacity() behavior has been changed and now returns the remaining capacity available for CFS instead of a scaling factor because RT, DL and IRQ provide now absolute utilization value. The same formula as schedutil is used: IRQ util_avg + (1 - IRQ util_avg / max capacity ) /Sum rq util_avg but the implementation is different because it doesn't return the same value and doesn't benefit of the same optimization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 0f6598d6f0def162745d059a67d1148cf42cfaa3 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Jun 28 17:45:09 2018 +0200 sched/irq: Add IRQ utilization tracking interrupt and steal time are the only remaining activities tracked by rt_avg. Like for sched classes, we can use PELT to track their average utilization of the CPU. But unlike sched class, we don't track when entering/leaving interrupt; Instead, we take into account the time spent under interrupt context when we update rqs' clock (rq_clock_task). This also means that we have to decay the normal context time and account for interrupt time during the update. That's also important to note that because: rq_clock == rq_clock_task + interrupt time and rq_clock_task is used by a sched class to compute its utilization, the util_avg of a sched class only reflects the utilization of the time spent in normal context and not of the whole time of the CPU. The utilization of interrupt gives an more accurate level of utilization of CPU. The CPU utilization is: avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq Most of the time, avg_irq is small and neglictible so the use of the approximation CPU utilization = /Sum avg_rq was enough. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c616a7241810a508cb961a6e1658748113161221 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Jun 28 17:45:07 2018 +0200 sched/dl: Add dl_rq utilization tracking Similarly to what happens with RT tasks, CFS tasks can be preempted by DL tasks and the CFS's utilization might no longer describes the real utilization level. Current DL bandwidth reflects the requirements to meet deadline when tasks are enqueued but not the current utilization of the DL sched class. We track DL class utilization to estimate the system utilization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit abc5cabb9b98b262baad2250bbafbe54d8f2fc3e Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Jun 28 17:45:05 2018 +0200 sched/rt: Add rt_rq utilization tracking schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks are preempted by RT tasks and in this case util_avg reflects the remaining capacity but not what CFS want to use. In such case, schedutil can select a lower OPP whereas the CPU is overloaded. In order to have a more accurate view of the utilization of the CPU, we track the utilization of RT tasks. Only util_avg is correctly tracked but not load_avg and runnable_load_avg which are useless for rt_rq. rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are the same at the root group level, so the PELT windows of the util_sum are aligned. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 7ce46e2c80931a4da93dbe7674998fee851c8f1f Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Jun 28 17:45:04 2018 +0200 sched/pelt: Move PELT related code in a dedicated file We want to track rt_rq's utilization as a part of the estimation of the whole rq's utilization. This is necessary because rt tasks can steal utilization to cfs tasks and make them lighter than they are. As we want to use the same load tracking mecanism for both and prevent useless dependency between cfs and rt code, PELT code is moved in a dedicated file. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit a7703d111f4372d3eba166af9ebd9092ff3dc66b Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Nov 16 15:21:52 2017 +0100 sched/fair: Update and fix the runnable propagation rule Unlike running, the runnable part can't be directly propagated through the hierarchy when we migrate a task. The main reason is that runnable time can be shared with other sched_entities that stay on the rq and this runnable time will also remain on prev cfs_rq and must not be removed. Instead, we can estimate what should be the new runnable of the prev cfs_rq and check that this estimation stay in a possible range. The prop_runnable_sum is a good estimation when adding runnable_sum but fails most often when we remove it. Instead, we could use the formula below instead: gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight which assumes that tasks are equally runnable which is not true but easy to compute. Beside these estimates, we have several simple rules that help us to filter out wrong ones: - ge->avg.runnable_sum <= than LOAD_AVG_MAX - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX) - ge->avg.runnable_sum can't increase when we detach a task The effect of these fixes is better cgroups balancing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Chris Mason <clm@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 480aacf775e2ec594e0fed31b15678ef967f29fb Author: Peter Zijlstra <peterz@infradead.org> Date: Thu Aug 24 13:06:35 2017 +0200 sched/fair: Update calc_group_() comments I had a wee bit of trouble recalling how the calc_group_runnable() stuff worked.. add hopefully better comments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 5b49acfeddbd52712395f011c4ab6e4695863738 Author: Josef Bacik <jbacik@fb.com> Date: Thu Aug 3 11:13:39 2017 -0400 sched/fair: Calculate runnable_weight slightly differently Our runnable_weight currently looks like this runnable_weight = shares runnable_load_avg / load_avg The goal is to scale the runnable weight for the group based on its runnable to load_avg ratio. The problem with this is it biases us towards tasks that never go to sleep. Tasks that go to sleep are going to have their runnable_load_avg decayed pretty hard, which will drastically reduce the runnable weight of groups with interactive tasks. To solve this imbalance we tweak this slightly, so in the ideal case it is still the above, but in the interactive case it is runnable_weight = shares * runnable_weight / load_weight which will make the weight distribution fairer between interactive and non-interactive groups. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: linux-kernel@vger.kernel.org Cc: riel@redhat.com Cc: tj@kernel.org Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 7be10d3b02e103159be137da4bc8fd9ebfda2b81 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri May 12 14:18:10 2017 +0200 sched/fair: Implement more accurate async detach The problem with the overestimate is that it will subtract too big a value from the load_sum, thereby pushing it down further than it ought to go. Since runnable_load_avg is not subject to a similar 'force', this results in the occasional 'runnable_load > load' situation. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit eebab46a1ff61d392a032456a0274185ec09c78d Author: Peter Zijlstra <peterz@infradead.org> Date: Fri May 12 14:16:30 2017 +0200 sched/fair: Align PELT windows between cfs_rq and its se The PELT _sum values are a saw-tooth function, dropping on the decay edge and then growing back up again during the window. When these window-edges are not aligned between cfs_rq and se, we can have the situation where, for example, on dequeue, the se decays first. Its _sum values will be small(er), while the cfs_rq _sum values will still be on their way up. Because of this, the subtraction: cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This will then, once the cfs_rq reaches an edge, translate into its _avg value jumping up. This is especially visible with the runnable_load bits, since they get added/subtracted a lot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit dd849181691e6b895696b331c9a750139a1bffd4 Author: Peter Zijlstra <peterz@infradead.org> Date: Thu May 11 17:57:24 2017 +0200 sched/fair: Implement synchonous PELT detach on load-balance migrate Vincent wondered why his self migrating task had a roughly 50% dip in load_avg when landing on the new CPU. This is because we uncondionally take the asynchronous detatch_entity route, which can lead to the attach on the new CPU still seeing the old CPU's contribution to tg->load_avg, effectively halving the new CPU's shares. While in general this is something we have to live with, there is the special case of runnable migration where we can do better. Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit fa4e5dee6619a0a3c5368e25147263b8576318e4 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat May 6 15:59:54 2017 +0200 sched/fair: Propagate an effective runnable_load_avg The load balancer uses runnable_load_avg as load indicator. For !cgroup this is: runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq That is, a direct sum of all runnable tasks on that runqueue. As opposed to load_avg, which is a sum of all tasks on the runqueue, which includes a blocked component. However, in the cgroup case, this comes apart since the group entities are always runnable, even if most of their constituent entities are blocked. Therefore introduce a runnable_weight which for task entities is the same as the regular weight, but for group entities is a fraction of the entity weight and represents the runnable part of the group runqueue. Then propagate this load through the PELT hierarchy to arrive at an effective runnable load avgerage -- which we should not confuse with the canonical runnable load average. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c83be534eabf6400d82a999eb1072a5e7ecf2eb2 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon May 8 17:30:46 2017 +0200 sched/fair: Rewrite PELT migration propagation When an entity migrates in (or out) of a runqueue, we need to add (or remove) its contribution from the entire PELT hierarchy, because even non-runnable entities are included in the load average sums. In order to do this we have some propagation logic that updates the PELT tree, however the way it 'propagates' the runnable (or load) change is (more or less): tg->weight * grq->avg.load_avg ge->avg.load_avg = ------------------------------ tg->load_avg But that is the expression for ge->weight, and per the definition of load_avg: ge->avg.load_avg := ge->weight * ge->avg.runnable_avg That destroys the runnable_avg (by setting it to 1) we wanted to propagate. Instead directly propagate runnable_sum. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 067555d1c451e393e04adf8056c17505de978ac1 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon May 8 16:51:41 2017 +0200 sched/fair: Rewrite cfs_rq->removed_*avg Since on wakeup migration we don't hold the rq->lock for the old CPU we cannot update its state. Instead we add the removed 'load' to an atomic variable and have the next update on that CPU collect and process it. Currently we have 2 atomic variables; which already have the issue that they can be read out-of-sync. Also, two atomic ops on a single cacheline is already more expensive than an uncontended lock. Since we want to add more, convert the thing over to an explicit cacheline with a lock in. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 938acabba842513db34a25ace967f7771998ee29 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed May 17 11:50:45 2017 +0200 sched/fair: Use reweight_entity() for set_user_nice() Now that we directly change load_avg and propagate that change into the sums, sys_nice() and co should do the same, otherwise its possible to confuse load accounting when we migrate near the weight change. Fixes-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added changelog, fixed the call condition. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit c1d25beea75097359c3d0300330e2bba0106cff0 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat May 6 16:11:34 2017 +0200 sched/fair: More accurate reweight_entity() When a (group) entity changes it's weight we should instantly change its load_avg and propagate that change into the sums it is part of. Because we use these values to predict future behaviour and are not interested in its historical value. Without this change, the change in load would need to propagate through the average, by which time it could again have changed etc.. always chasing itself. With this change, the cfs_rq load_avg sum will more accurately reflect the current runnable and expected return of blocked load. Reported-by: Paul Turner <pjt@google.com> [josef: compile fix !SMP \|\| !FAIR_GROUP] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 3092e27836e4e71089267a27e5a9ab4100627618 Author: Peter Zijlstra <peterz@infradead.org> Date: Thu Aug 24 17:45:35 2017 +0200 sched/fair: Introduce {en,de}queue_load_avg() Analogous to the existing {en,de}queue_runnable_load_avg() add helpers for {en,de}queue_load_avg(). More users will follow. Includes some code movement to avoid fwd declarations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 2d9eb6d1ce196555a3571506a0566791f53da0d4 Author: Peter Zijlstra <peterz@infradead.org> Date: Thu Aug 24 17:38:30 2017 +0200 sched/fair: Rename {en,de}queue_entity_load_avg() Since they're now purely about runnable_load, rename them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit beda9e28dbb0eb34d787d4cbac2bc8399fa6f1a6 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat May 6 17:37:03 2017 +0200 sched/fair: Move enqueue migrate handling Move the entity migrate handling from enqueue_entity_load_avg() to update_load_avg(). This has two benefits: - {en,de}queue_entity_load_avg() will become purely about managing runnable_load - we can avoid a double update_tg_load_avg() and reduce pressure on the global tg->shares cacheline The reason we do this is so that we can change update_cfs_shares() to change both weight and (future) runnable_weight. For this to work we need to have the cfs_rq averages up-to-date (which means having done the attach), but we need the cfs_rq->avg.runnable_avg to not yet include the se's contribution (since se->on_rq == 0). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 94f87f43b19ceeecc31c9e7169533ad64efeb085 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat May 6 17:32:43 2017 +0200 sched/fair: Change update_load_avg() arguments Most call sites of update_load_avg() already have cfs_rq_of(se) available, pass it down instead of recomputing it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit f1bfd1f0c849264e58ce1415ddea244440a92797 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat May 6 16:42:08 2017 +0200 sched/fair: Remove se->load.weight from se->avg.load_sum Remove the load from the load_sum for sched_entities, basically turning load_sum into runnable_sum. This prepares for better reweighting of group entities. Since we now have different rules for computing load_avg, split ___update_load_avg() into two parts, ___update_load_sum() and ___update_load_avg(). So for se: ___update_load_sum(.weight = 1) ___upate_load_avg(.weight = se->load.weight) and for cfs_rq: ___update_load_sum(.weight = cfs_rq->load.weight) ___upate_load_avg(.weight = 1) Since the primary consumable is load_avg, most things will not be affected. Only those few sites that initialize/modify load_sum need attention. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 9600274a37f34758ce3dfda9f5257b7737f298bb Author: Peter Zijlstra <peterz@infradead.org> Date: Thu May 11 18:16:06 2017 +0200 sched/fair: Cure calc_cfs_shares() vs. reweight_entity() Vincent reported that when running in a cgroup, his root cfs_rq->avg.load_avg dropped to 0 on task idle. This is because reweight_entity() will now immediately propagate the weight change of the group entity to its cfs_rq, and as it happens, our approxmation (5) for calc_cfs_shares() results in 0 when the group is idle. Avoid this by using the correct (3) as a lower bound on (5). This way the empty cgroup will slowly decay instead of instantly drop to 0. Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 664ab7a3e73cbb37c84125925f2cbbc91ed7fdca Author: Peter Zijlstra <peterz@infradead.org> Date: Tue May 9 11:04:07 2017 +0200 sched/fair: Add comment to calc_cfs_shares() Explain the magic equation in calc_cfs_shares() a bit better. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit 8cdb20cedab8458aafc8d4766ecb08a28d6ad076 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat May 6 16:03:17 2017 +0200 sched/fair: Clean up calc_cfs_shares() For consistencies sake, we should have only a single reading of tg->shares. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> commit e4f0f21e76d364601055d5d1f14eae828fd4383a Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 12:00:36 2024 +0300 Revert "UPSTREAM: sched/uclamp: Add CPU's clamp buckets refcounting" This reverts commit `116cf381f4`. commit 8a408ac031b6990f1a2d538680ffac88e42e477c Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 12:00:36 2024 +0300 Revert "UPSTREAM: sched/uclamp: Add bucket local max tracking" This reverts commit `a684fdc2de`. commit 6f2949c1073b97efb28f05d086fdc9e473d421c4 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 12:00:35 2024 +0300 Revert "UPSTREAM: sched/uclamp: Enforce last task's UCLAMP_MAX" This reverts commit `41ce223686`. commit 86d7baea30401efcc1d609c619c1526a952e6419 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 12:00:35 2024 +0300 Revert "UPSTREAM: sched/uclamp: Add system default clamps" This reverts commit `5bee2de619`. commit 9d758d18c03983c60e0aba8bfd98ca0084e7d7c4 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 12:00:33 2024 +0300 Revert "UPSTREAM: sched/core: Allow sched_setattr() to use the current policy" This reverts commit `38b1e6723b`. commit 164913a4e0ab61a201d7802eb0de81099453971a Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 12:00:20 2024 +0300 Revert "UPSTREAM: sched/uclamp: Extend sched_setattr() to support utilization clamping" This reverts commit `486d866797`. commit 98262c0640ec825cc93c5faf076c7d24ae8ae6a7 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:57 2024 +0300 Revert "UPSTREAM: sched/uclamp: Reset uclamp values on RESET_ON_FORK" This reverts commit `13caaea80c`. commit 17a9d13f5c3bc718f83c71b7b18b40a6525f22c0 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:57 2024 +0300 Revert "UPSTREAM: sched/uclamp: Set default clamps for RT tasks" This reverts commit `1e06d5830a`. commit 65d7db99c966d5c853da01eb8897be026f5674ed Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:57 2024 +0300 Revert "BACKPORT: sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks" This reverts commit `1f42ab6cc1`. commit 318ea0855d8084332eb060ca8948bb970cf08129 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:57 2024 +0300 Revert "UPSTREAM: sched/uclamp: Add uclamp_util_with()" This reverts commit `e6a7b88cc2`. commit f1ec7c15ea378c0fa4089a36d753e99054265c10 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:57 2024 +0300 Revert "UPSTREAM: sched/uclamp: Extend CPU's cgroup controller" This reverts commit `6a41399960`. commit 8814a297260c377a08f12350ab47f302d333043b Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:56 2024 +0300 Revert "UPSTREAM: sched/uclamp: Propagate parent clamps" This reverts commit `4963c9685e`. commit fe9d3b66e407939f83c8b9ee26b9318ec4316609 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:56 2024 +0300 Revert "UPSTREAM: sched/uclamp: Propagate system defaults to the root group" This reverts commit `a3bbe48401`. commit f6256062570adc1d7e261ffff855f0fa1128d8b1 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:56 2024 +0300 Revert "UPSTREAM: sched/uclamp: Use TG's clamps to restrict TASK's clamps" This reverts commit `dbb3b287e2`. commit 21d5ea4fd377bedd6ad5b49a829fa7b3867d9bd6 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:56 2024 +0300 Revert "UPSTREAM: sched/uclamp: Update CPU's refcount on TG's clamp changes" This reverts commit `b5c75307c4`. commit 2d9b5a47c47a33c965f744ce34c8924524558bae Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:56 2024 +0300 Revert "UPSTREAM: sched/uclamp: Always use 'enum uclamp_id' for clamp_id values" This reverts commit `a479e1de9b`. commit 962708c53ab41d17d9233543581553226f6da382 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:51 2024 +0300 Revert "UPSTREAM: sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code" This reverts commit `fe18af6732`. commit f1bb535dbf77aec377d06a795a52c3e1c3489ab9 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:19 2024 +0300 Revert "UPSTREAM: sched/core: Fix compilation error when cgroup not selected" This reverts commit `4f89a28e12`. commit 7e33f0d51debf2dd24a33ea7fbb8ac19edb42b1e Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:19 2024 +0300 Revert "sched/uclamp: Reject negative values in cpu_uclamp_write()" This reverts commit `a12fbe805e`. commit 546d51a84878a1046177b739ccdfbfe4748818b4 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:19 2024 +0300 Revert "sched/core: Fix size of rq::uclamp initialization" This reverts commit `c7be60603f`. commit f8456cecf7592fd67d02f5250579d5332a3e168d Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:19 2024 +0300 Revert "sched/core: Fix reset-on-fork from RT with uclamp" This reverts commit `fbeca2f356`. commit 7f497bb9e5d2be6a0182e30b1b9cd6feef3b70ab Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:19 2024 +0300 Revert "BACKPORT: sched/uclamp: Remove uclamp_util()" This reverts commit `b6f58cb73d`. commit 1573bacd6d6455ddcf9aa62e6ed71080ee8fe73d Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:19 2024 +0300 Revert "UPSTREAM: sched/uclamp: Make uclamp util helpers use and return UL values" This reverts commit `1fbeb27caf`. commit 33c125b0a8d140d9f6333d1334784b6eaae89009 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:18 2024 +0300 Revert "UPSTREAM: sched/uclamp: Fix initialization of struct uclamp_rq" This reverts commit `294a706618`. commit fd06089869bde9880e4953e1f43f288db2fc5b14 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:18 2024 +0300 Revert "BACKPORT: sched/uclamp: Protect uclamp fast path code with static key" This reverts commit `b2f4e5d8c3`. commit 592e4a8ed2c0a59fac12c39b2a023c3517c7c801 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:18 2024 +0300 Revert "UPSTREAM: sched/uclamp: Fix a deadlock when enabling uclamp static key" This reverts commit `3eb05145c9`. commit 60dc9ca3b6caf6db7cb3b209527111812b78587b Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:59:13 2024 +0300 Revert "BACKPORT: sched/uclamp: Add a new sysctl to control RT default boost value" This reverts commit `1abcbcbea7`. commit 72296d4745d8b53ccedd64edfaa455ca6c8fae9c Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:58:42 2024 +0300 Revert "ANDROID: sched/core: Add a latency-sensitive flag to uclamp" This reverts commit `b5fa516b3a`. commit 9b478422f4676d419918afb41a71015a02f79a84 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:58:42 2024 +0300 Revert "ANDROID: sched: Introduce uclamp latency and boost wrapper" This reverts commit `fe73bc33df`. commit 2de1a0534caaa6fa2255b9eda5d64fb2b9daa209 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:58:42 2024 +0300 Revert "sched/fair: Modify boosted_task_util() to reflect uclamp changes" This reverts commit `8c84f3cd7f`. commit 523dab791c4ebd52b8b6473d52967edeb7bdbfd8 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:58:36 2024 +0300 Revert "Revert "sched/fair: Revert Google's capacity margin hacks"" This reverts commit `b623172b47`. commit 43bd6d52219dcca4cdbd64129e4c4b6919549d9a Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:58:36 2024 +0300 Revert "Revert "sched: Stub prefer_high_cap"" This reverts commit `8e1140bfeb`. commit 64662ef0b3e7c6b11d7516ebc3f83483a1fa71a9 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:58:30 2024 +0300 Revert "Revert "Revert "sched: separate boost signal from placement hint""" This reverts commit `79111a20d9`. commit 1bac20806128df71140bd14c52b83dc1117ade38 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:57:50 2024 +0300 Revert "Revert "Revert "sched/fair: check if mid capacity cpu exists""" This reverts commit `7508259cb6`. commit 42e0ef5e1b4f56d6337c1a6f3504d275cbc9a043 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:58 2024 +0300 Revert "Revert "Revert "sched: separate capacity margin for boosted tasks""" This reverts commit `435cf700fa`. commit eab623c38f37d7cc0973478f55256173ce3d213d Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:57 2024 +0300 Revert "Revert "Revert "sched: change capacity margin down to 20% for non-boost task""" This reverts commit `0567c2dfa0`. commit 81f6c4feb22ee8da71f7adde5513ce1ae8a48346 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:57 2024 +0300 Revert "Revert "Revert "sched/fair: use actual cpu capacity to calculate boosted util""" This reverts commit `9468b2ce80`. commit c5bff7085b144149dc1ed774a3312c19a1669103 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:57 2024 +0300 Revert "Revert "Revert "sched/fair: do not use boosted margin for prefer_high_cap case""" This reverts commit `25f6b813b4`. commit d5cc737eb99b7aeced3b5d1546799ce1cffe0129 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:57 2024 +0300 Revert "Revert "Revert "Revert "sched/fair: use actual cpu capacity to calculate boosted util"""" This reverts commit `607d0c1f5a`. commit 944f85f208f1a520eb9759571d49f75db84b3ffc Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:50 2024 +0300 Revert "sched/fair: Make boosted and prefer_idle tunables uclamp aware" This reverts commit `8d58395724`. commit ff7fc5b1110dd9b31817b7186467938974ed8017 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:43 2024 +0300 Revert "sched/uclamp: Fix incorrect uclamp.latency_sensitive setting" This reverts commit `282a8ac6f5`. commit 4e5d3a1c362e4c3616126dfc3422d6207eeae977 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:43 2024 +0300 Revert "sched/uclamp: Make uclamp_boosted() return proper boosted value" This reverts commit `e1ffe11ccc`. commit b6ae08ea8bbe41f3c4b90b303e2c39a4c7c5f7b3 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:56:10 2024 +0300 Revert "sched/uclamp: Allow to reset a task uclamp constraint value" This reverts commit `b6b69eece5`. commit 658ddb506825027979ee263aff8d55e0cff3dd98 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:14 2024 +0300 Revert "UPSTREAM: sched/uclamp: Fix incorrect condition" This reverts commit `ce1a62791e`. commit 1c5a70035dafad8750eafa8c5ee40b11115be18e Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:14 2024 +0300 Revert "FROMGIT: sched/uclamp: Fix a bug in propagating uclamp value in new cgroups" This reverts commit `16d73cafea`. commit ed892012da806647a464df46203dad13ca07e4d8 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:14 2024 +0300 Revert "sched/uclamp: Fix locking around cpu_util_update_eff()" This reverts commit `e954be1dbf`. commit 505f58a688443900fdaeb4cca4e3ca66b0cf867c Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:13 2024 +0300 Revert "sched/uclamp: Remove unnecessary mutex_init()" This reverts commit `31714e404d`. commit 8f96c904caa429cb4e413bae75e746a38b9f1214 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:13 2024 +0300 Revert "FROMLIST: sched: Fix out-of-bound access in uclamp" This reverts commit `973affb92e`. commit 29df4134d5a23f686000ebfd31ee3edaf2f6bb44 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:13 2024 +0300 Revert "sched/uclamp: Fix wrong implementation of cpu.uclamp.min" This reverts commit `07c3a9af2f`. commit 4efefc8fcc48eaff905fc548099b5c5f181e8c8c Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:13 2024 +0300 Revert "sched/uclamp: Fix uclamp_tg_restrict()" This reverts commit `7656386687`. commit 7d5d5fe448300a61d6718919d3c38277e32285db Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:13 2024 +0300 Revert "sched/uclamp: Ignore max aggregation if rq is idle" This reverts commit `08a6710f89`. commit 488c9711417e6fc0d07401e928e9523bd241915c Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:13 2024 +0300 Revert "ANDROID: sched: Make uclamp changes depend on CAP_SYS_NICE" This reverts commit `957d737358`. commit 1ac8237c8049014ea11f5505c95fb0c91d5608d1 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:55:13 2024 +0300 Revert "sched: Fix UCLAMP_FLAG_IDLE setting" This reverts commit `e9a6a0f106`. commit 0ef1e5ecb196bd556499ba0a7021a53ed87cadc4 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:54:15 2024 +0300 Revert "sched/uclamp: Fix rq->uclamp_max not set on first enqueue" This reverts commit `977a80d795`. commit b69fc8ea5fbcc014e4a8fdafefb468dc348baa38 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:53:53 2024 +0300 Revert "sched/fair: Reduce busy load balance interval" This reverts commit `c54355abd4`. commit 45daeebd161dab7ff6550030d0a8bcb302b2c271 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:53:40 2024 +0300 Revert "sched/fair: Reduce minimal imbalance threshold" This reverts commit `47a4df59c1`. commit 8557606f8d15209e2f68ff49485e62fb97a441c3 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:53:32 2024 +0300 Revert "sched/deadline: Fix stale throttling on de-/boosted tasks" This reverts commit `1f2196de96`. commit 705944ee989af6d24932895c541abca24caa1b21 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:53:22 2024 +0300 Revert "sched: fair: consider all running tasks in cpu for load balance" This reverts commit `c1e04a4d13`. commit ecffc3948595a8cb3c246135082ee17ab7d414ca Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:53:03 2024 +0300 Revert "sched/fair: Revert Google's capacity margin hacks" This reverts commit `e7c82d8c4d`. commit c064aca8d86c768eb1ebf8536c25e49718cb55b9 Author: kondors1995 <normandija1945@gmail.com> Date: Sat Aug 17 11:48:56 2024 +0300 Revert "BACKPORT: sched/fair: Make task_fits_capacity() consider uclamp restrictions" This reverts commit `a59a899e6a`. # Conflicts: # include/linux/sched/sysctl.h # kernel/sched/fair.c # kernel/sched/sched.h # kernel/sched/tune.c	2024-08-25 15:00:04 +03:00
Jinliang Zheng	3409634195	mm: optimize the redundant loop of mm_update_owner_next() commit cf3f9a593dab87a032d2b6a6fb205e7f3de4f0a1 upstream. When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible to find an appropriate task_struct in the loop whose mm_struct is the same as the target mm_struct. If the above race condition is combined with the stress-ng-zombie and stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in write_lock_irq() for tasklist_lock. Recognize this situation in advance and exit early. Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.com Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tycho Andersen <tandersen@netflix.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 2890a7faf552dd3e4e40e343610ba3e0ba5b788e) Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>	2024-08-23 16:50:05 +00:00
Connor O'Brien	ef859d5eea	sched/cpufreq: Stop ignoring util updates Eliminate the check for SCHED_CPUFREQ_WALT flag in cpufreq_update_util(), update calling code to stop using that flag, and replace its definition with a placeholder. Test: Trace shows more frequent util updates Bug: 110604715 Change-Id: I3b74e950b984194f08ecbbf91872467a200c9d1d Signed-off-by: Connor O'Brien <connoro@google.com> Conflicts: kernel/sched/core.c kernel/sched/sched.h kernel/sched/walt.c Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com> # Conflicts: # kernel/sched/core.c	2024-08-23 18:57:59 +03:00
kondors1995	075ec6c652	sched:fair: fix unused warning not sure way its there but lets keep it	2024-08-15 17:46:10 +03:00
EmanuelCN	b8f0ad42cb	cpufreq: schedutil: Store the cached ratelimits values In a merge resolution google accidentally removed this, since we have checkouted the scheduler to redbull this part was missing.	2024-08-12 21:07:31 +03:00
EmanuelCN	3eb3287a03	cpufreq: schedutil: Set rate-limits globally Since we are not gonna modify them per cluster anymore its redundant to keep them this way, plus its less expensive.	2024-08-12 21:07:30 +03:00
Xuewen Yan	55b8aa5676	cpufreq: schedutil: Update next_freq when cpufreq_limits change When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq `2301000` unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq `2301000` unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at `2301000`, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com	2024-08-12 21:07:25 +03:00
kondors1995	fa49b7a369	cpufreq: schedutil:remove walt bits	2024-08-12 21:06:56 +03:00
Dawei Li	37b421a31a	sched/fair: Fix initial util_avg calculation Change se->load.weight to se_weight(se) in the calculation for the initial util_avg to avoid unnecessarily inflating the util_avg by 1024 times. The reason is that se->load.weight has the unit/scale as the scaled-up load, while cfs_rg->avg.load_avg has the unit/scale as the true task weight (as mapped directly from the task's nice/priority value). With CONFIG_32BIT, the scaled-up load is equal to the true task weight. With CONFIG_64BIT, the scaled-up load is 1024 times the true task weight. Thus, the current code may inflate the util_avg by 1024 times. The follow-up capping will not allow the util_avg value to go wild. But the calculation should have the correct logic. Signed-off-by: Dawei Li <daweilics@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com	2024-08-12 21:06:08 +03:00
EmanuelCN	ac1e9f45c2	cpufreq: schedutil: Give 25% headroom to prime core If we don't give any headroom to prime core it can result in big performance regression	2024-08-12 21:06:08 +03:00
EmanuelCN	8aa48a9f97	sched: Introduce Per-Cluster DVFS Headroom Typically, UI-demanding tasks are handled by little and big cores, while the prime core is more power-hungry and leads to unnecessary boosts, causing power wastage. From my observations, boosting the little cores provides the most significant performance improvement for UI tasks, followed by a modest boost to the big cores. This results in fewer stutters across the entire system. To avoid excessive boosting, the implementation ensures that when the utilization is equal to or higher than the maximum capacity value, util is returned as-is.	2024-08-12 21:06:08 +03:00
Vincent Guittot	e3169b7855	sched/fair: Fix frequency selection for non-invariant case Linus reported a ~50% performance regression on single-threaded workloads on his AMD Ryzen system, and bisected it to: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation") When frequency invariance is not enabled, get_capacity_ref_freq(policy) is supposed to return the current frequency and the performance margin applied by map_util_perf(), enabling the utilization to go above the maximum compute capacity and to select a higher frequency than the current one. After the changes in 9c0b4bb7f630, the performance margin was applied earlier in the path to take into account utilization clampings and we couldn't get a utilization higher than the maximum compute capacity, and the CPU remained 'stuck' at lower frequencies. To fix this, we must use a frequency above the current frequency to get a chance to select a higher OPP when the current one becomes fully used. Apply the same margin and return a frequency 25% higher than the current one in order to switch to the next OPP before we fully use the CPU at the current one. [ mingo: Clarified the changelog. ] Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Bisected-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Wyes Karny <wkarny@gmail.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Wyes Karny <wkarny@gmail.com> Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org	2024-08-12 21:06:08 +03:00
Vincent Guittot	3a07d786ab	sched/cpufreq: Rework iowait boost Use the max value that has already been computed inside sugov_get_util() to cap the iowait boost and remove dependency with uclamp_rq_util_with() which is not used anymore. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20231122133904.446032-3-vincent.guittot@linaro.org	2024-08-12 21:06:08 +03:00
Vincent Guittot	ab4bc46a57	sched/cpufreq: Rework schedutil governor performance estimation The current method to take into account uclamp hints when estimating the target frequency can end in a situation where the selected target frequency is finally higher than uclamp hints, whereas there are no real needs. Such cases mainly happen because we are currently mixing the traditional scheduler utilization signal with the uclamp performance hints. By adding these 2 metrics, we loose an important information when it comes to select the target frequency, and we have to make some assumptions which can't fit all cases. Rework the interface between the scheduler and schedutil governor in order to propagate all information down to the cpufreq governor. effective_cpu_util() interface changes and now returns the actual utilization of the CPU with 2 optional inputs: - The minimum performance for this CPU; typically the capacity to handle the deadline task and the interrupt pressure. But also uclamp_min request when available. - The maximum targeting performance for this CPU which reflects the maximum level that we would like to not exceed. By default it will be the CPU capacity but can be reduced because of some performance hints set with uclamp. The value can be lower than actual utilization and/or min performance level. A new sugov_effective_cpu_perf() interface is also available to compute the final performance level that is targeted for the CPU, after applying some cpufreq headroom and taking into account all inputs. With these 2 functions, schedutil is now able to decide when it must go above uclamp hints. It now also has a generic way to get the min performance level. The dependency between energy model and cpufreq governor and its headroom policy doesn't exist anymore. eenv_pd_max_util() asks schedutil for the targeted performance after applying the impact of the waking task. [ mingo: Refined the changelog & C comments. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org	2024-08-12 21:06:08 +03:00
Lukasz Luba	e80b03e731	cpufreq, sched/util: Optimize operations with single CPU capacity lookup The max CPU capacity is the same for all CPUs sharing frequency domain. There is a way to avoid heavy operations in a loop for each CPU by leveraging this knowledge. Thus, simplify the looping code in the sugov_next_freq_shared() and drop heavy multiplications. Instead, use simple max() to get the highest utilization from these CPUs. This is useful for platforms with many (4 or 6) little CPUs. We avoid heavy 2*PD_CPU_NUM multiplications in that loop, which is called billions of times, since it's not limited by the schedutil time delta filter in sugov_should_update_freq(). When there was no need to change frequency the code bailed out, not updating the sg_policy::last_freq_update_time. Then every visit after delta_ns time longer than the sg_policy::freq_update_delay_ns goes through and triggers the next frequency calculation code. Although, if the next frequency, as outcome of that, would be the same as current frequency, we won't update the sg_policy::last_freq_update_time and the story will be repeated (in a very short period, sometimes a few microseconds). The max CPU capacity must be fetched every time we are called, due to difficulties during the policy setup, where we are not able to get the normalized CPU capacity at the right time. The fetched CPU capacity value is than used in sugov_iowait_apply() to calculate the right boost. This required a few changes in the local functions and arguments. The capacity value should hopefully be fetched once when needed and then passed over CPU registers to those functions. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221208160256.859-2-lukasz.luba@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org>	2024-08-12 21:06:08 +03:00
Rafael J. Wysocki	dfb4c19d93	cpufreq: schedutil: Simplify iowait boosting There is not reason for the minimum iowait boost value in the schedutil cpufreq governor to depend on the available range of CPU frequencies. In fact, that dependency is generally confusing, because it causes the iowait boost to behave somewhat differently on CPUs with the same maximum frequency and different minimum frequencies, for example. For this reason, replace the min field in struct sugov_cpu with a constant and choose its values to be 1/8 of SCHED_CAPACITY_SCALE (for consistency with the intel_pstate driver's internal governor). [Note that policy->cpuinfo.max_freq will not be a constant any more after a subsequent change, so this change is depended on by it.] Link: https://lore.kernel.org/lkml/20190305083202.GU32494@hirez.programming.kicks-ass.net/T/#ee20bdc98b7d89f6110c0d00e5c3ee8c2ced93c3d Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2024-08-12 21:06:08 +03:00
Dietmar Eggemann	fb15c34397	sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com	2024-08-12 21:06:08 +03:00
Rafael J. Wysocki	3ad0d4c283	cpufreq: schedutil: Add util to struct sg_cpu Instead of passing util and max between functions while computing the utilization and capacity, store the former in struct sg_cpu (along with the latter and bw_dl). This will allow the current utilization value to be compared with the one obtained previously (which is requisite for some code changes to follow this one), but also it causes the code to look slightly more consistent and cleaner. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2024-08-12 21:06:08 +03:00
John Galt	57e593068f	sched/eevdf: 1ms base slice	2024-08-12 21:06:07 +03:00
John Galt	c36321830c	sched/features: enable SCHED_FEAT_PLACE_DEADLINE_INITIAL	2024-08-12 21:06:07 +03:00
Qais Yousef	a39ee7eda1	sched/uclamp: Fix fits_capacity() check in feec() commit 244226035a1f9b2b6c326e55ae5188fab4f428cb upstream. As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024, it'll always pick the previous CPU because fits_capacity() will always return false in this case. The new util_fits_cpu() logic should handle this correctly for us beside more corner cases where similar failures could occur, like when using UCLAMP_MAX. We open code uclamp_rq_util_with() except for the clamp() part, util_fits_cpu() needs the 'raw' values to be passed to it. Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp value for the rq. Makes the code more readable and ensures the right rules (use READ_ONCE/WRITE_ONCE) are respected transparently. [1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Reported-by: Yun Hsiang <hsiang023167@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com (cherry picked from commit 244226035a1f9b2b6c326e55ae5188fab4f428cb) [Fix trivial conflict in kernel/sched/fair.c due to new automatic variables in master vs 5.10] Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-12 21:06:07 +03:00
Youssef Esmat	a7fecb061a	sched/eevdf: Change base_slice to 3ms The default base slice of 0.75 msec is causing excessive context switches. Raise it to 3 msecs, which lowers the amount of context switches and the added overhead of the scheduler. BUG=b:308209790 TEST=Manual tested on DUT UPSTREAM-TASK=b:324602237 Change-Id: I7c7a7ecb377933a4032eb11161f5b5873e3e1e4c Signed-off-by: Youssef Esmat <youssefesmat@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5291506 Reviewed-by: Suleiman Souhlal <suleiman@google.com> Reviewed-by: Vineeth Pillai <vineethrp@google.com> Reviewed-by: Steven Rostedt <rostedt@google.com>	2024-08-12 21:06:07 +03:00
Youssef Esmat	810bfe810a	sched/eevdf: Add ENFORCE_ELIGIBILITY and default settings Eligibility reduces scheduling latency of high nice task at the cost of runtime of low nice tasks. This hurts the chrome model where UI tasks should not be interrupted by background tasks. This adds a feature to be able to disable eligibility checks. Note that fairness is presvered in the way the deadline is calculated since the deadline is adjusted by the weight of the task and a task. Disabling lag placement is similar to eligibility since lag will adjust the CPU time based on the amount of runtime. BUG=b:308209790 TEST=Manual tested on DUT UPSTREAM-TASK=b:324602237 Signed-off-by: Youssef Esmat <youssefesmat@chromium.org>	2024-08-12 21:06:07 +03:00
Aaron Lu	889347ba4e	sched/fair: Ratelimit update to tg->load_avg When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions and with each migration, the task's load will transfer from src cfs_rq to target cfs_rq and each change involves an update to tg->load_avg. Since the workload can trigger as many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. As a result, the two mentioned functions showed noticeable overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: during a 5s window, wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. Reduce the overhead by limiting updates to tg->load_avg to at most once per ms. The update frequency is a tradeoff between tracking accuracy and overhead. 1ms is chosen because PELT window is roughly 1ms and it delivered good results for the tests that I've done. After this change, the cost of accessing tg->load_avg is greatly reduced and performance improved. Detailed test results below. ============================== postgres_sysbench on SPR: 25% base: 42382±19.8% patch: 50174±9.5% (noise) 50% base: 67626±1.3% patch: 67365±3.1% (noise) 75% base: 100216±1.2% patch: 112470±0.1% +12.2% 100% base: 93671±0.4% patch: 113563±0.2% +21.2% ============================== hackbench on ICL: group=1 base: 114912±5.2% patch: 117857±2.5% (noise) group=4 base: 359902±1.6% patch: 361685±2.7% (noise) group=8 base: 461070±0.8% patch: 491713±0.3% +6.6% group=16 base: 309032±5.0% patch: 378337±1.3% +22.4% ============================= hackbench on SPR: group=1 base: 100768±2.9% patch: 103134±2.9% (noise) group=4 base: 413830±12.5% patch: 378660±16.6% (noise) group=8 base: 436124±0.6% patch: 490787±3.2% +12.5% group=16 base: 457730±3.2% patch: 680452±1.3% +48.8% ============================ netperf/udp_rr on ICL 25% base: 114413±0.1% patch: 115111±0.0% +0.6% 50% base: 86803±0.5% patch: 86611±0.0% (noise) 75% base: 35959±5.3% patch: 49801±0.6% +38.5% 100% base: 61951±6.4% patch: 70224±0.8% +13.4% =========================== netperf/udp_rr on SPR 25% base: 104954±1.3% patch: 107312±2.8% (noise) 50% base: 55394±4.6% patch: 54940±7.4% (noise) 75% base: 13779±3.1% patch: 36105±1.1% +162% 100% base: 9703±3.7% patch: 28011±0.2% +189% ============================================== netperf/tcp_stream on ICL (all in noise range) 25% base: 43092±0.1% patch: 42891±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% =============================================== netperf/tcp_stream on SPR (all in noise range) 25% base: 34491±0.3% patch: 34886±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: David Vernet <void@manifault.com> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com	2024-08-12 21:06:07 +03:00
Chengming Zhou	4650b915ab	sched/fair: Fix another detach on unattached task corner case commit `7dc603c902` ("sched/fair: Fix PELT integrity for new tasks") fixed two load tracking problems for new task, including detach on unattached new task problem. There still left another detach on unattached task problem for the task which has been woken up by try_to_wake_up() and waiting for actually being woken up by sched_ttwu_pending(). try_to_wake_up(p) cpu = select_task_rq(p) if (task_cpu(p) != cpu) set_task_cpu(p, cpu) migrate_task_rq_fair() remove_entity_load_avg() --> unattached se->avg.last_update_time = 0; __set_task_cpu() ttwu_queue(p, cpu) ttwu_queue_wakelist() __ttwu_queue_wakelist() task_change_group_fair() detach_task_cfs_rq() detach_entity_cfs_rq() detach_entity_load_avg() --> detach on unattached task set_task_rq() attach_task_cfs_rq() attach_entity_cfs_rq() attach_entity_load_avg() The reason of this problem is similar, we should check in detach_entity_cfs_rq() that se->avg.last_update_time != 0, before do detach_entity_load_avg(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-7-zhouchengming@bytedance.com	2024-08-12 21:06:07 +03:00
Sultan Alsawaf	cc8232d068	sched/cass: Honor uclamp even when no CPUs can satisfy the requirement When all CPUs available to a uclamp'd process are thermal throttled, it is possible for them to be throttled below the uclamp minimum requirement. In this case, CASS only considers uclamp when it compares relative utilization and nowhere else; i.e., CASS essentially ignores the most important aspect of uclamp. Fix it so that CASS tries to honor uclamp even when no CPUs available to a uclamp'd process are capable of fully meeting the uclamp minimum. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-08-12 21:06:07 +03:00
Sultan Alsawaf	d1a78a2d42	sched/cass: Fix disproportionate load spreading when CPUs are throttled When CPUs are thermal throttled, CASS tries to spread load such that their resulting P-state is scaled relatively to their _throttled_ maximum capacity, rather than their original capacity. As a result, throttled CPUs are unfairly under-utilized, causing other CPUs to receive the extra burden and thus run at a disproportionately higher P-state relative to the throttled CPUs. This not only hurts performance, but also greatly diminishes energy efficiency since it breaks CASS's basic load balancing principle. To fix this, some convoluted logic is required in order to make CASS aware of a CPU's throttled and non-throttled capacity. The non-throttled capacity is used for the fundamental relative utilization comparison, while the throttled capacity is used in conjunction to ensure a throttled CPU isn't accidentally overloaded as a result. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-08-12 21:06:07 +03:00
Patrick Bellasi	efaf04ee87	ANDROID: sched/fair: EAS: Add uclamp support to find_best_target() Utilization clamping can be used to boost the utilization of small tasks or cap that of big tasks. Thus, one of its possible usages is to bias tasks placement to "promote" small tasks on higher capacity (less energy efficient) CPUs or "constraint" big tasks on small capacity (more energy efficient) CPUs. When the Energy Aware Scheduler (EAS) looks for the most energy efficient CPU to run a task on, it currently considers only the effective utiliation estimated for a task. Fix this by adding an additional check to skip CPUs which capacity is smaller then the task clamped utilization. Change-Id: I43fa6fa27e27c1eb5272c6a45d1a6a5b0faae1aa Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>	2024-08-12 21:06:06 +03:00
kondors1995	c227f99b52	sched:fair:Adapt for current uclamp setup	2024-08-12 21:05:35 +03:00
kondors1995	6ba2fc3636	sched: remove googles capacity fuckery	2024-08-12 21:04:53 +03:00
EmanuelCN	ffd9073999	sched/eevdf: Use the other placement strategy Place lag is currently broken (4.19 needs more backports to make it work as intended) Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>	2024-08-12 21:00:44 +03:00
Xuewen Yan	1791a0596a	sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf() It was possible to have pick_eevdf() return NULL, which then causes a NULL-deref. This turned out to be due to entity_eligible() returning falsely negative because of a s64 multiplcation overflow. Specifically, reweight_eevdf() computes the vlag without considering the limit placed upon vlag as update_entity_lag() does, and then the scaling multiplication (remember that weight is 20bit fixed point) can overflow. This then leads to the new vruntime being weird which then causes the above entity_eligible() to go side-ways and claim nothing is eligible. Thus limit the range of vlag accordingly. All this was quite rare, but fatal when it does happen. Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/ Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/ Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Reported-by: Sergei Trofimovich <slyich@gmail.com> Reported-by: Igor Raits <igor@gooddata.com> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-and-tested-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240422082238.5784-1-xuewen.yan@unisoc.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>	2024-08-12 21:00:44 +03:00
Tianchen Ding	b5be62ec9e	sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr reweight_eevdf() only keeps V unchanged inside itself. When se != cfs_rq->curr, it would be dequeued from rb tree first. So that V is changed and the result is wrong. Pass the original V to reweight_eevdf() to fix this issue. Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> [peterz: flip if() condition for clarity] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Link: https://lkml.kernel.org/r/20240306022133.81008-3-dtcccc@linux.alibaba.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>	2024-08-12 21:00:44 +03:00
Tianchen Ding	9314e8ead4	sched/eevdf: Always update V if se->on_rq when reweighting reweight_eevdf() needs the latest V to do accurate calculation for new ve and vd. So update V unconditionally when se is runnable. Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Suggested-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Link: https://lore.kernel.org/r/20240306022133.81008-2-dtcccc@linux.alibaba.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>	2024-08-12 21:00:43 +03:00

1 2 3 4 5 ...

30005 Commits