Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.
A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.
This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* The bpf programs actually still support older kernels,
we just need to bypass the very first check for kernel version
Change-Id: I4264782ee63efb26b95abd94774938d5456200a3
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e3779 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.
Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:
# cat /etc/resolv.conf
nameserver 147.75.207.207
nameserver 147.75.207.208
For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:
# cilium service list
ID Frontend Backend
1 147.75.207.207:53 1 => 8.8.8.8:53
2 147.75.207.208:53 1 => 8.8.8.8:53
The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:
# nslookup 1.1.1.1
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
[...]
;; connection timed out; no servers could be reached
# dig 1.1.1.1
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
[...]
; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
;; global options: +cmd
;; connection timed out; no servers could be reached
For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:
# nslookup 1.1.1.1 8.8.8.8
1.1.1.1.in-addr.arpa name = one.one.one.one.
In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.
Same example after this fix:
# cilium service list
ID Frontend Backend
1 147.75.207.207:53 1 => 8.8.8.8:53
2 147.75.207.208:53 1 => 8.8.8.8:53
Lookups work fine now:
# nslookup 1.1.1.1
1.1.1.1.in-addr.arpa name = one.one.one.one.
Authoritative answers can be found from:
# dig 1.1.1.1
; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;1.1.1.1. IN A
;; AUTHORITY SECTION:
. 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
;; Query time: 17 msec
;; SERVER: 147.75.207.207#53(147.75.207.207)
;; WHEN: Tue May 21 12:59:38 UTC 2019
;; MSG SIZE rcvd: 111
And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:
# tcpdump -i any udp
[...]
12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
[...]
In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.
The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.
For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.
Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg")
Change-Id: If2bab00efe5f37a591083fe2676e76f35f8cecc3
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
with addition of tnum logic the verifier got smart enough and
we can enforce return codes at program load time.
For now do so for cgroup-bpf program types.
Change-Id: Iae3a46c3d38810e47cbf4ec23356abae03ded736
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In addition to already existing BPF hooks for sys_bind and sys_connect,
the patch provides new hooks for sys_sendmsg.
It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
that provides access to socket itlself (properties like family, type,
protocol) and user-passed `struct sockaddr *` so that BPF program can
override destination IP and port for system calls such as sendto(2) or
sendmsg(2) and/or assign source IP to the socket.
The hooks are implemented as two new attach types:
`BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
UDPv6 correspondingly.
UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
sys_connect hooks, i.e. to prevent reading from / writing to e.g.
user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
The difference with already existing hooks is sys_sendmsg are
implemented only for unconnected UDP.
For TCP it doesn't make sense to change user-provided `struct sockaddr *`
at sendto(2)/sendmsg(2) time since socket either was already connected
and has source/destination set or wasn't connected and call to
sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
Connected UDP is already handled by sys_connect hooks that can override
source/destination at connect time and use fast-path later, i.e. these
hooks don't affect UDP fast-path.
Rewriting source IP is implemented differently than that in sys_connect
hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
just bind socket to desired local IP address since source IP can be set
on per-packet basis by using ancillary data (cmsg(3)). So no matter if
socket is bound or not, source IP has to be rewritten on every call to
sys_sendmsg.
To do so two new fields are added to UAPI `struct bpf_sock_addr`;
* `msg_src_ip4` to set source IPv4 for UDPv4;
* `msg_src_ip6` to set source IPv6 for UDPv6.
Change-Id: Icf5938b0b69ddfb1e99dc2abc90204f7c97f0473
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, while parsing scan RNR Ie data is moved to
next neighbor_ap_info_field after parsing the current
neighbor_ap_info_field. But in last iteration pointer may
try to access invalid data if (uint8_t *)ie + rnr_ie_len + 2)
bytes are less than sizeof neighbor_ap_info_field and same
is the case with tbtt_length access.
Fix is to add a length check of data + next data size to be parsed
< (uint8_t *)ie + rnr_ie_len + 2) instead of adding a validation
of data length only.
CRs-Fixed: 3710080
Change-Id: I05e5a9a02f0f4f9bc468db894588e676f0a248c0
Given that a CPU's clock is gated at even the shallowest idle state,
waiting until a CPU idles at least once before reducing its frequency is
putting the cart before the horse. For long-running workloads with low
compute needs, requiring an idle call since the last frequency update to
lower the CPU's frequency results in significantly increased energy usage.
Given that there is already a mechanism in place to ratelimit frequency
changes, this heuristic is wholly unnecessary.
Allow single-CPU performance domains to drop their frequency without
requiring an idle call in between to improve energy. Right off the bat,
this reduces CPU power consumption by 7.5% playing a cat gif in Firefox on
a Pixel 8 (270 mW -> 250 mW). And there is no visible loss of performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
* HEAD: 2c85e5f15e
* Remove WALT and SchedTune pieces as we won't use them anymore
* Remove SCHED_CPUFREQ_RT_DL in include/linux/sched/cpufreq.h as it's unused now
* Add some new scheduler definitions from msm-4.19 for new schedutil
* Optimize freq switching for sm8150 (sm8150 use software freq switching)
* Minor format and code adjustments (compared to HEAD)
* Note: The new schedutil requires UClamp, which will be backported later
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
As reported by syzbot and experienced by Pavel, using cpus_read_lock()
in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem
and possibly others).
Instead, shrink the preempt disable region by iterating all CPUs and
checking the online status for each individual CPU while having
preemption disabled.
Fixes: 8850cb663b5c ("sched: Simplify wake_up_*idle*()")
Reported-by: syzbot+d5b23b18d2f4feae8a67@syzkaller.appspotmail.com
Reported-by: Pavel Machek <pavel@ucw.cz>
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Change-Id: I652eb678e8a2e30d71beeebae35a36d4d4c49a8d
(cherry picked from 96611c26dc351c33f73b48756a9feacc109e5bab)
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
# Conflicts:
# kernel/smp.c
Simplify and make wake_up_if_idle() more robust, also don't iterate
the whole machine with preempt_disable() in it's caller:
wake_up_all_idle_cpus().
This prepares for another wake_up_if_idle() user that needs a full
do_idle() cycle.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.769328779@infradead.org
Change-Id: If9f0dd5d99dd675828656161bfdba299f447f998
(cherry picked from 8850cb663b5cda04d33f9cfbc38889d73d3c8e24)
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This reverts the following commits:
4cd13c21b2 ("softirq: Let ksoftirqd do its job")
3c53776e29f8 ("Mark HI and TASKLET softirq synchronous")
1342d8080f61 ("softirq: Don't skip softirq execution when softirq thread is parking")
in a single change to avoid known bad intermediate states introduced by a
patch series reverting them individually.
Due to the mentioned commit, when the ksoftirqd threads take charge of
softirq processing, the system can experience high latencies.
In the past a few workarounds have been implemented for specific
side-effects of the initial ksoftirqd enforcement commit:
commit 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred")
commit 8d5755b3f7 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
commit 217f697436 ("net: busy-poll: allow preemption in sk_busy_loop()")
commit 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous")
But the latency problem still exists in real-life workloads, see the link
below.
The reverted commit intended to solve a live-lock scenario that can now be
addressed with the NAPI threaded mode, introduced with commit 29863d41bb6e
("net: implement threaded-able napi poll loop support"), which is nowadays
in a pretty stable status.
While a complete solution to put softirq processing under nice resource
control would be preferable, that has proven to be a very hard task. In
the short term, remove the main pain point, and also simplify a bit the
current softirq implementation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/netdev/305d7742212cbe98621b16be782b0562f1012cb6.camel@redhat.com
Link: https://lore.kernel.org/r/57e66b364f1b6f09c9bc0316742c3b14f4ce83bd.1683526542.git.pabeni@redhat.com
(cherry picked from commit d15121be7485655129101f3960ae6add40204463)
Change-Id: If014afbfa3bb56f7c490a22b8334857c8308f901
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
commit 7004ae7dfb08c25dc5165abc447a70036b00b60e
Author: kondors1995 <normandija1945@gmail.com>
Date: Fri Aug 16 17:06:30 2024 +0300
arm64: dts: sm8150-v2: Update gpu votlages
commit 7d2152a68429cc4e44ec5e0d6b3340c7b3bd7130
Author: kondors1995 <normandija1945@gmail.com>
Date: Thu Aug 15 17:46:10 2024 +0300
sched:fair: fix unused warning
not sure way its there but lets keep it
commit 37a33c25346a355a07b4900a864f6326740c3a61
Author: kondors1995 <normandija1945@gmail.com>
Date: Thu Aug 15 17:37:29 2024 +0300
arm64: dts: sm8150-v2: Fixup @2cc338b
commit ab938b00393e1a8e6a11b0f54d5c24d56f68a4de
Author: EmanuelCN <emanuelghub@gmail.com>
Date: Sun Aug 11 14:38:41 2024 +0300
drivers: gpu: drm: Do not affine pm qos requests to prime core
*This was accidentally readded when i rebased techpack/display and drm.
commit 42966bed31891213f5bdb64be9f23aae0acb2c2d
Author: kondors1995 <normandija1945@gmail.com>
Date: Wed Aug 14 10:56:36 2024 +0300
raphael:defconfig: Disable CONFIG_FAIR_GROUP_SCHED
commit 4e1c9539727fc898016524700f5cced0cda9f55d
Author: EmanuelCN <emanuelghub@gmail.com>
Date: Wed Aug 7 13:57:13 2024 +0300
cpufreq: schedutil: Store the cached ratelimits values
In a merge resolution google accidentally removed this, since we have checkouted the scheduler to redbull this part was missing.
commit 2dc779651497edbe71c74f4b41eab79b4b0e2368
Author: EmanuelCN <emanuelghub@gmail.com>
Date: Tue Aug 6 20:24:40 2024 +0300
cpufreq: schedutil: Set rate-limits globally
Since we are not gonna modify them per cluster anymore its redundant to keep them this way, plus its less expensive.
commit 4ac25bd369c6ba572b582ea7099e42dfd7f7137f
Author: Xuewen Yan <xuewen.yan@unisoc.com>
Date: Wed Jul 19 21:05:27 2023 +0800
cpufreq: schedutil: Update next_freq when cpufreq_limits change
When cpufreq's policy is 'single', there is a scenario that will
cause sg_policy's next_freq to be unable to update.
When the CPU's util is always max, the cpufreq will be max,
and then if we change the policy's scaling_max_freq to be a
lower freq, indeed, the sg_policy's next_freq need change to
be the lower freq, however, because the cpu_is_busy, the next_freq
would keep the max_freq.
For example:
The cpu7 is a single CPU:
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737
pid 4737's current affinity mask: ff
pid 4737's new affinity mask: 80
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2171000
At this time, the sg_policy's next_freq would stay at 2301000, which
is wrong.
To fix this, add a check for the ->need_freq_update flag.
[ mingo: Clarified the changelog. ]
Co-developed-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
commit e88732a5887bfaa8054408514f2af952e119a9c0
Author: kondors1995 <normandija1945@gmail.com>
Date: Mon Aug 12 20:37:01 2024 +0300
cpufreq: schedutil:remove walt bits
commit a19bfda67be888221f9348eff56a53ac3b34756a
Author: Dawei Li <daweilics@gmail.com>
Date: Sun Aug 4 21:38:34 2024 +0300
sched/fair: Fix initial util_avg calculation
Change se->load.weight to se_weight(se) in the calculation for the
initial util_avg to avoid unnecessarily inflating the util_avg by 1024
times.
The reason is that se->load.weight has the unit/scale as the scaled-up
load, while cfs_rg->avg.load_avg has the unit/scale as the true task
weight (as mapped directly from the task's nice/priority value). With
CONFIG_32BIT, the scaled-up load is equal to the true task weight. With
CONFIG_64BIT, the scaled-up load is 1024 times the true task weight.
Thus, the current code may inflate the util_avg by 1024 times. The
follow-up capping will not allow the util_avg value to go wild. But the
calculation should have the correct logic.
Signed-off-by: Dawei Li <daweilics@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com
commit b147bc04acfe5b5647ef50ea9af580aaed04894a
Author: EmanuelCN <emanuelghub@gmail.com>
Date: Wed Aug 7 14:13:40 2024 +0300
cpufreq: schedutil: Give 25% headroom to prime core
If we don't give any headroom to prime core it can result in big performance regression
commit 010effdcd39496611bca9f284c32a5e554a4fd27
Author: EmanuelCN <emanuelghub@gmail.com>
Date: Wed Jul 3 00:16:04 2024 +0300
sched: Introduce Per-Cluster DVFS Headroom
Typically, UI-demanding tasks are handled by little and big cores,
while the prime core is more power-hungry and leads to unnecessary boosts,
causing power wastage. From my observations, boosting the little cores
provides the most significant performance improvement for UI tasks,
followed by a modest boost to the big cores. This results in fewer
stutters across the entire system.
To avoid excessive boosting, the implementation ensures that when
the utilization is equal to or higher than the maximum capacity value,
util is returned as-is.
commit b72fefb30cfc63426415bdbf06d1bb4c87c4d32f
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Sun Jan 14 19:36:00 2024 +0100
sched/fair: Fix frequency selection for non-invariant case
Linus reported a ~50% performance regression on single-threaded
workloads on his AMD Ryzen system, and bisected it to:
9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
When frequency invariance is not enabled, get_capacity_ref_freq(policy)
is supposed to return the current frequency and the performance margin
applied by map_util_perf(), enabling the utilization to go above the
maximum compute capacity and to select a higher frequency than the current one.
After the changes in 9c0b4bb7f630, the performance margin was applied
earlier in the path to take into account utilization clampings and
we couldn't get a utilization higher than the maximum compute capacity,
and the CPU remained 'stuck' at lower frequencies.
To fix this, we must use a frequency above the current frequency to
get a chance to select a higher OPP when the current one becomes fully used.
Apply the same margin and return a frequency 25% higher than the current
one in order to switch to the next OPP before we fully use the CPU
at the current one.
[ mingo: Clarified the changelog. ]
Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wyes Karny <wkarny@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Wyes Karny <wkarny@gmail.com>
Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
commit dcad748d1a2dcdb781fbcc514dc977d8ddc91be6
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Nov 22 14:39:04 2023 +0100
sched/cpufreq: Rework iowait boost
Use the max value that has already been computed inside sugov_get_util()
to cap the iowait boost and remove dependency with uclamp_rq_util_with()
which is not used anymore.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-3-vincent.guittot@linaro.org
commit 31a906e3fcb3ba783d45157b722bcb3a321c1f8f
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Nov 22 14:39:03 2023 +0100
sched/cpufreq: Rework schedutil governor performance estimation
The current method to take into account uclamp hints when estimating the
target frequency can end in a situation where the selected target
frequency is finally higher than uclamp hints, whereas there are no real
needs. Such cases mainly happen because we are currently mixing the
traditional scheduler utilization signal with the uclamp performance
hints. By adding these 2 metrics, we loose an important information when
it comes to select the target frequency, and we have to make some
assumptions which can't fit all cases.
Rework the interface between the scheduler and schedutil governor in order
to propagate all information down to the cpufreq governor.
effective_cpu_util() interface changes and now returns the actual
utilization of the CPU with 2 optional inputs:
- The minimum performance for this CPU; typically the capacity to handle
the deadline task and the interrupt pressure. But also uclamp_min
request when available.
- The maximum targeting performance for this CPU which reflects the
maximum level that we would like to not exceed. By default it will be
the CPU capacity but can be reduced because of some performance hints
set with uclamp. The value can be lower than actual utilization and/or
min performance level.
A new sugov_effective_cpu_perf() interface is also available to compute
the final performance level that is targeted for the CPU, after applying
some cpufreq headroom and taking into account all inputs.
With these 2 functions, schedutil is now able to decide when it must go
above uclamp hints. It now also has a generic way to get the min
performance level.
The dependency between energy model and cpufreq governor and its headroom
policy doesn't exist anymore.
eenv_pd_max_util() asks schedutil for the targeted performance after
applying the impact of the waking task.
[ mingo: Refined the changelog & C comments. ]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org
commit 307c28badda88662bfd2cf4afb4168506d1e8e96
Author: Lukasz Luba <lukasz.luba@arm.com>
Date: Thu Dec 8 16:02:56 2022 +0000
cpufreq, sched/util: Optimize operations with single CPU capacity lookup
The max CPU capacity is the same for all CPUs sharing frequency domain.
There is a way to avoid heavy operations in a loop for each CPU by
leveraging this knowledge. Thus, simplify the looping code in the
sugov_next_freq_shared() and drop heavy multiplications. Instead, use
simple max() to get the highest utilization from these CPUs.
This is useful for platforms with many (4 or 6) little CPUs. We avoid
heavy 2*PD_CPU_NUM multiplications in that loop, which is called billions
of times, since it's not limited by the schedutil time delta filter in
sugov_should_update_freq(). When there was no need to change frequency
the code bailed out, not updating the sg_policy::last_freq_update_time.
Then every visit after delta_ns time longer than the
sg_policy::freq_update_delay_ns goes through and triggers the next
frequency calculation code. Although, if the next frequency, as outcome
of that, would be the same as current frequency, we won't update the
sg_policy::last_freq_update_time and the story will be repeated (in
a very short period, sometimes a few microseconds).
The max CPU capacity must be fetched every time we are called, due to
difficulties during the policy setup, where we are not able to get the
normalized CPU capacity at the right time.
The fetched CPU capacity value is than used in sugov_iowait_apply() to
calculate the right boost. This required a few changes in the local
functions and arguments. The capacity value should hopefully be fetched
once when needed and then passed over CPU registers to those functions.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20221208160256.859-2-lukasz.luba@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
commit db9c6be8f578a2decb289f5e2cce1af14819690e
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date: Thu Mar 28 11:33:21 2019 +0100
cpufreq: schedutil: Simplify iowait boosting
There is not reason for the minimum iowait boost value in the
schedutil cpufreq governor to depend on the available range of CPU
frequencies. In fact, that dependency is generally confusing,
because it causes the iowait boost to behave somewhat differently
on CPUs with the same maximum frequency and different minimum
frequencies, for example.
For this reason, replace the min field in struct sugov_cpu
with a constant and choose its values to be 1/8 of
SCHED_CAPACITY_SCALE (for consistency with the intel_pstate
driver's internal governor).
[Note that policy->cpuinfo.max_freq will not be a constant any more
after a subsequent change, so this change is depended on by it.]
Link: https://lore.kernel.org/lkml/20190305083202.GU32494@hirez.programming.kicks-ass.net/T/#ee20bdc98b7d89f6110c0d00e5c3ee8c2ced93c3d
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
commit 7e1f2edaa9a3738cdf7bc23eaec0feb866633e3a
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Tue Jun 21 10:04:10 2022 +0100
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).
A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.
Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
commit caa24c3962bc2a002034548e3c93dce95c9330c4
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date: Fri Jun 21 22:06:38 2024 +0300
cpufreq: schedutil: Add util to struct sg_cpu
Instead of passing util and max between functions while computing the
utilization and capacity, store the former in struct sg_cpu (along
with the latter and bw_dl).
This will allow the current utilization value to be compared with the
one obtained previously (which is requisite for some code changes to
follow this one), but also it causes the code to look slightly more
consistent and cleaner.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
commit 176187b62d8637f44e8fcdee2b539dee8c30491b
Author: John Galt <johngaltfirstrun@gmail.com>
Date: Thu Jun 27 21:52:34 2024 -0400
sched/eevdf: 1ms base slice
commit c64c439cffc03bff0b5b69455dbd9018ae920a64
Author: John Galt <johngaltfirstrun@gmail.com>
Date: Thu Jun 27 21:49:47 2024 -0400
sched/features: enable SCHED_FEAT_PLACE_DEADLINE_INITIAL
commit 190d38b6481aea95d454dcfab5801a578a86f3dc
Author: Qais Yousef <qais.yousef@arm.com>
Date: Tue Apr 18 15:09:35 2023 +0100
sched/uclamp: Fix fits_capacity() check in feec()
commit 244226035a1f9b2b6c326e55ae5188fab4f428cb upstream.
As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
it'll always pick the previous CPU because fits_capacity() will always
return false in this case.
The new util_fits_cpu() logic should handle this correctly for us beside
more corner cases where similar failures could occur, like when using
UCLAMP_MAX.
We open code uclamp_rq_util_with() except for the clamp() part,
util_fits_cpu() needs the 'raw' values to be passed to it.
Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
value for the rq. Makes the code more readable and ensures the right
rules (use READ_ONCE/WRITE_ONCE) are respected transparently.
[1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html
Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Reported-by: Yun Hsiang <hsiang023167@gmail.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com
(cherry picked from commit 244226035a1f9b2b6c326e55ae5188fab4f428cb)
[Fix trivial conflict in kernel/sched/fair.c due to new automatic
variables in master vs 5.10]
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a661b80985ae13d3ad1a1be93f482192e4a361b5
Author: Youssef Esmat <youssefesmat@chromium.org>
Date: Thu Jun 20 19:26:56 2024 +0300
sched/eevdf: Change base_slice to 3ms
The default base slice of 0.75 msec is causing excessive context
switches. Raise it to 3 msecs, which lowers the amount of context
switches and the added overhead of the scheduler.
BUG=b:308209790
TEST=Manual tested on DUT
UPSTREAM-TASK=b:324602237
Change-Id: I7c7a7ecb377933a4032eb11161f5b5873e3e1e4c
Signed-off-by: Youssef Esmat <youssefesmat@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5291506
Reviewed-by: Suleiman Souhlal <suleiman@google.com>
Reviewed-by: Vineeth Pillai <vineethrp@google.com>
Reviewed-by: Steven Rostedt <rostedt@google.com>
commit 7d18cd0125c2a4347268d2f72b53a9045937d98a
Author: Youssef Esmat <youssefesmat@chromium.org>
Date: Thu Jun 20 19:26:07 2024 +0300
sched/eevdf: Add ENFORCE_ELIGIBILITY and default settings
Eligibility reduces scheduling latency of high nice task at the
cost of runtime of low nice tasks. This hurts the chrome model where
UI tasks should not be interrupted by background tasks. This adds
a feature to be able to disable eligibility checks.
Note that fairness is presvered in the way the deadline is calculated
since the deadline is adjusted by the weight of the task and a task.
Disabling lag placement is similar to eligibility since lag will
adjust the CPU time based on the amount of runtime.
BUG=b:308209790
TEST=Manual tested on DUT
UPSTREAM-TASK=b:324602237
Signed-off-by: Youssef Esmat <youssefesmat@chromium.org>
commit 6ef762984070d49371eb803bf50aeacfebe51613
Author: Aaron Lu <aaron.lu@intel.com>
Date: Tue Sep 12 14:58:08 2023 +0800
sched/fair: Ratelimit update to tg->load_avg
When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group
10.63% 10.04% [kernel.vmlinux] [k] update_load_avg
Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.
E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.
==============================
postgres_sysbench on SPR:
25%
base: 42382±19.8%
patch: 50174±9.5% (noise)
50%
base: 67626±1.3%
patch: 67365±3.1% (noise)
75%
base: 100216±1.2%
patch: 112470±0.1% +12.2%
100%
base: 93671±0.4%
patch: 113563±0.2% +21.2%
==============================
hackbench on ICL:
group=1
base: 114912±5.2%
patch: 117857±2.5% (noise)
group=4
base: 359902±1.6%
patch: 361685±2.7% (noise)
group=8
base: 461070±0.8%
patch: 491713±0.3% +6.6%
group=16
base: 309032±5.0%
patch: 378337±1.3% +22.4%
=============================
hackbench on SPR:
group=1
base: 100768±2.9%
patch: 103134±2.9% (noise)
group=4
base: 413830±12.5%
patch: 378660±16.6% (noise)
group=8
base: 436124±0.6%
patch: 490787±3.2% +12.5%
group=16
base: 457730±3.2%
patch: 680452±1.3% +48.8%
============================
netperf/udp_rr on ICL
25%
base: 114413±0.1%
patch: 115111±0.0% +0.6%
50%
base: 86803±0.5%
patch: 86611±0.0% (noise)
75%
base: 35959±5.3%
patch: 49801±0.6% +38.5%
100%
base: 61951±6.4%
patch: 70224±0.8% +13.4%
===========================
netperf/udp_rr on SPR
25%
base: 104954±1.3%
patch: 107312±2.8% (noise)
50%
base: 55394±4.6%
patch: 54940±7.4% (noise)
75%
base: 13779±3.1%
patch: 36105±1.1% +162%
100%
base: 9703±3.7%
patch: 28011±0.2% +189%
==============================================
netperf/tcp_stream on ICL (all in noise range)
25%
base: 43092±0.1%
patch: 42891±0.5%
50%
base: 19278±14.9%
patch: 22369±7.2%
75%
base: 16822±3.0%
patch: 17086±2.3%
100%
base: 18216±0.6%
patch: 18078±2.9%
===============================================
netperf/tcp_stream on SPR (all in noise range)
25%
base: 34491±0.3%
patch: 34886±0.5%
50%
base: 19278±14.9%
patch: 22369±7.2%
75%
base: 16822±3.0%
patch: 17086±2.3%
100%
base: 18216±0.6%
patch: 18078±2.9%
Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
commit c0c01cb55fa02d91f3a9f1fd456ecc6842990f99
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date: Thu Aug 18 20:48:02 2022 +0800
sched/fair: Fix another detach on unattached task corner case
commit 7dc603c902 ("sched/fair: Fix PELT integrity for new tasks")
fixed two load tracking problems for new task, including detach on
unattached new task problem.
There still left another detach on unattached task problem for the task
which has been woken up by try_to_wake_up() and waiting for actually
being woken up by sched_ttwu_pending().
try_to_wake_up(p)
cpu = select_task_rq(p)
if (task_cpu(p) != cpu)
set_task_cpu(p, cpu)
migrate_task_rq_fair()
remove_entity_load_avg() --> unattached
se->avg.last_update_time = 0;
__set_task_cpu()
ttwu_queue(p, cpu)
ttwu_queue_wakelist()
__ttwu_queue_wakelist()
task_change_group_fair()
detach_task_cfs_rq()
detach_entity_cfs_rq()
detach_entity_load_avg() --> detach on unattached task
set_task_rq()
attach_task_cfs_rq()
attach_entity_cfs_rq()
attach_entity_load_avg()
The reason of this problem is similar, we should check in detach_entity_cfs_rq()
that se->avg.last_update_time != 0, before do detach_entity_load_avg().
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-7-zhouchengming@bytedance.com
commit 60cfd4b1c31915161d1c942a4a9067314fba8d79
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Mon May 6 23:43:57 2024 -0700
sched/cass: Honor uclamp even when no CPUs can satisfy the requirement
When all CPUs available to a uclamp'd process are thermal throttled, it is
possible for them to be throttled below the uclamp minimum requirement. In
this case, CASS only considers uclamp when it compares relative utilization
and nowhere else; i.e., CASS essentially ignores the most important aspect
of uclamp.
Fix it so that CASS tries to honor uclamp even when no CPUs available to a
uclamp'd process are capable of fully meeting the uclamp minimum.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit 82d9f4952ab48ec0b1dd5cd48da1c3619ceb01c9
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Mon May 6 20:14:08 2024 -0700
sched/cass: Fix disproportionate load spreading when CPUs are throttled
When CPUs are thermal throttled, CASS tries to spread load such that their
resulting P-state is scaled relatively to their _throttled_ maximum
capacity, rather than their original capacity.
As a result, throttled CPUs are unfairly under-utilized, causing other CPUs
to receive the extra burden and thus run at a disproportionately higher
P-state relative to the throttled CPUs. This not only hurts performance,
but also greatly diminishes energy efficiency since it breaks CASS's basic
load balancing principle.
To fix this, some convoluted logic is required in order to make CASS aware
of a CPU's throttled and non-throttled capacity. The non-throttled capacity
is used for the fundamental relative utilization comparison, while the
throttled capacity is used in conjunction to ensure a throttled CPU isn't
accidentally overloaded as a result.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit 6341ff1997878ad3b6b90a5bf0e20b99aa9e6d89
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Tue Dec 18 10:31:30 2018 +0000
ANDROID: sched/fair: EAS: Add uclamp support to find_best_target()
Utilization clamping can be used to boost the utilization of small tasks
or cap that of big tasks. Thus, one of its possible usages is to bias
tasks placement to "promote" small tasks on higher capacity (less energy
efficient) CPUs or "constraint" big tasks on small capacity (more energy
efficient) CPUs.
When the Energy Aware Scheduler (EAS) looks for the most energy
efficient CPU to run a task on, it currently considers only the
effective utiliation estimated for a task.
Fix this by adding an additional check to skip CPUs which capacity is
smaller then the task clamped utilization.
Change-Id: I43fa6fa27e27c1eb5272c6a45d1a6a5b0faae1aa
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c0ccb5593e4c2ac72758c5a0f68e274a5698a839
Author: EmanuelCN <emanuelghub@gmail.com>
Date: Mon Apr 8 21:44:58 2024 +0300
sched/eevdf: Use the other placement strategy
Place lag is currently broken (4.19 needs more backports to make it work as intended)
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 052afe14ea97a104e3c384f56405c30fb571efe9
Author: Xuewen Yan <xuewen.yan@unisoc.com>
Date: Mon Apr 22 16:22:38 2024 +0800
sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()
It was possible to have pick_eevdf() return NULL, which then causes a
NULL-deref. This turned out to be due to entity_eligible() returning
falsely negative because of a s64 multiplcation overflow.
Specifically, reweight_eevdf() computes the vlag without considering
the limit placed upon vlag as update_entity_lag() does, and then the
scaling multiplication (remember that weight is 20bit fixed point) can
overflow. This then leads to the new vruntime being weird which then
causes the above entity_eligible() to go side-ways and claim nothing
is eligible.
Thus limit the range of vlag accordingly.
All this was quite rare, but fatal when it does happen.
Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/
Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/
Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Reported-by: Sergei Trofimovich <slyich@gmail.com>
Reported-by: Igor Raits <igor@gooddata.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240422082238.5784-1-xuewen.yan@unisoc.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit e224fc314c4b2f28785f30f8c85af7c566d540b3
Author: Tianchen Ding <dtcccc@linux.alibaba.com>
Date: Wed Mar 6 10:21:33 2024 +0800
sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr
reweight_eevdf() only keeps V unchanged inside itself. When se !=
cfs_rq->curr, it would be dequeued from rb tree first. So that V is
changed and the result is wrong. Pass the original V to reweight_eevdf()
to fix this issue.
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
[peterz: flip if() condition for clarity]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lkml.kernel.org/r/20240306022133.81008-3-dtcccc@linux.alibaba.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 6f605d6768465bec0f16cead98e1b3b08e548afe
Author: Tianchen Ding <dtcccc@linux.alibaba.com>
Date: Wed Mar 6 10:21:32 2024 +0800
sched/eevdf: Always update V if se->on_rq when reweighting
reweight_eevdf() needs the latest V to do accurate calculation for new
ve and vd. So update V unconditionally when se is runnable.
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Suggested-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Link: https://lore.kernel.org/r/20240306022133.81008-2-dtcccc@linux.alibaba.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 3416506799b6b2b3eab0688051082ed2fc1d1f5b
Author: Wang Jinchao <wangjinchao@xfusion.com>
Date: Thu Dec 14 13:20:29 2023 +0800
sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
This variable became unused in:
5e963f2bd465 ("sched/fair: Commit to EEVDF")
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/202312141319+0800-wangjinchao@xfusion.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 2cfaa805dc73ae4e6395a8aa69ee79b9580cc7bb
Author: Yiwei Lin <s921975628@gmail.com>
Date: Fri Nov 17 16:01:06 2023 +0800
sched/fair: Update min_vruntime for reweight_entity() correctly
Since reweight_entity() may have chance to change the weight of
cfs_rq->curr entity, we should also update_min_vruntime() if
this is the case
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Yiwei Lin <s921975628@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lore.kernel.org/r/20231117080106.12890-1-s921975628@gmail.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 13ffac20f571298c2007b90674694aebe367be89
Author: Abel Wu <wuyun.abel@bytedance.com>
Date: Wed Nov 15 11:36:46 2023 +0800
sched/eevdf: O(1) fastpath for task selection
Since the RB-tree is now sorted by deadline, let's first try the
leftmost entity which has the earliest virtual deadline. I've done
some benchmarks to see its effectiveness.
All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.
hackbench: process/thread + pipe/socket + 1/2/4/8 groups
netperf: TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
tbench: loopback 24/48/72/96/192 threads
schbench: 1/2/4/8 mthreads
direct: cfs_rq has only one entity
parity: RUN_TO_PARITY
fast: O(1) fastpath
slow: heap search
(%) direct parity fast slow
hackbench 92.95 2.02 4.91 0.12
netperf 68.08 6.60 24.18 1.14
tbench 67.55 11.22 20.61 0.62
schbench 69.91 2.65 25.73 1.71
The above results indicate that this fastpath really makes task
selection more efficient.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-4-wuyun.abel@bytedance.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 010771bc9b9ab17c289743afe1241127ba5263df
Author: Abel Wu <wuyun.abel@bytedance.com>
Date: Wed Nov 15 11:36:45 2023 +0800
sched/eevdf: Sort the rbtree by virtual deadline
Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-3-wuyun.abel@bytedance.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 1e35ae32e182db3531da5d9055e45c7ccf826c96
Author: Abel Wu <wuyun.abel@bytedance.com>
Date: Tue Nov 21 21:44:26 2023 +0200
sched/eevdf: Fix vruntime adjustment on reweight
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.
But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 73fa83e38da16aad4009b37fe7ee7a9fb289adb9
Author: Yiwei Lin <s921975628@gmail.com>
Date: Fri Oct 20 13:56:17 2023 +0800
sched/fair: Remove unused 'curr' argument from pick_next_entity()
The 'curr' argument of pick_next_entity() has become unused after
the EEVDF changes.
[ mingo: Updated the changelog. ]
Signed-off-by: Yiwei Lin <s921975628@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit d2d81ece57c3c14431dc2525f551e2d2eadb2834
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue Oct 17 16:59:47 2023 +0200
sched/eevdf: Fix heap corruption more
Because someone is a flaming idiot... and forgot we have current as
se->on_rq but not actually in the tree itself, and walking rb_parent()
on an entry not in the tree is 'funky' and KASAN complains.
Fixes: 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap integrity")
Reported-by: 0599jiangyc@gmail.com
Reported-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218020
Link: https://lkml.kernel.org/r/CAJwJo6ZGXO07%3DQvW4fgQfbsDzQPs9xj5sAQ1zp%3DmAyPMNbHYww%40mail.gmail.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit b97630c04b5e7869bc0a5900e6e7dda8f7bcf5d3
Author: Benjamin Segall <bsegall@google.com>
Date: Fri Sep 29 17:09:30 2023 -0700
sched/eevdf: Fix pick_eevdf()
The old pick_eevdf() could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but
it turned out that that min_deadline wasn't actually eligible. In that
case we need to go back and search through any left branches we
skipped looking for the actual best _eligible_ min_deadline.
This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.
I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit df45b4b6a01c84ccb6ac72b0d5d20da7c90572d8
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Oct 6 21:24:45 2023 +0200
sched/eevdf: Fix min_deadline heap integrity
Marek and Biju reported instances of:
"EEVDF scheduling fail, picking leftmost"
which Mike correlated with cgroup scheduling and the min_deadline heap
getting corrupted; some trace output confirms:
> And yeah, min_deadline is hosed somehow:
>
> validate_cfs_rq: --- /
> __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr)
> __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31)
> __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33)
> validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256
Turns out that reweight_entity(), which tries really hard to be fast,
does not do the normal dequeue+update+enqueue pattern but *does* scale
the deadline.
However, it then fails to propagate the updated deadline value up the
heap.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 3a51a2af114512733bfe3b6965a7320af592ac6e
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 26 14:29:50 2023 +0200
sched/eevdf: Fix avg_vruntime()
The expectation is that placing a task at avg_vruntime() makes it
eligible. Turns out there is a corner case where this is not the case.
Specifically, avg_vruntime() relies on the fact that integer division
is a flooring function (eg. it discards the remainder). By this
property the value returned is slightly left of the true average.
However! when the average is a negative (relative to min_vruntime) the
effect is flipped and it becomes a ceil, with the result that the
returned value is just right of the average and thus not eligible.
Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 94630ff1c5dace213fc4d886abdddff610331975
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Sep 15 00:48:55 2023 +0200
sched/eevdf: Also update slice on placement
Tasks that never consume their full slice would not update their slice value.
This means that tasks that are spawned before the sysctl scaling keep their
original (UP) slice length.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit ebdbfa17b89f097f7dbb25126033767d392ec08a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed Sep 20 15:00:24 2023 +0200
sched/debug: Remove the /proc/sys/kernel/sched_child_runs_first sysctl
The /proc/sys/kernel/sched_child_runs_first knob is no longer connected since:
5e963f2bd4654 ("sched/fair: Commit to EEVDF")
Remove it.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230920130025.412071-2-bigeasy@linutronix.de
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit f3c2be38e880f64190b90cfa7a587e224036fa58
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed Aug 16 15:40:59 2023 +0200
sched/eevdf: Curb wakeup-preemption
Mike and others noticed that EEVDF does like to over-schedule quite a
bit -- which does hurt performance of a number of benchmarks /
workloads.
In particular, what seems to cause over-scheduling is that when lag is
of the same order (or larger) than the request / slice then placement
will not only cause the task to be placed left of current, but also
with a smaller deadline than current, which causes immediate
preemption.
[ notably, lag bounds are relative to HZ ]
Mike suggested we stick to picking 'current' for as long as it's
eligible to run, giving it uninterrupted runtime until it reaches
parity with the pack.
Augment Mike's suggestion by only allowing it to exhaust it's initial
request.
One random data point:
echo NO_RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
3,723,554 context-switches ( +- 0.56% )
9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% )
echo RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
2,556,535 context-switches ( +- 0.51% )
9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% )
Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit bb48e3bf106a493dad3be6f2a84f507d7a5d0202
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:49 2023 +0200
sched/fair: Propagate enqueue flags into place_entity()
This allows place_entity() to consider ENQUEUE_WAKEUP and
ENQUEUE_MIGRATED.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 498bfeceabaf3803261f7c934e4303d6f2137ccc
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:48 2023 +0200
sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit b194d2bfc5dc8070c78357d0c2b63a0cde2877d6
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue Nov 21 21:21:06 2023 +0200
sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
[@Helium-Studio: Also remove sysctl entries that dropped by this commit]
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
# Conflicts:
# kernel/sched/fair.c
# kernel/sysctl.c
commit 29c0c9b70590c86dbb3504704b4043ffe793d48f
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:46 2023 +0200
sched/smp: Use lag to simplify cross-runqueue placement
Using lag is both more correct and simpler when moving between
runqueues.
Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 241fa4e083655b4f67cae56fa716ca6234506b30
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date: Thu Aug 18 20:48:01 2022 +0800
sched/fair: Combine detach into dequeue when migrating task
When we are migrating task out of the CPU, we can combine detach and
propagation into dequeue_entity() to save the detach_entity_cfs_rq()
in migrate_task_rq_fair().
This optimization is like combining DO_ATTACH in the enqueue_entity()
when migrating task to the CPU. So we don't have to traverse the CFS tree
extra time to do the detach_entity_cfs_rq() -> propagate_entity_cfs_rq(),
which wouldn't be called anymore with this patch's change.
detach_task()
deactivate_task()
dequeue_task_fair()
for_each_sched_entity(se)
dequeue_entity()
update_load_avg() /* (1) */
detach_entity_load_avg()
set_task_cpu()
migrate_task_rq_fair()
detach_entity_cfs_rq() /* (2) */
update_load_avg();
detach_entity_load_avg();
propagate_entity_cfs_rq();
for_each_sched_entity()
update_load_avg()
This patch save the detach_entity_cfs_rq() called in (2) by doing
the detach_entity_load_avg() for a CPU migrating task inside (1)
(the task being the first se in the loop)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-6-zhouchengming@bytedance.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit d563959f4820312d3d1b462c78ca62a8a5fe3157
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:45 2023 +0200
sched/fair: Commit to lag based placement
Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.
Specifically, the whole FAIR_SLEEPER thing was a very crude
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.
One side effect of FAIR_SLEEPER is that it caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit ac875e302647f7266ce5678d02f06732e4512f2c
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:44 2023 +0200
sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.
Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.
EEVDF has two parameters:
- weight, or time-slope: which is mapped to nice just as before
- request size, or slice length: which is used to compute
the virtual deadline as: vd_i = ve_i + r_i/w_i
Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.
Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.
Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit fa237d82950d832b045ca398b0bdd7088422cb37
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:42 2023 +0200
sched/fair: Add lag based placement
With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.
Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 6063bb0c8a1acd02f35d9de3a2f73c17fd99c3f2
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:41 2023 +0200
sched/fair: Remove sched_feat(START_DEBIT)
With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 45d55875264f7f7756c814df55a0c505de1f13c0
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:40 2023 +0200
sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).
As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.
Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 482a42257efb9da9db63f03a8af5d5f07d3518f6
Author: Jiang Biao <benbjiang@tencent.com>
Date: Tue Aug 11 19:32:09 2020 +0800
sched/fair: Simplify the work when reweighting entity
The code in reweight_entity() can be simplified.
For a sched entity on the rq, the entity accounting can be replaced by
cfs_rq instantaneous load updates currently called from within the
entity accounting.
Even though an entity on the rq can't represent a task in
reweight_entity() (a task is always dequeued before calling this
function) and so the numa task accounting and the rq->cfs_tasks list
management of the entity accounting are never called, the redundant
cfs_rq->nr_running decrement/increment will be avoided.
Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200811113209.34057-1-benbjiang@tencent.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit d21036a9134537a7e56654ea6191797905c78148
Author: Vincent Donnefort <vincent.donnefort@arm.com>
Date: Tue Jun 21 10:04:08 2022 +0100
sched/fair: Provide u64 read for 32-bits arch helper
Introducing macro helpers u64_u32_{store,load}() to factorize lockless
accesses to u64 variables for 32-bits architectures.
Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
accommodate the later where the copy lies outside of the structure
(cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
use the _copy() version of those helpers.
Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
therefore, have a small penalty for 32-bits machines in set_task_rq_fair()
and init_cfs_rq().
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit fded4adb0d75b4c4c5400278f277a5ff8a782020
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed Apr 29 17:04:12 2020 +0200
rbtree, sched/fair: Use rb_add_cached()
Reduce rbtree boiler plate by using the new helper function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 759f0c10c642f4917e94b9ad3a29a68cfe02f1a5
Author: Michel Lespinasse <walken@google.com>
Date: Wed Sep 25 16:46:10 2019 -0700
augmented rbtree: rework the RB_DECLARE_CALLBACKS macro definition
Change the definition of the RBCOMPUTE function. The propagate callback
repeatedly calls RBCOMPUTE as it moves from leaf to root. it wants to
stop recomputing once the augmented subtree information doesn't change.
This was previously checked using the == operator, but that only works
when the augmented subtree information is a scalar field. This commit
modifies the RBCOMPUTE function so that it now sets the augmented subtree
information instead of returning it, and returns a boolean value
indicating if the propagate callback should stop.
The motivation for this change is that I want to introduce augmented
rbtree uses where the augmented data for the subtree is a struct instead
of a scalar.
Link: http://lkml.kernel.org/r/20190703040156.56953-4-walken@google.com
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 9372e034d9366c3cad0772ee1b310652d90bcb70
Author: Michel Lespinasse <walken@google.com>
Date: Wed Sep 25 16:46:07 2019 -0700
augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro
Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern. This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.
[walken@google.com: fix mm/vmalloc.c]
Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[walken@google.com: re-add check to check_augmented()]
Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit a3175d39d501e3512d4df600939201cbda8e9b33
Author: Michel Lespinasse <walken@google.com>
Date: Wed Sep 25 16:46:04 2019 -0700
augmented rbtree: add comments for RB_DECLARE_CALLBACKS macro
Patch series "make RB_DECLARE_CALLBACKS more generic", v3.
These changes are intended to make the RB_DECLARE_CALLBACKS macro more
generic (allowing the aubmented subtree information to be a struct instead
of a scalar).
I have verified the compiled lib/interval_tree.o and mm/mmap.o files to
check that they didn't change. This held as expected for interval_tree.o;
mmap.o did have some changes which could be reverted by marking
__vma_link_rb as noinline. I did not add such a change to the patchset; I
felt it was reasonable enough to leave the inlining decision up to the
compiler.
This patch (of 3):
Add a short comment summarizing the arguments to RB_DECLARE_CALLBACKS.
The arguments are also now capitalized. This copies the style of the
INTERVAL_TREE_DEFINE macro.
No functional changes in this commit, only comments and capitalization.
Link: http://lkml.kernel.org/r/20190703040156.56953-2-walken@google.com
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 9600cbd0d286c6cf86b68e88ea0600a219815d37
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:43 2023 +0200
rbtree: Add rb_add_augmented_cached() helper
While slightly sub-optimal, updating the augmented data while going
down the tree during lookup would be faster -- alas the augment
interface does not currently allow for that, provide a generic helper
to add a node to an augmented cached tree.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.862983648@infradead.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 3773003b378a5a3e5e44d0083d05c7437fd8a4b5
Author: Peter Zijlstra <peterz@infradead.org>
Date: Mon Oct 9 10:36:53 2017 +0200
sched/core: Ensure load_balance() respects the active_mask
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 1a4a49b4cd14bdcf91e3c9de97aa98c912e11c83
Author: Wanpeng Li <wanpengli@tencent.com>
Date: Mon Jan 13 08:50:27 2020 +0800
[BACKPORT]sched/nohz: Optimize get_nohz_timer_target()
On a machine, CPU 0 is used for housekeeping, the other 39 CPUs in the
same socket are in nohz_full mode. We can observe huge time burn in the
loop for seaching nearest busy housekeeper cpu by ftrace.
2) | get_nohz_timer_target() {
2) 0.240 us | housekeeping_test_cpu();
2) 0.458 us | housekeeping_test_cpu();
...
2) 0.292 us | housekeeping_test_cpu();
2) 0.240 us | housekeeping_test_cpu();
2) 0.227 us | housekeeping_any_cpu();
2) + 43.460 us | }
This patch optimizes the searching logic by finding a nearest housekeeper
CPU in the housekeeping cpumask, it can minimize the worst searching time
from ~44us to < 10us in my testing. In addition, the last iterated busy
housekeeper can become a random candidate while current CPU is a better
fallback if it is a housekeeper.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/1578876627-11938-1-git-send-email-wanpengli@tencent.com
Signed-off-by: DennySPB <dennyspb@gmail.com>
commit f773f3b275ad618158280c9793a1d27639f298f7
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri Dec 14 17:01:56 2018 +0100
sched/fair: Trigger asym_packing during idle load balance
Newly idle load balancing is not always triggered when a CPU becomes idle.
This prevents the scheduler from getting a chance to migrate the task
for asym packing.
Enable active migration during idle load balance too.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: valentin.schneider@arm.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jesse Chan <jc@linux.com>
Signed-off-by: billaids <jimmy.nelle@hsw-stud.de>
commit 75352e3fb31924ec3af106be73e210d2258fefe8
Author: Peter Zijlstra <peterz@infradead.org>
Date: Mon Oct 9 10:36:53 2017 +0200
sched/core: Ensure load_balance() respects the active_mask
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 74b6dcb03b6b9fa56fbcdf2ffb33e50fa97db6dc
Author: Frederic Weisbecker <frederic@kernel.org>
Date: Tue Dec 3 17:01:06 2019 +0100
sched: Use fair:prio_changed() instead of ad-hoc implementation
set_user_nice() implements its own version of fair::prio_changed() and
therefore misses a specific optimization towards nohz_full CPUs that
avoid sending an resched IPI to a reniced task running alone. Use the
proper callback instead.
Change-Id: I51ba67826dfcec0aa423758281943c01ba267c91
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-3-frederic@kernel.org
Signed-off-by: mydongistiny <jaysonedson@gmail.com>
Signed-off-by: DennySPB <dennyspb@gmail.com>
commit 1994e58a6bf1726d32a1c58af16b2be6fdfafd80
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Apr 26 12:19:32 2018 +0200
sched/fair: Fix the update of blocked load when newly idle
With commit:
31e77c93e432 ("sched/fair: Update blocked load when newly idle")
... we release the rq->lock when updating blocked load of idle CPUs.
This opens a time window during which another CPU can add a task to this
CPU's cfs_rq.
The check for newly added task of idle_balance() is not in the common path.
Move the out label to include this check.
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit ddd49e696ad91a561a8c8bf12ef61bc0e113ac41
Author: Josh Don <joshdon@google.com>
Date: Tue Aug 4 12:34:13 2020 -0700
sched/fair: Ignore cache hotness for SMT migration
SMT siblings share caches, so cache hotness should be irrelevant for
cross-sibling migration.
Signed-off-by: Josh Don <joshdon@google.com>
Proposed-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200804193413.510651-1-joshdon@google.com
commit de8674e8809c07543874ec4a4c91631c481caf4f
Author: Peter Oskolkov <posk@google.com>
Date: Wed Sep 30 10:35:32 2020 -0700
sched/fair: Tweak pick_next_entity()
Currently, pick_next_entity(...) has the following structure
(simplified):
[...]
if (last_buddy_ok())
result = last_buddy;
if (next_buddy_ok())
result = next_buddy;
[...]
The intended behavior is to prefer next buddy over last buddy;
the current code somewhat obfuscates this, and also wastes
cycles checking the last buddy when eventually the next buddy is
picked up.
So this patch refactors two 'ifs' above into
[...]
if (next_buddy_ok())
result = next_buddy;
else if (last_buddy_ok())
result = last_buddy;
[...]
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guitttot@linaro.org>
Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com
commit 6a20c355412765268bfc7635b626899873f73337
Author: Clement Courbet <courbet@google.com>
Date: Wed Mar 3 14:46:53 2021 -0800
sched: Optimize __calc_delta()
A significant portion of __calc_delta() time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.
This is ~7x faster on benchmarks.
The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on x86 we get an additional factor 2 in speed without dedicated
implementation.
On GCC, the asm versions of `fls` are about the same speed as the
builtin. On Clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?idI406.
```
name cpu/op
BM_Calc<__calc_delta_loop> 9.57ms Â=B112%
BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113%
BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113%
BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112%
BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113%
BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115%
BM_Calc<__calc_delta_builtin> 1.32ms Â=B111%
```
Signed-off-by: Clement Courbet <courbet@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com
commit e4793308d8214d546ae2eead6f9fac0426ada0f9
Author: Amit Kucheria <amit.kucheria@linaro.org>
Date: Mon Oct 21 17:45:12 2019 +0530
cpufreq: Initialize the governors in core_initcall
Initialize the cpufreq governors earlier to allow for earlier
performance control during the boot process.
Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/b98eae9b44eb2f034d7f5d12a161f5f831be1eb7.1571656015.git.amit.kucheria@linaro.org
# Conflicts:
# drivers/cpufreq/cpufreq_performance.c
commit 7f1bc529bc787dcbedfd2899cb9c99326dd57215
Author: Mathieu Poirier <mathieu.poirier@linaro.org>
Date: Fri Jul 19 15:59:53 2019 +0200
sched/topology: Add partition_sched_domains_locked()
Introduce the partition_sched_domains_locked() function by taking
the mutex locking code out of the original function. That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.
No change of functionality is introduced by this patch.
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit e17f27e0d39d77e4b1011278796a30e8d7b05d6f
Author: Valentin Schneider <valentin.schneider@arm.com>
Date: Tue Apr 9 18:35:45 2019 +0100
sched/topology: Skip duplicate group rewrites in build_sched_groups()
While staring at build_sched_domains(), I realized that get_group()
does several duplicate (thus useless) writes.
If you take the Arm Juno r0 (LITTLEs = [0, 3, 4, 5], bigs = [1, 2]), the
sched_group build flow would look like this:
('MC[cpu]->sg' means 'per_cpu_ptr(&tl->data->sg, cpu)' with 'tl == MC')
build_sched_groups(MC[CPU0]->sd, CPU0)
get_group(0) -> MC[CPU0]->sg
get_group(3) -> MC[CPU3]->sg
get_group(4) -> MC[CPU4]->sg
get_group(5) -> MC[CPU5]->sg
build_sched_groups(DIE[CPU0]->sd, CPU0)
get_group(0) -> DIE[CPU0]->sg
get_group(1) -> DIE[CPU1]->sg <=================+
|
build_sched_groups(MC[CPU1]->sd, CPU1) |
get_group(1) -> MC[CPU1]->sg |
get_group(2) -> MC[CPU2]->sg |
|
build_sched_groups(DIE[CPU1]->sd, CPU1) ^
get_group(1) -> DIE[CPU1]->sg } We've set up these two up here!
get_group(3) -> DIE[CPU0]->sg }
From this point on, we will only use sched_groups that have been
previously visited & initialized. The only new operation will
be which group pointer we affect to sd->groups.
On the Juno r0 we get 32 get_group() calls, every single one of them
writing to a sched_group->cpumask. However, all of the data structures
we need are set up after 8 visits (see above).
Return early from get_group() if we've already visited (and thus
initialized) the sched_group we're looking at. Overlapping domains
are not affected as they do not use build_sched_groups().
Tested on a Juno and a 2 * (Xeon E5-2690) system.
( FWIW I initially checked the refs for both sg && sg->sgc, but figured if
they weren't both 0 or > 1 then something must have gone wrong, so I
threw in a WARN_ON(). )
No change in functionality intended.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit c62d6453c251607e2fe6f6e204a0711b7edc03df
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon Jun 17 17:00:17 2019 +0200
sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is
unused since commit:
765d0af19f5f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'")
Remove it.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gregkh@linuxfoundation.org
Cc: linux@armlinux.org.uk
Cc: quentin.perret@arm.com
Cc: rafael@kernel.org
Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 360635bafb36f39478a5c32b43e3cb64b1df4dad
Author: John Galt <johngaltfirstrun@gmail.com>
Date: Mon Apr 22 16:01:47 2024 -0400
sched-pelt: make half life nonconfigurable
16ms only
commit baa288445831e3cef3b7f6cd20bc0c52f331acc2
Author: John Galt <johngaltfirstrun@gmail.com>
Date: Mon Apr 22 15:32:50 2024 -0400
sched/fair: decrease migration cost
commit 7e1c7b478d201f3909d4c8f1f9f121064457f4e9
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Tue Aug 1 17:26:48 2023 +0200
sched/rt: Don't try push tasks if there are none.
I have a RT task X at a high priority and cyclictest on each CPU with
lower priority than X's. If X is active and each CPU wakes their own
cylictest thread then it ends in a longer rto_push storm.
A random CPU determines via balance_rt() that the CPU on which X is
running needs to push tasks. X has the highest priority, cyclictest is
next in line so there is nothing that can be done since the task with
the higher priority is not touched.
tell_cpu_to_push() increments rto_loop_next and schedules
rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
loop counter and do the same. Once rto_push_irq_work_func() is active it
does nothing because it has _no_ pushable tasks on its runqueue. Then
checks rto_next_cpu() and decides to queue irq_work on the local CPU
because another CPU requested a push by incrementing the counter.
I have traces where ~30 CPUs request this ~3 times each before it
finally ends. This greatly increases X's runtime while X isn't making
much progress.
Teach rto_next_cpu() to only return CPUs which also have tasks on their
runqueue which can be pushed away. This does not reduce the
tell_cpu_to_push() invocations (rto_loop_next counter increments) but
reduces the amount of issued rto_push_irq_work_func() if nothing can be
done. As the result the overloaded CPU is blocked less often.
There are still cases where the "same job" is repeated several times
(for instance the current CPU needs to resched but didn't yet because
the irq-work is repeated a few times and so the old task remains on the
CPU) but the majority of request end in tell_cpu_to_push() before an IPI
is issued.
Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
commit 93662a6fbca530b6480fc92e9d119c5ccf09da7f
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed May 13 15:55:28 2020 +0200
sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
are quite close and follow the same sequence for enqueuing an entity in the
cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
enqueue_task_fair(). This fixes a problem already faced with the latter and
add an optimization in the last for_each_sched_entity loop.
Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
Reported-by Tao Zhou <zohooouoto@zoho.com.cn>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
commit e8164ed37c41537ee1db3584095cd533705d5ad5
Author: Josh Don <joshdon@google.com>
Date: Fri Apr 10 15:52:08 2020 -0700
sched/fair: Remove distribute_running from CFS bandwidth
This is mostly a revert of commit:
baa9be4ffb55 ("sched/fair: Fix throttle_list starvation with low CFS quota")
The primary use of distribute_running was to determine whether to add
throttled entities to the head or the tail of the throttled list. Now
that we always add to the tail, we can remove this field.
The other use of distribute_running is in the slack_timer, so that we
don't start a distribution while one is already running. However, even
in the event that this race occurs, it is fine to have two distributions
running (especially now that distribute grabs the cfs_b->lock to
determine remaining quota before assigning).
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200410225208.109717-3-joshdon@google.com
commit 1eb50996e28558aabcd74d3afb37e044fedac26d
Author: bsegall@google.com <bsegall@google.com>
Date: Thu Jun 6 10:21:01 2019 -0700
sched/fair: Don't push cfs_bandwith slack timers forward
When a cfs_rq sleeps and returns its quota, we delay for 5ms before
waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
sleep, as this has to be done outside of the rq lock we hold.
The current code waits for 5ms without any sleeps, instead of waiting
for 5ms from the first sleep, which can delay the unthrottle more than
we want. Switch this around so that we can't push this forward forever.
This requires an extra flag rather than using hrtimer_active, since we
need to start a new timer if the current one is in the process of
finishing.
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: DennySPB <dennyspb@gmail.com>
commit 2d4490a11d8952e8016647a5bbfa98f6df781dde
Author: Paul Turner <pjt@google.com>
Date: Fri Apr 10 15:52:07 2020 -0700
sched/fair: Eliminate bandwidth race between throttling and distribution
There is a race window in which an entity begins throttling before quota
is added to the pool, but does not finish throttling until after we have
finished with distribute_cfs_runtime(). This entity is not observed by
distribute_cfs_runtime() because it was not on the throttled list at the
time that distribution was running. This race manifests as rare
period-length statlls for such entities.
Rather than heavy-weight the synchronization with the progress of
distribution, we can fix this by aborting throttling if bandwidth has
become available. Otherwise, we immediately add the entity to the
throttled list so that it can be observed by a subsequent distribution.
Additionally, we can remove the case of adding the throttled entity to
the head of the throttled list, and simply always add to the tail.
Thanks to 26a8b12747c97, distribute_cfs_runtime() no longer holds onto
its own pool of runtime. This means that if we do hit the !assign and
distribute_running case, we know that distribution is about to end.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200410225208.109717-2-joshdon@google.com
commit 4f08342e65cedd1082630c987d9a7bafa7703c70
Author: Huaixin Chang <changhuaixin@linux.alibaba.com>
Date: Fri Mar 27 11:26:25 2020 +0800
sched/fair: Fix race between runtime distribution and assignment
Currently, there is a potential race between distribute_cfs_runtime()
and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
distributes without holding lock and finds out there is not enough
runtime to charge against after distribution. Because
assign_cfs_rq_runtime() might be called during distribution, and use
cfs_b->runtime at the same time.
Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
and cfs period timer runs, slow threads might run and sleep, returning
unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
a lot. If runtime distributed is large too, over-use of runtime happens.
A runtime over-using by about 70 percent of quota is seen when we
test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
95 slow threads in test group, configure 10ms quota for this group and
see the CPU usage of fibtest is 17.0%, which is far more than the
expected 10%.
On a smaller machine with 32 cores, we also run fibtest with 96
threads. CPU usage is more than 12%, which is also more than expected
10%. This shows that on similar workloads, this race do affect CPU
bandwidth control.
Solve this by holding lock inside distribute_cfs_runtime().
Fixes: c06f04c704 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/
commit 455896c7c0c647b57e0b742d00b3b65aa8f02e87
Author: kondors1995 <normandija1945@gmail.com>
Date: Thu Apr 4 21:48:12 2024 +0300
sched/cass:fixup
commit 4b88ec1c7957d055e3a23067908fa0ec47cbd2c5
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Wed Mar 13 21:25:29 2024 -0700
sched/cass: Eliminate redundant calls to smp_processor_id()
Calling smp_processor_id() can be expensive depending on how an arch
implements it, so avoid calling it more than necessary.
Use the raw variant too since this code is always guaranteed to run with
preemption disabled.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit f7d512d05d97dea633689bf689fd8a63960248ca
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Tue Mar 12 16:55:56 2024 -0700
sched/cass: Only treat sync waker CPU as idle if there's one task running
For synchronized wakes, the waker's CPU should only be treated as idle if
there aren't any other running tasks on that CPU. This is because, for
synchronized wakes, it is assumed that the waker will immediately go to
sleep after waking the wakee; therefore, if there aren't any other tasks
running on the waker's CPU, it'll go idle and should be treated as such to
improve task placement.
This optimization only applies when there aren't any other tasks running on
the waker's CPU, however.
Fix it by ensuring that there's only the waker running on its CPU.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit 3264ea2961b6a7a2e5589b199c031aa17361401a
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Mon Feb 19 13:13:02 2024 -0800
sched/cass: Fix suboptimal task placement when uclamp is used
Uclamp is designed to specify a process' CPU performance requirement scaled
as a CPU capacity value. It simply denotes the process' requirement for the
CPU's raw performance and thus P-state.
CASS currently treats uclamp as a CPU load value however, producing wildly
suboptimal CPU placement decisions for tasks which use uclamp. This hurts
performance and, even worse, massively hurts energy efficiency, with CASS
sometimes yielding power consumption that is a few times higher than EAS.
Since uclamp inherently throws a wrench into CASS's goal of keeping
relative P-states as low as possible across all CPUs, making it cooperate
with CASS requires a multipronged approach.
Make the following three changes to fix the uclamp task placement issue:
1. Treat uclamp as a CPU performance value rather than a CPU load value.
2. Clamp a CPU's utilization to the task's uclamp floor in order to keep
relative P-states as low as possible across all CPUs.
3. Consider preferring a non-idle CPU for uclamped tasks to avoid pushing
up the P-state of more than one CPU when there are multiple concurrent
uclamped tasks.
This fixes CASS's massive energy efficiency and performance issues when
uclamp is used.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit dc098ba6aab229c19099642f561d33562f1ff755
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Sat Jan 6 00:34:48 2024 -0800
sched/cass: Perform runqueue selection for RT tasks too
RT tasks aren't placed on CPUs in a load-balanced manner, much less an
energy efficient one. On systems which contain many RT tasks and/or IRQ
threads, energy efficiency and throughput are diminished significantly by
the default RT runqueue selection scheme which targets minimal latency.
In practice, performance is actually improved by spreading RT tasks fairly,
despite the small latency impact. Additionally, energy efficiency is
significantly improved since the placement of all tasks benefits from
energy-efficient runqueue selection, rather than just CFS tasks.
Perform runqueue selection for RT tasks in CASS to significantly improve
energy efficiency and overall performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit c8bbbe7148731e0024d0681f6cc8738cbe1cee06
Author: kondors1995 <normandija1945@gmail.com>
Date: Thu Apr 4 12:48:06 2024 +0300
sched/cass:checkout to kerneltoast/android_kernel_google_zuma@63f0b82d3
commit 1d4260eba475bad8a2aba7a974e854baefeb1040
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Wed Feb 28 21:07:44 2024 -0800
sched/cass: Clean up local variable scope in cass_best_cpu()
Move `curr` and `idle_state` to within the loop's scope for better
readability. Also, leave a comment about `curr->cpu` to make it clear that
`curr->cpu` must be initialized within the loop in order for `best->cpu` to
be valid.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit cc47ef1ef764b7d261402bfa04a42dc4a418ab0b
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Sun Dec 17 19:04:28 2023 -0800
sched/cass: Fix CPU selection when no candidate CPUs are idle
When no candidate CPUs are idle, CASS would keep `cidx` unchanged, and thus
`best == curr` would always be true. As a result, since the empty candidate
slot never changes, the current candidate `curr` always overwrites the best
candidate `best`. This causes the last valid CPU to always be selected by
CASS when no CPUs are idle (i.e., under heavy load).
Fix it by ensuring that the CPU loop in cass_best_cpu() flips the free
candidate index after the first candidate CPU is evaluated.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit 2dec239da28c7e1d19ead79686f5114e417e6d65
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Mon Nov 5 14:54:00 2018 +0000
UPSTREAM: sched/fair: Add lsub_positive() and use it consistently
The following pattern:
var -= min_t(typeof(var), var, val);
is used multiple times in fair.c.
The existing sub_positive() already captures that pattern, but it also
adds an explicit load-store to properly support lockless observations.
In other cases the pattern above is used to update local, and/or not
concurrently accessed, variables.
Let's add a simpler version of sub_positive(), targeted at local variables
updates, which gives the same readability benefits at calling sites,
without enforcing {READ,WRITE}_ONCE() barriers.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net
Change-Id: I6a6a3b2ae9e4baa4ab6e906bf2aaed7306303025
Signed-off-by: Jason Edson <jaysonedson@gmail.com>
Signed-off-by: DennySPb <dennyspb@gmail.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit fc9b2f93da21cddfc7d76fa941eb496935e48db4
Author: Rohit Jain <rohit.k.jain@oracle.com>
Date: Wed May 2 13:52:10 2018 -0700
sched/core: Don't schedule threads on pre-empted vCPUs
In paravirt configurations today, spinlocks figure out whether a vCPU is
running to determine whether or not spinlock should bother spinning. We
can use the same logic to prioritize CPUs when scheduling threads. If a
vCPU has been pre-empted, it will incur the extra cost of VMENTER and
the time it actually spends to be running on the host CPU. If we had
other vCPUs which were actually running on the host CPU and idle we
should schedule threads there.
Performance numbers:
Note: With patch is referred to as Paravirt in the following and without
patch is referred to as Base.
1) When only 1 VM is running:
a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):
+-------+-----------------+----------------+
|Number |Paravirt |Base |
|of +---------+-------+-------+--------+
|Threads|Average |Std Dev|Average| Std Dev|
+-------+---------+-------+-------+--------+
|1 |1.817 |0.076 |1.721 | 0.067 |
|2 |3.467 |0.120 |3.468 | 0.074 |
|4 |6.266 |0.035 |6.314 | 0.068 |
|8 |11.437 |0.105 |11.418 | 0.132 |
|16 |21.862 |0.167 |22.161 | 0.129 |
|25 |33.341 |0.326 |33.692 | 0.147 |
+-------+---------+-------+-------+--------+
2) When two VMs are running with same CPU affinities:
a) tbench test on VM 8 cpus
Base:
VM1:
Throughput 220.59 MB/sec 1 clients 1 procs max_latency=12.872 ms
Throughput 448.716 MB/sec 2 clients 2 procs max_latency=7.555 ms
Throughput 861.009 MB/sec 4 clients 4 procs max_latency=49.501 ms
Throughput 1261.81 MB/sec 7 clients 7 procs max_latency=76.990 ms
VM2:
Throughput 219.937 MB/sec 1 clients 1 procs max_latency=12.517 ms
Throughput 470.99 MB/sec 2 clients 2 procs max_latency=12.419 ms
Throughput 841.299 MB/sec 4 clients 4 procs max_latency=37.043 ms
Throughput 1240.78 MB/sec 7 clients 7 procs max_latency=77.489 ms
Paravirt:
VM1:
Throughput 222.572 MB/sec 1 clients 1 procs max_latency=7.057 ms
Throughput 485.993 MB/sec 2 clients 2 procs max_latency=26.049 ms
Throughput 947.095 MB/sec 4 clients 4 procs max_latency=45.338 ms
Throughput 1364.26 MB/sec 7 clients 7 procs max_latency=145.124 ms
VM2:
Throughput 224.128 MB/sec 1 clients 1 procs max_latency=4.564 ms
Throughput 501.878 MB/sec 2 clients 2 procs max_latency=11.061 ms
Throughput 965.455 MB/sec 4 clients 4 procs max_latency=45.370 ms
Throughput 1359.08 MB/sec 7 clients 7 procs max_latency=168.053 ms
b) Hackbench with 4 fd 1,000,000 loops
+-------+--------------------------------------+----------------------------------------+
|Number |Paravirt |Base |
|of +----------+--------+---------+--------+----------+--------+---------+----------+
|Threads|Average1 |Std Dev1|Average2 | Std Dev|Average1 |Std Dev1|Average2 | Std Dev 2|
+-------+----------+--------+---------+--------+----------+--------+---------+----------+
| 1 | 3.748 | 0.620 | 3.576 | 0.432 | 4.006 | 0.395 | 3.446 | 0.787 |
+-------+----------+--------+---------+--------+----------+--------+---------+----------+
Note that this test was run just to show the interference effect
over-subscription can have in baseline
c) schbench results with 2 message groups on 8 vCPU VMs
+-----------+-------+---------------+--------------+------------+
| | | Paravirt | Base | |
+-----------+-------+-------+-------+-------+------+------------+
| |Threads| VM1 | VM2 | VM1 | VM2 |%Improvement|
+-----------+-------+-------+-------+-------+------+------------+
|50.0000th | 1 | 52 | 53 | 58 | 54 | +6.25% |
|75.0000th | 1 | 69 | 61 | 83 | 59 | +8.45% |
|90.0000th | 1 | 80 | 80 | 89 | 83 | +6.98% |
|95.0000th | 1 | 83 | 83 | 93 | 87 | +7.78% |
|*99.0000th | 1 | 92 | 94 | 99 | 97 | +5.10% |
|99.5000th | 1 | 95 | 100 | 102 | 103 | +4.88% |
|99.9000th | 1 | 107 | 123 | 105 | 203 | +25.32% |
+-----------+-------+-------+-------+-------+------+------------+
|50.0000th | 2 | 56 | 62 | 67 | 59 | +6.35% |
|75.0000th | 2 | 69 | 75 | 80 | 71 | +4.64% |
|90.0000th | 2 | 80 | 82 | 90 | 81 | +5.26% |
|95.0000th | 2 | 85 | 87 | 97 | 91 | +8.51% |
|*99.0000th | 2 | 98 | 99 | 107 | 109 | +8.79% |
|99.5000th | 2 | 107 | 105 | 109 | 116 | +5.78% |
|99.9000th | 2 | 9968 | 609 | 875 | 3116 | -165.02% |
+-----------+-------+-------+-------+-------+------+------------+
|50.0000th | 4 | 78 | 77 | 78 | 79 | +1.27% |
|75.0000th | 4 | 98 | 106 | 100 | 104 | 0.00% |
|90.0000th | 4 | 987 | 1001 | 995 | 1015 | +1.09% |
|95.0000th | 4 | 4136 | 5368 | 5752 | 5192 | +13.16% |
|*99.0000th | 4 | 11632 | 11344 | 11024| 10736| -5.59% |
|99.5000th | 4 | 12624 | 13040 | 12720| 12144| -3.22% |
|99.9000th | 4 | 13168 | 18912 | 14992| 17824| +2.24% |
+-----------+-------+-------+-------+-------+------+------------+
Note: Improvement is measured for (VM1+VM2)
Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 495ad164dcd1731db6746346355ecf52a193c720
Author: EmanuelCN <emanuelghub@gmail.com>
Date: Wed Apr 3 18:31:28 2024 +0000
init: Enable SCHED_THERMAL_PRESSURE by default
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 9170eccde2a4c93f9be5b4bc0bed7bce1277f8da
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:13 2020 -0500
sched/fair: Enable tuning of decay period
Thermal pressure follows pelt signals which means the decay period for
thermal pressure is the default pelt decay period. Depending on SoC
characteristics and thermal activity, it might be beneficial to decay
thermal pressure slower, but still in-tune with the pelt signals. One way
to achieve this is to provide a command line parameter to set a decay
shift parameter to an integer between 0 and 10.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-10-thara.gopinath@linaro.org
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 0c8c7aaf09aea4f6296130605b75697fa21f2615
Author: Lukasz Luba <lukasz.luba@arm.com>
Date: Thu Jun 10 16:03:22 2021 +0100
UPSTREAM: thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure
The thermal pressure signal gives information to the scheduler about
reduced CPU capacity due to thermal. It is based on a value stored in a
per-cpu 'thermal_pressure' variable. The online CPUs will get the new
value there, while the offline won't. Unfortunately, when the CPU is back
online, the value read from per-cpu variable might be wrong (stale data).
This might affect the scheduler decisions, since it sees the CPU capacity
differently than what is actually available.
Fix it by making sure that all online+offline CPUs would get the proper
value in their per-cpu variable when thermal framework sets capping.
Fixes: f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping")
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/all/20210614191030.22241-1-lukasz.luba@arm.com/
Bug: 199501011
Change-Id: I10cceb48b72ccce1f51cfc0a7ecfa8d8e67d4394
(cherry picked from commit 2ad8ccc17d1e4270cf65a3f2a07a7534aa23e3fb)
Signed-off-by: Ram Chandrasekar <quic_rkumbako@quicinc.com>
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit cdbf22d3d8bc1b7a384677f4af7ea20fd169c538
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:12 2020 -0500
thermal: cpu-cooling: Update thermal pressure in case of a maximum frequency capping
Thermal governors can request for a CPU's maximum supported frequency to
be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular CPU is
reduced. Delta between the original maximum capacity and capped maximum
capacity is known as thermal pressure. Enable cpufreq cooling device to
update the thermal pressure in event of a capped maximum frequency.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-9-thara.gopinath@linaro.org
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit d81bc16f14b74721dcea680568e862361f531ca8
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:10 2020 -0500
sched/fair: Enable periodic update of average thermal pressure
Introduce support in scheduler periodic tick and other CFS bookkeeping
APIs to trigger the process of computing average thermal pressure for a
CPU. Also consider avg_thermal.load_avg in others_have_blocked which
allows for decay of pelt signals.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-7-thara.gopinath@linaro.org
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 0b07e1ccac5d97bf63df30b1a9e36c99a1d59390
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:09 2020 -0500
arm/topology: Populate arch_scale_thermal_pressure() for ARM platforms
Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus
enabling scheduler to retrieve instantaneous thermal pressure of a CPU.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-6-thara.gopinath@linaro.org
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit f4dd9d9f8a6e9eb80615f3cf46b33a823f743e3a
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:08 2020 -0500
arm64/topology: Populate arch_scale_thermal_pressure() for arm64 platforms
Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus
enabling scheduler to retrieve instantaneous thermal pressure of a CPU.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-5-thara.gopinath@linaro.org
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 8d62e23c1aff88f3ea55d7bb1f8f69295e150189
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:07 2020 -0500
drivers/base/arch_topology: Add infrastructure to store and update instantaneous thermal pressure
Add architecture specific APIs to update and track thermal pressure on a
per CPU basis. A per CPU variable thermal_pressure is introduced to keep
track of instantaneous per CPU thermal pressure. Thermal pressure is the
delta between maximum capacity and capped capacity due to a thermal event.
topology_get_thermal_pressure can be hooked into the scheduler specified
arch_scale_thermal_pressure to retrieve instantaneous thermal pressure of
a CPU.
arch_set_thermal_pressure can be used to update the thermal pressure.
Considering topology_get_thermal_pressure reads thermal_pressure and
arch_set_thermal_pressure writes into thermal_pressure, one can argue for
some sort of locking mechanism to avoid a stale value. But considering
topology_get_thermal_pressure can be called from a system critical path
like scheduler tick function, a locking mechanism is not ideal. This means
that it is possible the thermal_pressure value used to calculate average
thermal pressure for a CPU can be stale for up to 1 tick period.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-4-thara.gopinath@linaro.org
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 92727ae00ca5d23464f5d7ec7702bf62ab4ba026
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:06 2020 -0500
sched/topology: Add callback to read per CPU thermal pressure
Introduce the arch_scale_thermal_pressure() callback to retrieve per CPU thermal
pressure.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-3-thara.gopinath@linaro.org
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 62abba1d3769cd0c8566f1d354035499fb6cb8a0
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Fri Feb 21 19:52:05 2020 -0500
sched/pelt: Add support to track thermal pressure
Extrapolating on the existing framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a CPU running a RT/DL task through util_avg, the
average thermal pressure is tracked through load_avg. This is because
thermal pressure signal is weighted time "delta" capacity unlike util_avg
which is binary. "delta capacity" here means delta between the actual
capacity of a CPU and the decreased capacity a CPU due to a thermal event.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookkeeping (accumulate, decay and average) of the
thermal pressure.
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-2-thara.gopinath@linaro.org
Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 82a4f68550bde32a72afb0085c8250c7dd672d21
Author: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Date: Thu Feb 4 15:52:03 2021 +0530
sched: fair: consider all running tasks in cpu for load balance
Load_balancer considers only cfs running tasks for finding busiest cpu
to do load balancing. But cpu may be busy with other type tasks (ex: RT),
then that cpu might not selected as busy cpu due to weight vs nr_run
checks fails). And possibley cfs tasks running on that cpu would suffer
till other type tasks finishes or weight checks passes, while other cpus
sitting idle and not able to do load balance.
So, consider all running tasks to check cpu busieness.
Change-Id: Iddf3f668507e20359f6388fc30ff5897d234c902
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
commit 21028ad73e84230251c5704576d5c97eb4390e31
Author: Lucas Stach <l.stach@pengutronix.de>
Date: Mon Aug 31 13:07:19 2020 +0200
sched/deadline: Fix stale throttling on de-/boosted tasks
When a boosted task gets throttled, what normally happens is that it's
immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
runtime and clears the dl_throttled flag. There is a special case however:
if the throttling happened on sched-out and the task has been deboosted in
the meantime, the replenish is skipped as the task will return to its
normal scheduling class. This leaves the task with the dl_throttled flag
set.
Now if the task gets boosted up to the deadline scheduling class again
while it is sleeping, it's still in the throttled state. The normal wakeup
however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
actually place it on the rq. Thus we end up with a task that is runnable,
but not actually on the rq and neither a immediate replenishment happens,
nor is the replenishment timer set up, so the task is stuck in
forever-throttled limbo.
Clear the dl_throttled flag before dropping back to the normal scheduling
class to fix this issue.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
Signed-off-by: Zlatan Radovanovic <zlatan.radovanovic@fet.ba>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
commit 0541494a6aa9d50deb09f922e74c12b7d319e211
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon Sep 21 09:24:22 2020 +0200
sched/fair: Reduce minimal imbalance threshold
The 25% default imbalance threshold for DIE and NUMA domain is large
enough to generate significant unfairness between threads. A typical
example is the case of 11 threads running on 2x4 CPUs. The imbalance of
20% between the 2 groups of 4 cores is just low enough to not trigger
the load balance between the 2 groups. We will have always the same 6
threads on one group of 4 CPUs and the other 5 threads on the other
group of CPUS. With a fair time sharing in each group, we ends up with
+20% running time for the group of 5 threads.
Consider decreasing the imbalance threshold for overloaded case where we
use the load to balance task and to ensure fair time sharing.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Hillf Danton <hdanton@sina.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org
Signed-off-by: Zlatan Radovanovic <zlatan.radovanovic@fet.ba>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
commit ae5539ae9bd8eb9319a3a9773f1df1aa3daacbcb
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon Sep 21 09:24:24 2020 +0200
sched/fair: Reduce busy load balance interval
The busy_factor, which increases load balance interval when a cpu is busy,
is set to 32 by default. This value generates some huge LB interval on
large system like the THX2 made of 2 node x 28 cores x 4 threads.
For such system, the interval increases from 112ms to 3584ms at MC level.
And from 228ms to 7168ms at NUMA level.
Even on smaller system, a lower busy factor has shown improvement on the
fair distribution of the running time so let reduce it for all.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org
commit e644e2307fbc7b33c1baae6e8e50517b0a5924c0
Author: Valentin Schneider <valentin.schneider@arm.com>
Date: Wed Dec 11 11:38:50 2019 +0000
BACKPORT: sched/fair: Make task_fits_capacity() consider uclamp restrictions
task_fits_capacity() drives CPU selection at wakeup time, and is also used
to detect misfit tasks. Right now it does so by comparing task_util_est()
with a CPU's capacity, but doesn't take into account uclamp restrictions.
There's a few interesting uses that can come out of doing this. For
instance, a low uclamp.max value could prevent certain tasks from being
flagged as misfit tasks, so they could merrily remain on low-capacity CPUs.
Similarly, a high uclamp.min value would steer tasks towards high capacity
CPUs at wakeup (and, should that fail, later steered via misfit balancing),
so such "boosted" tasks would favor CPUs of higher capacity.
Introduce uclamp_task_util() and make task_fits_capacity() use it.
[QP: fixed missing dependency on fits_capacity() by using the open coded
alternative]
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-5-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a7008c07a568278ed2763436404752a98004c7ff)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Iabde2eda7252c3bcc273e61260a7a12a7de991b1
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit fed39a8a5b86100174489179f80a7514c7a85c4b
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Sep 27 17:58:58 2018 +0100
BACKPORT: ANDROID: sched/core: Move SchedTune task API into UtilClamp wrappers
The main SchedTune API calls realted to task tuning attributes are now
wrapped by more generic and mainlinish UtilClamp calls.
The new APIs are:
- uclamp_task(p) <= boosted_task_util(p)
- uclamp_boosted(p) <= schedtune_task_boost(p) > 0
- uclamp_latency_sensitive(p) <= schedtune_prefer_idle(p)
Let's provide also an implementation of the same API based on the new
uclamp.uclamp_latency_sensitive flag.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[Modified the patch to use uclamp.latency_sensitive instead mainline
attributes]
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ib1a6902e1c07a82a370e36bf1776d895b7528cbc
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 0152bf348f6c517ba445393032bb9049c533aa27
Author: Rick Yiu <rickyiu@google.com>
Date: Mon Jun 15 22:03:08 2020 +0800
ANDROID: sched/tune: Consider stune boost margin when computing energy
If CONFIG_SCHED_TUNE is enabled, it does not use boosted cpu util
to compute energy, so it could not reflect the real freq when a
cpu has boosted tasks on it. Addressing it by adding boost margin
if type is FREQUENCY_UTIL in schedutil_cpu_util().
Bug: 158637636
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I570920cb1e67d07de87006fca058d50e9358b7cd
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit a07859716117b26c7794a22ec3a4dec64b618ba0
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Apr 4 10:24:43 2019 +0100
ANDROID: sched/tune: Move SchedTune cpu API into UtilClamp wrappers
The SchedTune CPU boosting API is currently used from sugov_get_util()
to get the boosted utilization and to pass it into schedutil_cpu_util().
When UtilClamp is in use instead we call schedutil_cpu_util() by
passing in just the CFS utilization and the clamping is done internally
on the aggregated CFS+RT utilization for FREQUENCY_UTIL calls.
This asymmetry is not required moreover, schedutil code is polluted by
non-mainline SchedTune code.
Wrap SchedTune API call related to cpu utilization boosting with a more
generic and mainlinish UtilClamp call:
- uclamp_rq_util_with(cpu, util, p) <= boosted_cpu_util(cpu)
This new API is already used in schedutil_cpu_util() to clamp the
aggregated RT+CFS utilization on FREQUENCY_UTIL calls.
Move the cpu boosting into uclamp_rq_util_with() so that we remove any
SchedTune specific bit from kernel/sched/cpufreq_schedutil.c.
Get rid of the no more required boosted_cpu_util(cpu) method and replace
it with a stune_util(cpu, util) which signature is better aligned with
its uclamp_rq_util_with(cpu, util, p) counterpart.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I45b0f0f54123fe0a2515fa9f1683842e6b99234f
[Removed superfluous __maybe_unused for capacity_orig_of]
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 75d6dd5a51250b259269afa979ee3fdd2f089db2
Author: Quentin Perret <qperret@google.com>
Date: Thu Jun 10 15:13:06 2021 +0000
ANDROID: sched/core: Make uclamp changes depend on CAP_SYS_NICE
There is currently nothing preventing tasks from changing their per-task
clamp values in anyway that they like. The rationale is probably that
system administrators are still able to limit those clamps thanks to the
cgroup interface. However, this causes pain in a system where both
per-task and per-cgroup clamp values are expected to be under the
control of core system components (as is the case for Android).
To fix this, let's require CAP_SYS_NICE to change per-task clamp values.
There are ongoing discussions upstream about more flexible approaches
than this using the RLIMIT API -- see [1]. But the upstream discussion
has not converged yet, and this is way too late for UAPI changes in
android12-5.10 anyway, so let's apply this change which provides the
behaviour we want without actually impacting UAPIs.
[1] https://lore.kernel.org/lkml/20210623123441.592348-4-qperret@google.com/
Bug: 187186685
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I749312a77306460318ac5374cf243d00b78120dd
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit b6e71f141280f787c584a8c597d7898dbff28d10
Author: Quentin Perret <qperret@google.com>
Date: Thu Aug 5 11:21:53 2021 +0100
BACKPORT: sched/uclamp: Fix UCLAMP_FLAG_IDLE setting
The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
uclamp active task (that is, when buckets.tasks reaches 0 for all
buckets) to maintain the last uclamp.max and prevent blocked util from
suddenly becoming visible.
However, there is an asymmetry in how the flag is set and cleared which
can lead to having the flag set whilst there are active tasks on the rq.
Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
called at enqueue time, but set in uclamp_rq_dec_id() which is called
both when dequeueing a task _and_ in the update_uclamp_active() path. As
a result, when both uclamp_rq_{dec,ind}_id() are called from
update_uclamp_active(), the flag ends up being set but not cleared,
hence leaving the runqueue in a broken state.
Fix this by clearing the flag in update_uclamp_active() as well.
Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
Signed-off-by: RuRuTiaSaMa <1009087450@qq.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c0fda42ce6b4f1af0cd02cdf259aff2aaf8038cc
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Fri Nov 13 12:34:54 2020 +0100
UPSTREAM: sched/uclamp: Allow to reset a task uclamp constraint value
In case the user wants to stop controlling a uclamp constraint value
for a task, use the magic value -1 in sched_util_{min,max} with the
appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
the reset.
The advantage over the 'additional flag' approach (i.e. introducing
SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
exported via uapi. This avoids the need to document how this new flag
has be used in conjunction with the existing uclamp related flags.
The following subtle issue is fixed as well. When a uclamp constraint
value is set on a !user_defined uclamp_se it is currently first reset
and then set.
Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
stands for the 'sched class change' case.
The related condition 'if (uc_se->user_defined)' moved from
__setscheduler_uclamp() into uclamp_reset().
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yun Hsiang <hsiang023167@gmail.com>
Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggemann@arm.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c2295e608322cc2ed46f7bc785e9e092813a630f
Author: YueHaibing <yuehaibing@huawei.com>
Date: Tue Sep 22 21:24:10 2020 +0800
UPSTREAM: sched/core: Remove unused inline function uclamp_bucket_base_value()
There is no caller in tree, so can remove it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 384e707b762fe96e3257dc45fdd06f102077e6f0
Author: Qinglang Miao <miaoqinglang@huawei.com>
Date: Sat Jul 25 16:56:29 2020 +0800
UPSTREAM: sched/uclamp: Remove unnecessary mutex_init()
The uclamp_mutex lock is initialized statically via DEFINE_MUTEX(),
it is unnecessary to initialize it runtime via mutex_init().
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20200725085629.98292-1-miaoqinglang@huawei.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit a15a1c8215338f6381551dd7e42fe8103da1b104
Author: Qais Yousef <qais.yousef@arm.com>
Date: Thu Jul 16 12:03:45 2020 +0100
UPSTREAM: sched/uclamp: Add a new sysctl to control RT default boost value
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.
This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.
For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.
Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.
They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.
The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.
Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.
To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.
I anticipate this to be mostly in the form of modifying the init script
of a particular device.
To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.
Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.
With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 7f15a8da33343d817d828b5ca95d207e4a89495f
Author: Qais Yousef <qais.yousef@arm.com>
Date: Thu Dec 2 11:20:33 2021 +0000
UPSTREAM: sched/uclamp: Fix rq->uclamp_max not set on first enqueue
Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct
uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to
match the woken up task's uclamp_max when the rq is idle.
The code was relying on rq->uclamp_max initialized to zero, so on first
enqueue
static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
enum uclamp_id clamp_id)
{
...
if (uc_se->value > READ_ONCE(uc_rq->value))
WRITE_ONCE(uc_rq->value, uc_se->value);
}
was actually resetting it. But since commit d81ae8aac85c changed the
default to 1024, this no longer works. And since rq->uclamp_flags is
also initialized to 0, neither above code path nor uclamp_idle_reset()
update the rq->uclamp_max on first wake up from idle.
This is only visible from first wake up(s) until the first dequeue to
idle after enabling the static key. And it only matters if the
uclamp_max of this task is < 1024 since only then its uclamp_max will be
effectively ignored.
Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to
ensure uclamp_idle_reset() is called which then will update the rq
uclamp_max value as expected.
Bug: 254441685
Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com
(cherry picked from commit 315c4f884800c45cb6bd8c90422fad554a8b9588)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I621fc463a3e51361516c2479aff6c80213aaf918
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 3aa7ab1653f0b9cb224c095c7f9cfa92782749f0
Author: Qais Yousef <qais.yousef@arm.com>
Date: Thu Jun 17 17:51:55 2021 +0100
UPSTREAM: sched/uclamp: Fix uclamp_tg_restrict()
Now cpu.uclamp.min acts as a protection, we need to make sure that the
uclamp request of the task is within the allowed range of the cgroup,
that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
tg->uclamp[UCLAMP_MAX].
As reported by Xuewen [1] we can have some corner cases where there's
inversion between uclamp requested by task (p) and the uclamp values of
the taskgroup it's attached to (tg). Following table demonstrates
2 corner cases:
| p | tg | effective
-----------+-----+------+-----------
CASE 1
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 60%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
CASE 2
-----------+-----+------+-----------
uclamp_min | 0% | 30% | 30%
-----------+-----+------+-----------
uclamp_max | 20% | 50% | 20%
-----------+-----+------+-----------
With this fix we get:
| p | tg | effective
-----------+-----+------+-----------
CASE 1
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 50%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
CASE 2
-----------+-----+------+-----------
uclamp_min | 0% | 30% | 30%
-----------+-----+------+-----------
uclamp_max | 20% | 50% | 30%
-----------+-----+------+-----------
Additionally uclamp_update_active_tasks() must now unconditionally
update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
instance could have an impact on the effective UCLAMP_MIN of the tasks.
| p | tg | effective
-----------+-----+------+-----------
old
-----------+-----+------+-----------
uclamp_min | 60% | 0% | 50%
-----------+-----+------+-----------
uclamp_max | 80% | 50% | 50%
-----------+-----+------+-----------
*new*
-----------+-----+------+-----------
uclamp_min | 60% | 0% | *60%*
-----------+-----+------+-----------
uclamp_max | 80% |*70%* | *70%*
-----------+-----+------+-----------
[1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/
Bug: 254441685
Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
Reported-by: Xuewen Yan <xuewen.yan94@gmail.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
(cherry picked from commit 0213b7083e81f4acd69db32cb72eb4e5f220329a)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I128d75fea2900ec7bc360b44f18cada76c968578
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 02bc00f504e7174e58fbf90c176d37c91e1e759e
Author: Qais Yousef <qais.yousef@arm.com>
Date: Thu Jul 16 12:03:47 2020 +0100
UPSTREAM: sched/uclamp: Fix a deadlock when enabling uclamp static key
The following splat was caught when setting uclamp value of a task:
BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
cpus_read_lock+0x68/0x130
static_key_enable+0x1c/0x38
__sched_setscheduler+0x900/0xad8
Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()
Bug: 254441685
Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
(cherry picked from commit e65855a52b479f98674998cb23b21ef5a8144b04)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I9b33882f72b2f5a8bb8a1e077e7785f3462d1cee
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 557b3cd9f47d60ea7d5bf8be9f6d06e34786a8ec
Author: Xuewen Yan <xuewen.yan@unisoc.com>
Date: Wed Jun 30 22:12:04 2021 +0800
UPSTREAM: sched/uclamp: Ignore max aggregation if rq is idle
When a task wakes up on an idle rq, uclamp_rq_util_with() would max
aggregate with rq value. But since there is no task enqueued yet, the
values are stale based on the last task that was running. When the new
task actually wakes up and enqueued, then the rq uclamp values should
reflect that of the newly woken up task effective uclamp values.
This is a problem particularly for uclamp_max because it default to
1024. If a task p with uclamp_max = 512 wakes up, then max aggregation
would ignore the capping that should apply when this task is enqueued,
which is wrong.
Fix that by ignoring max aggregation if the rq is idle since in that
case the effective uclamp value of the rq will be the ones of the task
that will wake up.
Bug: 254441685
Fixes: 9d20ad7dfc9a ("sched/uclamp: Add uclamp_util_with()")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
[qias: Changelog]
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210630141204.8197-1-xuewen.yan94@gmail.com
(cherry picked from commit 3e1493f46390618ea78607cb30c58fc19e2a5035)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I6ea180d854d9d8ffa94abdac4800c9cb130f77cf
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 04eeff8cf3fc25bde9471441080f3900ccb09c3a
Author: Qais Yousef <qais.yousef@arm.com>
Date: Mon May 10 15:50:32 2021 +0100
UPSTREAM: sched/uclamp: Fix locking around cpu_util_update_eff()
cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the
uclamp_mutex or rcu_read_lock() like other call sites, which is
a mistake.
The uclamp_mutex is required to protect against concurrent reads and
writes that could update the cgroup hierarchy.
The rcu_read_lock() is required to traverse the cgroup data structures
in cpu_util_update_eff().
Surround the caller with the required locks and add some asserts to
better document the dependency in cpu_util_update_eff().
Bug: 254441685
Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com
(cherry picked from commit 93b73858701fd01de26a4a874eb95f9b7156fd4b)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I6b11073f23f58ce4c2415cdfc46140a60e3411a2
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit b60f7bed3b76a71ec848c5b690f6927bf36e2f0e
Author: Qais Yousef <qais.yousef@arm.com>
Date: Mon May 10 15:50:31 2021 +0100
UPSTREAM: sched/uclamp: Fix wrong implementation of cpu.uclamp.min
cpu.uclamp.min is a protection as described in cgroup-v2 Resource
Distribution Model
Documentation/admin-guide/cgroup-v2.rst
which means we try our best to preserve the minimum performance point of
tasks in this group. See full description of cpu.uclamp.min in the
cgroup-v2.rst.
But the current implementation makes it a limit, which is not what was
intended.
For example:
tg->cpu.uclamp.min = 20%
p0->uclamp[UCLAMP_MIN] = 0
p1->uclamp[UCLAMP_MIN] = 50%
Previous Behavior (limit):
p0->effective_uclamp = 0
p1->effective_uclamp = 20%
New Behavior (Protection):
p0->effective_uclamp = 20%
p1->effective_uclamp = 50%
Which is inline with how protections should work.
With this change the cgroup and per-task behaviors are the same, as
expected.
Additionally, we remove the confusing relationship between cgroup and
!user_defined flag.
We don't want for example RT tasks that are boosted by default to max to
change their boost value when they attach to a cgroup. If a cgroup wants
to limit the max performance point of tasks attached to it, then
cpu.uclamp.max must be set accordingly.
Or if they want to set different boost value based on cgroup, then
sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max
and set the right cpu.uclamp.min for each group to let the RT tasks
obtain the desired boost value when attached to that group.
As it stands the dependency on !user_defined flag adds an extra layer of
complexity that is not required now cpu.uclamp.min behaves properly as
a protection.
The propagation model of effective cpu.uclamp.min in child cgroups as
implemented by cpu_util_update_eff() is still correct. The parent
protection sets an upper limit of what the child cgroups will
effectively get.
Bug: 254441685
Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com
(cherry picked from commit 0c18f2ecfcc274a4bcc1d122f79ebd4001c3b445)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I9f9f7b9e7ef3d19ccb1685f271639c9ed76b580f
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit b2135621698d06e6a73b0f4ef15529321cc68e31
Author: Qais Yousef <qais.yousef@arm.com>
Date: Tue Jun 30 12:21:23 2020 +0100
BACKPORT: sched/uclamp: Protect uclamp fast path code with static key
There is a report that when uclamp is enabled, a netperf UDP test
regresses compared to a kernel compiled without uclamp.
https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/
While investigating the root cause, there were no sign that the uclamp
code is doing anything particularly expensive but could suffer from bad
cache behavior under certain circumstances that are yet to be
understood.
https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/
To reduce the pressure on the fast path anyway, add a static key that is
by default will skip executing uclamp logic in the
enqueue/dequeue_task() fast path until it's needed.
As soon as the user start using util clamp by:
1. Changing uclamp value of a task with sched_setattr()
2. Modifying the default sysctl_sched_util_clamp_{min, max}
3. Modifying the default cpu.uclamp.{min, max} value in cgroup
We flip the static key now that the user has opted to use util clamp.
Effectively re-introducing uclamp logic in the enqueue/dequeue_task()
fast path. It stays on from that point forward until the next reboot.
This should help minimize the effect of util clamp on workloads that
don't need it but still allow distros to ship their kernels with uclamp
compiled in by default.
SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end
up with unbalanced call to uclamp_rq_dec_id() if we flip the key while
a task is running in the rq. Since we know it is harmless we just
quietly return if we attempt a uclamp_rq_dec_id() when
rq->uclamp[].bucket[].tasks is 0.
In schedutil, we introduce a new uclamp_is_enabled() helper which takes
the static key into account to ensure RT boosting behavior is retained.
The following results demonstrates how this helps on 2 Sockets Xeon E5
2x10-Cores system.
nouclamp uclamp uclamp-static-key
Hmean send-64 162.43 ( 0.00%) 157.84 * -2.82%* 163.39 * 0.59%*
Hmean send-128 324.71 ( 0.00%) 314.78 * -3.06%* 326.18 * 0.45%*
Hmean send-256 641.55 ( 0.00%) 628.67 * -2.01%* 648.12 * 1.02%*
Hmean send-1024 2525.28 ( 0.00%) 2448.26 * -3.05%* 2543.73 * 0.73%*
Hmean send-2048 4836.14 ( 0.00%) 4712.08 * -2.57%* 4867.69 * 0.65%*
Hmean send-3312 7540.83 ( 0.00%) 7425.45 * -1.53%* 7621.06 * 1.06%*
Hmean send-4096 9124.53 ( 0.00%) 8948.82 * -1.93%* 9276.25 * 1.66%*
Hmean send-8192 15589.67 ( 0.00%) 15486.35 * -0.66%* 15819.98 * 1.48%*
Hmean send-16384 26386.47 ( 0.00%) 25752.25 * -2.40%* 26773.74 * 1.47%*
The perf diff between nouclamp and uclamp-static-key when uclamp is
disabled in the fast path:
8.73% -1.55% [kernel.kallsyms] [k] try_to_wake_up
0.07% +0.04% [kernel.kallsyms] [k] deactivate_task
0.13% -0.02% [kernel.kallsyms] [k] activate_task
The diff between nouclamp and uclamp-static-key when uclamp is enabled
in the fast path:
8.73% -0.72% [kernel.kallsyms] [k] try_to_wake_up
0.13% +0.39% [kernel.kallsyms] [k] activate_task
0.07% +0.38% [kernel.kallsyms] [k] deactivate_task
Bug: 254441685
Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-3-qais.yousef@arm.com
(cherry picked from commit 46609ce227039fd192e0ecc7d940bed587fd2c78)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I80555c22b856fbbd46692f83d501f03b6f393c35
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 322f697f698c34fd19271a1c21e756bbe0752aad
Author: Qais Yousef <qais.yousef@arm.com>
Date: Tue Jun 30 12:21:22 2020 +0100
BACKPORT: sched/uclamp: Fix initialization of struct uclamp_rq
struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.
But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.
Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.
Bug: 254441685
Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-2-qais.yousef@arm.com
(cherry picked from commit d81ae8aac85ca2e307d273f6dc7863a721bf054e)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I014101dbe53c85f412f87f7f6937b18d2f141800
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 416ea5e54d1471df3352093b1aa0c3b7a4341a8f
Author: Quentin Perret <qperret@google.com>
Date: Fri Apr 30 15:14:12 2021 +0000
UPSTREAM: sched: Fix out-of-bound access in uclamp
Util-clamp places tasks in different buckets based on their clamp values
for performance reasons. However, the size of buckets is currently
computed using a rounding division, which can lead to an off-by-one
error in some configurations.
For instance, with 20 buckets, the bucket size will be 1024/20=51. A
task with a clamp of 1024 will be mapped to bucket id 1024/51=20. Sadly,
correct indexes are in range [0,19], hence leading to an out of bound
memory access.
Clamp the bucket id to fix the issue.
Bug: 186415778
Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Suggested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210430151412.160913-1-qperret@google.com
(cherry picked from commit 6d2f8909a5fabb73fe2a63918117943986c39b6c)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I8097f5ed34abcff36c5ed395643d65727ea969eb
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 41abff075e5f46de03fcb89a58b41a73c3f137a6
Author: Quentin Perret <qperret@google.com>
Date: Thu Apr 16 09:59:56 2020 +0100
BACKPORT: sched/core: Fix reset-on-fork from RT with uclamp
uclamp_fork() resets the uclamp values to their default when the
reset-on-fork flag is set. It also checks whether the task has a RT
policy, and sets its uclamp.min to 1024 accordingly. However, during
reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after,
hence leading to an erroneous uclamp.min setting for the new task if it
was forked from RT.
Fix this by removing the unnecessary check on rt_task() in
uclamp_fork() as this doesn't make sense if the reset-on-fork flag is
set.
[ qperret: BACKPORT because of a conflict with the Android-specific
SUGOV_RT_MAX_FREQ sched_feat, which is equally unnecessary in this
path ]
Bug: 120440300
Fixes: 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks")
Reported-by: Chitti Babu Theegala <ctheegal@codeaurora.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com
(cherry picked from commit eaf5a92ebde5bca3bb2565616115bd6d579486cd)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I9a19ac5474d0508b8437e4a1d859573b4106ed08
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c719bd9543490474cb07d09d04cf7a01780c7a28
Author: Qais Yousef <qais.yousef@arm.com>
Date: Tue Jan 14 21:09:47 2020 +0000
UPSTREAM: sched/uclamp: Reject negative values in cpu_uclamp_write()
The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison
7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
7302 req.ret = -ERANGE;
7303 return req;
7304 }
# echo -1 > cpu.uclamp.min
# cat cpu.uclamp.min
42949671.96
Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().
# echo -1 > cpu.uclamp.min
sh: write error: Numerical result out of range
Bug: 120440300
Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
(cherry picked from commit b562d140649966d4daedd0483a8fe59ad3bb465a)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I17fc2b119dcbffb212e130ed2c37ae3a8d5bbb61
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit bea7af6318b69aed941c04b64fb7aec6760692fa
Author: Quentin Perret <quentin.perret@arm.com>
Date: Tue Jul 30 13:54:00 2019 +0100
BACKPORT: ANDROID: sched/core: Add a latency-sensitive flag to uclamp
Add a 'latency_sensitive' flag to uclamp in order to express the need
for some tasks to find a CPU where they can wake-up quickly. This is not
expected to be used without cgroup support, so add solely a cgroup
interface for it.
As this flag represents a boolean attribute and not an amount of
resources to be shared, it is not clear what the delegation logic should
be. As such, it is kept simple: every new cgroup starts with
latency_sensitive set to false, regardless of the parent.
In essence, this is similar to SchedTune's prefer-idle flag which was
used in android-4.19 and prior.
Bug: 120440300
Change-Id: I722d8ecabb428bb7b95a5b54bc70a87f182dde2a
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
(cherry picked from commit ad7dd648fc7dbe11f23673a3463af2468a274998)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit dcf228d2be609f20fb826ec420e1af7074b3265a
Author: Li Guanglei <guanglei.li@unisoc.com>
Date: Wed Dec 25 15:44:04 2019 +0800
FROMGIT: sched/core: Fix size of rq::uclamp initialization
rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.
Bug: 120440300
Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei <guanglei.li@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
(cherry picked from commit dcd6dffb0a75741471297724640733fa4e958d72
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Id36a2b77c45e586535e8fadfb7d66868ca8fe8c7
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 2b98fd7366175ec1903c23885f0357b51bc4c929
Author: Qais Yousef <qais.yousef@arm.com>
Date: Tue Dec 24 11:54:04 2019 +0000
FROMGIT: sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.
Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.
Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.
The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.
By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.
Bug: 120440300
Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
(cherry picked from commit 7226017ad37a888915628e59a84a2d1e57b40707
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I9636c60e04d58bbfc5041df1305b34a12b5a3f46
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit ca1d456778b535f2b7aa6c473d4d67e4ece5a6d0
Author: Valentin Schneider <valentin.schneider@arm.com>
Date: Wed Dec 11 11:38:49 2019 +0000
FROMGIT: sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
The current helper returns (CPU) rq utilization with uclamp restrictions
taken into account. A uclamp task utilization helper would be quite
helpful, but this requires some renaming.
Prepare the code for the introduction of a uclamp_task_util() by renaming
the existing uclamp_util_with() to uclamp_rq_util_with().
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit d2b58a286e89824900d501db0be1d4f6aed474fc
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I3e7146b788e079e400167203df5e5dadee2fd232
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 3b187fe5f97ec4e11f313001b70ce174b068efb8
Author: Valentin Schneider <valentin.schneider@arm.com>
Date: Wed Dec 11 11:38:48 2019 +0000
FROMGIT: sched/uclamp: Make uclamp util helpers use and return UL values
Vincent pointed out recently that the canonical type for utilization
values is 'unsigned long'. Internally uclamp uses 'unsigned int' values for
cache optimization, but this doesn't have to be exported to its users.
Make the uclamp helpers that deal with utilization use and return unsigned
long values.
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 686516b55e98edf18c2a02d36aaaa6f4c0f6c39c
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Id3837f12237e5b77eb3a236bd32457dcd7de743e
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 9f7cbbc57004a73d8550358a95c803f49a4067bb
Author: Valentin Schneider <valentin.schneider@arm.com>
Date: Wed Dec 11 11:38:47 2019 +0000
FROMGIT: sched/uclamp: Remove uclamp_util()
The sole user of uclamp_util(), schedutil_cpu_util(), was made to use
uclamp_util_with() instead in commit:
af24bde8df20 ("sched/uclamp: Add uclamp support to energy_compute()")
From then on, uclamp_util() has remained unused. Being a simple wrapper
around uclamp_util_with(), we can get rid of it and win back a few lines.
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 59fe675248ffc37d4167e9ec6920a2f3d5ec67bb
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I11dbff80c6c4be9666438800b2527aca8cd24025
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 1e2856646d5f69ab499a1c5cd9f2e3376b7a1ef9
Author: Qais Yousef <qais.yousef@arm.com>
Date: Wed Oct 9 11:46:11 2019 +0100
BACKPORT: sched/rt: Make RT capacity-aware
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.
In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.
The definition of task.requirement is dependent on the scheduling class.
For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.
capacity_orig_of(cpu) >= cfs_task.util
DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:
https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.
ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.
capacity_orig_of(cpu) >= rt_task.uclamp_min
Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.
Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.
On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.
[ mingo: Minor edits to comments. ]
Bug: 120440300
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 804d402fb6f6487b825aae8cf42fda6426c62867
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git)
[Qais: resolved minor conflict in kernel/sched/cpupri.c]
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ifc9da1c47de1aec9b4d87be2614e4c8968366900
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit fbcf2f9dcec79826507ced228a9278ae404c2a8a
Author: Valentin Schneider <valentin.schneider@arm.com>
Date: Fri Nov 15 10:39:08 2019 +0000
UPSTREAM: sched/uclamp: Fix overzealous type replacement
Some uclamp helpers had their return type changed from 'unsigned int' to
'enum uclamp_id' by commit
0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
range, which should really be unsigned int. The affected helpers are
uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.
Note that this doesn't lead to any obj diff using a relatively recent
aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
anything funny. I'm still marking this as fixing the above commit to be on
the safe side.
Bug: 120440300
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: patrick.bellasi@matbug.net
Cc: qperret@google.com
Cc: surenb@google.com
Cc: tj@kernel.org
Fixes: 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7763baace1b738d65efa46d68326c9406311c6bf)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I924a99c125372a8fca81cb4bc0c82e6a7183fc8a
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit b52d3d696bf2b1aecece38539b2fcf4f7f8445f5
Author: Qais Yousef <qais.yousef@arm.com>
Date: Thu Nov 14 21:10:52 2019 +0000
UPSTREAM: sched/uclamp: Fix incorrect condition
uclamp_update_active() should perform the update when
p->uclamp[clamp_id].active is true. But when the logic was inverted in
[1], the if condition wasn't inverted correctly too.
[1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/
Bug: 120440300
Reported-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6e1ff0773f49c7d38e8b4a9df598def6afb9f415)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I51b58a6089290277e08a0aaa72b86f852eec1512
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit bbe475a9c1f66f4da03c5870aff5883fed09eb77
Author: Qais Yousef <qais.yousef@arm.com>
Date: Tue Nov 5 11:22:12 2019 +0000
UPSTREAM: sched/core: Fix compilation error when cgroup not selected
When cgroup is disabled the following compilation error was hit
kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
struct css_task_iter it;
^~
kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration]
css_task_iter_start(css, 0, &it);
^~~~~~~~~~~~~~~~~~~
__sg_page_iter_start
kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration]
while ((p = css_task_iter_next(&it))) {
^~~~~~~~~~~~~~~~~~
__sg_page_iter_next
kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration]
css_task_iter_end(&it);
^~~~~~~~~~~~~~~~~
get_task_cred
kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable]
struct css_task_iter it;
^~
cc1: some warnings being treated as errors
make[2]: *** [kernel/sched/core.o] Error 1
Fix by protetion uclamp_update_active_tasks() with
CONFIG_UCLAMP_TASK_GROUP
Bug: 120440300
Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com
(cherry picked from commit e3b8b6a0d12cccf772113d6b5c1875192186fbd4)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ia4c0f801d68050526f9f117ec9189e448b01345a
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 35f8e35625c692abfc81b698e5112058d37dacad
Author: Ingo Molnar <mingo@kernel.org>
Date: Wed Sep 4 09:55:32 2019 +0200
UPSTREAM: sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code
Thadeu Lima de Souza Cascardo reported that 'chrt' broke on recent kernels:
$ chrt -p $$
chrt: failed to get pid 26306's policy: Argument list too long
and he has root-caused the bug to the following commit increasing sched_attr
size and breaking sched_read_attr() into returning -EFBIG:
a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
The other, bigger bug is that the whole sched_getattr() and sched_read_attr()
logic of checking non-zero bits in new ABI components is arguably broken,
and pretty much any extension of the ABI will spuriously break the ABI.
That's way too fragile.
Instead implement the perf syscall's extensible ABI instead, which we
already implement on the sched_setattr() side:
- if user-attributes have the same size as kernel attributes then the
logic is unchanged.
- if user-attributes are larger than the kernel knows about then simply
skip the extra bits, but set attr->size to the (smaller) kernel size
so that tooling can (in principle) handle older kernel as well.
- if user-attributes are smaller than the kernel knows about then just
copy whatever user-space can accept.
Also clean up the whole logic:
- Simplify the code flow - there's no need for 'ret' for example.
- Standardize on 'kattr/uattr' and 'ksize/usize' naming to make sure we
always know which side we are dealing with.
- Why is it called 'read' when what it does is to copy to user? This
code is so far away from VFS read() semantics that the naming is
actively confusing. Name it sched_attr_copy_to_user() instead, which
mirrors other copy_to_user() functionality.
- Move the attr->size assignment from the head of sched_getattr() to the
sched_attr_copy_to_user() function. Nothing else within the kernel
should care about the size of the structure.
With these fixes the sched_getattr() syscall now nicely supports an
extensible ABI in both a forward and backward compatible fashion, and
will also fix the chrt bug.
As an added bonus the bogus -EFBIG return is removed as well, which as
Thadeu noted should have been -E2BIG to begin with.
Bug: 120440300
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
Link: https://lkml.kernel.org/r/20190904075532.GA26751@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1251201c0d34fadf69d56efa675c2b7dd0a90eca)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I67e653c4f69db0140e9651c125b60e2b8cfd62f1
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 2e5fab725ddfc0db94558a0eb68fbb7584ab1a93
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Aug 22 14:28:11 2019 +0100
UPSTREAM: sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
The supported clamp indexes are defined in 'enum clamp_id', however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use 'unsigned int' to represent indices.
This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.
Fix it with a bulk rename now that we have all the bits merged.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0413d7f33e60751570fd6c179546bde2f7d82dcb)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I0be680b2489fa07244bac63b5c6fe1a79a53bef7
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 27326e00f48ecb7241256b777090282295061a10
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Aug 22 14:28:10 2019 +0100
UPSTREAM: sched/uclamp: Update CPU's refcount on TG's clamp changes
On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.
Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-6-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit babbe170e053c6ec2343751749995b7b9fd5fd2c)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I5e48891bd48c266dd282e1bab8f60533e4e29b48
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit e66ba8289ce3a479f0c6c52d25307c4d2b22f256
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Aug 22 14:28:09 2019 +0100
UPSTREAM: sched/uclamp: Use TG's clamps to restrict TASK's clamps
When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.
Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:
1. ensure cgroup clamps are always used to restrict task specific requests,
i.e. boosted not more than its TG effective protection and capped at
least as its TG effective limit.
2. implement a "nice-like" policy, where tasks are still allowed to request
less than what enforced by their TG effective limits and protections
Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.
Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3eac870a324728e5d17118888840dad70bcd37f3)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I0215e0a68cc0fa7c441e33052757f8571b7c99b9
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit d26df4b554e45013689e091881875e767cc55122
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Aug 22 14:28:08 2019 +0100
UPSTREAM: sched/uclamp: Propagate system defaults to the root group
The clamp values are not tunable at the level of the root task group.
That's for two main reasons:
- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.
- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.
When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.
Utilization clamping supports already the concepts of:
- system defaults: which define the maximum possible clamp values
usable by tasks.
- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.
Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.
When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7274a5c1bbec45f06f1fff4b8c8b5855b6cc189d)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ibf7ce5c46b67c79765b56b792ee22ed9595802c3
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 46b50a614c89320b23ab7615fb3cd92083c38bcf
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Aug 22 14:28:07 2019 +0100
UPSTREAM: sched/uclamp: Propagate parent clamps
In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.
Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.
Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).
Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0b60ba2dd342016e4e717dbaa4ca9af3a43f4434)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: If1cc136e1fb4a8f4c6ea15dc440b28d833a8d7e7
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 067c0b19313dc9e3730abf9ceb5942420a2c3fdd
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Thu Aug 22 14:28:06 2019 +0100
BACKPORT: sched/uclamp: Extend CPU's cgroup controller
The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.
With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.
Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.
Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.
Specifically:
- uclamp.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the uclamp.min
utilization
- uclamp.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the uclamp.max
utilization
These attributes:
a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.
b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent
c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests.
Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.
Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2480c093130f64ac3a410504fa8b3db1fc4b87ce)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I0285c44910bf073b80d7996361e6698bc5aedfae
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 64b042aac4283cf4456f213f3546daee0c731785
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:11 2019 +0100
UPSTREAM: sched/uclamp: Add uclamp_util_with()
So far uclamp_util() allows to clamp a specified utilization considering
the clamp values requested by RUNNABLE tasks in a CPU. For the Energy
Aware Scheduler (EAS) it is interesting to test how clamp values will
change when a task is becoming RUNNABLE on a given CPU.
For example, EAS is interested in comparing the energy impact of
different scheduling decisions and the clamp values can play a role on
that.
Add uclamp_util_with() which allows to clamp a given utilization by
considering the possible impact on CPU clamp values of a specified task.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-11-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9d20ad7dfc9a5cc64e33d725902d3863d350a66a)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ida153a3526b87f5674a6e037d4725d99eec7b478
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 0fc858cc7a9712c08da29adc187e69279768a6e4
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:10 2019 +0100
BACKPORT: sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class and irqs. However, when utilization clamping is in use,
the frequency selection should consider userspace utilization clamping
hints. This will allow, for example, to:
- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency
- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency
These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.
Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.
Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 982d9cdc22c9f6df5ad790caa229ff74fb1d95e7)
Conflicts:
kernel/sched/cpufreq_schedutil.c
1. Merged the if condition to include the non-upstream
sched_feat(SUGOV_RT_MAX_FREQ) check
2. Change the function signature to pass util_cfs and define
util as an automatic variable.
Bug: 120440300
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ie222c9ad84776fc2948e30c116eee876df697a17
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c927abaedd42cf29f8e3dc92613e84c4375e4773
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:09 2019 +0100
UPSTREAM: sched/uclamp: Set default clamps for RT tasks
By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand. This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.
Enforce the correct behavior for RT tasks by setting util_min to max
whenever:
1. the task is switched to the RT class and it does not already have a
user-defined clamp value assigned.
2. an RT task is forked from a parent with RESET_ON_FORK set.
NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-9-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1a00d999971c78ab024a17b0efc37d78404dd120)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I81fcadaea34f557e531fa5ac6aab84fcb0ee37c7
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit bd6028f582359eac294fe7363206330696c84b9c
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:08 2019 +0100
UPSTREAM: sched/uclamp: Reset uclamp values on RESET_ON_FORK
A forked tasks gets the same clamp values of its parent however, when
the RESET_ON_FORK flag is set on parent, e.g. via:
sys_sched_setattr()
sched_setattr()
__sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK)
the new forked task is expected to start with all attributes reset to
default values.
Do that for utilization clamp values too by checking the reset request
from the existing uclamp_fork() call which already provides the required
initialization for other uclamp related bits.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-8-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a87498ace58e23b62a572dc7267579ede4c8495c)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: If7bda202707aac3a2696a42f8146f607cdd36905
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c713ddf1b1d0c7fac555eacf8579b8ffd5bdd96d
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:07 2019 +0100
BACKPORT: sched/uclamp: Extend sched_setattr() to support utilization clamping
The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements that can translate into proper
decisions for both task placements and frequencies selections. Other
classes have a more simplified model based on the POSIX concept of
priorities.
Such a simple priority based model however does not allow to exploit
most advanced features of the Linux scheduler like, for example, driving
frequencies selection via the schedutil cpufreq governor. However, also
for non SCHED_DEADLINE tasks, it's still interesting to define tasks
properties to support scheduler decisions.
Utilization clamping exposes to user-space a new set of per-task
attributes the scheduler can use as hints about the expected/required
utilization for a task. This allows to implement a "proactive" per-task
frequency control policy, a more advanced policy than the current one
based just on "passive" measured task utilization. For example, it's
possible to boost interactive tasks (e.g. to get better performance) or
cap background tasks (e.g. to be more energy/thermal efficient).
Introduce a new API to set utilization clamping values for a specified
task by extending sched_setattr(), a syscall which already allows to
define task specific properties for different scheduling classes. A new
pair of attributes allows to specify a minimum and maximum utilization
the scheduler can consider for a task.
Do that by validating the required clamp values before and then applying
the required changes using _the_ same pattern already in use for
__setscheduler(). This ensures that the task is re-enqueued with the new
clamp values.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a509a7cd79747074a2c018a45bbbc52d1f4aed44)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I420e7ece5628bc639811a79654c35135a65bfd02
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 6306516692f50c3a2d2119142e393f5ef9f48259
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:06 2019 +0100
BACKPORT: sched/core: Allow sched_setattr() to use the current policy
The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and this makes it impossible to add more
generic task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.
Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.
Add support for the SETPARAM_POLICY policy, which is already used by the
sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX
syscall.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-6-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1d6362fa0cfc8c7b243fa92924429d826599e691)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I41cbe73d7aa30123adbd757fa30e346938651784
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 62d50c5a8556377176c6e3ef1226aa968e714730
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:05 2019 +0100
UPSTREAM: sched/uclamp: Add system default clamps
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.
Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.
Add a privileged interface to define a system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.
Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.
The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.
An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e8f14172c6b11e9a86c65532497087f8eb0f91b1)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I4f014c5ec9c312aaad606518f6e205fd0cfbcaa2
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 8bd52ebd588ad2c724fb789ee11b6e3179145eba
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:04 2019 +0100
UPSTREAM: sched/uclamp: Enforce last task's UCLAMP_MAX
When a task sleeps it removes its max utilization clamp from its CPU.
However, the blocked utilization on that CPU can be higher than the max
clamp value enforced while the task was running. This allows undesired
CPU frequency increases while a CPU is idle, for example, when another
CPU on the same frequency domain triggers a frequency update, since
schedutil can now see the full not clamped blocked utilization of the
idle CPU.
Fix this by using:
uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition.
Don't track any minimum utilization clamps since an idle CPU never
requires a minimum frequency. The decay of the blocked utilization is
good enough to reduce the CPU frequency.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e496187da71070687b55ff455e7d8d7d7f0ae0b9)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ie9eab897eb654ec9d4fba5eda20f66a91a712817
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit d3048c160f56d4003be0a9867b975140ea5efc0e
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:03 2019 +0100
UPSTREAM: sched/uclamp: Add bucket local max tracking
Because of bucketization, different task-specific clamp values are
tracked in the same bucket. For example, with 20% bucket size and
assuming to have:
Task1: util_min=25%
Task2: util_min=35%
both tasks will be refcounted in the [20..39]% bucket and always boosted
only up to 20% thus implementing a simple floor aggregation normally
used in histograms.
In systems with only few and well-defined clamp values, it would be
useful to track the exact clamp value required by a task whenever
possible. For example, if a system requires only 23% and 47% boost
values then it's possible to track the exact boost required by each
task using only 3 buckets of ~33% size each.
Introduce a mechanism to max aggregate the requested clamp values of
RUNNABLE tasks in the same bucket. Keep it simple by resetting the
bucket value to its base value only when a bucket becomes inactive.
Allow a limited and controlled overboosting margin for tasks recounted
in the same bucket.
In systems where the boost values are not known in advance, it is still
possible to control the maximum acceptable overboosting margin by tuning
the number of clamp groups. For example, 20 groups ensure a 5% maximum
overboost.
Remove the rq bucket initialization code since a correct bucket value
is now computed when a task is refcounted into a CPU's rq.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 60daf9c19410604f08c99e146bc378c8a64f4ccd)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I8782971f8867033cee5aaf981c96f9de33a5288c
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 016fa3c1ea0a4a5bed08ff564ae45903ce3dfb9b
Author: Patrick Bellasi <patrick.bellasi@arm.com>
Date: Fri Jun 21 09:42:02 2019 +0100
BACKPORT: sched/uclamp: Add CPU's clamp buckets refcounting
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.
When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.
Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).
A task has:
task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).
A runqueue has:
rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).
The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.
Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.
Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 69842cba9ace84849bb9b8edcdf2cefccd97901c)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I2c2c23572fb82e004f815cc9c783881355df6836
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit ff72a25515d4673ccdfe6a7a8606326725ee057f
Author: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Date: Wed May 22 15:56:14 2024 +0800
cpufreq: schedutil: Checkout to msm-4.19
* From: https://github.com/EmanuelCN/kernel_xiaomi_sm8250/blob/staging/kernel/sched/cpufreq_schedutil.c
* Removed sugov_get_util() under CONFIG_SCHED_WALT guard, because we don't use WALT anymore
* Preserved 819f63f and c2bdfaf, also introduce new definitions from 4.19 for new schedutil
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit eecd40ea2f0a9b22cd79f288da288e1d4513c086
Author: Viresh Kumar <viresh.kumar@linaro.org>
Date: Tue May 22 15:31:30 2018 +0530
cpufreq: Rename cpufreq_can_do_remote_dvfs()
This routine checks if the CPU running this code belongs to the policy
of the target CPU or if not, can it do remote DVFS for it remotely. But
the current name of it implies as if it is only about doing remote
updates.
Rename it to make it more relevant.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c201be3d3447e286e7e6924dc06fd61fe923954b
Author: kondors1995 <normandija1945@gmail.com>
Date: Mon Aug 5 10:10:39 2024 +0300
Revert "cpufreq: Avoid leaving stale IRQ work items during CPU offline"
This reverts commit c0079a7b3b.
commit 69078f1d44b561a0eddceb43873b2ad5a772fe9d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Feb 6 17:14:22 2019 +0100
sched/fair: Fix O(nr_cgroups) in the load balancing path
commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617 upstream.
This re-applies the commit reverted here:
commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply:
commit a9e7f6544b ("sched/fair: Fix O(nr_cgroups) in load balance path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 178af989ed8209fc342dfd89bdf23b935f7a2ce1
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Feb 6 17:14:21 2019 +0100
sched/fair: Optimize update_blocked_averages()
commit 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c upstream.
Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too, so they
will be added back in the right order next time.
With a normal decay of PELT, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :
commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")
In case of throttling, the PELT of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.
As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 5f6df438eb114f68750cbaf73ff655cf09767aa9
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Jan 30 06:22:47 2019 +0100
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
commit f6783319737f28e4436a69611853a5a098cbe974 upstream.
Sargun reported a crash:
"I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix
infinite loop in update_blocked_averages() by reverting a9e7f6544b
and put it on top of 4.19.13. In addition to this, I uninlined
list_add_leaf_cfs_rq for debugging.
This revealed a new bug that we didn't get to because we kept getting
crashes from the previous issue. When we are running with cgroups that
are rapidly changing, with CFS bandwidth control, and in addition
using the cpusets cgroup, we see this crash. Specifically, it seems to
occur with cgroups that are throttled and we change the allowed
cpuset."
The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that
it will walk down to root the 1st time a cfs_rq is used and we will finish
to add either a cfs_rq without parent or a cfs_rq with a parent that is
already on the list. But this is not always true in presence of throttling.
Because a cfs_rq can be throttled even if it has never been used but other CPUs
of the cgroup have already used all the bandwdith, we are not sure to go down to
the root and add all cfs_rq in the list.
Ensure that all cfs_rq will be added in the list even if they are throttled.
[ mingo: Fix !CGROUPS build. ]
Reported-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: tj@kernel.org
Fixes: 9c2791f936 ("Fix hierarchical order in rq->leaf_cfs_rq_list")
Link: https://lkml.kernel.org/r/1548825767-10799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Janne Huttunen <janne.huttunen@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 43bf835d9112937ae2b7d07b2b0a64b53598623d
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed Jan 30 14:41:04 2019 +0100
sched/fair: Add tmp_alone_branch assertion
commit 5d299eabea5a251fbf66e8277704b874bbba92dc upstream.
The magic in list_add_leaf_cfs_rq() requires that at the end of
enqueue_task_fair():
rq->tmp_alone_branch == &rq->lead_cfs_rq_list
If this is violated, list integrity is compromised for list entries
and the tmp_alone_branch pointer might dangle.
Also, reflow list_add_leaf_cfs_rq() while there. This looses one
indentation level and generates a form that's convenient for the next
patch.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Janne Huttunen <janne.huttunen@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 4a1c8a809ec62e802d9c2b174bcddd90750e6805
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri Jun 12 17:47:03 2020 +0200
sched/pelt: Cleanup PELT divider
Factorize in a single place the calculation of the divider to be used to
to compute *_avg from *_sum value
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200612154703.23555-1-vincent.guittot@linaro.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit ed092eafa66750326f15caa1a8e7c2f5f62f5951
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Wed Jun 3 10:03:01 2020 +0200
sched/pelt: Remove redundant cap_scale() definition
Besides in PELT cap_scale() is used in the Deadline scheduler class for
scale-invariant bandwidth enforcement.
Remove the cap_scale() definition in kernel/sched/pelt.c and keep the
one in kernel/sched/sched.h.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200603080304.16548-2-dietmar.eggemann@arm.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit a48e6b38b3b57cb00853bd405524f2158d454ba9
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date: Fri Apr 8 19:53:08 2022 +0800
UPSTREAM: sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
Since commit 23127296889f ("sched/fair: Update scale invariance of PELT")
change to use rq_clock_pelt() instead of rq_clock_task(), we should also
use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task
accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And
rename throttled_clock_task(_time) to be clock_pelt rather than clock_task.
Bug: 254441685
Fixes: 23127296889f ("sched/fair: Update scale invariance of PELT")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com
(cherry picked from commit 64eaf50731ac0a8c76ce2fedd50ef6652aabc5ff)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I61e971d09f14708b8ee170fd5d5109144bba6e34
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 98290e08beea376902e4e3d2eaeb3851b0192f74
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Jan 23 16:26:53 2019 +0100
UPSTREAM: sched/fair: Update scale invariance of PELT
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.
The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :
U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
with U is the max util_avg value = SCHED_CAPACITY_SCALE
At a lower capacity, the range becomes:
U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p)
with C reflecting the compute capacity ratio between current capacity and
max capacity.
so C tries to compensate changes in (1-y^r') but it can't be accurate.
Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.
In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.
In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:
On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.
each test runs 16 times:
./perf bench sched pipe
(higher is better)
kernel tip/sched/core + patch
ops/seconds ops/seconds diff
cgroup
root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38%
level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57%
level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86%
hackbench -l 1000
(lower is better)
kernel tip/sched/core + patch
duration(sec) duration(sec) diff
cgroup
root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57%
level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60%
level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66%
Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:
Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
972 (95%) 138ms not reachable 276ms
486 (47.5%) 30ms 138ms 60ms
256 (25%) 13ms 32ms 26ms
On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.
Bug: 120440300
Change-Id: I0bd4ed2317f2a9a965634e53ce1476417af697a6
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 23127296889fe84b0762b191b5d041e8ba6f2599)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 0a7060dfca03fe7a078974f343372ad623e7a25d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Jan 23 16:26:52 2019 +0100
UPSTREAM: sched/fair: Move the rq_of() helper function
Move rq_of() helper function so it can be used in pelt.c
[ mingo: Improve readability while at it. ]
Bug: 120440300
Change-Id: I2133979476631d68baaffcaa308f4cdab94f22b1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 62478d9911fab9694c195f0ca8e4701de09be98e)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 4ed482487f9d5ad5d11417da5376c596ffc2956c
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Fri Aug 3 15:05:38 2018 +0100
UPSTREAM: sched/fair: Remove setting task's se->runnable_weight during PELT update
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.
se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).
There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:
(1) A task switches to SCHED_IDLE.
(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
(during which only its static priority gets set) switches to
SCHED_OTHER or SCHED_BATCH.
Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.
Bug: 120440300
Change-Id: I52184a9e1fd53cb42ef3ae546b1fae78b744c9ad
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4a465e3ebbc8004ce4f7f08f6022ee8315a94edf)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit a97016777175a6305b14bbae29f4ce0a3a3930cd
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri Dec 14 23:10:06 2018 +0100
sched/pelt: Fix warning and clean up IRQ PELT config
Commit 11d4afd4ff667f9b6178ee8c142c36cb78bd84db upstream.
Create a config for enabling irq load tracking in the scheduler.
irq load tracking is useful only when irq or paravirtual time is
accounted but it's only possible with SMP for now.
Also use __maybe_unused to remove the compilation warning in
update_rq_clock_task() that has been introduced by:
2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Suggested-by: Ingo Molnar <mingo@redhat.com>
Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: dou_liyang@163.com
Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 8c3f9eb95d52cb68c7f62e2cfcf99a662008e4f7
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Jul 19 14:00:06 2018 +0200
sched/fair: Remove #ifdefs from scale_rt_capacity()
Reuse cpu_util_irq() that has been defined for schedutil and set irq util
to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
But the compiler is not able to optimize the sequence (at least with
aarch64 GCC 7.2.1):
free *= (max - irq);
free /= max;
when irq is fixed to 0
Add a new inline function scale_irq_capacity() that will scale utilization
when irq is accounted. Reuse this funciton in schedutil which applies
similar formula.
Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit bf59401f1068438e811e2bb93f5b2ac3c88f1b52
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri Aug 31 17:22:55 2018 +0200
sched/pelt: Fix update_blocked_averages() for RT and DL classes
update_blocked_averages() is called to periodiccally decay the stalled load
of idle CPUs and to sync all loads before running load balance.
When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
in order to potentially pull tasks and to use this newly idle CPU. This
load balance happens whereas prev task from another class has not been put
and its utilization updated yet. This may lead to wrongly account running
time as idle time for RT or DL classes.
Test that no RT or DL task is running when updating their utilization in
update_blocked_averages().
We still update RT and DL utilization instead of simply skipping them to
make sure that all metrics are synced when used during load balance.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking")
Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit cc819e146c895dd29b3fe93fbd558ccd0a190ed6
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Jun 28 17:45:12 2018 +0200
sched/core: Use PELT for scale_rt_capacity()
The utilization of the CPU by RT, DL and IRQs are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for CFS class.
scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for CFS instead of a scaling factor because RT, DL and
IRQ provide now absolute utilization value.
The same formula as schedutil is used:
IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg
but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 0f6598d6f0def162745d059a67d1148cf42cfaa3
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Jun 28 17:45:09 2018 +0200
sched/irq: Add IRQ utilization tracking
interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.
That's also important to note that because:
rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is:
avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c616a7241810a508cb961a6e1658748113161221
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Jun 28 17:45:07 2018 +0200
sched/dl: Add dl_rq utilization tracking
Similarly to what happens with RT tasks, CFS tasks can be preempted by DL
tasks and the CFS's utilization might no longer describes the real
utilization level.
Current DL bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the DL sched class. We track
DL class utilization to estimate the system utilization.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit abc5cabb9b98b262baad2250bbafbe54d8f2fc3e
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Jun 28 17:45:05 2018 +0200
sched/rt: Add rt_rq utilization tracking
schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS
tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks
are preempted by RT tasks and in this case util_avg reflects the remaining
capacity but not what CFS want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of RT tasks.
Only util_avg is correctly tracked but not load_avg and runnable_load_avg
which are useless for rt_rq.
rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 7ce46e2c80931a4da93dbe7674998fee851c8f1f
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Jun 28 17:45:04 2018 +0200
sched/pelt: Move PELT related code in a dedicated file
We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, PELT code is moved in a
dedicated file.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit a7703d111f4372d3eba166af9ebd9092ff3dc66b
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Nov 16 15:21:52 2017 +0100
sched/fair: Update and fix the runnable propagation rule
Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.
Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:
gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
which assumes that tasks are equally runnable which is not true but
easy to compute.
Beside these estimates, we have several simple rules that help us to filter
out wrong ones:
- ge->avg.runnable_sum <= than LOAD_AVG_MAX
- ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
- ge->avg.runnable_sum can't increase when we detach a task
The effect of these fixes is better cgroups balancing.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 480aacf775e2ec594e0fed31b15678ef967f29fb
Author: Peter Zijlstra <peterz@infradead.org>
Date: Thu Aug 24 13:06:35 2017 +0200
sched/fair: Update calc_group_*() comments
I had a wee bit of trouble recalling how the calc_group_runnable()
stuff worked.. add hopefully better comments.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 5b49acfeddbd52712395f011c4ab6e4695863738
Author: Josef Bacik <jbacik@fb.com>
Date: Thu Aug 3 11:13:39 2017 -0400
sched/fair: Calculate runnable_weight slightly differently
Our runnable_weight currently looks like this
runnable_weight = shares * runnable_load_avg / load_avg
The goal is to scale the runnable weight for the group based on its runnable to
load_avg ratio. The problem with this is it biases us towards tasks that never
go to sleep. Tasks that go to sleep are going to have their runnable_load_avg
decayed pretty hard, which will drastically reduce the runnable weight of groups
with interactive tasks. To solve this imbalance we tweak this slightly, so in
the ideal case it is still the above, but in the interactive case it is
runnable_weight = shares * runnable_weight / load_weight
which will make the weight distribution fairer between interactive and
non-interactive groups.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: linux-kernel@vger.kernel.org
Cc: riel@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 7be10d3b02e103159be137da4bc8fd9ebfda2b81
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri May 12 14:18:10 2017 +0200
sched/fair: Implement more accurate async detach
The problem with the overestimate is that it will subtract too big a
value from the load_sum, thereby pushing it down further than it ought
to go. Since runnable_load_avg is not subject to a similar 'force',
this results in the occasional 'runnable_load > load' situation.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit eebab46a1ff61d392a032456a0274185ec09c78d
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri May 12 14:16:30 2017 +0200
sched/fair: Align PELT windows between cfs_rq and its se
The PELT _sum values are a saw-tooth function, dropping on the decay
edge and then growing back up again during the window.
When these window-edges are not aligned between cfs_rq and se, we can
have the situation where, for example, on dequeue, the se decays
first.
Its _sum values will be small(er), while the cfs_rq _sum values will
still be on their way up. Because of this, the subtraction:
cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
will then, once the cfs_rq reaches an edge, translate into its _avg
value jumping up.
This is especially visible with the runnable_load bits, since they get
added/subtracted a lot.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit dd849181691e6b895696b331c9a750139a1bffd4
Author: Peter Zijlstra <peterz@infradead.org>
Date: Thu May 11 17:57:24 2017 +0200
sched/fair: Implement synchonous PELT detach on load-balance migrate
Vincent wondered why his self migrating task had a roughly 50% dip in
load_avg when landing on the new CPU. This is because we uncondionally
take the asynchronous detatch_entity route, which can lead to the
attach on the new CPU still seeing the old CPU's contribution to
tg->load_avg, effectively halving the new CPU's shares.
While in general this is something we have to live with, there is the
special case of runnable migration where we can do better.
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit fa4e5dee6619a0a3c5368e25147263b8576318e4
Author: Peter Zijlstra <peterz@infradead.org>
Date: Sat May 6 15:59:54 2017 +0200
sched/fair: Propagate an effective runnable_load_avg
The load balancer uses runnable_load_avg as load indicator. For
!cgroup this is:
runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
That is, a direct sum of all runnable tasks on that runqueue. As
opposed to load_avg, which is a sum of all tasks on the runqueue,
which includes a blocked component.
However, in the cgroup case, this comes apart since the group entities
are always runnable, even if most of their constituent entities are
blocked.
Therefore introduce a runnable_weight which for task entities is the
same as the regular weight, but for group entities is a fraction of
the entity weight and represents the runnable part of the group
runqueue.
Then propagate this load through the PELT hierarchy to arrive at an
effective runnable load avgerage -- which we should not confuse with
the canonical runnable load average.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c83be534eabf6400d82a999eb1072a5e7ecf2eb2
Author: Peter Zijlstra <peterz@infradead.org>
Date: Mon May 8 17:30:46 2017 +0200
sched/fair: Rewrite PELT migration propagation
When an entity migrates in (or out) of a runqueue, we need to add (or
remove) its contribution from the entire PELT hierarchy, because even
non-runnable entities are included in the load average sums.
In order to do this we have some propagation logic that updates the
PELT tree, however the way it 'propagates' the runnable (or load)
change is (more or less):
tg->weight * grq->avg.load_avg
ge->avg.load_avg = ------------------------------
tg->load_avg
But that is the expression for ge->weight, and per the definition of
load_avg:
ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
That destroys the runnable_avg (by setting it to 1) we wanted to
propagate.
Instead directly propagate runnable_sum.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 067555d1c451e393e04adf8056c17505de978ac1
Author: Peter Zijlstra <peterz@infradead.org>
Date: Mon May 8 16:51:41 2017 +0200
sched/fair: Rewrite cfs_rq->removed_*avg
Since on wakeup migration we don't hold the rq->lock for the old CPU
we cannot update its state. Instead we add the removed 'load' to an
atomic variable and have the next update on that CPU collect and
process it.
Currently we have 2 atomic variables; which already have the issue
that they can be read out-of-sync. Also, two atomic ops on a single
cacheline is already more expensive than an uncontended lock.
Since we want to add more, convert the thing over to an explicit
cacheline with a lock in.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 938acabba842513db34a25ace967f7771998ee29
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed May 17 11:50:45 2017 +0200
sched/fair: Use reweight_entity() for set_user_nice()
Now that we directly change load_avg and propagate that change into
the sums, sys_nice() and co should do the same, otherwise its possible
to confuse load accounting when we migrate near the weight change.
Fixes-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added changelog, fixed the call condition. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit c1d25beea75097359c3d0300330e2bba0106cff0
Author: Peter Zijlstra <peterz@infradead.org>
Date: Sat May 6 16:11:34 2017 +0200
sched/fair: More accurate reweight_entity()
When a (group) entity changes it's weight we should instantly change
its load_avg and propagate that change into the sums it is part of.
Because we use these values to predict future behaviour and are not
interested in its historical value.
Without this change, the change in load would need to propagate
through the average, by which time it could again have changed etc..
always chasing itself.
With this change, the cfs_rq load_avg sum will more accurately reflect
the current runnable and expected return of blocked load.
Reported-by: Paul Turner <pjt@google.com>
[josef: compile fix !SMP || !FAIR_GROUP]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 3092e27836e4e71089267a27e5a9ab4100627618
Author: Peter Zijlstra <peterz@infradead.org>
Date: Thu Aug 24 17:45:35 2017 +0200
sched/fair: Introduce {en,de}queue_load_avg()
Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
for {en,de}queue_load_avg(). More users will follow.
Includes some code movement to avoid fwd declarations.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 2d9eb6d1ce196555a3571506a0566791f53da0d4
Author: Peter Zijlstra <peterz@infradead.org>
Date: Thu Aug 24 17:38:30 2017 +0200
sched/fair: Rename {en,de}queue_entity_load_avg()
Since they're now purely about runnable_load, rename them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit beda9e28dbb0eb34d787d4cbac2bc8399fa6f1a6
Author: Peter Zijlstra <peterz@infradead.org>
Date: Sat May 6 17:37:03 2017 +0200
sched/fair: Move enqueue migrate handling
Move the entity migrate handling from enqueue_entity_load_avg() to
update_load_avg(). This has two benefits:
- {en,de}queue_entity_load_avg() will become purely about managing
runnable_load
- we can avoid a double update_tg_load_avg() and reduce pressure on
the global tg->shares cacheline
The reason we do this is so that we can change update_cfs_shares() to
change both weight and (future) runnable_weight. For this to work we
need to have the cfs_rq averages up-to-date (which means having done
the attach), but we need the cfs_rq->avg.runnable_avg to not yet
include the se's contribution (since se->on_rq == 0).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 94f87f43b19ceeecc31c9e7169533ad64efeb085
Author: Peter Zijlstra <peterz@infradead.org>
Date: Sat May 6 17:32:43 2017 +0200
sched/fair: Change update_load_avg() arguments
Most call sites of update_load_avg() already have cfs_rq_of(se)
available, pass it down instead of recomputing it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit f1bfd1f0c849264e58ce1415ddea244440a92797
Author: Peter Zijlstra <peterz@infradead.org>
Date: Sat May 6 16:42:08 2017 +0200
sched/fair: Remove se->load.weight from se->avg.load_sum
Remove the load from the load_sum for sched_entities, basically
turning load_sum into runnable_sum. This prepares for better
reweighting of group entities.
Since we now have different rules for computing load_avg, split
___update_load_avg() into two parts, ___update_load_sum() and
___update_load_avg().
So for se:
___update_load_sum(.weight = 1)
___upate_load_avg(.weight = se->load.weight)
and for cfs_rq:
___update_load_sum(.weight = cfs_rq->load.weight)
___upate_load_avg(.weight = 1)
Since the primary consumable is load_avg, most things will not be
affected. Only those few sites that initialize/modify load_sum need
attention.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 9600274a37f34758ce3dfda9f5257b7737f298bb
Author: Peter Zijlstra <peterz@infradead.org>
Date: Thu May 11 18:16:06 2017 +0200
sched/fair: Cure calc_cfs_shares() vs. reweight_entity()
Vincent reported that when running in a cgroup, his root
cfs_rq->avg.load_avg dropped to 0 on task idle.
This is because reweight_entity() will now immediately propagate the
weight change of the group entity to its cfs_rq, and as it happens,
our approxmation (5) for calc_cfs_shares() results in 0 when the group
is idle.
Avoid this by using the correct (3) as a lower bound on (5). This way
the empty cgroup will slowly decay instead of instantly drop to 0.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 664ab7a3e73cbb37c84125925f2cbbc91ed7fdca
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue May 9 11:04:07 2017 +0200
sched/fair: Add comment to calc_cfs_shares()
Explain the magic equation in calc_cfs_shares() a bit better.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit 8cdb20cedab8458aafc8d4766ecb08a28d6ad076
Author: Peter Zijlstra <peterz@infradead.org>
Date: Sat May 6 16:03:17 2017 +0200
sched/fair: Clean up calc_cfs_shares()
For consistencies sake, we should have only a single reading of tg->shares.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
commit e4f0f21e76d364601055d5d1f14eae828fd4383a
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 12:00:36 2024 +0300
Revert "UPSTREAM: sched/uclamp: Add CPU's clamp buckets refcounting"
This reverts commit 116cf381f4.
commit 8a408ac031b6990f1a2d538680ffac88e42e477c
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 12:00:36 2024 +0300
Revert "UPSTREAM: sched/uclamp: Add bucket local max tracking"
This reverts commit a684fdc2de.
commit 6f2949c1073b97efb28f05d086fdc9e473d421c4
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 12:00:35 2024 +0300
Revert "UPSTREAM: sched/uclamp: Enforce last task's UCLAMP_MAX"
This reverts commit 41ce223686.
commit 86d7baea30401efcc1d609c619c1526a952e6419
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 12:00:35 2024 +0300
Revert "UPSTREAM: sched/uclamp: Add system default clamps"
This reverts commit 5bee2de619.
commit 9d758d18c03983c60e0aba8bfd98ca0084e7d7c4
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 12:00:33 2024 +0300
Revert "UPSTREAM: sched/core: Allow sched_setattr() to use the current policy"
This reverts commit 38b1e6723b.
commit 164913a4e0ab61a201d7802eb0de81099453971a
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 12:00:20 2024 +0300
Revert "UPSTREAM: sched/uclamp: Extend sched_setattr() to support utilization clamping"
This reverts commit 486d866797.
commit 98262c0640ec825cc93c5faf076c7d24ae8ae6a7
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:57 2024 +0300
Revert "UPSTREAM: sched/uclamp: Reset uclamp values on RESET_ON_FORK"
This reverts commit 13caaea80c.
commit 17a9d13f5c3bc718f83c71b7b18b40a6525f22c0
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:57 2024 +0300
Revert "UPSTREAM: sched/uclamp: Set default clamps for RT tasks"
This reverts commit 1e06d5830a.
commit 65d7db99c966d5c853da01eb8897be026f5674ed
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:57 2024 +0300
Revert "BACKPORT: sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks"
This reverts commit 1f42ab6cc1.
commit 318ea0855d8084332eb060ca8948bb970cf08129
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:57 2024 +0300
Revert "UPSTREAM: sched/uclamp: Add uclamp_util_with()"
This reverts commit e6a7b88cc2.
commit f1ec7c15ea378c0fa4089a36d753e99054265c10
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:57 2024 +0300
Revert "UPSTREAM: sched/uclamp: Extend CPU's cgroup controller"
This reverts commit 6a41399960.
commit 8814a297260c377a08f12350ab47f302d333043b
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:56 2024 +0300
Revert "UPSTREAM: sched/uclamp: Propagate parent clamps"
This reverts commit 4963c9685e.
commit fe9d3b66e407939f83c8b9ee26b9318ec4316609
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:56 2024 +0300
Revert "UPSTREAM: sched/uclamp: Propagate system defaults to the root group"
This reverts commit a3bbe48401.
commit f6256062570adc1d7e261ffff855f0fa1128d8b1
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:56 2024 +0300
Revert "UPSTREAM: sched/uclamp: Use TG's clamps to restrict TASK's clamps"
This reverts commit dbb3b287e2.
commit 21d5ea4fd377bedd6ad5b49a829fa7b3867d9bd6
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:56 2024 +0300
Revert "UPSTREAM: sched/uclamp: Update CPU's refcount on TG's clamp changes"
This reverts commit b5c75307c4.
commit 2d9b5a47c47a33c965f744ce34c8924524558bae
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:56 2024 +0300
Revert "UPSTREAM: sched/uclamp: Always use 'enum uclamp_id' for clamp_id values"
This reverts commit a479e1de9b.
commit 962708c53ab41d17d9233543581553226f6da382
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:51 2024 +0300
Revert "UPSTREAM: sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code"
This reverts commit fe18af6732.
commit f1bb535dbf77aec377d06a795a52c3e1c3489ab9
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:19 2024 +0300
Revert "UPSTREAM: sched/core: Fix compilation error when cgroup not selected"
This reverts commit 4f89a28e12.
commit 7e33f0d51debf2dd24a33ea7fbb8ac19edb42b1e
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:19 2024 +0300
Revert "sched/uclamp: Reject negative values in cpu_uclamp_write()"
This reverts commit a12fbe805e.
commit 546d51a84878a1046177b739ccdfbfe4748818b4
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:19 2024 +0300
Revert "sched/core: Fix size of rq::uclamp initialization"
This reverts commit c7be60603f.
commit f8456cecf7592fd67d02f5250579d5332a3e168d
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:19 2024 +0300
Revert "sched/core: Fix reset-on-fork from RT with uclamp"
This reverts commit fbeca2f356.
commit 7f497bb9e5d2be6a0182e30b1b9cd6feef3b70ab
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:19 2024 +0300
Revert "BACKPORT: sched/uclamp: Remove uclamp_util()"
This reverts commit b6f58cb73d.
commit 1573bacd6d6455ddcf9aa62e6ed71080ee8fe73d
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:19 2024 +0300
Revert "UPSTREAM: sched/uclamp: Make uclamp util helpers use and return UL values"
This reverts commit 1fbeb27caf.
commit 33c125b0a8d140d9f6333d1334784b6eaae89009
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:18 2024 +0300
Revert "UPSTREAM: sched/uclamp: Fix initialization of struct uclamp_rq"
This reverts commit 294a706618.
commit fd06089869bde9880e4953e1f43f288db2fc5b14
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:18 2024 +0300
Revert "BACKPORT: sched/uclamp: Protect uclamp fast path code with static key"
This reverts commit b2f4e5d8c3.
commit 592e4a8ed2c0a59fac12c39b2a023c3517c7c801
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:18 2024 +0300
Revert "UPSTREAM: sched/uclamp: Fix a deadlock when enabling uclamp static key"
This reverts commit 3eb05145c9.
commit 60dc9ca3b6caf6db7cb3b209527111812b78587b
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:59:13 2024 +0300
Revert "BACKPORT: sched/uclamp: Add a new sysctl to control RT default boost value"
This reverts commit 1abcbcbea7.
commit 72296d4745d8b53ccedd64edfaa455ca6c8fae9c
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:58:42 2024 +0300
Revert "ANDROID: sched/core: Add a latency-sensitive flag to uclamp"
This reverts commit b5fa516b3a.
commit 9b478422f4676d419918afb41a71015a02f79a84
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:58:42 2024 +0300
Revert "ANDROID: sched: Introduce uclamp latency and boost wrapper"
This reverts commit fe73bc33df.
commit 2de1a0534caaa6fa2255b9eda5d64fb2b9daa209
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:58:42 2024 +0300
Revert "sched/fair: Modify boosted_task_util() to reflect uclamp changes"
This reverts commit 8c84f3cd7f.
commit 523dab791c4ebd52b8b6473d52967edeb7bdbfd8
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:58:36 2024 +0300
Revert "Revert "sched/fair: Revert Google's capacity margin hacks""
This reverts commit b623172b47.
commit 43bd6d52219dcca4cdbd64129e4c4b6919549d9a
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:58:36 2024 +0300
Revert "Revert "sched: Stub prefer_high_cap""
This reverts commit 8e1140bfeb.
commit 64662ef0b3e7c6b11d7516ebc3f83483a1fa71a9
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:58:30 2024 +0300
Revert "Revert "Revert "sched: separate boost signal from placement hint"""
This reverts commit 79111a20d9.
commit 1bac20806128df71140bd14c52b83dc1117ade38
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:57:50 2024 +0300
Revert "Revert "Revert "sched/fair: check if mid capacity cpu exists"""
This reverts commit 7508259cb6.
commit 42e0ef5e1b4f56d6337c1a6f3504d275cbc9a043
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:58 2024 +0300
Revert "Revert "Revert "sched: separate capacity margin for boosted tasks"""
This reverts commit 435cf700fa.
commit eab623c38f37d7cc0973478f55256173ce3d213d
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:57 2024 +0300
Revert "Revert "Revert "sched: change capacity margin down to 20% for non-boost task"""
This reverts commit 0567c2dfa0.
commit 81f6c4feb22ee8da71f7adde5513ce1ae8a48346
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:57 2024 +0300
Revert "Revert "Revert "sched/fair: use actual cpu capacity to calculate boosted util"""
This reverts commit 9468b2ce80.
commit c5bff7085b144149dc1ed774a3312c19a1669103
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:57 2024 +0300
Revert "Revert "Revert "sched/fair: do not use boosted margin for prefer_high_cap case"""
This reverts commit 25f6b813b4.
commit d5cc737eb99b7aeced3b5d1546799ce1cffe0129
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:57 2024 +0300
Revert "Revert "Revert "Revert "sched/fair: use actual cpu capacity to calculate boosted util""""
This reverts commit 607d0c1f5a.
commit 944f85f208f1a520eb9759571d49f75db84b3ffc
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:50 2024 +0300
Revert "sched/fair: Make boosted and prefer_idle tunables uclamp aware"
This reverts commit 8d58395724.
commit ff7fc5b1110dd9b31817b7186467938974ed8017
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:43 2024 +0300
Revert "sched/uclamp: Fix incorrect uclamp.latency_sensitive setting"
This reverts commit 282a8ac6f5.
commit 4e5d3a1c362e4c3616126dfc3422d6207eeae977
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:43 2024 +0300
Revert "sched/uclamp: Make uclamp_boosted() return proper boosted value"
This reverts commit e1ffe11ccc.
commit b6ae08ea8bbe41f3c4b90b303e2c39a4c7c5f7b3
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:56:10 2024 +0300
Revert "sched/uclamp: Allow to reset a task uclamp constraint value"
This reverts commit b6b69eece5.
commit 658ddb506825027979ee263aff8d55e0cff3dd98
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:14 2024 +0300
Revert "UPSTREAM: sched/uclamp: Fix incorrect condition"
This reverts commit ce1a62791e.
commit 1c5a70035dafad8750eafa8c5ee40b11115be18e
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:14 2024 +0300
Revert "FROMGIT: sched/uclamp: Fix a bug in propagating uclamp value in new cgroups"
This reverts commit 16d73cafea.
commit ed892012da806647a464df46203dad13ca07e4d8
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:14 2024 +0300
Revert "sched/uclamp: Fix locking around cpu_util_update_eff()"
This reverts commit e954be1dbf.
commit 505f58a688443900fdaeb4cca4e3ca66b0cf867c
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:13 2024 +0300
Revert "sched/uclamp: Remove unnecessary mutex_init()"
This reverts commit 31714e404d.
commit 8f96c904caa429cb4e413bae75e746a38b9f1214
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:13 2024 +0300
Revert "FROMLIST: sched: Fix out-of-bound access in uclamp"
This reverts commit 973affb92e.
commit 29df4134d5a23f686000ebfd31ee3edaf2f6bb44
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:13 2024 +0300
Revert "sched/uclamp: Fix wrong implementation of cpu.uclamp.min"
This reverts commit 07c3a9af2f.
commit 4efefc8fcc48eaff905fc548099b5c5f181e8c8c
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:13 2024 +0300
Revert "sched/uclamp: Fix uclamp_tg_restrict()"
This reverts commit 7656386687.
commit 7d5d5fe448300a61d6718919d3c38277e32285db
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:13 2024 +0300
Revert "sched/uclamp: Ignore max aggregation if rq is idle"
This reverts commit 08a6710f89.
commit 488c9711417e6fc0d07401e928e9523bd241915c
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:13 2024 +0300
Revert "ANDROID: sched: Make uclamp changes depend on CAP_SYS_NICE"
This reverts commit 957d737358.
commit 1ac8237c8049014ea11f5505c95fb0c91d5608d1
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:55:13 2024 +0300
Revert "sched: Fix UCLAMP_FLAG_IDLE setting"
This reverts commit e9a6a0f106.
commit 0ef1e5ecb196bd556499ba0a7021a53ed87cadc4
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:54:15 2024 +0300
Revert "sched/uclamp: Fix rq->uclamp_max not set on first enqueue"
This reverts commit 977a80d795.
commit b69fc8ea5fbcc014e4a8fdafefb468dc348baa38
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:53:53 2024 +0300
Revert "sched/fair: Reduce busy load balance interval"
This reverts commit c54355abd4.
commit 45daeebd161dab7ff6550030d0a8bcb302b2c271
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:53:40 2024 +0300
Revert "sched/fair: Reduce minimal imbalance threshold"
This reverts commit 47a4df59c1.
commit 8557606f8d15209e2f68ff49485e62fb97a441c3
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:53:32 2024 +0300
Revert "sched/deadline: Fix stale throttling on de-/boosted tasks"
This reverts commit 1f2196de96.
commit 705944ee989af6d24932895c541abca24caa1b21
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:53:22 2024 +0300
Revert "sched: fair: consider all running tasks in cpu for load balance"
This reverts commit c1e04a4d13.
commit ecffc3948595a8cb3c246135082ee17ab7d414ca
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:53:03 2024 +0300
Revert "sched/fair: Revert Google's capacity margin hacks"
This reverts commit e7c82d8c4d.
commit c064aca8d86c768eb1ebf8536c25e49718cb55b9
Author: kondors1995 <normandija1945@gmail.com>
Date: Sat Aug 17 11:48:56 2024 +0300
Revert "BACKPORT: sched/fair: Make task_fits_capacity() consider uclamp restrictions"
This reverts commit a59a899e6a.
# Conflicts:
# include/linux/sched/sysctl.h
# kernel/sched/fair.c
# kernel/sched/sched.h
# kernel/sched/tune.c
commit d0fa70aca54c8643248e89061da23752506ec0d4 upstream.
Add a check before visiting the members of ea to
make sure each ea stays within the ealist.
Signed-off-by: lei lu <llfamsec@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7f91bd0f2941fa36449ce1a15faaa64f840d9746)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
commit 233323f9b9f828cd7cd5145ad811c1990b692542 upstream.
The acpi_cst_latency_cmp() comparison function currently used for
sorting C-state latencies does not satisfy transitivity, causing
incorrect sorting results.
Specifically, if there are two valid acpi_processor_cx elements A and B
and one invalid element C, it may occur that A < B, A = C, and B = C.
Sorting algorithms assume that if A < B and A = C, then C < B, leading
to incorrect ordering.
Given the small size of the array (<=8), we replace the library sort
function with a simple insertion sort that properly ignores invalid
elements and sorts valid ones based on latency. This change ensures
correct ordering of the C-state latencies.
Fixes: 65ea8f2c6e23 ("ACPI: processor idle: Fix up C-state latency if not ordered")
Reported-by: Julian Sikorski <belegdol@gmail.com>
Closes: https://lore.kernel.org/lkml/70674dc7-5586-4183-8953-8095567e73df@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Tested-by: Julian Sikorski <belegdol@gmail.com>
Cc: All applicable <stable@vger.kernel.org>
Link: https://patch.msgid.link/20240701205639.117194-1-visitorckw@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c9d6e349f7aad4ab9c557047d357df256c15f25e)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
commit 24d3ba0a7b44c1617c27f5045eecc4f34752ab03 upstream.
The 32-bit ARM kernel stops working if the kernel grows to the point
where veneers for __get_user_* are created.
AAPCS32 [1] states, "Register r12 (IP) may be used by a linker as a
scratch register between a routine and any subroutine it calls. It
can also be used within a routine to hold intermediate values between
subroutine calls."
However, bl instructions buried within the inline asm are unpredictable
for compilers; hence, "ip" must be added to the clobber list.
This becomes critical when veneers for __get_user_* are created because
veneers use the ip register since commit 02e541db05 ("ARM: 8323/1:
force linker to use PIC veneers").
[1]: https://github.com/ARM-software/abi-aa/blob/2023Q1/aapcs32/aapcs32.rst
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: John Stultz <jstultz@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 41a5c1717bf4ad1b6084e8682de64b178eabc059)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
commit 3cad1bc010416c6dd780643476bc59ed742436b9 upstream.
When fcntl_setlk() races with close(), it removes the created lock with
do_lock_file_wait().
However, LSMs can allow the first do_lock_file_wait() that created the lock
while denying the second do_lock_file_wait() that tries to remove the lock.
In theory (but AFAIK not in practice), posix_lock_file() could also fail to
remove a lock due to GFP_KERNEL allocation failure (when splitting a range
in the middle).
After the bug has been triggered, use-after-free reads will occur in
lock_get_status() when userspace reads /proc/locks. This can likely be used
to read arbitrary kernel memory, but can't corrupt kernel memory.
This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in
enforcing mode and only works from some security contexts.
Fix it by calling locks_remove_posix() instead, which is designed to
reliably get rid of POSIX locks associated with the given file and
files_struct and is also used by filp_flush().
Fixes: c293621bbf ("[PATCH] stale POSIX lock handling")
Cc: stable@kernel.org
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@google.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
[stable fixup: ->c.flc_type was ->fl_type in older kernels]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d30ff33040834c3b9eee29740acd92f9c7ba2250)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 73810cd45b99c6c418e1c6a487b52c1e74edb20d ]
When building with clang, via:
make LLVM=1 -C tools/testing/selftests
...there are several warnings, and an error. This fixes all of those and
allows these tests to run and pass.
1. Fix linker error (undefined reference to memcpy) by providing a local
version of memcpy.
2. clang complains about using this form:
if (g = h & 0xf0000000)
...so factor out the assignment into a separate step.
3. The code is passing a signed const char* to elf_hash(), which expects
a const unsigned char *. There are several callers, so fix this at
the source by allowing the function to accept a signed argument, and
then converting to unsigned operations, once inside the function.
4. clang doesn't have __attribute__((externally_visible)) and generates
a warning to that effect. Fortunately, gcc 12 and gcc 13 do not seem
to require that attribute in order to build, run and pass tests here,
so remove it.
Reviewed-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Edward Liaw <edliaw@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit d5e9dddd18fdfe04772bce07d4a34e39e7b1e402)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit ce1dac560a74220f2e53845ec0723b562288aed4 ]
While in commit 2dd33f9cec ("spi: imx: support DMA for imx35") it was
claimed that DMA works on i.MX25, i.MX31 and i.MX35 the respective
device trees don't add DMA channels. The Reference manuals of i.MX31 and
i.MX25 also don't mention the CSPI core being DMA capable. (I didn't
check the others.)
Since commit e267a5b3ec59 ("spi: spi-imx: Use dev_err_probe for failed
DMA channel requests") this results in an error message
spi_imx 43fa4000.spi: error -ENODEV: can't get the TX DMA channel!
during boot. However that isn't fatal and the driver gets loaded just
fine, just without using DMA.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://patch.msgid.link/20240508095610.2146640-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 4f5e56dddabe947cc840ffb2db60d9df6ca9e8b9)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 391b59b045004d5b985d033263ccba3e941a7740 ]
Jan reported that 'cd ..' may take a long time in deep directory
hierarchies under a bind-mount. If concurrent renames happen it is
possible to livelock in is_subdir() because it will keep retrying.
Change is_subdir() from simply retrying over and over to retry once and
then acquire the rename lock to handle deep ancestor chains better. The
list of alternatives to this approach were less then pleasant. Change
the scope of rcu lock to cover the whole walk while at it.
A big thanks to Jan and Linus. Both Jan and Linus had proposed
effectively the same thing just that one version ended up being slightly
more elegant.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit a5c4645346b0efb5a10ed28ae281a9af29037608)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 0d151a103775dd9645c78c97f77d6e2a5298d913 ]
syzbot is reporting that calling hci_release_dev() from hci_error_reset()
due to hci_dev_put() from hci_error_reset() can cause deadlock at
destroy_workqueue(), for hci_error_reset() is called from
hdev->req_workqueue which destroy_workqueue() needs to flush.
We need to make sure that hdev->{rx_work,cmd_work,tx_work} which are
queued into hdev->workqueue and hdev->{power_on,error_reset} which are
queued into hdev->req_workqueue are no longer running by the moment
destroy_workqueue(hdev->workqueue);
destroy_workqueue(hdev->req_workqueue);
are called from hci_release_dev().
Call cancel_work_sync() on these work items from hci_unregister_dev()
as soon as hdev->list is removed from hci_dev_list.
Reported-by: syzbot <syzbot+da0a9c9721e36db712e8@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=da0a9c9721e36db712e8
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 48542881997e17b49dc16b93fe910e0cfcf7a9f9)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 6a7db25aad8ce6512b366d2ce1d0e60bac00a09d ]
When dmaengine supports pause function, in suspend state,
dmaengine_pause() is called instead of dmaengine_terminate_async(),
In end of playback stream, the runtime->state will go to
SNDRV_PCM_STATE_DRAINING, if system suspend & resume happen
at this time, application will not resume playback stream, the
stream will be closed directly, the dmaengine_terminate_async()
will not be called before the dmaengine_synchronize(), which
violates the call sequence for dmaengine_synchronize().
This behavior also happens for capture streams, but there is no
SNDRV_PCM_STATE_DRAINING state for capture. So use
dmaengine_tx_status() to check the DMA status if the status is
DMA_PAUSED, then call dmaengine_terminate_async() to terminate
dmaengine before dmaengine_synchronize().
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Link: https://patch.msgid.link/1718851218-27803-1-git-send-email-shengjiu.wang@nxp.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit fe0a6e7eb38f9d5396f6ff548186a6cd62c08b1a)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit a69ce592cbe0417664bc5a075205aa75c2ec1273 ]
The Lenovo N24 on resume becomes stuck in a state where it
sends incorrect packets, causing elantech_packet_check_v4 to fail.
The only way for the device to resume sending the correct packets is for
it to be disabled and then re-enabled.
This change adds a dmi check to trigger this behavior on resume.
Signed-off-by: Jonathan Denose <jdenose@google.com>
Link: https://lore.kernel.org/r/20240503155020.v2.1.Ifa0e25ebf968d8f307f58d678036944141ab17e6@changeid
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 9b6a1cb833dc8ceab3fbc45a261a8dd37c4f8013)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 1db5322b7e6b58e1b304ce69a50e9dca798ca95b ]
Change level for the "not connected" client message in the write
callback from error to debug.
The MEI driver currently disconnects all clients upon system suspend.
This behavior is by design and user-space applications with
open connections before the suspend are expected to handle errors upon
resume, by reopening their handles, reconnecting,
and retrying their operations.
However, the current driver implementation logs an error message every
time a write operation is attempted on a disconnected client.
Since this is a normal and expected flow after system resume
logging this as an error can be misleading.
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Link: https://lore.kernel.org/r/20240530091415.725247-1-tomas.winkler@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit bd2a753fa12cf3d28726a4bf067398514e52d57c)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit ed8c7fbdfe117abbef81f65428ba263118ef298a ]
The maximum possible return value of find_next_zero_bit(fdt->full_fds_bits,
maxbit, bitbit) is maxbit. This return value, multiplied by BITS_PER_LONG,
gives the value of bitbit, which can never be greater than maxfd, it can
only be equal to maxfd at most, so the following check 'if (bitbit > maxfd)'
will never be true.
Moreover, when bitbit equals maxfd, it indicates that there are no unused
fds, and the function can directly return.
Fix this check.
Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev>
Link: https://lore.kernel.org/r/20240529160656.209352-1-yuntao.wang@linux.dev
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 5611e11988535125b3a05305680851ff587702a9)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 77a92660d8fe8d29503fae768d9f5eb529c88b36 ]
expr_trans_bool() performs an incorrect transformation.
[Test Code]
config MODULES
def_bool y
modules
config A
def_bool y
select C if B != n
config B
def_tristate m
config C
tristate
[Result]
CONFIG_MODULES=y
CONFIG_A=y
CONFIG_B=m
CONFIG_C=m
This output is incorrect because CONFIG_C=y is expected.
Documentation/kbuild/kconfig-language.rst clearly explains the function
of the '!=' operator:
If the values of both symbols are equal, it returns 'n',
otherwise 'y'.
Therefore, the statement:
select C if B != n
should be equivalent to:
select C if y
Or, more simply:
select C
Hence, the symbol C should be selected by the value of A, which is 'y'.
However, expr_trans_bool() wrongly transforms it to:
select C if B
Therefore, the symbol C is selected by (A && B), which is 'm'.
The comment block of expr_trans_bool() correctly explains its intention:
* bool FOO!=n => FOO
^^^^
If FOO is bool, FOO!=n can be simplified into FOO. This is correct.
However, the actual code performs this transformation when FOO is
tristate:
if (e->left.sym->type == S_TRISTATE) {
^^^^^^^^^^
While it can be fixed to S_BOOLEAN, there is no point in doing so
because expr_tranform() already transforms FOO!=n to FOO when FOO is
bool. (see the "case E_UNEQUAL" part)
expr_trans_bool() is wrong and unnecessary.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit b366d89859fe7b58894b3698844b551fe32f892a)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 46edf4372e336ef3a61c3126e49518099d2e2e6d ]
Currently, the initial state of the "Save" button is always active.
If none of the CONFIG options are changed while loading the .config
file, the "Save" button should be greyed out.
This can be fixed by calling conf_read() after widget initialization.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit b6d6a91b584a022424d99264741bdfa6b336c83b)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit cf28ff8e4c02e1ffa850755288ac954b6ff0db8c ]
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
ila_output() is called from lwtunnel_output()
possibly from process context, and under rcu_read_lock().
We might be interrupted by a softirq, re-enter ila_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 7435bd2f84a25aba607030237261b3795ba782da)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 38a38f5a36da9820680d413972cb733349400532 ]
When support for Silead touchscreens was orginal added some touchscreens
with older firmware versions only supported 5 fingers and this was made
the default requiring the setting of a "silead,max-fingers=10" uint32
device-property for all touchscreen models which do support 10 fingers.
There are very few models with the old 5 finger fw, so in practice the
setting of the "silead,max-fingers=10" is boilerplate which needs to
be copy and pasted to every touchscreen config.
Reporting that 10 fingers are supported on devices which only support
5 fingers doesn't cause any problems for userspace in practice, since
at max 4 finger gestures are supported anyways. Drop the max_fingers
configuration and simply always assume 10 fingers.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Link: https://lore.kernel.org/r/20240525193854.39130-2-hdegoede@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit ce0368a52554d213c5cd447ba786b54390a845e1)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
On some x86 tablets with a silead touchscreen the windows logo on the
front is a capacitive home button. Touching this button results in a touch
with bits 12-15 of the Y coordinates set, while normally only the lower 12
are used.
Detect this and report a KEY_LEFTMETA press when this happens. Note for
now we only respond to the Y coordinate bits 12-15 containing 0x01, on some
tablets *without* a capacative button I've noticed these bits containing
0x04 when crossing the edges of the screen.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit eca3be9b95ac7cf9442654a54962859d74f8e38a)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 92ecbb3ac6f3fe8ae9edf3226c76aa17b6800699 ]
When testing the previous patch with CONFIG_UBSAN_BOUNDS, I've
noticed the following:
UBSAN: array-index-out-of-bounds in net/mac80211/scan.c:372:4
index 0 is out of range for type 'struct ieee80211_channel *[]'
CPU: 0 PID: 1435 Comm: wpa_supplicant Not tainted 6.9.0+ #1
Hardware name: LENOVO 20UN005QRT/20UN005QRT <...BIOS details...>
Call Trace:
<TASK>
dump_stack_lvl+0x2d/0x90
__ubsan_handle_out_of_bounds+0xe7/0x140
? timerqueue_add+0x98/0xb0
ieee80211_prep_hw_scan+0x2db/0x480 [mac80211]
? __kmalloc+0xe1/0x470
__ieee80211_start_scan+0x541/0x760 [mac80211]
rdev_scan+0x1f/0xe0 [cfg80211]
nl80211_trigger_scan+0x9b6/0xae0 [cfg80211]
...<the rest is not too useful...>
Since '__ieee80211_start_scan()' leaves 'hw_scan_req->req.n_channels'
uninitialized, actual boundaries of 'hw_scan_req->req.channels' can't
be checked in 'ieee80211_prep_hw_scan()'. Although an initialization
of 'hw_scan_req->req.n_channels' introduces some confusion around
allocated vs. used VLA members, this shouldn't be a problem since
everything is correctly adjusted soon in 'ieee80211_prep_hw_scan()'.
Cleanup 'kmalloc()' math in '__ieee80211_start_scan()' by using the
convenient 'struct_size()' as well.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://msgid.link/20240517153332.18271-2-dmantipov@yandex.ru
[improve (imho) indentation a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit cd3212a9e0209dff7eda30f01ab8590f5e8d92fb)
[Harshit: add #include that was pulled in through some other unknown
header on 4.19.y]
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>