exit_mmap() is responsible for freeing the vast majority of an mm's
memory; in order to unblock Simple LMK faster, report an mm as freed as
soon as exit_mmap() finishes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Commit c0ff7453bb ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
it is checking the flag on the current task rather than the given one.
This doesn't make much sense and it is actually wrong. If the current
task which updates the nodemask of a cpuset got killed by the OOM killer
then a part of the cpuset cgroup processes would have incompatible
nodemask which is surprising to say the least.
The comment suggests the intention was to skip oom victim or an exiting
task so we should be checking the given task. But even then it would be
layering violation because it is the memory allocator to interpret the
TIF_MEMDIE meaning. Simply drop both checks. All tasks in the cpuset
should simply follow the same mask.
Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
freezing_slow_path() is checking TIF_MEMDIE to skip OOM killed tasks.
It is, however, checking the flag on the current task rather than the
given one. This is really confusing because freezing() can be called
also on !current tasks. It would end up working correctly for its main
purpose because __refrigerator will be always called on the current task
so the oom victim will never get frozen. But it could lead to
surprising results when a task which is freezing a cgroup got oom killed
because only part of the cgroup would get frozen. This is highly
unlikely but worth fixing as the resulting code would be more clear
anyway.
Link: http://lkml.kernel.org/r/1467029719-17602-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert
other kernel code that the current process was killed due to memory
pressure, and needs to finish whatever it's doing quickly. In the page
allocator this allows victim processes to quickly allocate memory using
emergency reserves. This is especially important when memory pressure is
high; if all processes are taking a while to allocate memory, then our
victim processes will face the same problem and can potentially get
stuck in the page allocator for a while rather than die expeditiously.
To ensure that victim processes die quickly, set TIF_MEMDIE for the
entire victim thread group.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.
The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The implementation is utterly broken, resulting in all processes being
allows to move tasks between sets (as long as they have access to the
"tasks" attribute), and upstream is heading towards checking only
capability anyway, so let's get rid of this code.
BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat
Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394967
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Change-Id: Ic56500aaeaf5f3ebdfda094ad6ef4666c82e18c5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual
indirect call by register and use kernel internal opcode to
mark call instruction into bpf_tail_call() helper.
Change-Id: I1a45b8e3c13848c9689ce288d4862935ede97fa7
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
that a single __weak function in the core that can be overridden
similarly to the eBPF one. Also remove stale pr_err() mentions
of bpf_jit_compile.
Change-Id: Iac221c09e9ae0879acdd7064d710c4f7cb8f478d
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
This is trivial to do:
- add flags argument to simple_rename()
- check if flags doesn't have any other than RENAME_NOREPLACE
- assign simple_rename() to .rename2 instead of .rename
Filesystems converted:
hugetlbfs, ramfs, bpf.
Debugfs uses simple_rename() to implement debugfs_rename(), which is for
debugfs instances to rename files internally, not for userspace filesystem
access. For this case pass zero flags to simple_rename().
Change-Id: I1a46ece3b40b05c9f18fd13b98062d2a959b76a0
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Reinitialize rq->next_balance when a CPU is hot-added. Otherwise,
scheduler domain rebalancing may be skipped if rq->next_balance was
set to a future time when the CPU was last active, and the
newly-re-added CPU is in idle_balance(). As a result, the
newly-re-added CPU will remain idle with no tasks scheduled until the
softlockup watchdog runs - potentially 4 seconds later. This can
waste energy and reduce performance.
This behavior can be observed in some SoC kernels, which use CPU
hotplug to dynamically remove and add CPUs in response to load. In
one case that triggered this behavior,
0. the system started with all cores enabled, running multi-threaded
CPU-bound code;
1. the system entered some single-threaded code;
2. a CPU went idle and was hot-removed;
3. the system started executing a multi-threaded CPU-bound task;
4. the CPU from event 2 was re-added, to respond to the load.
The time interval between events 2 and 4 was approximately 300
milliseconds.
Of course, ideally CPU hotplug would not be used in this manner,
but this patch does appear to fix a real bug.
Nvidia folks: this patch is submitted as at least a partial fix for
bug 1243368 ("[sched] Load-balancing not happening correctly after
cores brought online")
Change-Id: Iabac21e110402bb581b7db40c42babc951d378d0
Signed-off-by: Paul Walmsley <pwalmsley@nvidia.com>
Cc: Peter Boonstoppel <pboonstoppel@nvidia.com>
Reviewed-on: http://git-master/r/206918
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Amit Kamath <akamath@nvidia.com>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Peter Boonstoppel <pboonstoppel@nvidia.com>
Reviewed-by: Diwakar Tundlam <dtundlam@nvidia.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
While setting smpboot thread state as running, there is possibility
of irq may fire at same core and wake up the smpboot thread of same
core which create self deadlock. To avoid that, protect the
same with spin_lock_irqsave.
Change-Id: I5eca9b27af94fee22af3bb201f26b63ed8930efe
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Some drivers need to know what the status of the interrupt line is.
This is especially true for drivers that register a handler with
IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING and in the handler they
need to know which edge transition it was invoked for. Provide a way
for these handlers to read the logical status of the line after their
handler was invoked. If the line reads high it was called for a
rising edge and if the line reads low it was called for a falling edge.
The irq_read_line callback in the chip allows the controller to provide
the real time status of this line. Controllers that can read the status
of an interrupt line should implement this by doing necessary
hardware reads and return the logical state of the line.
Interrupt controllers based on the slow bus architecture should conduct
the transaction in this callback. The genirq code will call the chip's
bus lock prior to calling irq_read_line. Obviously since the transaction
would be completed before returning from irq_read_line it need not do
any transactions in the bus unlock call.
Change-Id: I3c8746706530bba14a373c671d22ee963b84dfab
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Currently, there exists a corner case assuming when there is
only one clocksource e.g RTC, and system failed to go to
suspend mode. While resume rtc_resume() injects the sleeptime
as timekeeping_rtc_skipresume() returned 'false' (default value
of sleeptime_injected) due to which we can see mismatch in
timestamps.
This issue can also come in a system where more than one
clocksource are present and very first suspend fails.
Success case:
------------
{sleeptime_injected=false}
rtc_suspend() => timekeeping_suspend() => timekeeping_resume() =>
(sleeptime injected)
rtc_resume()
Failure case:
------------
{failure in sleep path} {sleeptime_injected=false}
rtc_suspend() => rtc_resume()
{sleeptime injected again which was not required as the suspend failed}
Fix this by handling the boolean logic properly.
Change-Id: I7ac5210ec326b41f4d36bb87209b667f21f3aa50
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Git-commit: f473e5f467f6049370575390b08dc42131315d60
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
CPUs that are idle are excellent candidates for latency sensitive or
high-performance tasks. Decrementing their capacity while they are idle
will result in these CPUs being chosen less, and they will prefer to
schedule smaller tasks instead of large ones. Disable this.
Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: clarencelol <clarencekuiek@icloud.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Even the interactive governor utilizes a realtime priority. It is
beneficial for schedutil to process it's workload at a >= priority
than mundane tasks (KGSL/AUDIO/ETC).
Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: clarencelol <clarencekuiek@icloud.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
The schedtune cgroup controller allows upto 5 cgroups including the
default/root cgroup. Until now the user space is creating only
4 additional cgroups namely, foreground, background, top-app and
audio-app. Recently another cgroup called rt is created before
the audio-app cgroup. Since kernel limits the cgroups to 5, the
creation of audio-app cgroup is failing. Fix this by increasing
the schedtune cgroup controller cgroup limit to 6.
Change-Id: I13252a90dba9b8010324eda29b8901cb0b20bc21
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Inform scheduler about capacity restrictions, such as during frequency
boosting.
Change-Id: Ic65bede69608acf8ca3f144f144049a4392a70f6
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
We should apply the iowait boost only if cpufreq policy has iowait boost
enabled. Also make it a schedutil configuration from sysfs so it can be
turned on/off if needed (by default initialize it to the policy value).
For systems that don't need/want it enabled, such as those on arm64
based mobile devices that are battery operated, it saves energy when the
cpufreq driver policy doesn't have it enabled (details below):
Here are some results for energy measurements collected running a
YouTube video for 30 seconds:
Before: 8.042533 mWh
After: 7.948377 mWh
Energy savings is ~1.2%
Bug: 38010527
Link: https://lkml.org/lkml/2017/5/19/42
Change-Id: If124076ad0c16ade369253840dedfbf870aff927
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Schedtune boosted tasks are biased to higher capacity CPUs by default.
Add a sched feature to enable/disable this behaviour.
Change-Id: I3500675c182f3929e893dbb33850fe033db6c146
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
We now need to pass the functions a boost slot argument. Also we rename
the functions to reflect that we intend to perform sched_boost.
Change-Id: I84a63aea2c9035267095762804efabf7be6c66d5
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Switch from a counter-based system to a slot-based system for managing
multiple dynamic Schedtune boost requests.
The primary limitations of the counter-based system was that it could
only keep track of two boost values at a time: the current dynamic boost
value and the default boost value. When more than one boost request is
issued, the system would only remember the highest value of them all.
Even if the task that requested the highest value had unboosted, this
value is still maintained as long as there are other active boosts that
are still running. A more ideal outcome would be for the system to
unboost to the maximum boost value of the remaining active boosts.
The slot-based system provides a solution to the problem by keeping
track of the boost values of all ongoing active boosts. It ensures that
the current boost value will be equal to the maximum boost value of
all ongoing active boosts. This is achieved with two linked lists
(active_boost_slots and available_boost_slots), which assign and keep
track of boost slot numbers for each successful boost request. The boost
value of each request is stored in an array (slot_boost[]), at an index
value equal to the assigned boost slot number.
For now we limit the number of active boost slots to 5 per Schedtune
group.
Change-Id: Iadc738fc919af092fd4c1b6312becf9567bc4c62
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
To reflect that the function is to be used mainly with CAF's devices
that have sched_boost. However, developers may use it as a switch to
dynamically boost schedtune to the values specified in
/dev/stune/*/schedtune.sched_boost.
Change-Id: I5012273e5572c6091a99a6954452bed3a2501c55
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
This was confusing to deal with given that it had the same name as the
Dynamic Schedtune Boost framework. It will be more apt to call it
sched_boost given that it was created to work with the sched_boost
feature in CAF devices.
The new tunable can be found in /dev/stune/*/schedtune.sched_boost
Change-Id: Iafa3e35ef7c7991f09595ba452d8050ddc694743
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
It does not make sense to be unable to reset Schedtune boost for a
particular Schedtune group if another Schedtune group's boost is still
active. Instead of using a global count, we should use a per-Schedtune
group count to keep track of active boosts taking place.
Change-Id: Ic47ccd2582dbb31aa245a13d301ddf538b0d318b
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
We will need to take care to ensure that every do_stune_boost() we call
is followed eventually by a reset_stune_boost() so that
stune_boost_count is managed correctly.
This allows us to stack several Dynamic Schedtune Boosts and reset only
when all Dynamic Schedtune Boosts have been disengaged.
Change-Id: I09b739e4503930eaf0e3f14870758b21ce9868f5
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Boost top-app SchedTune tasks using the dynamic_boost value when
/proc/sys/kernel/sched_boost is activated. This is usually triggered by
CAF's perf daemon.
Change-Id: I23f0e7822673230288fbaeda0a7f4aa8546bf7d3
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
We will use this in conjunction with CAF's perf daemon to somewhat
replicate core_ctl's sched_boost capabilities.
Credits to the developers at Codeaurora for the code.
Change-Id: Ifc4f76e02eed97ac2c5fc8c9a60e56c09aed6578
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Add a simple function to activate Dynamic Schedtune Boost and use the
dynamic_boost value of the SchedTune CGroup.
Change-Id: I106c1ad169419a575df400fc511b4be046b52152
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
For added flexibility and in preparation for introducing another function.
Change-Id: Ic95ba54e1549b0b70222c82a5ee1e164340e3258
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
This is to reduce confusion when we create a new dynamic_boost_write()
function in future patches.
Change-Id: I0cef57875a193034ce4a7dab6769449c9c0cda8a
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Provide functions to activate and reset SchedTune boost:
int do_stune_boost(char *st_name, int boost);
int reset_stune_boost(char *st_name);
Change-Id: Id3f93a63b7a94a08b124cb304bc0ffe9cc889d7a
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
This patch fixes one of the infrequent conditions in
commit 54b6baeca500 ("sched/fair: Skip frequency updates if CPU about to idle")
where we could have skipped a frequency update. The fix is to use the
correct flag which skips freq updates.
Note that this is a rare issue (can show up only during CFS throttling)
and even then we just do an additional frequency update which we were
doing anyway before the above patch.
Bug: 64689959
Change-Id: I0117442f395cea932ad56617065151bdeb9a3b53
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
CPU rq util updates happen when rq signals are updated as part of
enqueue and dequeue operations. Doing these updates triggers a call to
the registered util update handler, which takes schedtune boosting
into account. Enqueueing the task in the correct schedtune group after
this happens means that we will potentially not see the boost for an
entire throttle period.
Move the enqueue/dequeue operations for schedtune before the signal
updates which can trigger OPP changes.
Change-Id: I4236e6b194bc5daad32ff33067d4be1987996780
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
If CPU is about to idle, prevent a frequency update. With the number of
schedutil governor wake ups are reduced by more than half on a test
playing bluetooth audio.
Test: sugov wake ups drop by more than half when playing music with
screen off (476 / 1092)
Bug: 64689959
Change-Id: I400026557b4134c0ac77f51c79610a96eb985b4a
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
One SoC can have multiple CPU speedbins which cannot be represented
with current energy model due to fixed capacity per CPU frequency
steps.
Provide CPU's all possible frequency steps instead of capacities along
with corresponding energy costs to be able to support different
speedbins.
Change-Id: I96ff01372da5c383cd3172999ea1dcf95a7862ce
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: therootlord <igor_cestari@hotmail.com>
[kdrag0n: added missing sched_feat(ENERGY_AWARE) check]
Signed-off-by: kdrag0n <dragon@khronodragon.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
The CPU selection process for a prefer_idle task either minimizes or
maximizes the CPU capacity for idle CPUs depending on the task being
boosted or not.
Given that we are iterating through all CPUs, additionally filter the
choice by preferring a CPU in a more shallow idle state. This will
provide both a faster wake-up for the task and higher energy efficiency,
by allowing CPUs in deeper idle states to remain idle.
Change-Id: Ic55f727a0c551adc0af8e6ee03de6a41337a571b
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
The Non latency sensitive tasks CPU selection targets for an active
CPU in the little cluster. The shallowest c-state CPU is stored as
a backup. However if all CPUs in the little cluster are idle, we pick
an active CPU in the BIG cluster as the target CPU. This incorrect
choice of the target CPU may not get corrected by the
select_energy_cpu_idx() depending on the energy difference between
previous CPU and target CPU.
This can be fixed easily by maintaining the same variable that tracks
maximum capacity of the traversed CPU for both idle and active CPUs.
Change-Id: I3efb8bc82ff005383163921ef2bd39fcac4589ad
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Given that we have a few sites where the spare capacity of a CPU is
calculated as the difference between the original capacity of the CPU
and its computed new utilization, let's unify the calculation and use
that value tracked with a local spare_cap variable.
Change-Id: I78daece7543f78d4f74edbee5e9ceb62908af507
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
For !prefer_idle tasks we want to minimize capacity_orig to bias their
scheduling towards more energy efficient CPUs. This does not happen in
the current code for boosted tasks due the order of CPUs considered
(from big CPUs to LITTLE CPUs), and to the shallow idle state and
spare capacity maximization filters, which are used to select the best
idle backup CPU and the best active CPU candidates.
Let's fix this by enabling the above filters only when we are within
same capacity CPUs.
Taking in part each of the two cases:
1. Selection of a backup idle CPU - Non prefer_idle tasks should prefer
more energy efficient CPUs when there are idle CPUs in the system,
independent of the order given by the presence of a boosted margin.
This is the behavior for !sysctl_sched_cstate_aware and this should
be the behaviour for when sysctl_sched_cstate_aware is set as well,
given that we should prefer a more efficient CPU even if it's in a
deeper idle state.
2. Selection of an active target CPU: There is no reason for boosted
tasks to benefit from a higher chance to be placed on big CPU which
is provided by ordering CPUs from bigs to littles.
The other mechanism in place set for boosted tasks (making sure we
select a CPU that fits the task) is enough for a non latency
sensitive case. Also, by choosing a CPU with maximum spare capacity
we also cover the preference towards spreading tasks, rather than
packing them, which improves the chances for tasks to get better
performance due to potential reduced preemption. Therefore, prefer
more energy efficient CPUs and only consider spare capacity for CPUs
with equal capacity_orig.
Change-Id: I3b97010e682674420015e771f0717192444a63a2
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Reported-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
find_best_target is currently split into code handling latency sensitive
tasks and code handling non-latency sensitive tasks based on the value
of the prefer_idle flag.
Another differentiation is done for boosted tasks, preferring to start
with higher-capacity CPU when boosted, and with more efficient CPUs when
not boosted. This additional differentiation is obtained by imposing an
order when considering CPUs for selection. This order is determined in
typical big.LITTLE systems by the start point (the CPU with the maximum
or minimum capacity) and by the order of big and little CPU groups
provided in the sched domain hierarchy.
However, it's not guaranteed that the sched domain hierarchy will give
us a sorted list of CPU groups based on their maximum capacities when
dealing with systems with more than 2 capacity groups.
For example, if we consider a system with three groups of CPUs (LITTLEs,
mediums, bigs), the sched domain hierarchy might provide the following
scheduling groups ordering for a prefer_idle-boosted task:
big CPUs -> LITTLE CPUs -> medium CPUs.
If the big CPUs are not idle, but there are a few LITTLEs and mediums
as idle CPUs, by returning the first idle CPU, we will be incorrectly
prefering a lower capacity CPU over a higher capacity CPU.
In order to eliminate this reliance on assuming sched groups are ordered
by capacity, let's:
1. Iterate though all candidate CPUs for all cases.
2. Minimise or maximise the capacity of the considered CPU, depending
on prefer_idle and boost information.
Taking in part each of the four possible cases we analyse the
implementation and impact of this solution:
1. prefer_idle and boosted
This type of tasks needs to favour the selection of a reserved idle CPU,
and thus we still start from the biggest CPU in the system, but we
iterate though all CPUs as to correctly handle the example above by
maximising the capacity of the idle CPU we select. When all CPUs are
active, we already iterate though all CPUs and we're able to maximise
spare capacity or minimise utilisation for the considered target or
backup CPU.
2. prefer_idle and !boosted
For these tasks we prefer the selection of a more energy efficient CPU
and therefore we start from the smallest CPUs in the system, but we
iterate through all the CPUs as to select the most energy efficient idle
CPU, implementation which mimics existing behaviour. When all CPUs are
active, we already iterate though all CPUs and we're able to
maximise spare capacity or minimise utilisation for the considered
target or backup CPU.
3. !prefer_idle and boosted and
4. !prefer_idle and !boosted
For these tasks we already iterate though all CPUs and we're able to
maximise the energy efficiency of the selected CPU.
Change-Id: I940399e22eff29453cba0e2ec52a03b17eec12ae
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
cumulative runnable average is maintained in cfs_rq along with
rq so that when a cfs_rq is throttled/unthrottled, the contribution
of that cfs_rq can be updated at the rq level. Implement the
fixup_cumulative_runnable_avg callback for fair class to handle
the cfs_rq cumulative runnable average updates when the runnable
tasks demand is changed.
Bug: 139071966
Change-Id: Iccd473677cf491920aa82a6fc7e0a5374e5bb27f
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>