Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After a period of intense memory pressure is over, it's common for
vmpressure to still have old reclaim efficiency data accumulated from
this time. When memory pressure starts to rise again, this stale data
will factor into vmpressure's calculations, and can cause vmpressure to
report an erroneously high pressure. The reverse is possible, too:
vmpressure may report pressures that are erroneously low due to stale
data that's been stored.
Furthermore, since kswapd can still be performing reclaim when there are
no failed memory allocations stuck in the page allocator's slow path,
vmpressure may still report pressures when there aren't any memory
allocations to satisfy. This can cause last-resort memory reclaimers to
kill processes to free memory when it's not needed.
To fix the rampant stale data, keep track of when there are processes
utilizing reclaim in the page allocator's slow path, and reset the
accumulated data in vmpressure when a new period of elevated memory
pressure begins. Extra measures are taken for the kswapd issue mentioned
above by ignoring all reclaim efficiency data reported by kswapd when
there aren't any failed memory allocations in the page allocator which
utilize reclaim.
Note that since sr_lock can now be used from IRQ context, IRQs must be
disabled whenever sr_lock is used to prevent deadlocks.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The direct reclaim vmpressure path was erroneously excluded from the
PAGE_ALLOC_COSTLY_ORDER check which was added in commit "mm: vmpressure:
Ignore allocation orders above PAGE_ALLOC_COSTLY_ORDER".
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Hard-coding adj ranges to search for victims results in a few problems.
Firstly, the hard-coded adjs must be vigilantly updated to match what
userspace uses, which makes long-term support a headache. Secondly, a
full traversal of every running process must be done for each adj range,
which can turn out to be quite expensive, especially if userspace
assigns many different adj values and we want to enumerate them all.
This leads us to the final problem, which is that processes with
different adjs within the same hard-coded adj range will be treated the
same, even though they're not: the process with a higher adj is less
important, and the process with a lower adj is more important. This
could be fixed by enumerating every possible adj, but again, that would
necessitate several scans through the active process list, which is bad
for performance, especially since latency is critical here.
Since adjs are only 16 bits, and we only care about positive adjs, that
leaves us with 15 bits of the adj that matter. This is a relatively
small number of potential adjs (32,768), which makes it possible to
allocate a static array that's indexed using the adj. Each entry in this
array is a pointer to the first task_struct in a singly-linked list of
task_structs sharing an adj. A `simple_lmk_next` member is added to
task_struct to accommodate this linked list. The victim finder now
iterates downward through the array searching for linked lists of tasks,
starting from the highest adj found, so that the lowest-priority
processes are always considered first for reclaim. This fixes all of the
problems mentioned above, and now there is only one traversal through
every running process. The array itself only takes up 256 KiB of memory
on 64-bit, which is a very small price to pay for the advantages gained.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The page allocator wakes all kswapds in an allocation context's allowed
nodemask in the slow path, so it doesn't make sense to have the kswapd-
waiter count per each NUMA node. Instead, it should be a global counter
to stop all kswapds when there are no failed allocation requests.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
PAGE_ALLOC_COSTLY_ORDER allocations can cause vmpressure to incorrectly
think that memory pressure is high, when it's really just that the
allocation's high order is difficult to satisfy. When this rare scenario
occurs, ignore the input to vmpressure to avoid sending out a spurious
high-pressure signal.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Using kswapd's scan depth to trigger task kills is inconsistent and
unreliable. When memory pressure quickly spikes, the kswapd scan depth
trigger fails to kick off Simple LMK fast enough, causing severe lag.
Additionally, kswapd could stop scanning prematurely before reaching the
desired scan depth to trigger Simple LMK, which could also cause stalls.
To remedy this, use the vmpressure framework instead, since it provides
more consistent and accurate readings on memory pressure. This is not
very tunable though, so remove CONFIG_ANDROID_SIMPLE_LMK_AGGRESSION.
Triggering Simple LMK to kill when the reported memory pressure is 100
should yield good results on all setups.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The 20 ms delay in the reclaim thread is a hacky fudge factor that can
cause Simple LMK to behave wildly differently depending on the
circumstances of when it is invoked. When kswapd doesn't get enough CPU
time to finish up and go back to sleep within 20 ms, Simple LMK performs
superfluous reclaims.
This is suboptimal, so make Simple LMK more deterministic by eliminating
the delay and instead queuing up reclaim requests from kswapd.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.
The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
commit 5c4e0a21fae877a7ef89be6dcc6263ec672372b8 upstream.
When building m68k:allmodconfig, recent versions of gcc generate the
following error if the length of UTS_RELEASE is less than 8 bytes.
In function 'memcpy_and_pad',
inlined from 'nvmet_execute_disc_identify' at
drivers/nvme/target/discovery.c:268:2: arch/m68k/include/asm/string.h:72:25: error:
'__builtin_memcpy' reading 8 bytes from a region of size 7
Discussions around the problem suggest that this only happens if an
architecture does not provide strlen(), if -ffreestanding is provided as
compiler option, and if CONFIG_FORTIFY_SOURCE=n. All of this is the case
for m68k. The exact reasons are unknown, but seem to be related to the
ability of the compiler to evaluate the return value of strlen() and
the resulting execution flow in memcpy_and_pad(). It would be possible
to work around the problem by using sizeof(UTS_RELEASE) instead of
strlen(UTS_RELEASE), but that would only postpone the problem until the
function is called in a similar way. Uninline memcpy_and_pad() instead
to solve the problem for good.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Change-Id: I21516b6de0b5f3d8af30ebbbfcac2d4a495658ac
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Alexander Grund <theflamefire89@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
commit 1359798f9d4082eb04575efdd19512fbd9c28464 upstream.
The way I'd implemented the new helper memcpy_and_pad with
__FORTIFY_INLINE caused compiler warnings for certain kernel
configurations.
This helper is only used in a single place at this time, and thus
doesn't benefit much from fortification. So simplify the code
by dropping fortification support for now.
Fixes: 01f33c336e2d "string.h: add memcpy_and_pad()"
Change-Id: I8bb1ec4490e27d450ba2042074d6f228b102462a
Signed-off-by: Martin Wilck <mwilck@suse.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexander Grund <theflamefire89@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
commit 01f33c336e2d298ea5d4ce5d6e5bcd12865cc30f upstream.
This helper function is useful for the nvme subsystem, and maybe
others.
Note: the warnings reported by the kbuild test robot for this patch
are actually generated by the use of CONFIG_PROFILE_ALL_BRANCHES
together with __FORTIFY_INLINE.
Change-Id: I5f7e1e9143ce9df88af0afd02aef971d5172bd3e
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimbeg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[AG: Backported to 4.4]
Signed-off-by: Alexander Grund <theflamefire89@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
The implementation is utterly broken, resulting in all processes being
allows to move tasks between sets (as long as they have access to the
"tasks" attribute), and upstream is heading towards checking only
capability anyway, so let's get rid of this code.
BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat
Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394967
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual
indirect call by register and use kernel internal opcode to
mark call instruction into bpf_tail_call() helper.
Change-Id: I1a45b8e3c13848c9689ce288d4862935ede97fa7
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
that a single __weak function in the core that can be overridden
similarly to the eBPF one. Also remove stale pr_err() mentions
of bpf_jit_compile.
Change-Id: Iac221c09e9ae0879acdd7064d710c4f7cb8f478d
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely. As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.
Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Change-Id: Ic0f16cf514773f473705d48c787527f910943f1a
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
This is trivial to do:
- add flags argument to simple_rename()
- check if flags doesn't have any other than RENAME_NOREPLACE
- assign simple_rename() to .rename2 instead of .rename
Filesystems converted:
hugetlbfs, ramfs, bpf.
Debugfs uses simple_rename() to implement debugfs_rename(), which is for
debugfs instances to rename files internally, not for userspace filesystem
access. For this case pass zero flags to simple_rename().
Change-Id: I1a46ece3b40b05c9f18fd13b98062d2a959b76a0
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Disabling preemption is only required on 32-bit systems.
Skip toggling preemption for better real-time performance.
Change-Id: I02b3be0c62387b184267683da7fcdc740d0ecffe
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Most dentry allocations exceed 32B.
Increase it by 192 bytes to accommodate larger allocation requests.
This still ensures 64 bytes cacheline alignments.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: LibXZR <xzr467706992@163.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
This allows other kernel code to directly call drop_caches.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs. This gfp mask discrepancy is really unfortunate and easily
fixable. Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.
This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well. The same applies to read_cache_pages.
ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Upgrade to the latest upstream zstd version 1.4.10.
This patch is 100% generated from upstream zstd commit 20821a46f412 [0].
This patch is very large because it is transitioning from the custom
kernel zstd to using upstream directly. The new zstd follows upstreams
file structure which is different. Future update patches will be much
smaller because they will only contain the changes from one upstream
zstd release.
As an aid for review I've created a commit [1] that shows the diff
between upstream zstd as-is (which doesn't compile), and the zstd
code imported in this patch. The verion of zstd in this patch is
generated from upstream with changes applied by automation to replace
upstreams libc dependencies, remove unnecessary portability macros,
replace `/**` comments with `/*` comments, and use the kernel's xxhash
instead of bundling it.
The benefits of this patch are as follows:
1. Using upstream directly with automated script to generate kernel
code. This allows us to update the kernel every upstream release, so
the kernel gets the latest bug fixes and performance improvements,
and doesn't get 3 years out of date again. The automation and the
translated code are tested every upstream commit to ensure it
continues to work.
2. Upgrades from a custom zstd based on 1.3.1 to 1.4.10, getting 3 years
of performance improvements and bug fixes. On x86_64 I've measured
15% faster BtrFS and SquashFS decompression+read speeds, 35% faster
kernel decompression, and 30% faster ZRAM decompression+read speeds.
3. Zstd-1.4.10 supports negative compression levels, which allow zstd to
match or subsume lzo's performance.
4. Maintains the same kernel-specific wrapper API, so no callers have to
be modified with zstd version updates.
One concern that was brought up was stack usage. Upstream zstd had
already removed most of its heavy stack usage functions, but I just
removed the last functions that allocate arrays on the stack. I've
measured the high water mark for both compression and decompression
before and after this patch. Decompression is approximately neutral,
using about 1.2KB of stack space. Compression levels up to 3 regressed
from 1.4KB -> 1.6KB, and higher compression levels regressed from 1.5KB
-> 2KB. We've added unit tests upstream to prevent further regression.
I believe that this is a reasonable increase, and if it does end up
causing problems, this commit can be cleanly reverted, because it only
touches zstd.
I chose the bulk update instead of replaying upstream commits because
there have been ~3500 upstream commits since the 1.3.1 release, zstd
wasn't ready to be used in the kernel as-is before a month ago, and not
all upstream zstd commits build. The bulk update preserves bisectablity
because bugs can be bisected to the zstd version update. At that point
the update can be reverted, and we can work with upstream to find and
fix the bug.
Note that upstream zstd release 1.4.10 doesn't exist yet. I have cut a
staging branch at 20821a46f412 [0] and will apply any changes requested
to the staging branch. Once we're ready to merge this update I will cut
a zstd release at the commit we merge, so we have a known zstd release
in the kernel.
The implementation of the kernel API is contained in
zstd_compress_module.c and zstd_decompress_module.c.
[0] 20821a46f4
[1] e0fa481d0e
Signed-off-by: Nick Terrell <terrelln@fb.com>
Tested By: Paul Jones <paul@pauljones.id.au>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 on x86-64
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
This patch:
- Moves `include/linux/zstd.h` -> `include/linux/zstd_lib.h`
- Updates modified zstd headers to yearless copyright
- Adds a new API in `include/linux/zstd.h` that is functionally
equivalent to the in-use subset of the current API. Functions are
renamed to avoid symbol collisions with zstd, to make it clear it is
not the upstream zstd API, and to follow the kernel style guide.
- Updates all callers to use the new API.
There are no functional changes in this patch. Since there are no
functional change, I felt it was okay to update all the callers in a
single patch. Once the API is approved, the callers are mechanically
changed.
This patch is preparing for the 3rd patch in this series, which updates
zstd to version 1.4.10. Since the upstream zstd API is no longer exposed
to callers, the update can happen transparently.
Signed-off-by: Nick Terrell <terrelln@fb.com>
Tested By: Paul Jones <paul@pauljones.id.au>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 on x86-64
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Some drivers need to know what the status of the interrupt line is.
This is especially true for drivers that register a handler with
IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING and in the handler they
need to know which edge transition it was invoked for. Provide a way
for these handlers to read the logical status of the line after their
handler was invoked. If the line reads high it was called for a
rising edge and if the line reads low it was called for a falling edge.
The irq_read_line callback in the chip allows the controller to provide
the real time status of this line. Controllers that can read the status
of an interrupt line should implement this by doing necessary
hardware reads and return the logical state of the line.
Interrupt controllers based on the slow bus architecture should conduct
the transaction in this callback. The genirq code will call the chip's
bus lock prior to calling irq_read_line. Obviously since the transaction
would be completed before returning from irq_read_line it need not do
any transactions in the bus unlock call.
Change-Id: I3c8746706530bba14a373c671d22ee963b84dfab
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
SDM660 doesn't really have a clean way of populating the OPP without
HMP, so this should bypass that requirement.
Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Make iowait boost a cpufreq policy option and enable it for intel_pstate
cpufreq driver. Governors like schedutil can use it to determine if
boosting for tasks that wake up with p->in_iowait set is needed.
Bug: 38010527
Link: https://lkml.org/lkml/2017/5/19/43
Change-Id: Icf59e75fbe731dc67abb28fb837f7bb0cd5ec6cc
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Switch from a counter-based system to a slot-based system for managing
multiple dynamic Schedtune boost requests.
The primary limitations of the counter-based system was that it could
only keep track of two boost values at a time: the current dynamic boost
value and the default boost value. When more than one boost request is
issued, the system would only remember the highest value of them all.
Even if the task that requested the highest value had unboosted, this
value is still maintained as long as there are other active boosts that
are still running. A more ideal outcome would be for the system to
unboost to the maximum boost value of the remaining active boosts.
The slot-based system provides a solution to the problem by keeping
track of the boost values of all ongoing active boosts. It ensures that
the current boost value will be equal to the maximum boost value of
all ongoing active boosts. This is achieved with two linked lists
(active_boost_slots and available_boost_slots), which assign and keep
track of boost slot numbers for each successful boost request. The boost
value of each request is stored in an array (slot_boost[]), at an index
value equal to the assigned boost slot number.
For now we limit the number of active boost slots to 5 per Schedtune
group.
Change-Id: Iadc738fc919af092fd4c1b6312becf9567bc4c62
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
To reflect that the function is to be used mainly with CAF's devices
that have sched_boost. However, developers may use it as a switch to
dynamically boost schedtune to the values specified in
/dev/stune/*/schedtune.sched_boost.
Change-Id: I5012273e5572c6091a99a6954452bed3a2501c55
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
We will use this in conjunction with CAF's perf daemon to somewhat
replicate core_ctl's sched_boost capabilities.
Credits to the developers at Codeaurora for the code.
Change-Id: Ifc4f76e02eed97ac2c5fc8c9a60e56c09aed6578
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Add a simple function to activate Dynamic Schedtune Boost and use the
dynamic_boost value of the SchedTune CGroup.
Change-Id: I106c1ad169419a575df400fc511b4be046b52152
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Provide functions to activate and reset SchedTune boost:
int do_stune_boost(char *st_name, int boost);
int reset_stune_boost(char *st_name);
Change-Id: Id3f93a63b7a94a08b124cb304bc0ffe9cc889d7a
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
One SoC can have multiple CPU speedbins which cannot be represented
with current energy model due to fixed capacity per CPU frequency
steps.
Provide CPU's all possible frequency steps instead of capacities along
with corresponding energy costs to be able to support different
speedbins.
Change-Id: I96ff01372da5c383cd3172999ea1dcf95a7862ce
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: therootlord <igor_cestari@hotmail.com>
[kdrag0n: added missing sched_feat(ENERGY_AWARE) check]
Signed-off-by: kdrag0n <dragon@khronodragon.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
When the minimum frequency available to a policy is modified either
by userspace or by external actors setting a minimum frequency for a
voltage domain to control behaviour of some connected component, we
expect that the cpufreq policy will be updated to reflect this.
If we wish to use this information to guide energy estimation and
scheduling decisions, we need to track it.
Implement cpufreq_scale_min_freq_capacity() to provide the scheduler
with a minimum frequency scaling correction factor for more accurate
cpu capacity information.
This scaling factor describes the influence of running a cpu with a
current policy minimum frequency higher than the minimum possible
frequency.
The factor is:
current_min_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
This factor is computed in scale_min_freq_capacity and returned,
per cpu, in cpufreq_scale_min_freq_capacity.
Change-Id: I66237025a7c0bce6bfd6e973ea22b8d3f6c41827
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Implements the Max Frequency Capping Engine (MFCE) getter function
cpufreq_scale_max_freq_capacity() to provide the scheduler with a
maximum frequency scaling correction factor for more accurate cpu
capacity handling by being able to deal with max frequency capping.
This scaling factor describes the influence of running a cpu with a
current maximum frequency (policy) lower than the maximum possible
frequency (cpuinfo).
The factor is:
policy_max_freq(cpu) << SCHED_CAPACITY_SHIFT / cpuinfo_max_freq(cpu)
It also implements the MFCE setter function scale_max_freq_capacity()
which is called from cpufreq_set_policy().
Change-Id: I38ef736cfa587520cf4f97012be25cbb0c5af04d
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
Energy cost estimation has been a long lasting challenge for WALT
because WALT guides CPU frequency based on the CPU utilization of
previous window. Consequently it's not possible to know newly
waking-up task's energy cost until WALT's end of the current window.
The WALT already tracks 'Previous Runnable Sum' (prev_runnable_sum)
and 'Cumulative Runnable Average' (cr_avg). They are designed for
CPU frequency guidance and task placement but unfortunately both
are not suitable for the energy cost estimation.
It's because using prev_runnable_sum for energy cost calculation would
make us to account CPU and task's energy solely based on activity in the
previous window so for example, any task didn't have an activity in the
previous window will be accounted as a 'zero energy cost' task.
Energy estimation with cr_avg is what energy_diff() relies on at present.
However cr_avg can only represent instantaneous picture of energy cost
thus for example, if a CPU was fully occupied for an entire WALT window
and became idle just before window boundary, and if there is a wake-up,
energy_diff() accounts that CPU is a 'zero energy cost' CPU.
As a result, introduce a new accounting unit 'Cumulative Window Demand'.
The cumulative window demand tracks all the tasks' demands have seen in
current window which is neither instantaneous nor actual execution time.
Because task demand represents estimated scaled execution time when the
task runs a full window, accumulation of all the demands represents
predicted CPU load at the end of window.
Thus we can estimate CPU's frequency at the end of current WALT window
with the cumulative window demand.
The use of prev_runnable_sum for the CPU frequency guidance and cr_avg
for the task placement have not changed and these are going to be used
for both purpose while this patch aims to add an additional statistics.
Change-Id: I9908c77ead9973a26dea2b36c001c2baf944d4f5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>