kernel_google_b1c1

Author	SHA1	Message	Date
Wei Wang	6400cd3b94	sched: restrict iowait boost to tasks with prefer_idle Currently iowait doesn't distinguish background/foreground tasks and we have seen cases where a device run to high frequency unnecessarily when running some background I/O. This patch limits iowait boost to tasks with prefer_idle only. Specifically, on Pixel, those are foreground and top app tasks. Bug: 130308826 Test: Boot and trace Change-Id: I2d892beeb4b12b7e8f0fb2848c23982148648a10 Signed-off-by: Wei Wang <wvw@google.com> Signed-off-by: Lau <laststandrighthere@gmail.com>	2024-08-15 08:22:43 +05:30
Maria Yu	b5c22baa21	sched: core: Clear walt rq request in cpu starting Clear walt rq request in cpu starting. Change-Id: Id3004337f3924984b8b812151a6ba01c6f1c013e Signed-off-by: Maria Yu <aiquny@codeaurora.org> (cherry picked from commit 32df8f93e147dd54331161e9180d7ea488b750f9)	2024-08-15 08:22:18 +05:30
Pavankumar Kondeti	74a8607aa7	sched/walt: Fix the memory leak of idle task load pointers The memory for task load pointers are allocated twice for each idle thread except for the boot CPU. This happens during boot from idle_threads_init()->idle_init() in the following 2 paths. 1. idle_init()->fork_idle()->copy_process()-> sched_fork()->init_new_task_load() 2. idle_init()->fork_idle()-> init_idle()->init_new_task_load() The memory allocation for all tasks happens through the 1st path, so use the same for idle tasks and kill the 2nd path. Since the idle thread of boot CPU does not go through fork_idle(), allocate the memory for it separately. Change-Id: I4696a414ffe07d4114b56d326463026019e278f1 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> (cherry picked from commit eb58f47212c9621be82108de57bcf3e94ce1035a)	2024-08-15 07:11:04 +05:30
DhineshCool	c0dd3261ad	Revert "sched: Do not reduce perceived CPU capacity while idle" This reverts commit `20dfb57cb1`.	2024-08-15 06:33:57 +05:30
DhineshCool	f99e24746b	Revert "cpufreq: schedutil: Enforce realtime priority" This reverts commit `970b81bf75`.	2024-08-15 06:17:11 +05:30
Maria Yu	d6631fffef	sched/fair: Consider others if target cpu overutilized If target cpu overutilized, it's better to consider other group cpu. It can avoid unnecessary waiting on overutilized cpu and wait until load balance for task to be run. Change-Id: I6f8bccb611d2f11471254cf2795fb5bf3f122292 Signed-off-by: Maria Yu <aiquny@codeaurora.org> (cherry picked from commit b9f8fdc34eeb61fcc7c770b6277a83fd30fc7d8e)	2024-08-13 23:40:43 +05:30
Chris Redpath	9314b62205	FROMLIST: sched/fair: Don't move tasks to lower capacity cpus unless necessary When lower capacity CPUs are load balancing and considering to pull something from a higher capacity group, we should not pull tasks from a cpu with only one task running as this is guaranteed to impede progress for that task. If there is more than one task running, load balance in the higher capacity group would have already made any possible moves to resolve imbalance and we should make better use of system compute capacity by moving a task if we still have more than one running. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Change-Id: Ib86570abdd453a51be885b086c8d80be2773a6f2 Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> [from https://lore.kernel.org/lkml/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com/] Signed-off-by: Chris Redpath <chris.redpath@arm.com> Git-commit: 07e7ce6c8459defc34e63ae0f0334e811d223990 Git-repo: https://android.googlesource.com/kernel/common/ [clingutla@codeaurora.org: Resolved merge conflicts.] Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> (cherry picked from commit 779459e3fffda001181cfd6b1be2ffd3da25002c)	2024-08-13 23:40:43 +05:30
Joonwoo Park	ef01112e02	sched: ceil idle index to prevent from out of bound accessing It's possible size of given idle cost index is smaller than CPU's possible idle index size. Ceil the CPU's idle index to prevent out of bound accessing. Change-Id: Idecb4f68758dd0183886ea74d0e9da3d236b0062 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> (cherry picked from commit ecedc7afd841c8d7ef0145924620304608d269ef)	2024-08-13 23:40:42 +05:30
Joonwoo Park	12312cb361	sched: prevent out of bound access in sched_group_energy() group_idle_state() can return INT_MAX + 1 which is undefined behaviour when there is no CPUs in sched_group. Prevent such by error correctly. Change-Id: If9796c829c091e461231569dc38c5e5456f58037 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org> [clingutla@codeaurora.org: Fixed trivial merge conflicts and squashed msm-4.14 change] Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> (cherry picked from commit bb5b0e61527011e4ebfc4058713a9068da9e7492)	2024-08-13 23:40:42 +05:30
Maria Yu	57d6066272	cpufreq: schedutil: Queue sugov irq work on policy online cpu Got never update frequency if scheduled the irq work on an offlined cpu and it will always pending. Queue sugov irq work on any online cpu if current cpu is offline. Change-Id: I33fc691917b5866488b6aeb11ed902a2753130b2 Signed-off-by: Maria Yu <aiquny@codeaurora.org> (cherry picked from commit 1d2db9ab99a9abd0d9dcb320e6e0d266e21884f9)	2024-08-13 23:40:42 +05:30
Maria Yu	aa4a0a2807	sched/walt: Avoid walt irq work in offlined cpu Avoid walt irq work in offlined cpu. Change-Id: Ia4410562f66bfa57daa15d8c0a785a2c7a95f2a0 Signed-off-by: Maria Yu <aiquny@codeaurora.org> (cherry picked from commit 702cec976c863388c784eff37a71fa3ee8bb84d7)	2024-08-13 23:40:42 +05:30
Pavankumar Kondeti	37a5c34f00	Revert "sched: Remove sched_ktime_clock()" This reverts 'commit `24c18127e9` ("sched: Remove sched_ktime_clock()")' WALT accounting uses ktime_get() as time source to keep windows in align with the tick. ktime_get() API should not be called while the timekeeping subsystem is suspended during the system suspend. The code before the reverted patch has a wrapper around ktime_get() to avoid calling ktime_get() when timekeeping subsystem is suspended. The reverted patch removed this wrapper with the assumption that there will not be any scheduler activity while timekeeping subsystem is suspended. The timekeeping subsystem is resumed very early even before non-boot CPUs are brought online. However it is possible that tasks can wake up from the idle notifiers which gets called before timekeeping subsystem is resumed. When this happens, the time read from ktime_get() will not be consistent. We see a jump from the values that would be returned later when timekeeping subsystem is resumed. The rq->window_start update happens with incorrect time. This rq->window_start becomes inconsistent with the rest of the CPUs's rq->window_start and wallclock time after timekeeping subsystem is resumed. This results in WALT accounting bugs. Change-Id: I9c3b2fb9ffbf1103d1bd78778882450560dac09f Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> (cherry picked from commit faa04442e7a31357724dbb8e49ba64372ef37862)	2024-08-13 23:40:42 +05:30
Pavankumar Kondeti	e8e661152f	sched/fair: Fix redundant load balancer reattempt due to LBF_ALL_PINNED LBF_ALL_PINNED flag should cleared in can_migrate_task() if the task can run on the destination CPU during load balance. In current code, can_migrate_task() return incorrectly without clearing this flag in case if the task can't be migrated to the destination CPU due to cumulative window demand constraints. Since LBF_ALL_PINNED flag is not cleared, load balancer thinks that none of the tasks running on the busiest group can't be migrated to the destination CPU due to affinity settings and tries to find another busiest group. Prevent this incorrect reattempt of load balance by clearing LBF_ALL_PINNED flag right after the task affinity check in can_migrate_task(). Change-Id: Iad1cf42b1aaf70106ee5ecfbd9499ccb6eb7497e [clingutla@codeaurora.org: Resolved merge conflicts] Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> (cherry picked from commit 5ee367fc9386d4e36af644942d9d10f97827bab1)	2024-08-13 23:40:41 +05:30
Maria Yu	a01e3aaaff	sched/fair: Avoid unnecessary active load balance When find busiest group, it will avoid load balance if it is only 1 task running on src cpu. Consider race when different cpus do newly idle load balance at the same time, check src cpu nr_running to avoid unnecessary active load balance again. See the race condition example here: 1) cpu2 have 2 tasks, so cpu2 rq->nr_running == 2 and cfs.h_nr_running ==2. 2) cpu4 and cpu5 doing newly idle load balance at the same time. 3) cpu4 and cpu5 both see cpu2 sched_load_balance_sg_stats sum_nr_run=2 so they are both see cpu2 as the busiest rq. 4) cpu5 did a success migration task from cpu2, so cpu2 only have 1 task left, cpu2 rq->nr_running == 1 and cfs.h_nr_running ==1. 5) cpu4 surely goes to no_move because currently cpu4 only have 1 task which is currently running. 6) and then cpu4 goes here to check if cpu2 need active load balance. Change-Id: Ia9539a43e9769c4936f06ecfcc11864984c50c29 Signed-off-by: Maria Yu <aiquny@codeaurora.org> (cherry picked from commit fc61703628de002e2a5bf88e09933dbc3552d156)	2024-08-13 23:40:41 +05:30
Pavankumar Kondeti	9efe3c5438	sched/walt: Fix stale max_capacity issue during CPU hotplug Scheduler keeps track of the maximum capacity among all online CPUs in max_capacity. This is useful in checking if a given cluster/CPU is a max capacity CPU or not. The capacity of a CPU gets updated when its max frequency is limited by cpufreq and/or thermal. The CPUfreq limits notifications are received via CPUfreq policy notifier. However CPUfreq keeps the policy intact even when all of the CPUs governed by the policy are hotplugged out. So the CPUFREQ_REMOVE_POLICY notification never arrives and scheduler's notion of max_capacity becomes stale. The max_capacity may get corrected at some point later when CPUFREQ_NOTIFY notification comes for other online CPUs. But when the hotplugged CPUs comes online the max_capacity does not reflect since CPUFREQ_ADD_POLICY is not sent by the cpufreq. For example consider a system with 4 BIG and 4 little CPUs. Their original capacities are 2048 and 1024 respectively. The max_capacity points to 2048 when all CPUs are online. Now, 1. All 4 BIG CPUs are hotplugged out. Since there is no notification, the max_capacity still points to 2048, which is incorrect. 2. User clips the little CPUs's max_freq by 50%. CPUFREQ_NOTIFY arrives and max_capacity is updated by iterating all the online CPUs. At this point max_capacity becomes 512 which is correct. 3. User removes the above limits of little CPUs. The max_capacity becomes 1024 which is correct. 4. Now, BIG CPUs are hotplugged in. Since there is no notification, the max_capacity still points to 1024, which is incorrect. Fix this issue by wiring the max_capacity updates in WALT to scheduler hotplug callbacks. Ideally we want cpufreq domain hotplug callbacks but such notifiers are not present. So the max_capacity update is forced even when it is not necessary but that should not be a concern. Because CPU hotplug is supposed to be a rare event. The scheduler hotplug callbacks happen even before the hotplug CPU is removed from cpu_online_mask, so use cpu_active() check while evaluating the max_capacity. Since cpu_active_mask is a subset of cpu_online_mask, this is sufficient. Change-Id: I97b1974e2de1a9730285715858f1ada416d92a7a Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> (cherry picked from commit 3cd81b52aedf6802aaf7b41f3550b1850c7a09a4)	2024-08-13 23:40:41 +05:30
tip-bot for Jacob Shin	2bc84a0ac1	sched/fair: Force balancing on NOHZ balance if local group has capacity The "goto force_balance" here is intended to mitigate the fact that avg_load calculations can result in bad placement decisions when priority is asymmetrical. The original commit that adds it: `fab476228b` ("sched: Force balancing on newidle balance if local group has capacity") explains: Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU capacity) systems - consider 8 always-running, same-priority tasks on a system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on the "big" CPUs (which will be represented by one sched_group in the DIE sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving one CPU unused. Because the "big" group has a higher group_capacity its avg_load may not present an imbalance that would cause migrating a task to the idle "little". The force_balance case here solves the problem but currently only for CPU_NEWLY_IDLE balances, which in theory might never happen on the unused CPU. Including CPU_IDLE in the force_balance case means there's an upper bound on the time before we can attempt to solve the underutilization: after DIE's sd->balance_interval has passed the next nohz balance kick will help us out. Change-Id: I6b0db178c0707603c8fd764fd3e44524c5345241 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 583ffd99d7657755736d831bbc182612d1d2697d Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> (cherry picked from commit 3d9aec71e139bce6d592b56afaa30f02c344e80e)	2024-08-13 23:40:41 +05:30
Lingutla Chandrasekhar	70e5add1e9	sched: energy: rebuild sched_domains with actual capacities While sched initialization, sched_domains might have built with default capacity values, and max_{min_}_cap_org_cpu's have updated based on them. After energy probe called, these capacities would change, but max_{min_}cap_org_cpu's still have old values. And using these staled cpus could give wrong start_cpu in finding energy efficient cpu. So rebuild sched_domains, which updates all cpu's group capacities with actual capacities and then build domains again, and update max_{min_}cap_org_cpus as well. Change-Id: I07d58bc849de363c5ed8fb743ab98d3fba727130 Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> (cherry picked from commit 5b2c99599d1dcf79ef7dec93c7935d6fc48869db)	2024-08-13 23:40:41 +05:30
Martin KaFai Lau	a6710190e0	bpf: Refactor codes handling percpu map Refactor the codes that populate the value of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH typed bpf_map. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: kdrag0n <dragon@khronodragon.com>	2024-08-13 23:40:22 +05:30
Martin KaFai Lau	84360a36df	bpf: Add percpu LRU list Instead of having a common LRU list, this patch allows a percpu LRU list which can be selected by specifying a map attribute. The map attribute will be added in the later patch. While the common use case for LRU is #reads >> #updates, percpu LRU list allows bpf prog to absorb unusual #updates under pathological case (e.g. external traffic facing machine which could be under attack). Each percpu LRU is isolated from each other. The LRU nodes (including free nodes) cannot be moved across different LRU Lists. Here are the update performance comparison between common LRU list and percpu LRU list (the test code is at the last patch): [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \ ./map_perf_test 16 $i \| awk '{r += $3}END{print r " updates"}'; done 1 cpus: 2934082 updates 4 cpus: 7391434 updates 8 cpus: 6500576 updates [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \ ./map_perf_test 32 $i \| awk '{r += $3}END{printr " updates"}'; done 1 cpus: 2896553 updates 4 cpus: 9766395 updates 8 cpus: 17460553 updates Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: kdrag0n <dragon@khronodragon.com>	2024-08-13 23:40:22 +05:30
Martin KaFai Lau	d15c5c69e6	bpf: LRU List Introduce bpf_lru_list which will provide LRU capability to the bpf_htab in the later patch. * General Thoughts: 1. Target use case. Read is more often than update. (i.e. bpf_lookup_elem() is more often than bpf_update_elem()). If bpf_prog does a bpf_lookup_elem() first and then an in-place update, it still counts as a read operation to the LRU list concern. 2. It may be useful to think of it as a LRU cache 3. Optimize the read case 3.1 No lock in read case 3.2 The LRU maintenance is only done during bpf_update_elem() 4. If there is a percpu LRU list, it will lose the system-wise LRU property. A completely isolated percpu LRU list has the best performance but the memory utilization is not ideal considering the work load may be imbalance. 5. Hence, this patch starts the LRU implementation with a global LRU list with batched operations before accessing the global LRU list. As a LRU cache, #read >> #update/#insert operations, it will work well. 6. There is a local list (for each cpu) which is named 'struct bpf_lru_locallist'. This local list is not used to sort the LRU property. Instead, the local list is to batch enough operations before acquiring the lock of the global LRU list. More details on this later. 7. In the later patch, it allows a percpu LRU list by specifying a map-attribute for scalability reason and for use cases that need to prepare for the worst (and pathological) case like DoS attack. The percpu LRU list is completely isolated from each other and the LRU nodes (including free nodes) cannot be moved across the list. The following description is for the global LRU list but mostly applicable to the percpu LRU list also. * Global LRU List: 1. It has three sub-lists: active-list, inactive-list and free-list. 2. The two list idea, active and inactive, is borrowed from the page cache. 3. All nodes are pre-allocated and all sit at the free-list (of the global LRU list) at the beginning. The pre-allocation reasoning is similar to the existing BPF_MAP_TYPE_HASH. However, opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in the LRU map. * Active/Inactive List (of the global LRU list): 1. The active list, as its name says it, maintains the active set of the nodes. We can think of it as the working set or more frequently accessed nodes. The access frequency is approximated by a ref-bit. The ref-bit is set during the bpf_lookup_elem(). 2. The inactive list, as its name also says it, maintains a less active set of nodes. They are the candidates to be removed from the bpf_htab when we are running out of free nodes. 3. The ordering of these two lists is acting as a rough clock. The tail of the inactive list is the older nodes and should be released first if the bpf_htab needs free element. * Rotating the Active/Inactive List (of the global LRU list): 1. It is the basic operation to maintain the LRU property of the global list. 2. The active list is only rotated when the inactive list is running low. This idea is similar to the current page cache. Inactive running low is currently defined as "# of inactive < # of active". 3. The active list rotation always starts from the tail. It moves node without ref-bit set to the head of the inactive list. It moves node with ref-bit set back to the head of the active list and then clears its ref-bit. 4. The inactive rotation is pretty simply. It walks the inactive list and moves the nodes back to the head of active list if its ref-bit is set. The ref-bit is cleared after moving to the active list. If the node does not have ref-bit set, it just leave it as it is because it is already in the inactive list. * Shrinking the Inactive List (of the global LRU list): 1. Shrinking is the operation to get free nodes when the bpf_htab is full. 2. It usually only shrinks the inactive list to get free nodes. 3. During shrinking, it will walk the inactive list from the tail, delete the nodes without ref-bit set from bpf_htab. 4. If no free node found after step (3), it will forcefully get one node from the tail of inactive or active list. Forcefully is in the sense that it ignores the ref-bit. * Local List: 1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to batch enough operations before acquiring the lock of the global LRU. 2. A local list has two sub-lists, free-list and pending-list. 3. During bpf_update_elem(), it will try to get from the free-list of (the current CPU local list). 4. If the local free-list is empty, it will acquire from the global LRU list. The global LRU list can either satisfy it by its global free-list or by shrinking the global inactive list. Since we have acquired the global LRU list lock, it will try to get at most LOCAL_FREE_TARGET elements to the local free list. 5. When a new element is added to the bpf_htab, it will first sit at the pending-list (of the local list) first. The pending-list will be flushed to the global LRU list when it needs to acquire free nodes from the global list next time. * Lock Consideration: The LRU list has a lock (lru_lock). Each bucket of htab has a lock (buck_lock). If both locks need to be acquired together, the lock order is always lru_lock -> buck_lock and this only happens in the bpf_lru_list.c logic. In hashtab.c, both locks are not acquired together (i.e. one lock is always released first before acquiring another lock). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: kdrag0n <dragon@khronodragon.com>	2024-08-13 23:40:21 +05:30
Michal Hocko	d9af72efb8	bpf: do not use KMALLOC_SHIFT_MAX Commit `01b3f52157` ("bpf: fix allocation warnings in bpf maps and integer overflow") has added checks for the maximum allocateable size. It (ab)used KMALLOC_SHIFT_MAX for that purpose. While this is not incorrect it is not very clean because we already have KMALLOC_MAX_SIZE for this very reason so let's change both checks to use KMALLOC_MAX_SIZE instead. The original motivation for using KMALLOC_SHIFT_MAX was to work around an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER". Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: kdrag0n <dragon@khronodragon.com>	2024-08-13 23:40:21 +05:30
Tyler Nijmeh	20dfb57cb1	sched: Do not reduce perceived CPU capacity while idle CPUs that are idle are excellent candidates for latency sensitive or high-performance tasks. Decrementing their capacity while they are idle will result in these CPUs being chosen less, and they will prefer to schedule smaller tasks instead of large ones. Disable this. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>	2024-08-13 23:36:15 +05:30
Tyler Nijmeh	f5daa9d7ec	sched: Enable NEXT_BUDDY for better cache locality By scheduling the last woken task first, we can increase cache locality since that task is likely to touch the same data as before. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>	2024-08-13 23:36:15 +05:30
Tyler Nijmeh	970b81bf75	cpufreq: schedutil: Enforce realtime priority Even the interactive governor utilizes a realtime priority. It is beneficial for schedutil to process it's workload at a >= priority than mundane tasks (KGSL/AUDIO/ETC). Signed-off-by: Tyler Nijmeh <tylernij@gmail.com> Signed-off-by: clarencelol <clarencekuiek@icloud.com> Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>	2024-08-13 23:36:14 +05:30
Sultan Alsawaf	26a793cb28	Revert "mutex: Add a delay into the SPIN_ON_OWNER wait loop." This reverts commit `c8de3f45ee`. This doesn't make sense for a few reasons. Firstly, upstream uses this mutex code and it works fine on all arches; why should arm be any different? Secondly, once the mutex owner starts to spin on `wait_lock`, preemption is disabled and the owner will be in an actively-running state. The optimistic mutex spinning occurs when the lock owner is actively running on a CPU, and while the optimistic spinning takes place, no attempt to acquire `wait_lock` is made by the new waiter. Therefore, it is guaranteed that new mutex waiters which optimistically spin will not contend the `wait_lock` spin lock that the owner needs to acquire in order to make forward progress. Another potential source of `wait_lock` contention can come from tasks that call mutex_trylock(), but this isn't actually problematic (and if it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This won't introduce significant contention on `wait_lock` because the trylock code exits before attempting to lock `wait_lock`, specifically when the atomic mutex counter indicates that the mutex is already locked. So in reality, the amount of `wait_lock` contention that can come from mutex_trylock() amounts to only one task. And once it finishes, `wait_lock` will no longer be contended and the previous mutex owner can proceed with clean up. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Albert I <kras@raphielgang.org>	2024-08-13 23:36:11 +05:30
Sultan Alsawaf	0e39b53ee6	PM / freezer: Reduce freeze timeout to 1 second for Android Freezing processes on Android usually takes less than 100 ms, and if it takes longer than that to the point where the 20 second freeze timeout is reached, it's because the remaining processes to be frozen are deadlocked waiting for something from a process which is already frozen. There's no point in burning power trying to freeze for that long, so reduce the freeze timeout to a very generous 1 second for Android and don't let anything mess with it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>	2024-08-13 23:32:53 +05:30
Sultan Alsawaf	2e211875a5	PM / freezer: Abort suspend when there's a wakeup while freezing Although try_to_freeze_tasks() stops when there's a wakeup, it doesn't return an error when it successfully freezes everything it wants to freeze. As a result, the suspend attempt can continue even after a wakeup is issued. Although the wakeup will be eventually caught later in the suspend process, kicking the can down the road is suboptimal; when there's a wakeup detected, suspend should be immediately aborted by returning an error instead. Make try_to_freeze_tasks() do just that, and also move the wakeup check above the `todo` check so that we don't miss a wakeup from a process that successfully froze. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Change-Id: I6d0ff54b1e1e143df2679d3848019590725c6351	2024-08-13 23:31:23 +05:30
Sultan Alsawaf	a65385e72e	PM: sleep: Don't allow s2idle to be used Unfortunately, s2idle is only somewhat functional. Although commit 70441d36af58 ("cpuidle: lpm_levels: add soft watchdog for s2idle") makes s2idle usable, there are still CPU stalls caused by s2idle's buggy behavior, and the aforementioned hack doesn't address them. Therefore, let's stop userspace from enabling s2idle and instead enforce the default deep sleep mode. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-08-13 23:31:23 +05:30
Xunlei Pang	55c26752ef	sched/fair: Advance global expiration when period timer is restarted When period gets restarted after some idle time, start_cfs_bandwidth() doesn't update the expiration information, expire_cfs_rq_runtime() will see cfs_rq->runtime_expires smaller than rq clock and go to the clock drift logic, wasting needless CPU cycles on the scheduler hot path. Update the global expiration in start_cfs_bandwidth() to avoid frequent expire_cfs_rq_runtime() calls once a new period begins. Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: RuRuTiaSaMa <1009087450@qq.com>	2024-08-13 23:31:04 +05:30
Sultan Alsawaf	d6ba2bb7dd	sched/fair: Compile out NUMA code entirely when NUMA is disabled Scheduler code is very hot and every little optimization counts. Instead of constantly checking sched_numa_balancing when NUMA is disabled, compile it out. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-08-13 23:31:04 +05:30
Pavankumar Kondeti	7aa59f8faf	BACKPORT: ANDROID: sched: Exempt paused CPU from nohz idle balance A CPU can be paused while it is idle with it's tick stopped. nohz_balance_exit_idle() should be called from the local CPU, so it can't be called during pause which can happen remotely. This results in paused CPU participating in the nohz idle balance, which should be avoided. This can be done by calling Fix this issue by calling nohz_balance_exit_idle() from the paused CPU when it exits and enters idle again. This lazy approach avoids waking the CPU from idle during pause. Bug: 180530906 Change-Id: Ia2dfd9c9cac9b0f37c55a9256b9d5f3141ca0421 Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> [ Tashar02: Backport to k4.19 ] [ RealJohnGalt: Backport to k4.14 ] Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>	2024-08-13 23:31:04 +05:30
Patrick Bellasi	b306fed53f	cpufreq: schedutil: Fix iowait boost reset A more energy efficient update of the IO wait boosting mechanism has been introduced in: commit a5a0809 ("cpufreq: schedutil: Make iowait boost more energy efficient") where the boost value is expected to be: - doubled at each successive wakeup from IO staring from the minimum frequency supported by a CPU - reset when a CPU is not updated for more then one tick by either disabling the IO wait boost or resetting its value to the minimum frequency if this new update requires an IO boost. This approach is supposed to "ignore" boosting for sporadic wakeups from IO, while still getting the frequency boosted to the maximum to benefit long sequence of wakeup from IO operations. However, these assumptions are not always satisfied. For example, when an IO boosted CPU enters idle for more the one tick and then wakes up after an IO wait, since in sugov_set_iowait_boost() we first check the IOWAIT flag, we keep doubling the iowait boost instead of restarting from the minimum frequency value. This misbehavior could happen mainly on non-shared frequency domains, thus defeating the energy efficiency optimization, but it can also happen on shared frequency domain systems. Let fix this issue in sugov_set_iowait_boost() by: - first check the IO wait boost reset conditions to eventually reset the boost value - then applying the correct IO boost value if required by the caller Fixes: a5a0809 (cpufreq: schedutil: Make iowait boost more energy efficient) Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Yaroslav Furman <yaro330@gmail.com> - backport to 4.4 Signed-off-by: Danny Lin <danny@kdrag0n.dev>	2024-08-13 23:28:07 +05:30
Peter Zijlstra	eb7d9be835	sched/core: Fix rules for running on online && !active CPUs [ Upstream commit 175f0e25abeaa2218d431141ce19cf1de70fa82d ] As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-13 23:27:23 +05:30
Xunlei Pang	41d1faa508	sched/fair: Fix bandwidth timer clock drift condition commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream. I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat \| grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b\|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: `a9cf55b286` ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Change-Id: Ida0d756728675758499caa225238ed13b4423168 Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [alakeshh: backport: Fixed merge conflicts: - sched.h: Fix the indentation and order in which the variables are declared to match with coding style of the existing code in 4.14 Struct members of same type were declared in separate lines in upstream patch which has been changed back to having multiple members of same type in the same line. e.g. int a; int b; -> int a, b; ] Signed-off-by: Alakesh Haloi <alakeshh@amazon.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> # 4.14.x Fixes: `51f2176d74` ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-13 23:26:23 +05:30
Runmin Wang	2dd90440a1	sched/fair: load balance if a group is overloaded Doing more aggressive balance if a sched_group is overloaded. Change-Id: I00950c23c67a40b3431b68ac7ce2a1e470e563ed Signed-off-by: Runmin Wang <runminw@codeaurora.org>	2024-08-13 23:26:23 +05:30
Tyler Nijmeh	e89e4a37bb	perf: Restrict perf event sampling CPU time to 5% Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>	2024-08-13 23:26:05 +05:30
Tyler Nijmeh	cc2af967c1	sched: Process new forks before processing their parent This should let brand new tasks launch marginally faster. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>	2024-08-13 23:26:05 +05:30
Jacob Pan	7fa204cbe8	cpuidle: Allow enforcing deepest idle state selection When idle injection is used to cap power, we need to override the governor's choice of idle states. For this reason, make it possible the deepest idle state selection to be enforced by setting a flag on a given CPU to achieve the maximum potential power draw reduction. Change-Id: I9737e99c4f3f4bc38016b313e76b50cec4cf56cb Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-08-13 23:26:04 +05:30
Joel Fernandes	27def5fb20	cpufreq: schedutil: Use unsigned int for iowait boost Make iowait_boost and iowait_boost_max as unsigned int since its unit is kHz and this is consistent with struct cpufreq_policy. Also change the local variables in sugov_iowait_boost() to match this. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-08-13 23:25:59 +05:30
Joel Fernandes	04ec17e1bd	cpufreq: schedutil: Make iowait boost more energy efficient Currently the iowait_boost feature in schedutil makes the frequency go to max on iowait wakeups. This feature was added to handle a case that Peter described where the throughput of operations involving continuous I/O requests [1] is reduced due to running at a lower frequency, however the lower throughput itself causes utilization to be low and hence causing frequency to be low hence its "stuck". Instead of going to max, its also possible to achieve the same effect by ramping up to max if there are repeated in_iowait wakeups happening. This patch is an attempt to do that. We start from a lower frequency (policy->min) and double the boost for every consecutive iowait update until we reach the maximum iowait boost frequency (iowait_boost_max). I ran a synthetic test (continuous O_DIRECT writes in a loop) on an x86 machine with intel_pstate in passive mode using schedutil. In this test the iowait_boost value ramped from 800MHz to 4GHz in 60ms. The patch achieves the desired improved throughput as the existing behavior. [1] https://patchwork.kernel.org/patch/9735885/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Joel Fernandes <joelaf@google.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-08-13 23:25:58 +05:30
Wei Wang	468e53bb05	ANDROID: sched: fair: balance for single core cluster Android will unset SD_LOAD_BALANCE for single core cluster domain and for some product it is true to have a single core cluster and the MC domain thus lacks the SD_LOAD_BALANCE flag. This will cause select_task_rq_fair logic break and the task will spin forever in that core. Fixes: 00bbe7d605a9 "ANDROID: sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability" Bug: 141334320 Test: boot and see task on core7 scheduled correctly Change-Id: I7c2845b1f7bc1d4051eb3ad6a5f9838fb0b1ba04 Signed-off-by: Wei Wang <wvw@google.com>	2024-08-13 23:25:58 +05:30
Sultan Alsawaf	93937605cf	kernel: Don't allow IRQ affinity masks to have more than one CPU Even with an affinity mask that has multiple CPUs set, IRQs always run on the first CPU in their affinity mask. Drivers that register an IRQ affinity notifier (such as pm_qos) will therefore have an incorrect assumption of where an IRQ is affined. Fix the IRQ affinity mask deception by forcing it to only contain one set CPU. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>	2024-08-13 23:25:57 +05:30
Zachariah Kennedy	d5c9e16340	sched/fair.c: Don't allow SchedTune boosted tasks to be migrated to small cores We want boosted tasks to run on big cores. But CAF's load balancer changes do not account for SchedTune boosting, so this allows for boosted tasks to be migrated to a suboptimal core. Let's mitigate this by setting the LBF_IGNORE_BIG_TASKS for tasks migrating from a larger capacity core to a smaller one and to check if the task is SchedTune boosted. If both are true, do not migrate this task. Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>	2024-08-13 23:24:53 +05:30
darkhz	2dff7f511d	kernel: time: Silence "Suspended for..." debug messages. Change-Id: Id585557f265d748e1d8d8bf2e4471bfcca2fe0a4	2024-08-13 23:21:36 +05:30
tytydraco	f91d2df8a3	power: Start killing wakelocks after one minute of idle Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>	2024-08-13 23:19:41 +05:30
Charan Teja Reddy	f038b94806	mm: oom_kill: reap memory of a task that receives SIGKILL Free the pages parallely for a task that receives SIGKILL using the oom_reaper. This freeing of pages will help to give the pages to buddy system well advance. This reaps for the process which received SIGKILL through either sys_kill from user or kill_pid from kernel and that sending process has CAP_KILL capability. Also sysctl interface, reap_mem_on_sigkill, is added to turn on/off this feature. [ExactExampl]: make it enabled by default Change-Id: I21adb95de5e380a80d7eb0b87d9b5b553f52e28a Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> (cherry picked from commit f9920cfa7ecf420e6a1ced2b53920f3ea9ddfc19)	2024-08-13 23:14:33 +05:30
Nguyen Huu Huy	0694dd0d3f	workqueue : add root permission to root control of wq_power_effecient Add this to control enable to disable wq_power_effecient on app kernel manager Change from commit : `e8abf85c64`	2024-08-13 23:11:51 +05:30
Johannes Weiner	2cd0679c57	BACKPORT: psi: Optimize switching tasks inside shared cgroups When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup. This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all. We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task. The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org Signed-off-by: Aarqw12 <lcockx@protonmail.com> Signed-off-by: prorooter007 <shreyashwasnik112@gmail.com> Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>	2024-08-13 23:11:51 +05:30
Johannes Weiner	c7afdeb9a9	BACKPORT: psi: Fix cpu.pressure for cpu.max and competing cgroups For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limitations in a cgrouped reality: When cpu.max is in use, it doesn't capture the time in which a task is not executing on the CPU due to throttling. Likewise, it doesn't capture the time in which a competing cgroup is occupying the CPU - meaning it only reflects cgroup-internal competitive pressure, not outside pressure. Enable tracking of currently executing tasks, and then change the definition of cpu pressure in a cgroup from NR_RUNNING > 1 to NR_RUNNING > ON_CPU which will capture the effects of cpu.max as well as competition from outside the cgroup. After this patch, a cgroup running `stress -c 1` with a cpu.max setting of 5000 10000 shows ~50% continuous CPU pressure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org Signed-off-by: Aarqw12 <lcockx@protonmail.com> Signed-off-by: prorooter007 <shreyashwasnik112@gmail.com> Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>	2024-08-13 23:11:51 +05:30
Yafang Shao	63fdb8eeb1	BACKPORT: psi: Move PF_MEMSTALL out of task->flags The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Aarqw12 <lcockx@protonmail.com> Signed-off-by: prorooter007 <shreyashwasnik112@gmail.com>	2024-08-13 23:11:51 +05:30

1 2 3 4 5 ...

26318 Commits