29441 Commits

Author SHA1 Message Date
Neill Kapron
627db36ed3 UPSTREAM: ANDROID: bpf: do not fail to load if log is full
Upstream commit 973c7a0d8a38 ("bpf: fix precision backtracking
instruction iteration") slightly changes the logic in the verifier which
results in the verifier log growing. This results in the log being too
small when loading the filterPowerSupplyEvents BPF program in Android,
and therefore causing the program loading to fail. Because this
program is labeled 'critical', a load failure forces a boot loop.

This BPF program exists on the vendor partition, and therefore we must
maintain the GRF/ treble boundary and modify the kernel logic.

The kernel's bpf log logic is refactored in the 6.4 kernel and
acknowledges the shortcomings of the existing approach which causes the
program load to fail. Instead of backporting the significant changes,
this change simply ignores the fact that the log is full.

For more information see commit 121664093803 ("bpf: Switch BPF verifier
log to be a rotating log by default")

Bug: 432207940
Bug: 433641053
Test: verify pixel 6 boots on a 5.10 kernel including commit 973c7a0d8a38
Change-Id: I35c3d2074dd9b39e44bfdbaf66fa56ec917df0a6
Signed-off-by: Neill Kapron <nkapron@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2026-01-04 11:55:52 +05:30
Daniel Mentz
257fc94aa2 BACKPORT: ANDROID: Disable kthread delayed work fp check in CFI builds
With non-canonical CFI, LLVM generates jump table entries for external
symbols in modules and as a result, a function pointer passed from a
module to the core kernel will have a different address.

Disable the warning for now.

Bug: 145210207
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Change-Id: I576a07206a465902773481e51a84529f0ac2e84b
2026-01-04 11:55:52 +05:30
Mayank Grover
81e236915d BACKPORT: ANDROID: modules: cfi cleanup for module load failure
Cleanup cfi shadow for failure in module loading,
to avoid causing warnings.

Bug: 172542186
Change-Id: I1de7ffa7d884c8e46891b8bbc8196ec0d2cef0d6
Signed-off-by: Mayank Grover <groverm@codeaurora.org>
2026-01-04 11:55:52 +05:30
sidex15
1219b14df9 fs: implement susfs v1.5.12
- This is a heavily modified version of susfs v1.5.12
- It does not comply with the upstream offical susfs v1.5.12
- sus_mount functionality still remain in v1.5.5 as backporting it to the latest version will result a mount detection leak in some apps/detectors
- Increase susfs_open_redirect UID limit to <11000
- susfs magic mount support is still implemented and enabled
- sus_map is implemented and complied with the upstream v1.5.12 codebase

This commit requires a bunch of backports commits from v4.19 and v5.x to make sus_map working:

0a8cbf3725edbacc5f1ead33eeae7e4d78823b5a proc: less memory for /proc/*/map_files readdir
37ae2444584654f6785f2cc49181f05af788c9b2 mm: smaps: split PSS into components
49a5115e11350ee68f6a5fbd56b3e817bf9e5aac fs/task_mmu: add pkeys header
6f94042bed51121f8f28a5e572cda20c21fed2e1 mm/pkeys: Add an empty arch_pkeys_enabled()
bbd5aec12b32097a71dc6a0097194a18f3ee9a17 mm/pkeys, powerpc, x86: Provide an empty vma_pkey() in linux/pkeys.h
849ca8ce954d9dbb082dcf83c98af861e98e5635 mm: /proc/pid/smaps_rollup: convert to single value seq_file
6071a482c8e603be25895cc2cac5f0eab61c4051 mm: /proc/pid/smaps: factor out common stats printing
03fd2fbe9c40da8128cec5c69ef54755c0f38c6c mm: /proc/pid/smaps: factor out mem stats gathering
95f8be4c8a86a491a1c2ac9bfe470aef9e1baa8f mm: /proc/pid/*maps remove is_pid and related wrappers
27956d255e3b012372951dd6131e07c106d2daae procfs: add seq_put_hex_ll to speed up /proc/pid/maps
7f2847d02cdc4491b5ee6d4a0043854cbd6c7a1a proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps

For KernelSU side patches for this commit you need the sidex15's KernelSU-Next fork:
https://github.com/sidex15/KernelSU-Next/tree/n3x7g3n-kernel

Or if you want to patch on your own here's the commit patch of susfs in the KernelSU-Next:
13b1dfd6e2

Co-authored-by: simonpunk <simonpunk2016@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:52 +05:30
NeilBrown
6e4f18d8ab BACKPORT: cred: add get_cred_rcu()
Sometimes we want to opportunistically get a
ref to a cred in an rcu_read_lock protected section.
get_task_cred() does this, and NFS does as similar thing
with its own credential structures.
To prepare for NFS converting to use 'struct cred' more
uniformly, define get_cred_rcu(), and use it in
get_task_cred().

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
[neobuddy89: Backport for KernelSU-Next]
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:51 +05:30
Zhongqiu Han
3af245f6b3 sched: idle: Optimize the generic idle loop by removing needless memory barrier
The memory barrier rmb() in generic idle loop do_idle() function is not
needed, it doesn't order any load instruction, just remove it as needless
rmb() can cause performance impact.

The rmb() was introduced by the tglx/history.git commit f2f1b44c75c4
("[PATCH] Remove RCU abuse in cpu_idle()") to order the loads between
cpu_idle_map and pm_idle. It pairs with wmb() in function cpu_idle_wait().

And then with the removal of cpu_idle_state in function cpu_idle() and
wmb() in function cpu_idle_wait() in commit 783e391b7b ("x86: Simplify
cpu_idle_wait"), rmb() no longer has a reason to exist.

After that, commit d166991234 ("idle: Implement generic idle function")
implemented a generic idle function cpu_idle_loop() which resembles the
functionality found in arch/. And it retained the rmb() in generic idle
loop in file kernel/cpu/idle.c.

And at last, commit cf37b6b484 ("sched/idle: Move cpu/idle.c to
sched/idle.c") moved cpu/idle.c to sched/idle.c. And commit c1de45ca83
("sched/idle: Add support for tasks that inject idle") renamed function
cpu_idle_loop() to do_idle().

History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241009093745.9504-1-quic_zhonhan@quicinc.com
Change-Id: I7d04d05f25b66ab266b66424dfddd58857e5242b
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:50 +05:30
Ashwini Oruganti
f9d4d4f4d7 Revert "mm: oom_kill: reap memory of a task that receives SIGKILL"
This reverts commit 724adb4433.

Reason for revert: The changes introduced in this commit are causing an
undesirable SELinux denial as a side-effect. Additionally, it appears
(http://b/130391762) that we do not enable the functionality that this
commit adds. Reverting the commit fixes the SELinux denial bug.

Bug: 138594811
Change-Id: Ie440e290dc89b72c46de2194da0275910c03a66d
Signed-off-by: Ashwini Oruganti <ashfall@google.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Park Ju Hyung
c89b15482e sched: do not allocate window cpu arrays separately
These are allocated extremely frequently.

Allocate them with CONFIG_NR_CPUS upon struct ravg's allocation.

This will break walt debug tracings.

Change-Id: I8f67bb00fb916e04bfc954d812a3b99a3a5495c2
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Kazuki H
06e0bcff04 sched/idle: Enter wfi state instead of polling during active migration
WFI's wakeup latency is low enough, use that instead of polling and
burning power.

Change-Id: Iee1c1cdf515224267925037a859c6a74fc61abb7
Signed-off-by: Kazuki H <kazukih0205@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Srikar Dronamraju
094310650c sched/numa: Modify migrate_swap() to accept additional parameters
There are checks in migrate_swap_stop() that check if the task/CPU
combination is as per migrate_swap_arg before migrating.

However atleast one of the two tasks to be swapped by migrate_swap() could
have migrated to a completely different CPU before updating the
migrate_swap_arg. The new CPU where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node.  Instead of achieving node
consolidation, it may end up spreading the load across nodes.

To avoid that pass the CPUs as additional parameters.

While here, place migrate_swap under CONFIG_NUMA_BALANCING.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25377.3     25226.6     -0.59
1     72287       73326       1.437

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0ad4e3dfe6cf3f207e61cbd8e3e4a943f8c1ad20)
Change-Id: Ia520fdeb7233d96891af72f80a44b71658951981
[dereference23: Backport to msm-4.14]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Rishabh Bhatnagar
e05cb79c60 sched: walt: Increase nr_threshold to 40 percent
Increase the nr_threshold percentage to 40 from 15.

Change-Id: I32ce7246fc4cd32d4c8110bef63971c9a2dceb55
Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Pavankumar Kondeti
592c86ff28 sched: walt: Fix stale walt CPU reservation flag
When CPU trying to move a task to other cpu in active load balance or
by other means, then the other helping cpu marked as reserved to avoid
 it for other scheduler decisions. Once the task moved successfully,
the reservation will be cleared enables for other scheduler decisions.
The reserved flag is been analogously protected with busy cpu’s
rq->active_balance, which is protected with runqueue locks. So whenever
rq->active_balance is set for busy cpu, then reserved flag would set for
helping cpu.

Sometimes, it is observed that, cpu is marked as reserved with no cpu's
rq->active_balance set. There are some unlikely possible corner cases
may cause this behavior:
 - On active load balance path, cpu stop machine returns queued status
   of active_balance work on cpu_stopper, which is not checked on active
   balance path. so when stop machine is not able to queue ( unlikely),
   then reserved flag wouldn't be cleared.

   So, catch the return value and on failure, clear reserved flag for cpu.

 - Clear_walt_request() called on the cpu to clear any pending walt works,
   it may possible that, push_task might have changed or cleared, then
   reserved cpu would be left uncleared.

   So clear the push_cpu independent of push_task.

Change-Id: I75d032bf399cb3da8e807186b1bc903114168a4e
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Abhijeet Dharmapurikar
c68528621a sched/walt: Improve the scheduler
This change is for general scheduler improvement.

Change-Id: Iffd4ae221581aaa4aeb244a0cddd40a8b6aac74d
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
[dereference23: Backport to msm-4.14]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Lingutla Chandrasekhar
528defcb34 sched: Improve the scheduler
This change is for general scheduler improvements.

Change-Id: I37d6cb75ca8b08d9ca155b86b7d71ff369f46e14
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Lingutla Chandrasekhar
13f713dce4 sched: walt: Improve the scheduler
This change is for general scheduler improvements.

Change-Id: Ia2854ae8701151761fe0780b6451133ab09a050b
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Abhijeet Dharmapurikar
71569effa8 sched: Improve the Scheduler
This change is for general scheduler improvement.

Change-Id: I7cb85ea7133a94923fae97d99f5b0027750ce189
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Pavankumar Kondeti
da6e9f09d4 sched/fair: Optimize the tick path active migration
When a task is upmigrating via tickpath, the lower capacity CPU
that is running the task will wake up the migration task to
carry the migration to the other higher capacity CPU. The migration
task dequeue the task from lower capacity CPU and enqueue it on
the higher capacity CPU. A rescheduler IPI is sent now to the higher
capacity CPU. If the higher capacity CPU was in deep sleep state, it
results in more waiting time for the task to be upmigrated. This can
be optimized by waking up the higher capacity CPU along with waking
the migration task on the lower capacity CPU. Since we reserve the
higher capacity CPU, the is_reserved() API can be used to prevent
the CPU entering idle again.

Change-Id: I7bda9a905a66a9326c1dc74e50fa94eb58e6b705
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
[clingutla@codeaurora.org: Resolved minor merge conflicts]
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Alexander Winkowski
6c89e8231e sched: Introduce rotation_ctl
This is WALT rotation logic extracted from core_ctl to avoid
CPU isolation overhead while retaining the performance gain.

Change-Id: I912d2dabf7e32eaf9da2f30b38898d1b29ff0a53
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Alexander Winkowski
e280ba9583 sched: Remove unused core_ctl.h
To avoid confusion with include/linux/sched/core_ctl.h

Change-Id: I037b1cc0fa09c06737a369b4e7dfdd89cd7ad9af
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Sultan Alsawaf
401abc5b6b sched/fair: Set asym priority equally for all CPUs in a performance domain
All CPUs in a performance domain share the same capacity, and therefore
aren't different from one another when distinguishing between which one is
better for asymmetric packing.

Instead of unfairly prioritizing lower-numbered CPUs within the same
performance domain, treat all CPUs in a performance domain equally for
asymmetric packing.

Change-Id: Ibc18d45fabc2983650ebebec910578e26bd26809
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Sultan Alsawaf
ea91002ce3 sched/fair: Always update CPU capacity when load balancing
Limiting CPU capacity updates, which are quite cheap, results in worse
balancing decisions during opportunistic balancing (e.g., SD_BALANCE_WAKE).
This causes opportunistic placement decisions to be skewed using stale CPU
capacity data, and when a CPU isn't idling much, its capacity suffers from
even more staleness since the only exception to the 100 ms capacity update
ratelimit is a CPU exiting idle.

Since the capacity updates are cheap, always do it when load balancing in
order to improve opportunistic task placement decisions.

Change-Id: If1d451ce742fd093010057e31e71012d47fad70a
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Wei Wang
8e507d74c2 Revert "sched/core: fix userspace affining threads incorrectly"
This reverts commit d43b69c4ad.

Bug:133481659
Test: build
Change-Id: I615023c611c4de1eb334e4374af7306991f4216b
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:48 +05:30
Wei Wang
a56a7ae6df Revert "sched/core: Fix use after free issue in is_sched_lib_based_app()"
This reverts commit 0e6ca1640c.

Bug:133481659
Test: build
Change-Id: Ie6a0b5e46386c98882614be19dedc61ffd3870e5
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Wei Wang
3f5bda5a63 Revert "sched: Improve the scheduler"
This reverts commit a3dd94a1bb.

Bug:133481659
Test: build
Change-Id: Ib23609315f3446223521612621fe54469537c172
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Alexander Winkowski
a1e19c30de Revert "sched: Improve the scheduler"
This reverts commit 92daaf50af.

Change-Id: I52d562da3c755f114d459ad09813188697ca81d8
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Sultan Alsawaf
0a497b785d cpufreq: schedutil: Use the frequency below the target if they're close
Schedutil targets a frequency tipping point of 80% to vote for a higher
frequency when utilization crosses that threshold.

Since the tipping point calculation is done without regard to the size of
the gap between each frequency step, this often results in a large
frequency jump when it isn't strictly necessary, which hurts energy
efficiency.

For example, if a CPU has 2000 MHz and 3000 MHz frequency steps, and
schedutil targets a frequency of 2005 MHz, then the 3000 MHz frequency step
will be used even though the target frequency of 2005 MHz is very close to
2000 MHz. In this hypothetical scenario, using 2000 MHz would clearly
satisfy the system's performance needs while consuming less energy than the
3000 MHz step.

To counter-balance the frequency tipping point, add a frequency tipping
point in the opposite direction to prefer the frequency step below the
calculated target frequency when the target frequency is less than 20%
higher than that lower step. A threshold of 20% was empirically determined
to provide significant energy savings without really impacting performance.

This improves schedutil's energy efficiency on CPUs which have large gaps
between their frequency steps, as is often the case on ARM.

Change-Id: Ie75b79e5eb9f52c966848a9fb1c8016d7ae22098
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Miguel de Dios
b1838a31b8 kernel: sched: cpufreq_schedutil: Make iowait boost optional.
Bug: 120438505
Change-Id: I59e3675a320ce71c3c90be3904756b125300ba6b
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Connor O'Brien
18a9d3584e cpufreq: schedutil: fix check for stale utilization values
Part of the fix from commit d86ab9cff8 ("cpufreq: schedutil: use now
as reference when aggregating shared policy requests") is reversed in
commit 05d2ca2420 ("cpufreq: schedutil: Ignore CPU load older than
WALT window size") due to a porting mistake. Restore it while keeping
the relevant change from the latter patch.

Bug: 117438867
Test: build & boot
Change-Id: I21399be760d7c8e2fff6c158368a285dc6261647
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Danny Lin
2fc7b997ad Revert "cpufreq: schedutil: Fix for CR 2040904"
This reverts commit b8b6f565c0.

CAF's hispeed boost and predicted load features aren't any good. Remove
them entirely to prevent userspace from trying to enable them
(specifically pl) and to reduce useless overhead in schedutil, since it
runs *very* often.

Change-Id: I0446b49a59e5dce8e1b7712bdb654c9a5e6ff0ed
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Daniel Bristot de Oliveira
ccfb5d1c83 UPSTREAM: sched/rt: Disable RT_RUNTIME_SHARE by default
The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
between CPUs, allowing a CPU to run a real-time task up to 100% of the
time while leaving more space for non-real-time tasks to run on the CPU
that lend rt_runtime.

The problem is that a CPU can easily borrow enough rt_runtime to allow
a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
which are non-real-time by design.

This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
The feature will still be present for users that want to enable it,
though.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Wei Wang <wvw@google.com>
Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
(cherry picked from commit 2586af1ac187f6b3a50930a4e33497074e81762d)
Change-Id: Ibb1b185d512130783ac9f0a29f0e20e9828c86fd

Bug: 169673278
Test: build, boot and check the trace with RT task
Signed-off-by: Kyle Lin <kylelin@google.com>
Change-Id: Iffede8107863b02ad4a0cb902fc8119416931bdb
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:47 +05:30
Shaokun Zhang
896b136ff7 UPSTREAM: cgroup: Remove unused cgrp variable
The 'cgrp' is set but not used in commit <76f969e8948d8>
("cgroup: cgroup v2 freezer").
Remove it to avoid [-Wunused-but-set-variable] warning.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from 533307dc20a9e84a0687d4ca24aeb669516c0243)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Change-Id: I6221a975c04f06249a4f8d693852776ae08a8d8e
2026-01-04 11:55:40 +05:30
Oleg Nesterov
5df874bc76 UPSTREAM: cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ptrace_stop()
ptrace_stop() does preempt_enable_no_resched() to avoid the preemption,
but after that cgroup_enter_frozen() does spin_lock/unlock and this adds
another preemption point.

Reported-and-tested-by: Bruce Ashfield <bruce.ashfield@gmail.com>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Change-Id: Ic53e0f2d6624b0bb90817b0c57060fb7db971348
(cherry picked from commit 937c6b27c73e02cd4114f95f5c37ba2c29fadba1)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Roman Gushchin
6e0a12380b UPSTREAM: cgroup: freezer: fix frozen state inheritance
If a new child cgroup is created in the frozen cgroup hierarchy
(one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
flag should be set. Otherwise if a process will be attached to the
child cgroup, it won't become frozen.

The problem can be reproduced with the test_cgfreezer_mkdir test.

This is the output before this patch:
  ~/test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
  not ok 4 test_cgfreezer_mkdir
  ok 5 test_cgfreezer_rmdir
  ok 6 test_cgfreezer_migrate
  ok 7 test_cgfreezer_ptrace
  ok 8 test_cgfreezer_stopped
  ok 9 test_cgfreezer_ptraced
  ok 10 test_cgfreezer_vfork

And with this patch:
  ~/test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  ok 4 test_cgfreezer_mkdir
  ok 5 test_cgfreezer_rmdir
  ok 6 test_cgfreezer_migrate
  ok 7 test_cgfreezer_ptrace
  ok 8 test_cgfreezer_stopped
  ok 9 test_cgfreezer_ptraced
  ok 10 test_cgfreezer_vfork

Reported-by: Mark Crossen <mcrossen@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Tejun Heo <tj@kernel.org>

Change-Id: I6ba7b8dec5600e78bb7448f03fd97a9b43838fa0
(cherry picked from commit 97a61369830ab085df5aed0ff9256f35b07d425a)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Roman Gushchin
fbad9ee881 UPSTREAM: signal: unconditionally leave the frozen state in ptrace_stop()
Alex Xu reported a regression in strace, caused by the introduction of
the cgroup v2 freezer. The regression can be reproduced by stracing
the following simple program:

  #include <unistd.h>

  int main() {
      write(1, "a", 1);
      return 0;
  }

An attempt to run strace ./a.out leads to the infinite loop:
  [ pre-main omitted ]
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  [ repeats forever ]

The problem occurs because the traced task leaves ptrace_stop()
(and the signal handling loop) with the frozen bit set. So let's
call cgroup_leave_frozen(true) unconditionally after sleeping
in ptrace_stop().

With this patch applied, strace works as expected:
  [ pre-main omitted ]
  write(1, "a", 1)                        = 1
  exit_group(0)                           = ?
  +++ exited with 0 +++

Reported-by: Alex Xu <alex_y_xu@yahoo.ca>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

Change-Id: If644b15ead36ce13f0c2c3dd57eebe3658e3edf7
(cherry picked from commit 05b289263772b0698589abc47771264a685cd365)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Roman Gushchin
c9cbc44943 UPSTREAM: cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa
(cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Roman Gushchin
15663069d0 UPSTREAM: cgroup: implement __cgroup_task_count() helper
The helper is identical to the existing cgroup_task_count()
except it doesn't take the css_set_lock by itself, assuming
that the caller does.

Also, move cgroup_task_count() implementation into
kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Change-Id: Iaa9085d2375d395a051543d2555389213c2892d6
(cherry picked from commit aade7f9efba098859681f8e88d81a5b44ad09b12)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Roman Gushchin
fe7a69af89 UPSTREAM: cgroup: rename freezer.c into legacy_freezer.c
Freezer.c will contain an implementation of cgroup v2 freezer,
so let's rename the v1 freezer to avoid naming conflicts.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Change-Id: Ie196fbcca1e0bf46af9200752d8fdf90b97e5a8b
(cherry picked from commit 50943f3e136adfc421f9768d6ae09ba7b83aaefd)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Shakeel Butt
208b6124ed UPSTREAM: cgroup: remove extra cgroup_migrate_finish() call
The callers of cgroup_migrate_prepare_dst() correctly call
cgroup_migrate_finish() for success and failure cases both. No need to
call it in cgroup_migrate_prepare_dst() in failure case.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I785d7ab70a42b1b79aea9852bb14ba5abefcaa9b
(cherry picked from commit d6e486ee0ef2f99a4069d9186e53dac61b28cb3c)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Al Viro
ec1bc657c6 UPSTREAM: cgroup: saner refcounting for cgroup_root
* make the reference from superblock to cgroup_root counting -
do cgroup_put() in cgroup_kill_sb() whether we'd done
percpu_ref_kill() or not; matching grab is done when we allocate
a new root.  That gives the same refcounting rules for all callers
of cgroup_do_mount() - a reference to cgroup_root has been grabbed
by caller and it either is transferred to new superblock or dropped.

* have cgroup_kill_sb() treat an already killed refcount as "just
don't bother killing it, then".

* after successful cgroup_do_mount() have cgroup1_mount() recheck
if we'd raced with mount/umount from somebody else and cgroup_root
got killed.  In that case we drop the superblock and bugger off
with -ERESTARTSYS, same as if we'd found it in the list already
dying.

* don't bother with delayed initialization of refcount - it's
unreliable and not needed.  No need to prevent attempts to bump
the refcount if we find cgroup_root of another mount in progress -
sget will reuse an existing superblock just fine and if the
other sb manages to die before we get there, we'll catch
that immediately after cgroup_do_mount().

* don't bother with kernfs_pin_sb() - no need for doing that
either.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change-Id: I8e088dfc516b76c42d9d4b34db7f49f0cebc5414
(cherry picked from commit 35ac1184244f1329783e1d897f74926d8bb1103a)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Tejun Heo
c38520af07 UPSTREAM: cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param
It can be useful to inhibit all cgroup1 hierarchies especially during
transition and for debugging.  cgroup_no_v1 can block hierarchies with
controllers which leaves out the named hierarchies.  Expand it to
cover the named hierarchies so that "cgroup_no_v1=all,named" disables
all cgroup1 hierarchies.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Marcin Pawlowski <mpawlowski@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: Ibd093dd9b70d15402a21db3c1ef56005ebc7f99e
(cherry picked from commit 3fc9c12d27b4ded4f1f761a800558dab2e6bbac5)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Yangtao Li
a40db08240 UPSTREAM: cgroup: remove unnecessary unlikely()
WARN_ON() already contains an unlikely(), so it's not necessary to use
unlikely.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I092c0aae2a06b13d3fc9ecfbb24ab3e8d10235f6
(cherry picked from commit 4d9ebbe2b061a9c25e12ba8539ba172533132eb6)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:40 +05:30
Tejun Heo
e0118bca02 UPSTREAM: cgroup: Explicitly remove core interface files
The "cgroup." core interface files bypass the usual interface removal
path and get removed recursively along with the cgroup itself.  While
this works now, the subtle discrepancy gets in the way of implementing
common mechanisms.

This patch updates cgroup core interface file handling so that it's
consistent with controller interface files.  When added, the css is
marked CSS_VISIBLE and they're explicitly removed before the cgroup is
destroyed.

This doesn't cause user-visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I4091581388cb1514171d6de8fdac5f0fe6ae1695
(cherry picked from commit 5faaf05f2976fd9ec0ecd582bcfb3a41cde4c65e)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:39 +05:30
Roman Gushchin
568fa46b10 UPSTREAM: cgroup: make cgroup.threads delegatable
Make cgroup.threads file delegatable.
The behavior of cgroup.threads should follow the behavior of cgroup.procs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Discovered-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I82d23cd511122e5a75b23b26e03ccc9e43b171e5
(cherry picked from commit 4f58424da3deead2605e39a9df65f5f06107a3cb)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:39 +05:30
Tejun Heo
849520d94b BACKPORT: string: drop __must_check from strscpy() and restore strscpy() usages in cgroup
e7fd37ba1217 ("cgroup: avoid copying strings longer than the buffers")
converted possibly unsafe strncpy() usages in cgroup to strscpy().
However, although the callsites are completely fine with truncated
copied, because strscpy() is marked __must_check, it led to the
following warnings.

  kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
  kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
     strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
	       ^

To avoid the warnings, 50034ed49645 ("cgroup: use strlcpy() instead of
strscpy() to avoid spurious warning") switched them to strlcpy().

strlcpy() is worse than strlcpy() because it unconditionally runs
strlen() on the source string, and the only reason we switched to
strlcpy() here was because it was lacking __must_check, which doesn't
reflect any material differences between the two function.  It's just
that someone added __must_check to strscpy() and not to strlcpy().

These basic string copy operations are used in variety of ways, and
one of not-so-uncommon use cases is safely handling truncated copies,
where the caller naturally doesn't care about the return value.  The
__must_check doesn't match the actual use cases and forces users to
opt for inferior variants which lack __must_check by happenstance or
spread ugly (void) casts.

Remove __must_check from strscpy() and restore strscpy() usages in
cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
(cherry picked from commit 08a77676f9c5fc69a681ccd2cd8140e65dcb26c7)
[backport the cgroup portions that weren't applied with the earlier
patch
779128d80c 'string: drop __must_check from
strscpy() and restore strscpy() usages in cgroup']
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Change-Id: Iaa636d39d15c44be47fc6b6ba202ecb7ff73c5e7
2026-01-04 11:55:39 +05:30
Arnd Bergmann
fc6e9945d7 UPSTREAM: cgroup: use strlcpy() instead of strscpy() to avoid spurious warning
As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would
work just as well and avoid that warning, so the change below could be folded
into that commit.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I8215beea12d94fda6a7834f8be6f8e0891285d0e
(cherry picked from commit 50034ed49645463a16327cad05694e201e6b4126)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:39 +05:30
Ma Shimiao
1b59565001 UPSTREAM: cgroup: avoid copying strings longer than the buffers
cgroup root name and file name have max length limit, we should
avoid copying longer name than that to the name.

tj: minor update to $SUBJ.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: Iff4f30be79184f19d9f3ec253bbab9c4ad91f36c
(cherry picked from commit e7fd37ba12170cc414be8b639dfc2c5f7172fac2)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:39 +05:30
Roman Gushchin
7abaa878d0 UPSTREAM: cgroup: export list of cgroups v2 features using sysfs
The active development of cgroups v2 sometimes leads to a creation
of interfaces, which are not turned on by default (to provide
backward compatibility). It's handy to know from userspace, which
cgroup v2 features are supported without calculating it based
on the kernel version. So, let's export the list of such features
using /sys/kernel/cgroup/features pseudo-file.

The list is hardcoded and has to be extended when new functionality
is added. Each feature is printed on a new line.

Example:
  $ cat /sys/kernel/cgroup/features
  nsdelegate

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I2baf0b7bcc27491568772defc43a06d0a5ed46bf
(cherry picked from commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:39 +05:30
Roman Gushchin
5a336133b7 UPSTREAM: cgroup: export list of delegatable control files using sysfs
Delegatable cgroup v2 control files may require special handling
(e.g. chowning), and the exact list of such files varies between
kernel versions (and likely to be extended in the future).

To guarantee correctness of this list and simplify the life
of userspace (systemd, first of all), let's export the list
via /sys/kernel/cgroup/delegate pseudo-file.

Format is siple: each control file name is printed on a new line.
Example:
  $ cat /sys/kernel/cgroup/delegate
  cgroup.procs
  cgroup.subtree_control

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I9d3143ecbae9d7579d2b1e6ccf381190ef5d3255
(cherry picked from commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:39 +05:30
Tejun Heo
f06e789336 UPSTREAM: cgroup: statically initialize init_css_set->dfl_cgrp
Like other csets, init_css_set's dfl_cgrp is initialized when the cset
gets linked.  init_css_set gets linked in cgroup_init().  This has
been fine till now but the recently added basic CPU usage accounting
may end up accessing dfl_cgrp of init before cgroup_init() leading to
the following oops.

  SELinux:  Initializing.
  BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
  IP: account_system_index_time+0x60/0x90
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd64 #10
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  +1.9.3-20161025_171302-gandalf 04/01/2014
  task: ffffffff81e10480 task.stack: ffffffff81e00000
  RIP: 0010:account_system_index_time+0x60/0x90
  RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
  RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
  RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
  RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
  R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
  R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
  FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   account_system_time+0x45/0x60
   account_process_tick+0x5a/0x140
   update_process_times+0x22/0x60
   tick_periodic+0x2b/0x90
   tick_handle_periodic+0x25/0x70
   timer_interrupt+0x15/0x20
   __handle_irq_event_percpu+0x7e/0x1b0
   handle_irq_event_percpu+0x23/0x60
   handle_irq_event+0x42/0x70
   handle_level_irq+0x83/0x100
   handle_irq+0x6f/0x110
   do_IRQ+0x46/0xd0
   common_interrupt+0x9d/0x9d

Fix it by statically initializing init_css_set.dfl_cgrp so that init's
default cgroup is accessible from the get-go.

Fixes: 041cd640b2f3 ("cgroup: Implement cgroup2 basic CPU usage accounting")
Reported-by: “kbuild-all@01.org” <kbuild-all@01.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: Ia754e3d34561ff09db126712e1a40d993b28f5d9
(cherry picked from commit 38683148828165ea0b66ace93a9fedc2d3281e27)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2026-01-04 11:55:39 +05:30
Xiaoming Ni
f265c613f7 kernel/notifier.c: intercept duplicate registrations to avoid infinite loops
[ Upstream commit 1a50cb80f219c44adb6265f5071b81fc3c1deced ]

Registering the same notifier to a hook repeatedly can cause the hook
list to form a ring or lose other members of the list.

  case1: An infinite loop in notifier_chain_register() can cause soft lockup
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test2);

  case2: An infinite loop in notifier_chain_register() can cause soft lockup
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_call_chain(&test_notifier_list, 0, NULL);

  case3: lose other hook test2
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test2);
          atomic_notifier_chain_register(&test_notifier_list, &test1);

  case4: Unregister returns 0, but the hook is still in the linked list,
         and it is not really registered. If you call
         notifier_call_chain after ko is unloaded, it will trigger oops.

If the system is configured with softlockup_panic and the same hook is
repeatedly registered on the panic_notifier_list, it will cause a loop
panic.

Add a check in notifier_chain_register(), intercepting duplicate
registrations to avoid infinite loops

Link: http://lkml.kernel.org/r/1568861888-34045-2-git-send-email-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sam Protsenko <semen.protsenko@linaro.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
2026-01-04 11:55:39 +05:30