Commit Graph

31341 Commits

Author SHA1 Message Date
Frederic Weisbecker
e00a2dfe71 rcu: Assume rcu_init() is called before smp
The rcu_init() function is called way before SMP is initialized and
therefore only the boot CPU should be online at this stage.

Simplify the boot per-cpu initialization accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-12-08 00:52:05 +00:00
Cyrill Gorcunov
c9e54c78d7 rcu: rcu_qs -- Use raise_softirq_irqoff to not save irqs twice
The rcu_qs is disabling IRQs by self so no need to do the same in raise_softirq
but instead we can save some cycles using raise_softirq_irqoff directly.

CC: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2025-12-08 00:52:05 +00:00
Paul E. McKenney
04904ffe37 rcu/tiny: Convert to SPDX license identifier
Replace the license boiler plate with a SPDX license identifier.
While in the area, update an email address.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2025-12-08 00:52:05 +00:00
Paul E. McKenney
3dd3836e93 rcu: Rename rcu_check_callbacks() to rcu_sched_clock_irq()
The name rcu_check_callbacks() arguably made sense back in the early
2000s when RCU was quite a bit simpler than it is today, but it has
become quite misleading, especially with the advent of dyntick-idle
and NO_HZ_FULL.  The rcu_check_callbacks() function is RCU's hook into
the scheduling-clock interrupt, and is now but one of many ways that
callbacks get promoted to invocable state.

This commit therefore changes the name to rcu_sched_clock_irq(),
which is the same number of characters and clearly indicates this
function's relation to the rest of the Linux kernel.  In addition, for
the sake of consistency, rcu_flavor_check_callbacks() is also renamed
to rcu_flavor_sched_clock_irq().

While in the area, the header comments for both functions are reworked.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2025-12-08 00:52:04 +00:00
Paul E. McKenney
20743a0645 srcu: Make call_srcu() available during very early boot
Event tracing is moving to SRCU in order to take advantage of the fact
that SRCU may be safely used from idle and even offline CPUs.  However,
event tracing can invoke call_srcu() very early in the boot process,
even before workqueue_init_early() is invoked (let alone rcu_init()).
Therefore, call_srcu()'s attempts to queue work fail miserably.

This commit therefore detects this situation, and refrains from attempting
to queue work before rcu_init() time, but does everything else that it
would have done, and in addition, adds the srcu_struct to a global list.
The rcu_init() function now invokes a new srcu_init() function, which
is empty if CONFIG_SRCU=n.  Otherwise, srcu_init() queues work for
each srcu_struct on the list.  This all happens early enough in boot
that there is but a single CPU with interrupts disabled, which allows
synchronization to be dispensed with.

Of course, the queued work won't actually be invoked until after
workqueue_init() is invoked, which happens shortly after the scheduler
is up and running.  This means that although call_srcu() may be invoked
any time after per-CPU variables have been set up, there is still a very
narrow window when synchronize_srcu() won't work, and this window
extends from the time that the scheduler starts until the time that
workqueue_init() returns.  This can be fixed in a manner similar to
the fix for synchronize_rcu_expedited() and friends, but until someone
actually needs to use synchronize_srcu() during this window, this fix
is added churn for no benefit.

Finally, note that Tree SRCU's new srcu_init() function invokes
queue_work() rather than the queue_delayed_work() function that is
invoked post-boot.  The reason is that queue_delayed_work() will (as you
would expect) post a timer, and timers have not yet been initialized.
So use of queue_work() avoids the complaints about use of uninitialized
spinlocks that would otherwise result.  Besides, some delay is already
provide by the aforementioned fact that the queued work won't actually
be invoked until after the scheduler is up and running.

Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-12-08 00:52:04 +00:00
Paul E. McKenney
071728046f rcu: Motivate Tiny RCU forward progress
If a long-running CPU-bound in-kernel task invokes call_rcu(), the
callback won't be invoked until the next context switch.  If there are
no other runnable tasks (which is not an uncommon situation on deep
embedded systems), the callback might never be invoked.

This commit therefore causes rcu_check_callbacks() to ask the scheduler
for a context switch if there are callbacks posted that are still waiting
for a grace period.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2025-12-08 00:52:04 +00:00
Paul E. McKenney
1e9c40c21d rcu: Clean up flavor-related definitions and comments in tiny.c
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2025-12-08 00:52:04 +00:00
Paul E. McKenney
8cb39e4b1e rcu: Express Tiny RCU updates in terms of RCU rather than RCU-sched
This commit renames Tiny RCU functions so that the lowest level of
functionality is RCU (e.g., synchronize_rcu()) rather than RCU-sched
(e.g., synchronize_sched()).  This provides greater naming compatibility
with Tree RCU, which will in turn permit more LoC removal once
the RCU-sched and RCU-bh update-side API is removed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix Tiny call_rcu()'s EXPORT_SYMBOL() in response to a bug
  report from kbuild test robot. ]
2025-12-08 00:52:04 +00:00
Paul E. McKenney
c10274c5a7 rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds
Now that RCU-preempt knows about preemption disabling, its implementation
of synchronize_rcu() works for synchronize_sched(), and likewise for the
other RCU-sched update-side API members.  This commit therefore confines
the RCU-sched update-side code to CONFIG_PREEMPT=n builds, and defines
RCU-sched's update-side API members in terms of those of RCU-preempt.

This means that any given build of the Linux kernel has only one
update-side flavor of RCU, namely RCU-preempt for CONFIG_PREEMPT=y builds
and RCU-sched for CONFIG_PREEMPT=n builds.  This in turn means that kernels
built with CONFIG_RCU_NOCB_CPU=y have only one rcuo kthread per CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
2025-12-08 00:52:04 +00:00
Paul E. McKenney
d3b896077a rcu: Define RCU-bh update API in terms of RCU
Now that the main RCU API knows about softirq disabling and softirq's
quiescent states, the RCU-bh update code can be dispensed with.
This commit therefore removes the RCU-bh update-side implementation and
defines RCU-bh's update-side API in terms of that of either RCU-preempt or
RCU-sched, depending on the setting of the CONFIG_PREEMPT Kconfig option.

In kernels built with CONFIG_RCU_NOCB_CPU=y this has the knock-on effect
of reducing by one the number of rcuo kthreads per CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2025-12-08 00:52:04 +00:00
Woomymy
54404ec743 kernel: irq_work: Remove mediatek schedule monitor support
Change-Id: I4cf9879d9e8eb605f37e50fcde089b32ef6e7c9d
2025-12-08 00:52:03 +00:00
Sultan Alsawaf
d17ffb511b sched/fair: Compile out NUMA code entirely when NUMA is disabled
Scheduler code is very hot and every little optimization counts. Instead
of constantly checking sched_numa_balancing when NUMA is disabled,
compile it out.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I7334594fbe835f615a199cfe02ee526135abab06
2025-12-08 00:13:47 +00:00
Yaroslav Furman
83c0ffd088 rcu: fix a performance regression
Commit "rcu: Create RCU-specific workqueues with rescuers" switched RCU
to using local workqueses and removed power efficiency flag from them.

This caused a performance regression that can be observed in Geekbench 5
after enabling CONFIG_WQ_POWER_EFFICIENT_DEFAULT: score went down from
760/2500 to 620/2300 (single/multi core respectively).

Add WQ_POWER_EFFICIENT flag to avoid this regression.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
2025-10-18 10:51:05 +00:00
Peter Zijlstra
2de72a359a sched: Fix rq->nr_iowait ordering
schedule()				ttwu()
    deactivate_task();			  if (p->on_rq && ...) // false
					    atomic_dec(&task_rq(p)->nr_iowait);
    if (prev->in_iowait)
      atomic_inc(&rq->nr_iowait);

Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.

Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net
Change-Id: Iee2ed007cbdbe9cb1ca8e028d928d263d85e1f2b
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
2025-10-18 10:51:05 +00:00
Peter Zijlstra
ee8d7807c7 sched/core: Fix preempt warning in ttwu
John reported a DEBUG_PREEMPT warning caused by commit:

  aacedf26fb76 ("sched/core: Optimize try_to_wake_up() for local wakeups")

I overlooked that ttwu_stat() requires preemption disabled.

Reported-by: John Stultz <john.stultz@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: aacedf26fb76 ("sched/core: Optimize try_to_wake_up() for local wakeups")
Link: https://lkml.kernel.org/r/20190710105736.GK3402@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change-Id: I385c5e0796c36e35761cc4658edd3c00ac908bfb
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
2025-10-18 10:51:05 +00:00
Peter Zijlstra
a8736dd057 sched/core: Optimize try_to_wake_up() for local wakeups
Jens reported that significant performance can be had on some block
workloads by special casing local wakeups. That is, wakeups on the
current task before it schedules out.

Given something like the normal wait pattern:

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);

		if (cond)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);

Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.

Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:

 - ttwu_stat(); we already had concurrency on rq stats, this now also
   brings it to task stats. -ENOCARE

 - tracepoints; it is now possible to get multiple instances of
   trace_sched_waking() (and possibly trace_sched_wakeup()) for the
   same task. Tracers will have to learn to cope.

Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.

After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.

Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.

This then yields an extremely simple and fast path for 'p == current'.

Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: gkohli@codeaurora.org
Cc: hch@lst.de
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change-Id: I603279a68529f271e74b2bd123fba2ccb09ee0d3
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
2025-10-18 10:51:05 +00:00
Peter Oskolkov
53570fa708 sched/fair: Tweak pick_next_entity()
Currently, pick_next_entity(...) has the following structure
(simplified):

  [...]
  if (last_buddy_ok())
    result = last_buddy;
  if (next_buddy_ok())
    result = next_buddy;
  [...]

The intended behavior is to prefer next buddy over last buddy;
the current code somewhat obfuscates this, and also wastes
cycles checking the last buddy when eventually the next buddy is
picked up.

So this patch refactors two 'ifs' above into

  [...]
  if (next_buddy_ok())
      result = next_buddy;
  else if (last_buddy_ok())
      result = last_buddy;
  [...]

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guitttot@linaro.org>
Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com
Signed-off-by: John Vincent <git@tenseventyseven.cf>
2025-10-18 10:51:00 +00:00
LinkBoi00
bba5960695 {kernel, net}: Disable useless kernel modules
Signed-off-by: LinkBoi00 <linkdevel@protonmail.com>
2025-10-18 10:51:00 +00:00
Pavankumar Kondeti
2a7433f75e sched/walt: Fix negative count of sched_asym_cpucapacity static key
The current code sets per-cpu variable sd_asym_cpucapacity while
building sched domains even when there are no asymmetric CPUs.
This is done to make sure that EAS remains enabled on a b.L system
after hotplugging out all big/LITTLE CPUs. However it is causing
the below warning during CPU hotplug.

[13988.932604] pc : static_key_slow_dec_cpuslocked+0xe8/0x150
[13988.932608] lr : static_key_slow_dec_cpuslocked+0xe8/0x150
[13988.932610] sp : ffffffc010333c00
[13988.932612] x29: ffffffc010333c00 x28: ffffff8138d88088
[13988.932615] x27: 0000000000000000 x26: 0000000000000081
[13988.932618] x25: ffffff80917efc80 x24: ffffffc010333c60
[13988.932621] x23: ffffffd32bf09c58 x22: 0000000000000000
[13988.932623] x21: 0000000000000000 x20: ffffff80917efc80
[13988.932626] x19: ffffffd32bf0a3e0 x18: ffffff8138039c38
[13988.932628] x17: ffffffd32bf2b000 x16: 0000000000000050
[13988.932631] x15: 0000000000000050 x14: 0000000000040000
[13988.932633] x13: 0000000000000178 x12: 0000000000000001
[13988.932636] x11: 16a9ca5426841300 x10: 16a9ca5426841300
[13988.932639] x9 : 16a9ca5426841300 x8 : 16a9ca5426841300
[13988.932641] x7 : 0000000000000000 x6 : ffffff813f4edadb
[13988.932643] x5 : 0000000000000000 x4 : 0000000000000004
[13988.932646] x3 : ffffffc010333880 x2 : ffffffd32a683a2c
[13988.932648] x1 : ffffffd329355498 x0 : 000000000000001b
[13988.932651] Call trace:
[13988.932656]  static_key_slow_dec_cpuslocked+0xe8/0x150
[13988.932660]  partition_sched_domains_locked+0x1f8/0x80c
[13988.932666]  sched_cpu_deactivate+0x9c/0x13c
[13988.932670]  cpuhp_invoke_callback+0x6ac/0xa8c
[13988.932675]  cpuhp_thread_fun+0x158/0x1ac
[13988.932678]  smpboot_thread_fn+0x244/0x3e4
[13988.932681]  kthread+0x168/0x178
[13988.932685]  ret_from_fork+0x10/0x18

The mismatch between increment/decrement of sched_asym_cpucapacity
static key is resulting in the above warning. It is due to
the fact that the increment happens only when the system really
has asymmetric capacity CPUs. This check is done by going through
each CPU capacity. So when system becomes SMP during hotplug,
the increment never happens. However the decrement of this static
key is done when any of the currently online CPU has per-cpu variable
sd_asym_cpucapacity value as non-NULL. Since we always set this
variable, we run into this issue.

Our goal was to enable EAS on SMP. To achieve that enable EAS and
build perf domains (required for compute energy) irrespective
of per-cpu variable sd_asym_cpucapacity value. In this way we
no longer have to enable sched_asym_cpucapacity feature on SMP
to enable EAS.

Change-Id: Id46f2b80350b742c75195ad6939b814d4695eb07
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2025-10-18 10:51:00 +00:00
Eric W. Biederman
b2d486cd8f tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.

Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Change-Id: I8b3d0891c3a68e549c10e7e10bfc814f82138a5d
Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-commit: 0ff7b2cfbae36ebcd216c6a5ad7f8534eebeaee2
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Srinivasarao Pathipati <quic_spathi@quicinc.com>
2025-10-18 10:51:00 +00:00
Eric W. Biederman
af0b4b4a83 tasks: Add a count of task RCU users
Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Change-Id: Ib1f00439f5e119cce4af2bf712df5a60b47fa81f
Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-commit: 3fbd7ee285b2bbc6eebd15a3c8786d9776a402a8
Git-repo: https://android.googlesource.com/kernel/common/
[quic_spathi@quicinc.com: resolved trivial merge conflicts]
Signed-off-by: Srinivasarao Pathipati <quic_spathi@quicinc.com>
2025-10-18 10:50:59 +00:00
Park Ju Hyung
33b56f04b6 cgroup: use kmem_cache pool for struct cgrp_cset_link
These get allocated and freed millions of times on Android.

Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Change-Id: Id0c0c8dfbd835fabf7b13ab9e97322cf6868802c
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
2025-10-17 23:08:52 +00:00
bengris32
e8f6ddc79a drivers: {cpufreq,mediatek}: Allow disabling MediaTek CPU policy
Change-Id: I19610741ae7e153d88164798050ede7158d09a12
Signed-off-by: bengris32 <bengris32@protonmail.ch>
2025-10-17 23:08:10 +00:00
LinkBoi00
c415fc91d2 kernel: time: Drop some spammy alarmtimer verbose logging
Change-Id: Ic23453b1ba8fd848cf3b7b411148a6d2344abbc0
Signed-off-by: LinkBoi00 <linkdevel@protonmail.com>
2025-10-17 23:06:57 +00:00
Qinglang Miao
fb17f77612 UPSTREAM:sched/uclamp: Remove unnecessary mutex_init()
The uclamp_mutex lock is initialized statically via DEFINE_MUTEX(),
it is unnecessary to initialize it runtime via mutex_init().

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20200725085629.98292-1-miaoqinglang@huawei.com
Signed-off-by: RuRuTiaSaMa <1009087450@qq.com>
2025-10-17 23:01:38 +00:00
Michel Lespinasse
7dfd6b5320 UPSTREAM: mmap locking API: convert nested write lock sites
Add API for nested write locks and convert the few call sites doing that.

Change-Id: I5d0be03fe9bb4d1e12b608464ce0353b81185a7a
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-7-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-10-12 15:06:04 +01:00
Artem Savkov
e337088063 UPSTREAM: ftrace: Return the first found result in lookup_rec()
It appears that ip ranges can overlap so. In that case lookup_rec()
returns whatever results it got last even if it found nothing in last
searched page.

This breaks an obscure livepatch late module patching usecase:
  - load livepatch
  - load the patched module
  - unload livepatch
  - try to load livepatch again

To fix this return from lookup_rec() as soon as it found the record
containing searched-for ip. This used to be this way prior lookup_rec()
introduction.

Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com

Cc: stable@vger.kernel.org
Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()")
Change-Id: Ibfd941aa40df53bce30b7973d58c3665a4a4a8d8
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 15:02:48 +01:00
Steven Rostedt (VMware)
84f8dcd11a UPSTREAM: ftrace: Separate out functionality from ftrace_location_range()
Create a new function called lookup_rec() from the functionality of
ftrace_location_range(). The difference between lookup_rec() is that it
returns the record that it finds, where as ftrace_location_range() returns
only if it found a match or not.

The lookup_rec() is static, and can be used for new functionality where
ftrace needs to find a record of a specific address.

Change-Id: I7e5a80df3f1486b889d5fa533728794f79afa24a
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 15:02:45 +01:00
Pankaj Bharadiya
fd78a84b36 BACKPORT: treewide: Use sizeof_field() macro
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().

This patch is generated using following script:

EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do

	if [[ "$file" =~ $EXCLUDE_FILES ]]; then
		continue
	fi
	sed -i  -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done

Change-Id: I24296633f28fea05d12618c8e47dc8acb8df18d8
Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
2025-10-12 15:02:25 +01:00
Toke Høiland-Jørgensen
9c88ef1085 UPSTREAM: bpf: Fix stackmap overflow check on 32-bit arches
[ Upstream commit 7a4b21250bf79eef26543d35bd390448646c536b ]

The stackmap code relies on roundup_pow_of_two() to compute the number
of hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code.

The commit in the fixes tag actually attempted to fix this, but the fix
did not account for the UB, so the fix only works on CPUs where an
overflow does result in a neat truncation to zero, which is not
guaranteed. Checking the value before rounding does not have this
problem.

Fixes: 6183f4d3a0a2 ("bpf: Check for integer overflow when using roundup_pow_of_two()")
Change-Id: Id67d50b83af553ac5c1087ebded62c4526e95235
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>
Message-ID: <20240307120340.99577-4-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-12 15:01:33 +01:00
Michel Lespinasse
82d4befc18 UPSTREAM: mmap locking API: add mmap_read_trylock_non_owner()
Add a couple APIs used by kernel/bpf/stackmap.c only:
- mmap_read_trylock_non_owner()
- mmap_read_unlock_non_owner() (may be called from a work queue).

It's still not ideal that bpf/stackmap subverts the lock ownership in this
way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
way of addressing this in the short term.

Change-Id: I2126c9da656dfa59753b280c645c0eb3228ce323
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-8-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-10-12 15:01:33 +01:00
Qian Cai
172dec8a20 BACKPORT: locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:

  b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")

@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.

Change-Id: Ic64f4522a848a31195e546ff531d59201e6e8deb
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-10-12 15:01:32 +01:00
Mike Rapoport
213670f0fe BACKPORT: mm: introduce include/linux/pgtable.h
The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Change-Id: I8a69883a0091366839170f569a44e12544327183
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-10-12 14:59:59 +01:00
Kuan-Ying Lee
409397cc7d ANDROID: syscall_check: add vendor hook for bpf syscall
Through this vendor hook, we can get the timing to check
current running task for the validation of its credential
and bpf operations.

Bug: 191291287

Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Ie4ed8df7ad66df2486fc7e52a26d9191fc0c176e
2025-10-12 14:59:59 +01:00
Michel Lespinasse
b8ba613d40 BACKPORT: mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Change-Id: If729000ea8cedab7079ccc1350db26ed71f0df92
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-10-12 14:53:46 +01:00
Julien Thierry
b50f106397 BACKPORT: objtool: Rename frame.h -> objtool.h
Header frame.h is getting more code annotations to help objtool analyze
object files.

Rename the file to objtool.h.

[ jpoimboe: add objtool.h to MAINTAINERS ]

Change-Id: I0b95a75fb3cfe673bf18d8d5a886b2809ea3b5f5
Signed-off-by: Julien Thierry <jthierry@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
2025-10-12 13:56:31 +01:00
Sami Tolvanen
dbabc70396 ANDROID: Revert "ANDROID: bpf: validate bpf_func when BPF_JIT is enabled with CFI"
This reverts commit 788bbf4f261fc558b714bdedd4122d7115efc940.

Reason for revert: fixes a conflict with upcoming upstream BPF changes.
Bug: 145210207
Change-Id: I3bbc1279fc613be0d2e833008413ad3561b851df
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-10-12 13:56:31 +01:00
Peter Zijlstra
ec55a92169 UPSTREAM: x86/ibt,ftrace: Search for __fentry__ location
commit aebfd12521d9c7d0b502cf6d06314cfbcdccfe3b upstream.

Currently a lot of ftrace code assumes __fentry__ is at sym+0. However
with Intel IBT enabled the first instruction of a function will most
likely be ENDBR.

Change ftrace_location() to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location.

Then audit/update all callsites of this function to consistently use
these new semantics.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Change-Id: I72966b96df528f86121b6f6c866b56bf09a4227f
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.227581603@infradead.org
Stable-dep-of: e60b613df8b6 ("ftrace: Fix possible use-after-free issue in ftrace_location()")
[Shivani: Modified to apply on v5.10.y]
Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-12 13:55:27 +01:00
Steven Rostedt (VMware)
d69a112c03 UPSTREAM: ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
If a direct ftrace callback is at a location that does not have any other
ftrace helpers attached to it, it is possible to simply just change the
text to call the new caller (if the architecture supports it). But this
requires special architecture code. Currently, modify_ftrace_direct() uses a
trick to add a stub ftrace callback to the location forcing it to call the
ftrace iterator. Then it can change the direct helper to call the new
function in C, and then remove the stub. Removing the stub will have the
location now call the new location that the direct helper is using.

The new helper function does the registering the stub trick, but is a weak
function, allowing an architecture to override it to do something a bit more
direct.

Link: https://lore.kernel.org/r/20191115215125.mbqv7taqnx376yed@ast-mbp.dhcp.thefacebook.com

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: I24ee1bebc80aa17ee382063b3cd7d58ea6126508
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 13:55:26 +01:00
Steven Rostedt (VMware)
28f9678c46 UPSTREAM: ftrace: Add ftrace_find_direct_func()
As function_graph tracer modifies the return address to insert a trampoline
to trace the return of a function, it must be aware of a direct caller, as
when it gets called, the function's return address may not be at on the
stack where it expects. It may have to see if that return address points to
the a direct caller and adjust if it is.

Change-Id: I80bd23932c426ec3b2db76538dacd283b972db0a
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 13:55:26 +01:00
Steven Rostedt (VMware)
0ad1df7e64 UPSTREAM: ftrace: Add helper find_direct_entry() to consolidate code
Both unregister_ftrace_direct() and modify_ftrace_direct() needs to
normalize the ip passed in to match the rec->ip, as it is acceptable to have
the ip on the ftrace call site but not the start. There are also common
validity checks with the record found by the ip, these should be done for
both unregister_ftrace_direct() and modify_ftrace_direct().

Change-Id: Ib74ac2ae2f0c9d6c261b409bd87ce9f908e7c8da
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 13:55:26 +01:00
Kuan-Ying Lee
0d91eb18f0 ANDROID: bpf: Add vendor hook
Add vendor hook for bpf, so we can get memory type and
use it to do memory type check for architecture
dependent page table setting.

Bug: 181639260

Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Icac325a040fb88c7f6b04b2409029b623bd8515f
2025-10-12 13:55:15 +01:00
bengris32
c6aa1292ca Merge branch 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip into android-4.19.y-mediatek
* 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip:
  CIP: Bump version suffix to -cip124 after merge from cip/linux-4.19.y-st tree
  Update localversion-st, tree is up-to-date with 5.4.298.
  f2fs: fix to do sanity check on ino and xnid
  squashfs: fix memory leak in squashfs_fill_super
  pNFS: Handle RPC size limit for layoutcommits
  wifi: iwlwifi: fw: Fix possible memory leak in iwl_fw_dbg_collect
  usb: core: usb_submit_urb: downgrade type check
  udf: Verify partition map count
  f2fs: fix to avoid panic in f2fs_evict_inode
  usb: hub: Fix flushing and scheduling of delayed work that tunes runtime pm
  Revert "drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS"
  net: usb: qmi_wwan: add Telit Cinterion LE910C4-WWX new compositions
  HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version()
  HID: asus: fix UAF via HID_CLAIMED_INPUT validation
  efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
  sctp: initialize more fields in sctp_v6_from_sk()
  net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts
  net/mlx5e: Set local Xoff after FW update
  net: dlink: fix multicast stats being counted incorrectly
  atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control().
  net/atm: remove the atmdev_ops {get, set}sockopt methods
  Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced
  powerpc/kvm: Fix ifdef to remove build warning
  net: ipv4: fix regression in local-broadcast routes
  vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put()
  scsi: core: sysfs: Correct sysfs attributes access rights
  ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
  alloc_fdtable(): change calling conventions.
  ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
  net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit
  ipv6: sr: validate HMAC algorithm ID in seg6_hmac_info_add
  ALSA: usb-audio: Fix size validation in convert_chmap_v3()
  scsi: qla4xxx: Prevent a potential error pointer dereference
  usb: xhci: Fix slot_id resource race conflict
  nfs: fix UAF in direct writes
  NFS: Fix up commit deadlocks
  Bluetooth: fix use-after-free in device_for_each_child()
  selftests: forwarding: tc_actions.sh: add matchall mirror test
  codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()
  sch_qfq: make qfq_qlen_notify() idempotent
  sch_hfsc: make hfsc_qlen_notify() idempotent
  sch_drr: make drr_qlen_notify() idempotent
  btrfs: populate otime when logging an inode item
  media: venus: hfi: explicitly release IRQ during teardown
  f2fs: fix to avoid out-of-boundary access in dnode page
  media: venus: protect against spurious interrupts during probe
  media: venus: vdec: Clamp param smaller than 1fps and bigger than 240.
  drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
  media: rainshadow-cec: fix TOCTOU race condition in rain_interrupt()
  media: v4l2-ctrls: Don't reset handler's error in v4l2_ctrl_handler_free()
  ata: Fix SATA_MOBILE_LPM_POLICY description in Kconfig
  usb: musb: omap2430: fix device leak at unbind
  NFS: Fix the setting of capabilities when automounting a new filesystem
  NFS: Fix up handling of outstanding layoutcommit in nfs_update_inode()
  NFSv4: Fix nfs4_bitmap_copy_adjust()
  usb: typec: fusb302: cache PD RX state
  cdc-acm: fix race between initial clearing halt and open
  USB: cdc-acm: do not log successful probe on later errors
  nfsd: handle get_client_locked() failure in nfsd4_setclientid_confirm()
  tracing: Add down_write(trace_event_sem) when adding trace event
  usb: hub: Don't try to recover devices lost during warm reset.
  usb: hub: avoid warm port reset during USB3 disconnect
  x86/mce/amd: Add default names for MCA banks and blocks
  iio: hid-sensor-prox: Fix incorrect OFFSET calculation
  mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
  mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage()
  net: usbnet: Fix the wrong netif_carrier_on() call
  net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
  ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value
  kbuild: Add KBUILD_CPPFLAGS to as-option invocation
  kbuild: add $(CLANG_FLAGS) to KBUILD_CPPFLAGS
  kbuild: Add CLANG_FLAGS to as-instr
  mips: Include KBUILD_CPPFLAGS in CHECKFLAGS invocation
  kbuild: Update assembler calls to use proper flags and language target
  ARM: 9448/1: Use an absolute path to unified.h in KBUILD_AFLAGS
  usb: dwc3: Ignore late xferNotReady event to prevent halt timeout
  USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles
  usb: storage: realtek_cr: Use correct byte order for bcs->Residue
  USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera
  usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive
  iio: proximity: isl29501: fix buffered read on big-endian systems
  ftrace: Also allocate and copy hash for reading of filter files
  fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable()
  fs/buffer: fix use-after-free when call bh_read() helper
  drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3
  media: venus: Add a check for packet size after reading from shared memory
  media: ov2659: Fix memory leaks in ov2659_probe()
  media: usbtv: Lock resolution while streaming
  media: gspca: Add bounds checking to firmware parser
  jbd2: prevent softlockup in jbd2_log_do_checkpoint()
  PCI: endpoint: Fix configfs group removal on driver teardown
  PCI: endpoint: Fix configfs group list head handling
  mtd: rawnand: fsmc: Add missing check after DMA map
  wifi: brcmsmac: Remove const from tbl_ptr parameter in wlc_lcnphy_common_read_table()
  zynq_fpga: use sgtable-based scatterlist wrappers
  ata: libata-scsi: Fix ata_to_sense_error() status handling
  ext4: fix reserved gdt blocks handling in fsmap
  ext4: fix fsmap end of range reporting with bigalloc
  ext4: check fast symlink for ea_inode correctly
  Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()"
  vt: defkeymap: Map keycodes above 127 to K_HOLE
  usb: gadget: udc: renesas_usb3: fix device leak at unbind
  usb: atm: cxacru: Merge cxacru_upload_firmware() into cxacru_heavy_init()
  m68k: Fix lost column on framebuffer debug console
  serial: 8250: fix panic due to PSLVERR
  media: uvcvideo: Do not mark valid metadata as invalid
  media: uvcvideo: Fix 1-byte out-of-bounds read in uvc_parse_format()
  btrfs: fix log tree replay failure due to file with 0 links and extents
  thunderbolt: Fix copy+paste error in match_service_id()
  misc: rtsx: usb: Ensure mmc child device is active when card is present
  scsi: lpfc: Remove redundant assignment to avoid memory leak
  rtc: ds1307: remove clear of oscillator stop flag (OSF) in probe
  pNFS: Fix uninited ptr deref in block/scsi layout
  pNFS: Fix disk addr range check in block/scsi layout
  pNFS: Fix stripe mapping in block/scsi layout
  ipmi: Fix strcpy source and destination the same
  kconfig: lxdialog: fix 'space' to (de)select options
  kconfig: gconf: fix potential memory leak in renderer_edited()
  kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed()
  scsi: aacraid: Stop using PCI_IRQ_AFFINITY
  scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans
  kconfig: nconf: Ensure null termination where strncpy is used
  kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
  PCI: pnv_php: Work around switches with broken presence detection
  media: uvcvideo: Fix bandwidth issue for Alcor camera
  media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar
  media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb()
  media: usb: hdpvr: disable zero-length read messages
  media: tc358743: Increase FIFO trigger level to 374
  media: tc358743: Return an appropriate colorspace from tc358743_set_fmt
  media: tc358743: Check I2C succeeded during probe
  pinctrl: stm32: Manage irq affinity settings
  scsi: mpt3sas: Correctly handle ATA device errors
  RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()
  MIPS: Don't crash in stack_top() for tasks without ABI or vDSO
  jfs: upper bound check of tree index in dbAllocAG
  jfs: Regular file corruption check
  jfs: truncate good inode pages when hard link is 0
  scsi: bfa: Double-free fix
  MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free}
  watchdog: dw_wdt: Fix default timeout
  fs/orangefs: use snprintf() instead of sprintf()
  scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated
  ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
  vhost: fail early when __vhost_add_used() fails
  uapi: in6: restore visibility of most IPv6 socket options
  net: ncsi: Fix buffer overflow in fetching version id
  net: dsa: b53: fix b53_imp_vlan_setup for BCM5325
  net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs
  wifi: iwlegacy: Check rate_idx range after addition
  netmem: fix skb_frag_address_safe with unreadable skbs
  wifi: rtlwifi: fix possible skb memory leak in `_rtl_pci_rx_interrupt()`.
  wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd()
  net: fec: allow disable coalescing
  (powerpc/512) Fix possible `dma_unmap_single()` on uninitialized pointer
  s390/stp: Remove udelay from stp_sync_clock()
  wifi: iwlwifi: mvm: fix scan request validation
  net: thunderx: Fix format-truncation warning in bgx_acpi_match_id()
  net: ipv4: fix incorrect MTU in broadcast routes
  wifi: cfg80211: Fix interface type validation
  et131x: Add missing check after DMA map
  be2net: Use correct byte order and format string for TCP seq and ack_seq
  s390/time: Use monotonic clock in get_cycles()
  wifi: cfg80211: reject HTC bit for management frames
  ktest.pl: Prevent recursion of default variable options
  ASoC: codecs: rt5640: Retry DEVICE_ID verification
  ALSA: usb-audio: Avoid precedence issues in mixer_quirks macros
  ALSA: hda/ca0132: Fix buffer overflow in add_tuning_control
  platform/x86: thinkpad_acpi: Handle KCOV __init vs inline mismatches
  pm: cpupower: Fix the snapshot-order of tsc,mperf, clock in mperf_stop()
  ALSA: intel8x0: Fix incorrect codec index usage in mixer for ICH4
  ASoC: hdac_hdmi: Rate limit logging on connection and disconnection
  mmc: rtsx_usb_sdmmc: Fix error-path in sd_set_power_mode()
  ACPI: processor: fix acpi_object initialization
  PM: sleep: console: Fix the black screen issue
  thermal: sysfs: Return ENODATA instead of EAGAIN for reads
  selftests: tracing: Use mutex_unlock for testing glob filter
  ARM: tegra: Use I/O memcpy to write to IRAM
  gpio: tps65912: check the return value of regmap_update_bits()
  ASoC: soc-dapm: set bias_level if snd_soc_dapm_set_bias_level() was successed
  cpufreq: Exit governor when failed to start old governor
  usb: xhci: Avoid showing errors during surprise removal
  usb: xhci: Set avg_trb_len = 8 for EP0 during Address Device Command
  usb: xhci: Avoid showing warnings for dying controller
  selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t
  usb: xhci: print xhci->xhc_state when queue_command failed
  securityfs: don't pin dentries twice, once is enough...
  hfs: fix not erasing deleted b-tree node issue
  drbd: add missing kref_get in handle_write_conflicts
  arm64: Handle KCOV __init vs inline mismatches
  hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file()
  hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()
  hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read()
  hfs: fix slab-out-of-bounds in hfs_bnode_read()
  sctp: linearize cloned gso packets in sctp_rcv
  netfilter: ctnetlink: fix refcount leak on table dump
  udp: also consider secpath when evaluating ipsec use for checksumming
  fs: Prevent file descriptor table allocations exceeding INT_MAX
  sunvdc: Balance device refcount in vdc_port_mpgroup_check
  NFSD: detect mismatch of file handle and delegation stateid in OPEN op
  net: dpaa: fix device leak when querying time stamp info
  net: gianfar: fix device leak when querying time stamp info
  netlink: avoid infinite retry looping in netlink_unicast()
  ALSA: usb-audio: Validate UAC3 cluster segment descriptors
  ALSA: usb-audio: Validate UAC3 power domain descriptors, too
  usb: gadget : fix use-after-free in composite_dev_cleanup()
  MIPS: mm: tlb-r4k: Uniquify TLB entries on init
  USB: serial: option: add Foxconn T99W709
  vsock: Do not allow binding to VMADDR_PORT_ANY
  net/packet: fix a race in packet_set_ring() and packet_notifier()
  perf/core: Prevent VMA split of buffer mappings
  perf/core: Exit early on perf_mmap() fail
  perf/core: Don't leak AUX buffer refcount on allocation failure
  pptp: fix pptp_xmit() error path
  smb: client: let recv_done() cleanup before notifying the callers.
  benet: fix BUG when creating VFs
  ipv6: reject malicious packets in ipv6_gso_segment()
  pptp: ensure minimal skb length in pptp_xmit()
  netpoll: prevent hanging NAPI when netcons gets enabled
  NFS: Fix filehandle bounds checking in nfs_fh_to_dentry()
  pci/hotplug/pnv-php: Wrap warnings in macro
  pci/hotplug/pnv-php: Improve error msg on power state change failure
  usb: chipidea: udc: fix sleeping function called from invalid context
  f2fs: fix to avoid out-of-boundary access in devs.path
  f2fs: fix to avoid UAF in f2fs_sync_inode_meta()
  rtc: pcf8563: fix incorrect maximum clock rate handling
  rtc: hym8563: fix incorrect maximum clock rate handling
  rtc: ds1307: fix incorrect maximum clock rate handling
  mtd: rawnand: atmel: set pmecc data setup time
  mtd: rawnand: atmel: Fix dma_mapping_error() address
  jfs: fix metapage reference count leak in dbAllocCtl
  fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref
  crypto: qat - fix seq_file position update in adf_ring_next()
  dmaengine: nbpfaxi: Add missing check after DMA map
  dmaengine: mv_xor: Fix missing check after DMA map and missing unmap
  fs/orangefs: Allow 2 more characters in do_c_string()
  crypto: img-hash - Fix dma_unmap_sg() nents value
  scsi: isci: Fix dma_unmap_sg() nents value
  scsi: mvsas: Fix dma_unmap_sg() nents value
  scsi: ibmvscsi_tgt: Fix dma_unmap_sg() nents value
  perf tests bp_account: Fix leaked file descriptor
  crypto: ccp - Fix crash when rebind ccp device for ccp.ko
  pinctrl: sunxi: Fix memory leak on krealloc failure
  power: supply: max14577: Handle NULL pdata when CONFIG_OF is not set
  clk: davinci: Add NULL check in davinci_lpsc_clk_register()
  mtd: fix possible integer overflow in erase_xfer()
  crypto: marvell/cesa - Fix engine load inaccuracy
  PCI: rockchip-host: Fix "Unexpected Completion" log message
  vrf: Drop existing dst reference in vrf_ip6_input_dst
  netfilter: xt_nfacct: don't assume acct name is null-terminated
  can: kvaser_usb: Assign netdev.dev_port based on device channel index
  wifi: brcmfmac: fix P2P discovery failure in P2P peer due to missing P2P IE
  Reapply "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()"
  mwl8k: Add missing check after DMA map
  wifi: rtl8xxxu: Fix RX skb size for aggregation disabled
  net/sched: Restrict conditions for adding duplicating netems to qdisc tree
  arch: powerpc: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX
  netfilter: nf_tables: adjust lockdep assertions handling
  drm/amd/pm/powerplay/hwmgr/smu_helper: fix order of mask and value
  m68k: Don't unregister boot console needlessly
  tcp: fix tcp_ofo_queue() to avoid including too much DUP SACK range
  iwlwifi: Add missing check for alloc_ordered_workqueue
  wifi: iwlwifi: Fix memory leak in iwl_mvm_init()
  wifi: rtl818x: Kill URBs before clearing tx status queue
  caif: reduce stack size, again
  staging: nvec: Fix incorrect null termination of battery manufacturer
  samples: mei: Fix building on musl libc
  usb: early: xhci-dbc: Fix early_ioremap leak
  Revert "vmci: Prevent the dispatching of uninitialized payloads"
  pps: fix poll support
  vmci: Prevent the dispatching of uninitialized payloads
  staging: fbtft: fix potential memory leak in fbtft_framebuffer_alloc()
  ARM: dts: vfxxx: Correctly use two tuples for timer address
  ASoC: ops: dynamically allocate struct snd_ctl_elem_value
  hfsplus: remove mutex_lock check in hfsplus_free_extents
  ASoC: Intel: fix SND_SOC_SOF dependencies
  ethernet: intel: fix building with large NR_CPUS
  usb: phy: mxs: disconnect line when USB charger is attached
  usb: chipidea: udc: protect usb interrupt enable
  usb: chipidea: udc: add new API ci_hdrc_gadget_connect
  comedi: comedi_test: Fix possible deletion of uninitialized timers
  nilfs2: reject invalid file types when reading inodes
  i2c: qup: jump out of the loop in case of timeout
  net/sched: sch_qfq: Avoid triggering might_sleep in atomic context in qfq_delete_class
  net: appletalk: Fix use-after-free in AARP proxy probe
  net: appletalk: fix kerneldoc warnings
  RDMA/core: Rate limit GID cache warning messages
  usb: hub: fix detection of high tier USB3 devices behind suspended hubs
  net_sched: sch_sfq: reject invalid perturb period
  net_sched: sch_sfq: move the limit validation
  net_sched: sch_sfq: use a temporary work area for validating configuration
  net_sched: sch_sfq: don't allow 1 packet limit
  net_sched: sch_sfq: handle bigger packets
  net_sched: sch_sfq: annotate data-races around q->perturb_period
  power: supply: bq24190_charger: Fix runtime PM imbalance on error
  xhci: Disable stream for xHC controller with XHCI_BROKEN_STREAMS
  virtio-net: ensure the received length does not exceed allocated size
  usb: dwc3: qcom: Don't leave BCR asserted
  usb: musb: fix gadget state on disconnect
  net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtree
  net: vlan: fix VLAN 0 refcount imbalance of toggling filtering during runtime
  Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU
  Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
  Bluetooth: SMP: If an unallowed command is received consider it a failure
  Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb()
  usb: net: sierra: check for no status endpoint
  net/sched: sch_qfq: Fix race condition on qfq_aggregate
  net: emaclite: Fix missing pointer increment in aligned_read()
  comedi: Fix use of uninitialized data in insn_rw_emulate_bits()
  comedi: Fix some signed shift left operations
  comedi: das6402: Fix bit shift out of bounds
  comedi: das16m1: Fix bit shift out of bounds
  comedi: aio_iiro_16: Fix bit shift out of bounds
  comedi: pcl812: Fix bit shift out of bounds
  iio: adc: max1363: Reorder mode_list[] entries
  iio: adc: max1363: Fix MAX1363_4X_CHANS/MAX1363_8X_CHANS[]
  soc: aspeed: lpc-snoop: Don't disable channels that aren't enabled
  soc: aspeed: lpc-snoop: Cleanup resources in stack-order
  mmc: sdhci-pci: Quirk for broken command queuing on Intel GLK-based Positivo models
  memstick: core: Zero initialize id_reg in h_memstick_read_dev_id()
  isofs: Verify inode mode when loading from disk
  dmaengine: nbpfaxi: Fix memory corruption in probe()
  af_packet: fix soft lockup issue caused by tpacket_snd()
  af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd()
  phonet/pep: Move call to pn_skb_get_dst_sockaddr() earlier in pep_sock_accept()
  HID: core: do not bypass hid_hw_raw_request
  HID: core: ensure __hid_request reserves the report ID as the first byte
  HID: core: ensure the allocated report buffer can contain the reserved report ID
  pch_uart: Fix dma_sync_sg_for_device() nents value
  Input: xpad - set correct controller type for Acer NGR200
  i2c: stm32: fix the device used for the DMA map
  usb: gadget: configfs: Fix OOB read on empty string write
  USB: serial: ftdi_sio: add support for NDI EMGUIDE GEMINI
  USB: serial: option: add Foxconn T99W640
  USB: serial: option: add Telit Cinterion FE910C04 (ECM) composition
  dma-mapping: add generic helpers for mapping sgtable objects
  usb: renesas_usbhs: Flush the notify_hotplug_work
  gpio: rcar: Use raw_spinlock to protect register access

Change-Id: Ia6b8b00918487999c648f298d3550afc7eaaae03
Signed-off-by: bengris32 <bengris32@protonmail.ch>
2025-10-12 13:39:56 +01:00
Tengda Wu
cdb89bab3e ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
[ Upstream commit 4013aef2ced9b756a410f50d12df9ebe6a883e4a ]

When calling ftrace_dump_one() concurrently with reading trace_pipe,
a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race
condition.

The issue occurs because:

CPU0 (ftrace_dump)                              CPU1 (reader)
echo z > /proc/sysrq-trigger

!trace_empty(&iter)
trace_iterator_reset(&iter) <- len = size = 0
                                                cat /sys/kernel/tracing/trace_pipe
trace_find_next_entry_inc(&iter)
  __find_next_entry
    ring_buffer_empty_cpu <- all empty
  return NULL

trace_printk_seq(&iter.seq)
  WARN_ON_ONCE(s->seq.len >= s->seq.size)

In the context between trace_empty() and trace_find_next_entry_inc()
during ftrace_dump, the ring buffer data was consumed by other readers.
This caused trace_find_next_entry_inc to return NULL, failing to populate
`iter.seq`. At this point, due to the prior trace_iterator_reset, both
`iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal,
the WARN_ON_ONCE condition is triggered.

Move the trace_printk_seq() into the if block that checks to make sure the
return value of trace_find_next_entry_inc() is non-NULL in
ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before
subsequent operations.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com
Fixes: d769041f86 ("ring_buffer: implement new locking")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ulrich Hecht <uli@kernel.org>
2025-09-22 10:17:52 +02:00
Micah Morton
da65e40eca BACKPORT: LSM: add SafeSetID module that gates setid calls
This change ensures that the set*uid family of syscalls in kernel/sys.c
(setreuid, setuid, setresuid, setfsuid) all call ns_capable_common with
the CAP_OPT_INSETID flag, so capability checks in the security_capable
hook can know whether they are being called from within a set*uid
syscall. This change is a no-op by itself, but is needed for the
proposed SafeSetID LSM.

Change-Id: Ie661692d340f57b74c5cd6623159c028795d481f
Signed-off-by: Micah Morton <mortonm@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
2025-09-20 03:24:13 +01:00
Song Liu
d44aabf053 UPSTREAM: perf, bpf: Consider events with attr.bpf_event as side-band events
Events with attr.bpf_event set should be considered as side-band events,
as they carry information about BPF programs.

Change-Id: I581267aaad653f95ba8e3f286dff0a5d31479a69
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@fb.com
Cc: netdev@vger.kernel.org
Fixes: 6ee52e2a3fe4 ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT")
Link: http://lkml.kernel.org/r/20190226002019.3748539-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-20 03:24:11 +01:00
Odin Ugedal
0aea134aeb UPSTREAM: cgroup: fix psi monitor for root cgroup
Fix NULL pointer dereference when adding new psi monitor to the root
cgroup. PSI files for root cgroup was introduced in df5ba5be742 by using
system wide psi struct when reading, but file write/monitor was not
properly fixed. Since the PSI config for the root cgroup isn't
initialized, the current implementation tries to lock a NULL ptr,
resulting in a crash.

Can be triggered by running this as root:
$ tee /sys/fs/cgroup/cpu.pressure <<< "some 10000 1000000"

Change-Id: Id2137e41eab9efa13f52ac2afa937f66b69847e1
Signed-off-by: Odin Ugedal <odin@uged.al>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Dan Schatzberg <dschatzberg@fb.com>
Fixes: df5ba5be7425 ("kernel/sched/psi.c: expose pressure metrics on root cgroup")
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # 5.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-20 03:24:10 +01:00
Qian Cai
dd0d07a16f UPSTREAM: cgroup: fix psi_show() crash on 32bit ino archs
Similar to the commit d7495343228f ("cgroup: fix incorrect
WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not
equal to 1 on 32bit ino archs which triggers all sorts of issues with
psi_show() on s390x. For example,

 BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/
 Read of size 4 at addr 000000001e0ce000 by task read_all/3667
 collect_percpu_times+0x2d0/0x798
 psi_show+0x7c/0x2a8
 seq_read+0x2ac/0x830
 vfs_read+0x92/0x150
 ksys_read+0xe2/0x188
 system_call+0xd8/0x2b4

Fix it by using cgroup_ino().

Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
Change-Id: Iefbc4965c651541e1bb1b23ac67c991d913ed5a7
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.5
2025-09-20 03:24:10 +01:00
Lorenzo Bianconi
da419dac95 BACKPORT: bpf: cpumap: Implement XDP_REDIRECT for eBPF programs attached to map entries
Introduce XDP_REDIRECT support for eBPF programs attached to cpumap
entries.
This patch has been tested on Marvell ESPRESSObin using a modified
version of xdp_redirect_cpu sample in order to attach a XDP program
to CPUMAP entries to perform a redirect on the mvneta interface.
In particular the following scenario has been tested:

rq (cpu0) --> mvneta - XDP_REDIRECT (cpu0) --> CPUMAP - XDP_REDIRECT (cpu1) --> mvneta

$./xdp_redirect_cpu -p xdp_cpu_map0 -d eth0 -c 1 -e xdp_redirect \
	-f xdp_redirect_kern.o -m tx_port -r eth0

tx: 285.2 Kpps rx: 285.2 Kpps

Attaching a simple XDP program on eth0 to perform XDP_TX gives
comparable results:

tx: 288.4 Kpps rx: 288.4 Kpps

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Change-Id: I485e490be5fc60d91fb37a4e9b694b72ef83c627
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/2cf8373a731867af302b00c4ff16c122630c4980.1594734381.git.lorenzo@kernel.org
2025-09-20 03:24:09 +01:00
Daniel Borkmann
0fd96bafc4 UPSTREAM: bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
[ Upstream commit ff40e51043af63715ab413995ff46996ecf9583f ]

Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:

  1) The audit events that are triggered due to calls to security_locked_down()
     can OOM kill a machine, see below details [0].

  2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
     when trying to wake up kauditd, for example, when using trace_sched_switch()
     tracepoint, see details in [1]. Triggering this was not via some hypothetical
     corner case, but with existing tools like runqlat & runqslower from bcc, for
     example, which make use of this tracepoint. Rough call sequence goes like:

     rq_lock(rq) -> -------------------------+
       trace_sched_switch() ->               |
         bpf_prog_xyz() ->                   +-> deadlock
           selinux_lockdown() ->             |
             audit_log_end() ->              |
               wake_up_interruptible() ->    |
                 try_to_wake_up() ->         |
                   rq_lock(rq) --------------+

What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:

  allow <who> <who> : lockdown { <reason> };

However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:

  bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.

Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.

The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

  I starting seeing this with F-34. When I run a container that is traced with
  BPF to record the syscalls it is doing, auditd is flooded with messages like:

  type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
    for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
      scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
        tclass=lockdown permissive=0

  This seems to be leading to auditd running out of space in the backlog buffer
  and eventually OOMs the machine.

  [...]
  auditd running at 99% CPU presumably processing all the messages, eventually I get:
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
  Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
  Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
  Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  [...]

[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
    Serhei Makarov says:

  Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
  bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
  is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
  testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
  ppc64le. Example stack trace:

  [...]
  [  730.868702] stack backtrace:
  [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
  [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  [  730.873278] Call Trace:
  [  730.873770]  dump_stack+0x7f/0xa1
  [  730.874433]  check_noncircular+0xdf/0x100
  [  730.875232]  __lock_acquire+0x1202/0x1e10
  [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
  [  730.876844]  lock_acquire+0xc2/0x3a0
  [  730.877551]  ? __wake_up_common_lock+0x52/0x90
  [  730.878434]  ? lock_acquire+0xc2/0x3a0
  [  730.879186]  ? lock_is_held_type+0xa7/0x120
  [  730.880044]  ? skb_queue_tail+0x1b/0x50
  [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
  [  730.881656]  ? __wake_up_common_lock+0x52/0x90
  [  730.882532]  __wake_up_common_lock+0x52/0x90
  [  730.883375]  audit_log_end+0x5b/0x100
  [  730.884104]  slow_avc_audit+0x69/0x90
  [  730.884836]  avc_has_perm+0x8b/0xb0
  [  730.885532]  selinux_lockdown+0xa5/0xd0
  [  730.886297]  security_locked_down+0x20/0x40
  [  730.887133]  bpf_probe_read_compat+0x66/0xd0
  [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
  [  730.888917]  trace_call_bpf+0xe9/0x240
  [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
  [  730.890579]  perf_trace_sched_switch+0x142/0x180
  [  730.891485]  ? __schedule+0x6d8/0xb20
  [  730.892209]  __schedule+0x6d8/0xb20
  [  730.892899]  schedule+0x5b/0xc0
  [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
  [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
  [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [...]

Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Change-Id: Ie9ec178238bbe62a9e284d52eecff1ee569b307d
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-20 03:24:09 +01:00