29492 Commits

Author SHA1 Message Date
kondors1995
7972295b00 Merge tag 'v4.14.327' of git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux into 13.0
This is the 4.14.327 stable release
2023-10-11 15:44:44 +03:00
pwnrazr
7a8eb680e8 kernel: sysctl: stub sched_group*
The values are taken from our kernel tree. Not sure if it's needed or not tho

fixes logspam:
E ANDR-PERF-RESOURCEQS: Failed to apply optimization [3, 22]
E ANDR-PERF-UTIL: Failed to read /proc/sys/kernel/sched_group_upmigrate
E ANDR-PERF-OPTSHANDLER: Failed to read /proc/sys/kernel/sched_group_upmigrate
E ANDR-PERF-RESOURCEQS: Failed to apply optimization [3, 21]
2023-10-11 15:31:37 +03:00
pwnrazr
47803f2c11 kernel: sysctl: stub sched_boost
stops logspam
E ANDR-PERF-RESOURCEQS: Failed to apply optimization [3, 0]

[3, 0] refers to major, minor from 59660e6ac4/proprietary/vendor/etc/perf/commonresourceconfigs.xml (L55)
where a major of 3 (0x03) and a minor of 0 (0x0) refers to sched_boost
2023-10-11 15:31:37 +03:00
John Galt
226422bd47 kernel/sysctl: guard sysctl stub for walt if walt is enabled 2023-10-11 15:31:37 +03:00
kondors1995
e448e7d88c Revert "sched: Never migrate tasks upon execve()"
As per Emanuel & Sultan recomendation

https://t.me/sultanskernel/956122

This reverts commit 230a5b636c.
2023-10-11 15:31:37 +03:00
Tashfin Shakeer Rhythm
a3b30e9dca Revert "ANDROID: sched/fair: Avoid unnecessary balancing of asymmetric capacity groups"
This negatively affects the scheduler performance. Revert it.

This reverts commit 0d6eeac4fba5f065b4033f59c03897d9b1203d20.

Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-10-11 15:31:36 +03:00
Valentin Schneider
b88ddb23ab UPSTREAM: sched/fair: Introduce a CPU capacity comparison helper
During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass

  group_smaller_max_cpu_capacity(<candidate group>, <local group>);

which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.

Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.

One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.

Also note that comparing capacity extrema of local and source sched_group's
doesn't make much sense when at the day of the day the imbalance will be
pulled by a known env->dst_cpu, whose capacity can be anywhere within the
local group's capacity extrema.

While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-10-11 15:31:36 +03:00
Valentin Schneider
1b602a22e6 UPSTREAM: sched/fair: Clean up active balance nr_balance_failed trickery
When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().

This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.

Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose. Cleanup the LBF_DST_PINNED active balance special case.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-10-11 15:31:36 +03:00
Lingutla Chandrasekhar
424ab00d82 sched/fair: Ignore percpu threads for imbalance pulls
[ Upstream commit 9bcb959d05eeb564dfc9cac13a59843a4fb2edf2 ]

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3
       L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-10-11 15:31:36 +03:00
pwnrazr
fd2e0d75aa Revert "Trigger sbalance on enable_nonboot_cpus"
This reverts commit 7226f90fb58123c5dbece9d74c9095af946d1219.

As per Sultan:
"This is only after wake up from deep sleep.
Also, the IRQ subsystem remembers which CPU
an IRQ was affined to and should put the
IRQs back upon resume"
2023-10-11 15:31:36 +03:00
Sultan Alsawaf
f5062c7917 sched/cass: Introduce the Capacity Aware Superset Scheduler
The Capacity Aware Superset Scheduler (CASS) optimizes runqueue selection
of CFS tasks. By using CPU capacity as a basis for comparing the relative
utilization between different CPUs, CASS fairly balances load across CPUs
of varying capacities. This results in improved multi-core performance,
especially when CPUs are overutilized because CASS doesn't clip a CPU's
utilization when it eclipses the CPU's capacity.

As a superset of capacity aware scheduling, CASS implements a hierarchy of
criteria to determine the better CPU to wake a task upon between CPUs that
have the same relative utilization. This way, single-core performance,
latency, and cache affinity are all optimized where possible.

CASS doesn't feature explicit energy awareness but its basic load balancing
principle results in decreased overall energy, often better than what is
possible with explicit energy awareness. By fairly balancing load based on
relative utilization, all CPUs are kept at their lowest P-state necessary
to satisfy the overall load at any given moment.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2023-10-11 15:31:35 +03:00
Viresh Kumar
1ee0aba9f4 sched/fair: Load balance aggressively for SCHED_IDLE CPUs
The fair scheduler performs periodic load balance on every CPU to check
if it can pull some tasks from other busy CPUs. The duration of this
periodic load balance is set to sd->balance_interval for the idle CPUs
and is calculated by multiplying the sd->balance_interval with the
sd->busy_factor (set to 32 by default) for the busy CPUs. The
multiplication is done for busy CPUs to avoid doing load balance too
often and rather spend more time executing actual task. While that is
the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
tasks.

With the recent enhancements in the fair scheduler around SCHED_IDLE
CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
CPU instead of other busy or idle CPUs. The same reasoning should be
applied to the load balancer as well to make it migrate tasks more
aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
latency of the migrated (SCHED_OTHER) tasks.

This patch makes minimal changes to the fair scheduler to do the next
load balance soon after the last non SCHED_IDLE task is dequeued from a
runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
ignored while calculating the balance_interval for such CPUs. This is
done to avoid delaying the periodic load balance by few hundred
milliseconds for SCHED_IDLE CPUs.

This is tested on ARM64 Hikey620 platform (octa-core) with the help of
rt-app and it is verified, using kernel traces, that the newly
SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
and pulls tasks from other busy CPUs.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:35 +03:00
Rohit Jain
0bb885973d sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
In the following commit:

  247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs")

... we distinguish between idle_cpu() when the vCPU is not running for
scheduling threads.

However, the idle_cpu() function is used in other places for
actually checking whether the state of the CPU is idle or not.

Hence split the use of that function based on the desired return value,
by introducing the available_idle_cpu() function.

This fixes a (slight) regression in that initial vCPU commit, because
some code paths (like the load-balancer) don't care and shouldn't care
if the vCPU is preempted or not, they just want to know if there's any
tasks on the CPU.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:35 +03:00
Rohit Jain
ad126723c3 sched/core: Don't schedule threads on pre-empted vCPUs
In paravirt configurations today, spinlocks figure out whether a vCPU is
running to determine whether or not spinlock should bother spinning. We
can use the same logic to prioritize CPUs when scheduling threads. If a
vCPU has been pre-empted, it will incur the extra cost of VMENTER and
the time it actually spends to be running on the host CPU. If we had
other vCPUs which were actually running on the host CPU and idle we
should schedule threads there.

Performance numbers:

Note: With patch is referred to as Paravirt in the following and without
patch is referred to as Base.

1) When only 1 VM is running:

    a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):

	+-------+-----------------+----------------+
	|Number |Paravirt         |Base            |
	|of     +---------+-------+-------+--------+
	|Threads|Average  |Std Dev|Average| Std Dev|
	+-------+---------+-------+-------+--------+
	|1      |1.817    |0.076  |1.721  | 0.067  |
	|2      |3.467    |0.120  |3.468  | 0.074  |
	|4      |6.266    |0.035  |6.314  | 0.068  |
	|8      |11.437   |0.105  |11.418 | 0.132  |
	|16     |21.862   |0.167  |22.161 | 0.129  |
	|25     |33.341   |0.326  |33.692 | 0.147  |
	+-------+---------+-------+-------+--------+

2) When two VMs are running with same CPU affinities:

    a) tbench test on VM 8 cpus

    Base:

	VM1:

	Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
	Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
	Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
	Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms

	VM2:

	Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
	Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
	Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
	Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms

    Paravirt:

	VM1:

	Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
	Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
	Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
	Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms

	VM2:

	Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
	Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
	Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
	Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms

    b) Hackbench with 4 fd 1,000,000 loops

	+-------+--------------------------------------+----------------------------------------+
	|Number |Paravirt                              |Base                                    |
	|of     +----------+--------+---------+--------+----------+--------+---------+----------+
	|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
	|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
	+-------+----------+--------+---------+--------+----------+--------+---------+----------+

    Note that this test was run just to show the interference effect
    over-subscription can have in baseline

    c) schbench results with 2 message groups on 8 vCPU VMs

	+-----------+-------+---------------+--------------+------------+
	|           |       | Paravirt      | Base         |            |
	+-----------+-------+-------+-------+-------+------+------------+
	|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
	|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
	|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
	|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
	|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
	|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
	|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
	|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
	|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
	|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
	|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
	|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
	|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
	|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
	|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
	|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
	|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
	|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
	|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
	+-----------+-------+-------+-------+-------+------+------------+

    Note: Improvement is measured for (VM1+VM2)

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:35 +03:00
Viresh Kumar
f8e4f55165 sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.

A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).

This patch updates both fast and slow paths with this optimization.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: chris.redpath@arm.com
Cc: quentin.perret@linaro.org
Cc: songliubraving@fb.com
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Cc: tkjos@google.com
Link: https://lkml.kernel.org/r/eeafa25fdeb6f6edd5b2da716bc8f0ba7708cbcf.1561523542.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
2023-10-11 15:31:35 +03:00
Viresh Kumar
0b25ce8b82 sched/fair: Start tracking SCHED_IDLE tasks count in cfs_rq
Track how many tasks are present with SCHED_IDLE policy in each cfs_rq.
This will be used by later commits.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: chris.redpath@arm.com
Cc: quentin.perret@linaro.org
Cc: songliubraving@fb.com
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Cc: tkjos@google.com
Link: https://lkml.kernel.org/r/0d3cdc427fc68808ad5bccc40e86ed0bf9da8bb4.1561523542.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:35 +03:00
Samuel Pascua
a0618b6810 Revert "sched/fair: Improve the scheduler"
This reverts commit 04d07a9b6e.
2023-10-11 15:31:35 +03:00
Samuel Pascua
18a48cb626 sched/fair: Add lsub_positive() and use it consistently
The following pattern:

   var -= min_t(typeof(var), var, val);

is used multiple times in fair.c.

The existing sub_positive() already captures that pattern, but it also
adds an explicit load-store to properly support lockless observations.
In other cases the pattern above is used to update local, and/or not
concurrently accessed, variables.

Let's add a simpler version of sub_positive(), targeted at local variables
updates, which gives the same readability benefits at calling sites,
without enforcing {READ,WRITE}_ONCE() barriers.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:35 +03:00
Patrick Bellasi
d5a4645f08 sched/fair: Add lsub_positive() and use it consistently
The following pattern:

   var -= min_t(typeof(var), var, val);

is used multiple times in fair.c.

The existing sub_positive() already captures that pattern, but it also
adds an explicit load-store to properly support lockless observations.
In other cases the pattern above is used to update local, and/or not
concurrently accessed, variables.

Let's add a simpler version of sub_positive(), targeted at local variables
updates, which gives the same readability benefits at calling sites,
without enforcing {READ,WRITE}_ONCE() barriers.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:34 +03:00
Patrick Bellasi
240202b52c sched/fair: Mask UTIL_AVG_UNCHANGED usages
The _task_util_est() is mainly used to add/remove the task contribution
to/from the rq's estimated utilization at task enqueue/dequeue time.
In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep
consistency between enqueue and dequeue time while still being
transparent to update_load_avg calls which will eventually reset the
flag.

Let's move the flag forcing within _task_util_est() itself so that we
can simplify calling code by hiding that estimated utilization
implementation detail into one of its internal functions.

This will affect also the "public" API task_util_est() but we know that
the flag will (eventually) impact just on the LSB of the estimated
utilization, thus it's certainly acceptable.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20181105145400.935-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:34 +03:00
Viresh Kumar
8fc30a0937 sched/core: Create task_has_idle_policy() helper
We already have task_has_rt_policy() and task_has_dl_policy() helpers,
create task_has_idle_policy() as well and update sched core to start
using it.

While at it, use task_has_dl_policy() at one more place.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 15:31:21 +03:00
Zheng Yejian
f1c2b40261 ring-buffer: Avoid softlockup in ring_buffer_resize()
[ Upstream commit f6bd2c92488c30ef53b5bd80c52f0a7eee9d545a ]

When user resize all trace ring buffer through file 'buffer_size_kb',
then in ring_buffer_resize(), kernel allocates buffer pages for each
cpu in a loop.

If the kernel preemption model is PREEMPT_NONE and there are many cpus
and there are many buffer pages to be allocated, it may not give up cpu
for a long time and finally cause a softlockup.

To avoid it, call cond_resched() after each cpu buffer allocation.

Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10 21:43:39 +02:00
John Galt
3d74d9fd12 kernel/sysctl: move some values read only
Forgotten post rebase
2023-10-05 18:28:47 +03:00
John Galt
527afe22e6 treewide: more sched_set_fifo conversion 2023-10-05 18:28:46 +03:00
Linus Torvalds
69ad5d98c0 treewide: backport sched_set_fifo*
Merge tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull sched/fifo updates from Ingo Molnar:
 "This adds the sched_set_fifo*() encapsulation APIs to remove static
  priority level knowledge from non-scheduler code.

  The three APIs for non-scheduler code to set SCHED_FIFO are:

   - sched_set_fifo()
   - sched_set_fifo_low()
   - sched_set_normal()

  These are two FIFO priority levels: default (high), and a 'low'
  priority level, plus sched_set_normal() to set the policy back to
  non-SCHED_FIFO.

  Since the changes affect a lot of non-scheduler code, we kept this in
  a separate tree"

* tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched,tracing: Convert to sched_set_fifo()
  sched: Remove sched_set_*() return value
  sched: Remove sched_setscheduler*() EXPORTs
  sched,psi: Convert to sched_set_fifo_low()
  sched,rcutorture: Convert to sched_set_fifo_low()
  sched,rcuperf: Convert to sched_set_fifo_low()
  sched,locktorture: Convert to sched_set_fifo()
  sched,irq: Convert to sched_set_fifo()
  sched,watchdog: Convert to sched_set_fifo()
  sched,serial: Convert to sched_set_fifo()
  sched,powerclamp: Convert to sched_set_fifo()
  sched,ion: Convert to sched_set_normal()
  sched,powercap: Convert to sched_set_fifo*()
  sched,spi: Convert to sched_set_fifo*()
  sched,mmc: Convert to sched_set_fifo*()
  sched,ivtv: Convert to sched_set_fifo*()
  sched,drm/scheduler: Convert to sched_set_fifo*()
  sched,msm: Convert to sched_set_fifo*()
  sched,psci: Convert to sched_set_fifo*()
  sched,drbd: Convert to sched_set_fifo*()
  ...
2023-10-05 18:28:46 +03:00
John Galt
c801b87451 Revert "kernel/rcu: drop WQ_UNBOUND"
This reverts commit 8fe630b1cf954f23c65990cd10bccfa4c207597c.
2023-10-05 18:28:40 +03:00
Peng Liu
f2e8f5a9ed UPSTREAM: sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
commit bf475ce0a3 ("sched/fair: Add per-CPU min capacity to
sched_group_capacity") introduced per-cpu min_capacity.

commit e3d6d0cb66f2 ("sched/fair: Add sched_group per-CPU max capacity")
introduced per-cpu max_capacity.

In the SD_OVERLAP case, the local variable 'capacity' represents the sum
of CPU capacity of all CPUs in the first sched group (sg) of the sched
domain (sd).

It is erroneously used to calculate sg's min and max CPU capacity.
To fix this use capacity_of(cpu) instead of 'capacity'.

The code which achieves this via cpu_rq(cpu)->sd->groups->sgc->capacity
(for rq->sd != NULL) can be removed since it delivers the same value as
capacity_of(cpu) which is currently only used for the (!rq->sd) case
(see update_cpu_capacity()).
An sg of the lowest sd (rq->sd or sd->child == NULL) represents a single
CPU (and hence sg->sgc->capacity == capacity_of(cpu)).

Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200104130828.GA7718@iZj6chx1xj0e0buvshuecpZ
Signed-off-by: DennySPb <dennyspb@gmail.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-09-29 17:35:05 +03:00
kondors1995
220cc484cd Merge tag 'v4.14.326' of git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux into dev/13.0
This is the 4.14.326 stable release
2023-09-29 17:19:45 +03:00
John Galt
395ebe933f Trigger sbalance on enable_nonboot_cpus
To improve the situation of longer irq balancing period:

The biggest irq placement regression is after wake, due to the many irqs
which have been migrated back to cpu0 from non-boot cpu offlining.
2023-09-27 18:06:08 +03:00
John Galt
090b2b0af5 kernel/power: implement pm qos stub 2023-09-27 18:05:49 +03:00
Vincent Guittot
e92ce85c9f sched/fair: Don't set LBF_ALL_PINNED unnecessarily
Setting LBF_ALL_PINNED during active load balance is only valid when there
is only 1 running task on the rq otherwise this ends up increasing the
balance interval whereas other tasks could migrate after the next interval
once they become cache-cold as an example.

LBF_ALL_PINNED flag is now always set it by default. It is then cleared
when we find one task that can be pulled when calling detach_tasks() or
during active migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210107103325.30851-3-vincent.guittot@linaro.org
2023-09-27 18:05:40 +03:00
Vincent Guittot
44e7e1816e sched/fair: Make sure to try to detach at least one movable task
During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.

The maximum of detached tasks remained the same as before.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org
2023-09-27 18:05:40 +03:00
Sultan Alsawaf
2e57085d1e cpumask: Add optimized helpers when NR_CPUS fits in a long
When NR_CPUS fits in a long, it's possible to use compiler built-ins to
produce much faster code when operating on cpumasks compared to just using
the generic bitops APIs.

Therefore, add optimized helpers using compiler built-ins when NR_CPUS fits
in a long. This also turns nr_cpu_ids into a compile-time constant for
further optimization potential.

Note that compared to the upstream cpumask rewrite with this feature, these
optimized helpers perfectly preserve the semantics of the helpers they
replace. And this change is much smaller than the upstream version.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2023-09-27 18:05:39 +03:00
Lu Jialin
3bb207a29e cgroup:namespace: Remove unused cgroup_namespaces_init()
[ Upstream commit 82b90b6c5b38e457c7081d50dff11ecbafc1e61a ]

cgroup_namspace_init() just return 0. Therefore, there is no need to
call it during start_kernel. Just remove it.

Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23 10:46:58 +02:00
Gaosheng Cui
d061e2bfc2 audit: fix possible soft lockup in __audit_inode_child()
[ Upstream commit b59bc6e37237e37eadf50cd5de369e913f524463 ]

Tracefs or debugfs maybe cause hundreds to thousands of PATH records,
too many PATH records maybe cause soft lockup.

For example:
  1. CONFIG_KASAN=y && CONFIG_PREEMPTION=n
  2. auditctl -a exit,always -S open -k key
  3. sysctl -w kernel.watchdog_thresh=5
  4. mkdir /sys/kernel/debug/tracing/instances/test

There may be a soft lockup as follows:
  watchdog: BUG: soft lockup - CPU#45 stuck for 7s! [mkdir:15498]
  Kernel panic - not syncing: softlockup: hung tasks
  Call trace:
   dump_backtrace+0x0/0x30c
   show_stack+0x20/0x30
   dump_stack+0x11c/0x174
   panic+0x27c/0x494
   watchdog_timer_fn+0x2bc/0x390
   __run_hrtimer+0x148/0x4fc
   __hrtimer_run_queues+0x154/0x210
   hrtimer_interrupt+0x2c4/0x760
   arch_timer_handler_phys+0x48/0x60
   handle_percpu_devid_irq+0xe0/0x340
   __handle_domain_irq+0xbc/0x130
   gic_handle_irq+0x78/0x460
   el1_irq+0xb8/0x140
   __audit_inode_child+0x240/0x7bc
   tracefs_create_file+0x1b8/0x2a0
   trace_create_file+0x18/0x50
   event_create_dir+0x204/0x30c
   __trace_add_new_event+0xac/0x100
   event_trace_add_tracer+0xa0/0x130
   trace_array_create_dir+0x60/0x140
   trace_array_create+0x1e0/0x370
   instance_mkdir+0x90/0xd0
   tracefs_syscall_mkdir+0x68/0xa0
   vfs_mkdir+0x21c/0x34c
   do_mkdirat+0x1b4/0x1d4
   __arm64_sys_mkdirat+0x4c/0x60
   el0_svc_common.constprop.0+0xa8/0x240
   do_el0_svc+0x8c/0xc0
   el0_svc+0x20/0x30
   el0_sync_handler+0xb0/0xb4
   el0_sync+0x160/0x180

Therefore, we add cond_resched() to __audit_inode_child() to fix it.

Fixes: 5195d8e217 ("audit: dynamically allocate audit_names when not enough space is in the names array")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23 10:46:56 +02:00
Christoph Hellwig
35c739f793 modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules
commit 9011e49d54dcc7653ebb8a1e05b5badb5ecfa9f9 upstream.

It has recently come to my attention that nvidia is circumventing the
protection added in 262e6ae7081d ("modules: inherit
TAINT_PROPRIETARY_MODULE") by importing exports from their proprietary
modules into an allegedly GPL licensed module and then rexporting them.

Given that symbol_get was only ever intended for tightly cooperating
modules using very internal symbols it is logical to restrict it to
being used on EXPORT_SYMBOL_GPL and prevent nvidia from costly DMCA
Circumvention of Access Controls law suites.

All symbols except for four used through symbol_get were already exported
as EXPORT_SYMBOL_GPL, and the remaining four ones were switched over in
the preparation patches.

Fixes: 262e6ae7081d ("modules: inherit TAINT_PROPRIETARY_MODULE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23 10:46:52 +02:00
RealJohnGalt
eb18bc9c8e treewide: selectively over inline 2023-09-15 15:35:29 +03:00
Josh Don
43f6681f8c sched: Allow newidle balancing to bail out of load_balance
While doing newidle load balancing, it is possible for new tasks to
arrive, such as with pending wakeups. newidle_balance() already accounts
for this by exiting the sched_domain load_balance() iteration if it
detects these cases. This is very important for minimizing wakeup
latency.

However, if we are already in load_balance(), we may stay there for a
while before returning back to newidle_balance(). This is most
exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
very straightforward workaround to this is to adjust should_we_balance()
to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
detected.

This was tested with the following reproduction:
- two threads that take turns sleeping and waking each other up are
  affined to two cores
- a large number of threads with 100% utilization are pinned to all
  other cores

Without this patch, wakeup latency was ~120us for the pair of threads,
almost entirely spent in load_balance(). With this patch, wakeup latency
is ~6us.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
Signed-off-by: Kazuki Hashimoto <kazukih@tuta.io>
2023-09-15 15:03:44 +03:00
Pavankumar Kondeti
6a15624121 BACKPORT: ANDROID: sched: Exempt paused CPU from nohz idle balance
A CPU can be paused while it is idle with it's tick stopped.
nohz_balance_exit_idle() should be called from the local CPU,
so it can't be called during pause which can happen remotely.
This results in paused CPU participating in the nohz idle balance,
which should be avoided. This can be done by calling

Fix this issue by calling nohz_balance_exit_idle() from the paused
CPU when it exits and enters idle again. This lazy approach avoids
waking the CPU from idle during pause.

Bug: 180530906
Change-Id: Ia2dfd9c9cac9b0f37c55a9256b9d5f3141ca0421
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
[ Tashar02: Backport to k4.19 ]
[ RealJohnGalt: Backport to k4.14 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-09-15 15:03:35 +03:00
Tashfin Shakeer Rhythm
d0bb2fa40d Revert "wait: Do accept() in LIFO order for cache efficiency"
This is irrelevant on ANDROID and the commit itself has one unused function
alongside faulty symmetry maintenance of extern and function prototype usage.

This reverts commit 124bd40a326554900f9641ed7e8df675bb490a61.

Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-09-15 15:00:51 +03:00
Tashfin Shakeer Rhythm
e208d3826c Revert "cpuidle: Hardcode stop_tick"
This patch has been abandoned by its author. Hence, drop it.

This reverts commit 49d8c6657a86d16e39db66ec8c5eb46dc69ea5fb.

Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-09-15 14:59:08 +03:00
pwnrazr
48ea411c4e irq: Demote no longer affine to logging to debug 2023-09-15 14:55:19 +03:00
Sultan Alsawaf
230a5b636c sched: Never migrate tasks upon execve()
Explicitly return the previous CPU for SD_BALANCE_EXEC in order to never
migrate tasks upon execve(). We don't know much about a task after it's
called execve(), so there's not much task-specific information to go off of
for load balancing.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[ EmanuelCN: Adapt to msm-4.19 with Sultan's help ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-09-10 13:11:26 +03:00
Sultan Alsawaf
b88065c266 sched/fair: Always update CPU capacity when load balancing
Limiting CPU capacity updates, which are quite cheap, results in worse
balancing decisions during opportunistic balancing (e.g., SD_BALANCE_WAKE).
This causes opportunistic placement decisions to be skewed using stale CPU
capacity data, and when a CPU isn't idling much, its capacity suffers from
even more staleness since the only exception to the 100 ms capacity update
ratelimit is a CPU exiting idle.

Since the capacity updates are cheap, always do it when load balancing in
order to improve opportunistic task placement decisions.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
2023-09-10 13:11:22 +03:00
Sultan Alsawaf
bd4d045fec arm64: Disable GENERIC_IRQ_EFFECTIVE_AFF_MASK
The effective affinity mask causes a lot of bugs by virtue of many
set_irq_affinity handlers only setting an effective affinity mask for an
IRQ's parent but not the IRQ itself. Since this is a widespread issue that
would require manual fixing on every different SoC, just disable the
effective affinity mask altogether and use the first CPU in an affinity
mask configured.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Dark-Matter7232 <me@const.eu.org>
2023-09-10 13:11:15 +03:00
Sultan Alsawaf
8f9fbc81ff kernel: Introduce SBalance IRQ balancer
This is a simple IRQ balancer that polls every X number of milliseconds and
moves IRQs from the most interrupt-heavy CPU to the least interrupt-heavy
CPUs until the heaviest CPU is no longer the heaviest. IRQs are only moved
from one source CPU to any number of destination CPUs per balance run.
Balancing is skipped if the gap between the most interrupt-heavy CPU and
the least interrupt-heavy CPU is below the configured threshold of
interrupts.

The heaviest IRQs are targeted for migration in order to reduce the number
of IRQs to migrate. If moving an IRQ would reduce overall balance, then it
won't be migrated.

The most interrupt-heavy CPU is calculated by scaling the number of new
interrupts on that CPU to the CPU's current capacity. This way, interrupt
heaviness takes into account factors such as thermal pressure and time
spent processing interrupts rather than just the sheer number of them. This
also makes SBalance aware of CPU asymmetry, where different CPUs can have
different performance capacities and be proportionally balanced.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Dark-Matter7232 <me@const.eu.org>
2023-09-10 13:11:15 +03:00
kondors1995
74e28df591 Revert "sched/core: Use SCHED_RR in place of SCHED_FIFO for all users"
This reverts commit d397be20ce.
2023-09-10 13:11:15 +03:00
kondors1995
7457652193 treewide: Remove perf critical API & its ussage 2023-09-10 13:11:06 +03:00
Xiaoming Ni
e537e4f3d6 kernel/notifier.c: intercept duplicate registrations to avoid infinite loops
[ Upstream commit 1a50cb80f219c44adb6265f5071b81fc3c1deced ]

Registering the same notifier to a hook repeatedly can cause the hook
list to form a ring or lose other members of the list.

  case1: An infinite loop in notifier_chain_register() can cause soft lockup
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test2);

  case2: An infinite loop in notifier_chain_register() can cause soft lockup
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_call_chain(&test_notifier_list, 0, NULL);

  case3: lose other hook test2
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test2);
          atomic_notifier_chain_register(&test_notifier_list, &test1);

  case4: Unregister returns 0, but the hook is still in the linked list,
         and it is not really registered. If you call
         notifier_call_chain after ko is unloaded, it will trigger oops.

If the system is configured with softlockup_panic and the same hook is
repeatedly registered on the panic_notifier_list, it will cause a loop
panic.

Add a check in notifier_chain_register(), intercepting duplicate
registrations to avoid infinite loops

Link: http://lkml.kernel.org/r/1568861888-34045-2-git-send-email-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sam Protsenko <semen.protsenko@linaro.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-03 09:33:10 +03:00
Vasily Averin
d35c369633 kernel/notifier.c: double register detection
By design notifiers can be registerd once only, 2nd register attempt
called by mistake silently corrupts notifiers list.

A few years ago I investigated described problem, the host was power
cycled because of notifier list corruption.  I've prepared this patch
and applied it to the OpenVZ kernel and sent this patch but nobody
commented on it.  Later it helped us to detect a similar problem in the
OpenVz kernel.

Mistakes with notifier registration can happen for example during
subsystem initialization from different namespaces, or because of a lost
unregister in the roll-back path on initialization failures.

The proposed check cannot prevent the described problem, however it
allows us to detect its reason quickly without coredump analysis.

Link: http://lkml.kernel.org/r/04127e71-4782-9bbb-fe5a-7c01e93a99b0@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-03 09:33:05 +03:00