704995 Commits

Author SHA1 Message Date
Christian Hoffmann
bd1b170f1d arch: arm64: vdso32: Drop -no-integrated-as
Signed-off-by: Onelots <onelots@onelots.fr>
2024-12-24 00:25:57 +01:00
LuK1337
0a39bd9334 ARM64: vdso32: Hardcode toolchain target
Fixes the following error when building with clang r530567:
error: version 'kernel' in target triple 'arm-unknown-linux-androidkernel' is invalid

Change-Id: I5a2d27bf0e8a22b2fe752c64efc0cc91c790b5f0
2024-12-23 23:32:25 +01:00
Chung-Hsien Hsu
895eaac708 nl80211: add WPA3 definition for SAE authentication
Add definition of WPA version 3 for SAE authentication.

Change-Id: I19ca34b8965168f011cc1352eba420f2d54b0258
Signed-off-by: Chung-Hsien Hsu <stanley.hsu@cypress.com>
Signed-off-by: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-12-23 23:32:25 +01:00
Jonglin Lee
43e39d2a76 cpuidle: lpm_levels: Don't print parent clocks during suspend
Calling clock_debug_print_enabled with print_parent = true
during suspend may cause a scheduling while atomic violation.
Call with print_parent = false instead to prevent the violation.

Bug: 132511008
Change-Id: I80f646d77d0cc98b4004084022ce1dce0e80cc93
Signed-off-by: Jonglin Lee <jonglin@google.com>
Signed-off-by: GeoPD <geoemmanuelpd2001@gmail.com>
2024-08-15 08:22:52 +05:30
Wei Wang
6400cd3b94 sched: restrict iowait boost to tasks with prefer_idle
Currently iowait doesn't distinguish background/foreground tasks and we
have seen cases where a device run to high frequency unnecessarily when
running some background I/O. This patch limits iowait boost to tasks with
prefer_idle only. Specifically, on Pixel, those are foreground and top
app tasks.

Bug: 130308826
Test: Boot and trace
Change-Id: I2d892beeb4b12b7e8f0fb2848c23982148648a10
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: Lau <laststandrighthere@gmail.com>
2024-08-15 08:22:43 +05:30
Maria Yu
b5c22baa21 sched: core: Clear walt rq request in cpu starting
Clear walt rq request in cpu starting.

Change-Id: Id3004337f3924984b8b812151a6ba01c6f1c013e
Signed-off-by: Maria Yu <aiquny@codeaurora.org>
(cherry picked from commit 32df8f93e147dd54331161e9180d7ea488b750f9)
2024-08-15 08:22:18 +05:30
Pavankumar Kondeti
74a8607aa7 sched/walt: Fix the memory leak of idle task load pointers
The memory for task load pointers are allocated twice for each
idle thread except for the boot CPU. This happens during boot
from idle_threads_init()->idle_init() in the following 2 paths.

1. idle_init()->fork_idle()->copy_process()->
		sched_fork()->init_new_task_load()

2. idle_init()->fork_idle()-> init_idle()->init_new_task_load()

The memory allocation for all tasks happens through the 1st path,
so use the same for idle tasks and kill the 2nd path. Since
the idle thread of boot CPU does not go through fork_idle(),
allocate the memory for it separately.

Change-Id: I4696a414ffe07d4114b56d326463026019e278f1
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
(cherry picked from commit eb58f47212c9621be82108de57bcf3e94ce1035a)
2024-08-15 07:11:04 +05:30
DhineshCool
c0dd3261ad Revert "sched: Do not reduce perceived CPU capacity while idle"
This reverts commit 20dfb57cb1.
2024-08-15 06:33:57 +05:30
DhineshCool
f99e24746b Revert "cpufreq: schedutil: Enforce realtime priority"
This reverts commit 970b81bf75.
2024-08-15 06:17:11 +05:30
DhineshCool
158bbf5f52 Revert "binder: Reserve caches for small, high-frequency memory allocations"
This reverts commit fab295b5c9.
2024-08-14 19:17:42 +05:30
DhineshCool
2ca4eb547b defconfig: cybertron-v10 2024-08-13 23:44:43 +05:30
ExactExampl
0d2f0ead2b defconfig: b1c1: enable zram deduplication feature
* Unset CONFIG_ZRAM_WRITEBACK while at it as writeback isn't being used
2024-08-13 23:41:21 +05:30
Juhyung Park
277f74bc1c zram: switch to 64-bit hash for dedup
The original dedup code does not handle collision from the observation
that it practically does not happen.

For additional peace of mind, use a bigger hash size for reducing the
possibility of collision even further.

Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>
Signed-off-by: snnbyyds <snnbyyds@gmail.com>
2024-08-13 23:40:44 +05:30
Charan Teja Reddy
9864f4c40c zram: fix race condition while returning zram_entry refcount
With deduplication enabled, the duplicated zram objects are tracked
using the zram_entry backed by a refcount. The race condition while
decrementing the refcount through zram_dedup_put() is as follows:
Say Task A and task B share the same object and thus the
zram_entry->refcount = 2.
Task A				Task B

zram_dedup_put  		zram_dedup_put
				spin_lock(&hash->lock);
				entry->refcount--; (Now it is 1)
				spin_unlock(&hash->lock);
spin_lock(&hash->lock);
entry->refcount--; (Now it is 0)
spin_unlock(&hash->lock);

return entry->refcount		return entry->refcount

We return 0 in above steps thus leading to double free of the handle,
which is a slab object.

Change-Id: I8dd9bad27140a6e3a295905bf4411050d8eac931
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>
Signed-off-by: snnbyyds <snnbyyds@gmail.com>
2024-08-13 23:40:44 +05:30
Joonsoo Kim
222effc77a zram: compare all the entries with same checksum for deduplication
Until now, we compare just one entry with same checksum when
checking duplication since it is the simplest way to implement.
However, for the completeness, checking all the entries is better
so this patch implement to compare all the entries with same checksum.
Since this event would be rare so there would be no performance loss.

Change-Id: Ie7d61c14d127a28f5a06d85b0ca66b9fada20cbb
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: https://lore.kernel.org/patchwork/patch/787163/
Patch-mainline: linux-kernel@ Thu, 11 May 2017 22:30:29
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>
Signed-off-by: snnbyyds <snnbyyds@gmail.com>
2024-08-13 23:40:43 +05:30
Joonsoo Kim
0dfcc58d8d zram: make deduplication feature optional
Benefit of deduplication is dependent on the workload so it's not
preferable to always enable. Therefore, make it optional in Kconfig
and device param. Default is 'off'. This option will be beneficial
for users who use the zram as blockdev and stores build output to it.

Change-Id: If282bb8aa15c5749859a87cf36db7eb9edb3b1ed
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: https://lore.kernel.org/patchwork/patch/787164/
Patch-mainline: linux-kernel@ Thu, 11 May 2017 22:30:52
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>
Signed-off-by: snnbyyds <snnbyyds@gmail.com>
2024-08-13 23:40:43 +05:30
Joonsoo Kim
3343551e25 zram: implement deduplication in zram
This patch implements deduplication feature in zram. The purpose
of this work is naturally to save amount of memory usage by zram.

Android is one of the biggest users to use zram as swap and it's
really important to save amount of memory usage. There is a paper
that reports that duplication ratio of Android's memory content is
rather high [1]. And, there is a similar work on zswap that also
reports that experiments has shown that around 10-15% of pages
stored in zswp are duplicates and deduplicate them provides some
benefits [2].

Also, there is a different kind of workload that uses zram as blockdev
and store build outputs into it to reduce wear-out problem of real
blockdev. In this workload, deduplication hit is very high due to
temporary files and intermediate object files. Detailed analysis is
on the bottom of this description.

Anyway, if we can detect duplicated content and avoid to store duplicated
content at different memory space, we can save memory. This patch
tries to do that.

Implementation is almost simple and intuitive but I should note
one thing about implementation detail.

To check duplication, this patch uses checksum of the page and
collision of this checksum could be possible. There would be
many choices to handle this situation but this patch chooses
to allow entry with duplicated checksum to be added to the hash,
but, not to compare all entries with duplicated checksum
when checking duplication. I guess that checksum collision is quite
rare event and we don't need to pay any attention to such a case.
Therefore, I decided the most simplest way to implement the feature.
If there is a different opinion, I can accept and go that way.

Following is the result of this patch.

Test result #1 (Swap):
Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18

orig_data_size: 145297408
compr_data_size: 32408125
mem_used_total: 32276480
dup_data_size: 3188134
meta_data_size: 1444272

Last two metrics added to mm_stat are related to this work.
First one, dup_data_size, is amount of saved memory by avoiding
to store duplicated page. Later one, meta_data_size, is the amount of
data structure to support deduplication. If dup > meta, we can judge
that the patch improves memory usage.

In Adnroid, we can save 5% of memory usage by this work.

Test result #2 (Blockdev):
build the kernel and store output to ext4 FS on zram

<no-dedup>
Elapsed time: 249 s
mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0

<dedup>
Elapsed time: 250 s
mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038  3945792

There is no performance degradation and save 23% memory.

Test result #3 (Blockdev):
copy android build output dir(out/host) to ext4 FS on zram

<no-dedup>
Elapsed time: out/host: 88 s
mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0

<dedup>
Elapsed time: out/host: 100 s
mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336

It shows performance degradation roughly 13% and save 24% memory. Maybe,
it is due to overhead of calculating checksum and comparison.

Test result #4 (Blockdev):
copy android build output dir(out/target/common) to ext4 FS on zram

<no-dedup>
Elapsed time: out/host: 203 s
mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0

<dedup>
Elapsed time: out/host: 201 s
mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336

Memory is saved by 42% and performance is the same. Even if there is overhead
of calculating checksum and comparison, large hit ratio compensate it since
hit leads to less compression attempt.

I checked the detailed reason of savings on kernel build workload and
there are some cases that deduplication happens.

1) *.cmd
Build command is usually similar in one directory so content of these file
are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd
and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file,
respectively.

2) intermediate object files
built-in.o and temporary object file have the similar contents. More than
50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o.

3) vmlinux
.tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin
have the similar contents.

Android test has similar case that some of object files(.class
and .so) are similar with another ones.
(./host/linux-x86/lib/libartd.so and
./host/linux-x86-lib/libartd-comiler.so)

Anyway, benefit seems to be largely dependent on the workload so
following patch will make this feature optional. However, this feature
can help some usecases so is deserved to be merged.

[1]: MemScope: Analyzing Memory Duplication on Android Systems,
dl.acm.org/citation.cfm?id=2797023
[2]: zswap: Optimize compressed pool memory utilization,
lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2

Change-Id: I8fe80c956c33f88a6af337d50d9e210e5c35ce37
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: https://lore.kernel.org/patchwork/patch/787162/
Patch-mainline: linux-kernel@ Thu, 11 May 2017 22:30:26
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>
Signed-off-by: snnbyyds <snnbyyds@gmail.com>
2024-08-13 23:40:43 +05:30
Joonsoo Kim
28405fd009 zram: introduce zram_entry to prepare dedup functionality
Following patch will implement deduplication functionality
in the zram and it requires an indirection layer to manage
the life cycle of zsmalloc handle. To prepare that, this patch
introduces zram_entry which can be used to manage the life-cycle
of zsmalloc handle. Many lines are changed due to rename but
core change is just simple introduction about newly data structure.

Change-Id: Ibf9912397c8c7dbcf1465550bc83a71f904e41c7
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: https://lore.kernel.org/patchwork/patch/787161/
Patch-mainline: linux-kernel@ Thu, 11 May 2017 22:30:21
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Marco Zanin <mrczn.bb@gmail.com>
Signed-off-by: snnbyyds <snnbyyds@gmail.com>
2024-08-13 23:40:43 +05:30
Tengfei Fan
ed6abdb80f ANDROID: cpufreq: times: Have two spinlock in different cache line
task_time_in_state_lock and uid_lock currently is
very possiblly in same cache line that will cause
livelock if 2 cores in contention.

Change-Id: I644687c4d610af5e84a43f422a711d386d6d5181
Signed-off-by: Tengfei Fan <tengfeif@codeaurora.org>
(cherry picked from commit bfea4ae591301043498a214cf6ef4e6250106316)
2024-08-13 23:40:43 +05:30
Maria Yu
d6631fffef sched/fair: Consider others if target cpu overutilized
If target cpu overutilized, it's better to consider
other group cpu. It can avoid unnecessary waiting on
overutilized cpu and wait until load balance for task
to be run.

Change-Id: I6f8bccb611d2f11471254cf2795fb5bf3f122292
Signed-off-by: Maria Yu <aiquny@codeaurora.org>
(cherry picked from commit b9f8fdc34eeb61fcc7c770b6277a83fd30fc7d8e)
2024-08-13 23:40:43 +05:30
Chris Redpath
9314b62205 FROMLIST: sched/fair: Don't move tasks to lower capacity cpus unless necessary
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Change-Id: Ib86570abdd453a51be885b086c8d80be2773a6f2
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Git-commit: 07e7ce6c8459defc34e63ae0f0334e811d223990
Git-repo: https://android.googlesource.com/kernel/common/
[clingutla@codeaurora.org: Resolved merge conflicts.]
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
(cherry picked from commit 779459e3fffda001181cfd6b1be2ffd3da25002c)
2024-08-13 23:40:43 +05:30
Joonwoo Park
ef01112e02 sched: ceil idle index to prevent from out of bound accessing
It's possible size of given idle cost index is smaller than CPU's
possible idle index size.  Ceil the CPU's idle index to prevent out
of bound accessing.

Change-Id: Idecb4f68758dd0183886ea74d0e9da3d236b0062
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
(cherry picked from commit ecedc7afd841c8d7ef0145924620304608d269ef)
2024-08-13 23:40:42 +05:30
Joonwoo Park
12312cb361 sched: prevent out of bound access in sched_group_energy()
group_idle_state() can return INT_MAX + 1 which is undefined behaviour
when there is no CPUs in sched_group.  Prevent such by error correctly.

Change-Id: If9796c829c091e461231569dc38c5e5456f58037
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
[clingutla@codeaurora.org: Fixed trivial merge conflicts and squashed
  msm-4.14 change]
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
(cherry picked from commit bb5b0e61527011e4ebfc4058713a9068da9e7492)
2024-08-13 23:40:42 +05:30
Maria Yu
57d6066272 cpufreq: schedutil: Queue sugov irq work on policy online cpu
Got never update frequency if scheduled the irq
work on an offlined cpu and it will always pending.
Queue sugov irq work on any online cpu if current
cpu is offline.

Change-Id: I33fc691917b5866488b6aeb11ed902a2753130b2
Signed-off-by: Maria Yu <aiquny@codeaurora.org>
(cherry picked from commit 1d2db9ab99a9abd0d9dcb320e6e0d266e21884f9)
2024-08-13 23:40:42 +05:30
Maria Yu
aa4a0a2807 sched/walt: Avoid walt irq work in offlined cpu
Avoid walt irq work in offlined cpu.

Change-Id: Ia4410562f66bfa57daa15d8c0a785a2c7a95f2a0
Signed-off-by: Maria Yu <aiquny@codeaurora.org>
(cherry picked from commit 702cec976c863388c784eff37a71fa3ee8bb84d7)
2024-08-13 23:40:42 +05:30
Pavankumar Kondeti
37a5c34f00 Revert "sched: Remove sched_ktime_clock()"
This reverts 'commit 24c18127e9 ("sched: Remove sched_ktime_clock()")'

WALT accounting uses ktime_get() as time source to keep windows in
align with the tick. ktime_get() API should not be called while the
timekeeping subsystem is suspended during the system suspend. The
code before the reverted patch has a wrapper around ktime_get() to
avoid calling ktime_get() when timekeeping subsystem is suspended.

The reverted patch removed this wrapper with the assumption that there
will not be any scheduler activity while timekeeping subsystem is
suspended. The timekeeping subsystem is resumed very early even before
non-boot CPUs are brought online. However it is possible that tasks
can wake up from the idle notifiers which gets called before timekeeping
subsystem is resumed.

When this happens, the time read from ktime_get() will not be consistent.
We see a jump from the values that would be returned later when timekeeping
subsystem is resumed. The rq->window_start update happens with incorrect
time. This rq->window_start becomes inconsistent with the rest of the
CPUs's rq->window_start and wallclock time after timekeeping subsystem is
resumed. This results in WALT accounting bugs.

Change-Id: I9c3b2fb9ffbf1103d1bd78778882450560dac09f
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
(cherry picked from commit faa04442e7a31357724dbb8e49ba64372ef37862)
2024-08-13 23:40:42 +05:30
Pavankumar Kondeti
e8e661152f sched/fair: Fix redundant load balancer reattempt due to LBF_ALL_PINNED
LBF_ALL_PINNED flag should cleared in can_migrate_task() if the task
can run on the destination CPU during load balance. In current code,
can_migrate_task() return incorrectly without clearing this flag
in case if the task can't be migrated to the destination CPU due to
cumulative window demand constraints. Since LBF_ALL_PINNED flag
is not cleared, load balancer thinks that none of the tasks running
on the busiest group can't be migrated to the destination CPU due
to affinity settings and tries to find another busiest group. Prevent
this incorrect reattempt of load balance by clearing LBF_ALL_PINNED
flag right after the task affinity check in can_migrate_task().

Change-Id: Iad1cf42b1aaf70106ee5ecfbd9499ccb6eb7497e
[clingutla@codeaurora.org: Resolved merge conflicts]
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
(cherry picked from commit 5ee367fc9386d4e36af644942d9d10f97827bab1)
2024-08-13 23:40:41 +05:30
Maria Yu
a01e3aaaff sched/fair: Avoid unnecessary active load balance
When find busiest group, it will avoid load balance if
it is only 1 task running on src cpu. Consider race when
different cpus do newly idle load balance at the same time,
check src cpu nr_running to avoid unnecessary active load
balance again.
See the race condition example here:
  1) cpu2 have 2 tasks, so cpu2 rq->nr_running == 2 and cfs.h_nr_running
      ==2.
  2) cpu4 and cpu5 doing newly idle load balance at the same time.
  3) cpu4 and cpu5 both see cpu2 sched_load_balance_sg_stats sum_nr_run=2
     so they are both see cpu2 as the busiest rq.
  4) cpu5 did a success migration task from cpu2, so cpu2 only have 1 task
     left, cpu2 rq->nr_running == 1 and cfs.h_nr_running ==1.
  5) cpu4 surely goes to no_move because currently cpu4 only have 1 task
     which is currently running.
  6) and then cpu4 goes here to check if cpu2 need active load balance.

Change-Id: Ia9539a43e9769c4936f06ecfcc11864984c50c29
Signed-off-by: Maria Yu <aiquny@codeaurora.org>
(cherry picked from commit fc61703628de002e2a5bf88e09933dbc3552d156)
2024-08-13 23:40:41 +05:30
Pavankumar Kondeti
9efe3c5438 sched/walt: Fix stale max_capacity issue during CPU hotplug
Scheduler keeps track of the maximum capacity among all online CPUs
in max_capacity. This is useful in checking if a given cluster/CPU
is a max capacity CPU or not. The capacity of a CPU gets updated
when its max frequency is limited by cpufreq and/or thermal. The
CPUfreq limits notifications are received via CPUfreq policy
notifier. However CPUfreq keeps the policy intact even when all
of the CPUs governed by the policy are hotplugged out. So the
CPUFREQ_REMOVE_POLICY notification never arrives and scheduler's
notion of max_capacity becomes stale. The max_capacity may get
corrected at some point later when CPUFREQ_NOTIFY notification
comes for other online CPUs. But when the hotplugged CPUs comes
online the max_capacity does not reflect since CPUFREQ_ADD_POLICY
is not sent by the cpufreq.

For example consider a system with 4 BIG and 4 little CPUs. Their
original capacities are 2048 and 1024 respectively. The max_capacity
points to 2048 when all CPUs are online. Now,

1. All 4 BIG CPUs are hotplugged out. Since there is no notification,
the max_capacity still points to 2048, which is incorrect.
2. User clips the little CPUs's max_freq by 50%. CPUFREQ_NOTIFY arrives
and max_capacity is updated by iterating all the online CPUs. At this
point max_capacity becomes 512 which is correct.
3. User removes the above limits of little CPUs. The max_capacity
becomes 1024 which is correct.
4. Now, BIG CPUs are hotplugged in. Since there is no notification,
the max_capacity still points to 1024, which is incorrect.

Fix this issue by wiring the max_capacity updates in WALT to scheduler
hotplug callbacks. Ideally we want cpufreq domain hotplug callbacks
but such notifiers are not present. So the max_capacity update is
forced even when it is not necessary but that should not be a concern.
Because CPU hotplug is supposed to be a rare event.

The scheduler hotplug callbacks happen even before the hotplug CPU is
removed from cpu_online_mask, so use cpu_active() check while evaluating
the max_capacity. Since cpu_active_mask is a subset of cpu_online_mask,
this is sufficient.

Change-Id: I97b1974e2de1a9730285715858f1ada416d92a7a
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
(cherry picked from commit 3cd81b52aedf6802aaf7b41f3550b1850c7a09a4)
2024-08-13 23:40:41 +05:30
tip-bot for Jacob Shin
2bc84a0ac1 sched/fair: Force balancing on NOHZ balance if local group has capacity
The "goto force_balance" here is intended to mitigate the fact that
avg_load calculations can result in bad placement decisions when
priority is asymmetrical.

The original commit that adds it:

  fab476228b ("sched: Force balancing on newidle balance if local group has capacity")

explains:

    Under certain situations, such as a niced down task (i.e. nice =
    -15) in the presence of nr_cpus NICE0 tasks, the niced task lands
    on a sched group and kicks away other tasks because of its large
    weight. This leads to sub-optimal utilization of the
    machine. Even though the sched group has capacity, it does not
    pull tasks because sds.this_load >> sds.max_load, and f_b_g()
    returns NULL.

A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU
capacity) systems - consider 8 always-running, same-priority tasks on a
system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on
the "big" CPUs (which will be represented by one sched_group in the DIE
sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving
one CPU unused. Because the "big" group has a higher group_capacity its
avg_load may not present an imbalance that would cause migrating a
task to the idle "little".

The force_balance case here solves the problem but currently only for
CPU_NEWLY_IDLE balances, which in theory might never happen on the
unused CPU. Including CPU_IDLE in the force_balance case means
there's an upper bound on the time before we can attempt to solve the
underutilization: after DIE's sd->balance_interval has passed the
next nohz balance kick will help us out.

Change-Id: I6b0db178c0707603c8fd764fd3e44524c5345241
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-commit: 583ffd99d7657755736d831bbc182612d1d2697d
Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
(cherry picked from commit 3d9aec71e139bce6d592b56afaa30f02c344e80e)
2024-08-13 23:40:41 +05:30
Lingutla Chandrasekhar
70e5add1e9 sched: energy: rebuild sched_domains with actual capacities
While sched initialization, sched_domains might have built
with default capacity values, and max_{min_}_cap_org_cpu's
have updated based on them. After energy probe called,
these capacities would change, but max_{min_}cap_org_cpu's
still have old values. And using these staled cpus could give
wrong start_cpu in finding energy efficient cpu.

So rebuild sched_domains, which updates all cpu's group capacities
with actual capacities and then build domains again, and update
max_{min_}cap_org_cpus as well.

Change-Id: I07d58bc849de363c5ed8fb743ab98d3fba727130
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
(cherry picked from commit 5b2c99599d1dcf79ef7dec93c7935d6fc48869db)
2024-08-13 23:40:41 +05:30
Sultan Alsawaf
dd9658622e msm: kgsl: Avoid dynamically allocating small command buffers
Most command buffers here are rather small (fewer than 256 words); it's
a waste of time to dynamically allocate memory for such a small buffer
when it could easily fit on the stack.

Conditionally using an on-stack command buffer when the size is small
enough eliminates the need for using a dynamically-allocated buffer most
of the time, reducing GPU command submission latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
2024-08-13 23:40:30 +05:30
Sultan Alsawaf
388342b609 msm: kgsl: Don't allocate memory dynamically for temp command buffers
The temporary command buffer in _set_pagetable_gpu is only the size of a
single page, and _set_pagetable_gpu is never executed concurrently. It
is therefore easy to replace the dynamic command buffer allocation with
a static one to improve performance by avoiding the latency of dynamic
memory allocation.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
2024-08-13 23:40:30 +05:30
Prakash Kamliya
a892b85bfb msm: kgsl: Relax adreno spin idle tight loop
Tight loop of adreno_spin_idle() causing RT throttling.
Relax the tight loop by giving chance to other thread.

Change-Id: Ic23d4551c0cc0b5f2fa7844ca73444d1412d480c
Signed-off-by: Prakash Kamliya <pkamliya@codeaurora.org>
Signed-off-by: Raphiel Rollerscaperers <rapherion@raphielgang.org>
2024-08-13 23:40:30 +05:30
Martin KaFai Lau
a6710190e0 bpf: Refactor codes handling percpu map
Refactor the codes that populate the value
of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
typed bpf_map.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
2024-08-13 23:40:22 +05:30
Martin KaFai Lau
84360a36df bpf: Add percpu LRU list
Instead of having a common LRU list, this patch allows a
percpu LRU list which can be selected by specifying a map
attribute.  The map attribute will be added in the later
patch.

While the common use case for LRU is #reads >> #updates,
percpu LRU list allows bpf prog to absorb unusual #updates
under pathological case (e.g. external traffic facing machine which
could be under attack).

Each percpu LRU is isolated from each other.  The LRU nodes (including
free nodes) cannot be moved across different LRU Lists.

Here are the update performance comparison between
common LRU list and percpu LRU list (the test code is
at the last patch):

[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
 1 cpus: 2934082 updates
 4 cpus: 7391434 updates
 8 cpus: 6500576 updates

[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
  1 cpus: 2896553 updates
  4 cpus: 9766395 updates
  8 cpus: 17460553 updates

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
2024-08-13 23:40:22 +05:30
Martin KaFai Lau
d15c5c69e6 bpf: LRU List
Introduce bpf_lru_list which will provide LRU capability to
the bpf_htab in the later patch.

* General Thoughts:
1. Target use case.  Read is more often than update.
   (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
   If bpf_prog does a bpf_lookup_elem() first and then an in-place
   update, it still counts as a read operation to the LRU list concern.
2. It may be useful to think of it as a LRU cache
3. Optimize the read case
   3.1 No lock in read case
   3.2 The LRU maintenance is only done during bpf_update_elem()
4. If there is a percpu LRU list, it will lose the system-wise LRU
   property.  A completely isolated percpu LRU list has the best
   performance but the memory utilization is not ideal considering
   the work load may be imbalance.
5. Hence, this patch starts the LRU implementation with a global LRU
   list with batched operations before accessing the global LRU list.
   As a LRU cache, #read >> #update/#insert operations, it will work well.
6. There is a local list (for each cpu) which is named
   'struct bpf_lru_locallist'.  This local list is not used to sort
   the LRU property.  Instead, the local list is to batch enough
   operations before acquiring the lock of the global LRU list.  More
   details on this later.
7. In the later patch, it allows a percpu LRU list by specifying a
   map-attribute for scalability reason and for use cases that need to
   prepare for the worst (and pathological) case like DoS attack.
   The percpu LRU list is completely isolated from each other and the
   LRU nodes (including free nodes) cannot be moved across the list.  The
   following description is for the global LRU list but mostly applicable
   to the percpu LRU list also.

* Global LRU List:
1. It has three sub-lists: active-list, inactive-list and free-list.
2. The two list idea, active and inactive, is borrowed from the
   page cache.
3. All nodes are pre-allocated and all sit at the free-list (of the
   global LRU list) at the beginning.  The pre-allocation reasoning
   is similar to the existing BPF_MAP_TYPE_HASH.  However,
   opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
   the LRU map.

* Active/Inactive List (of the global LRU list):
1. The active list, as its name says it, maintains the active set of
   the nodes.  We can think of it as the working set or more frequently
   accessed nodes.  The access frequency is approximated by a ref-bit.
   The ref-bit is set during the bpf_lookup_elem().
2. The inactive list, as its name also says it, maintains a less
   active set of nodes.  They are the candidates to be removed
   from the bpf_htab when we are running out of free nodes.
3. The ordering of these two lists is acting as a rough clock.
   The tail of the inactive list is the older nodes and
   should be released first if the bpf_htab needs free element.

* Rotating the Active/Inactive List (of the global LRU list):
1. It is the basic operation to maintain the LRU property of
   the global list.
2. The active list is only rotated when the inactive list is running
   low.  This idea is similar to the current page cache.
   Inactive running low is currently defined as
   "# of inactive < # of active".
3. The active list rotation always starts from the tail.  It moves
   node without ref-bit set to the head of the inactive list.
   It moves node with ref-bit set back to the head of the active
   list and then clears its ref-bit.
4. The inactive rotation is pretty simply.
   It walks the inactive list and moves the nodes back to the head of
   active list if its ref-bit is set. The ref-bit is cleared after moving
   to the active list.
   If the node does not have ref-bit set, it just leave it as it is
   because it is already in the inactive list.

* Shrinking the Inactive List (of the global LRU list):
1. Shrinking is the operation to get free nodes when the bpf_htab is
   full.
2. It usually only shrinks the inactive list to get free nodes.
3. During shrinking, it will walk the inactive list from the tail,
   delete the nodes without ref-bit set from bpf_htab.
4. If no free node found after step (3), it will forcefully get
   one node from the tail of inactive or active list.  Forcefully is
   in the sense that it ignores the ref-bit.

* Local List:
1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
   batch enough operations before acquiring the lock of the
   global LRU.
2. A local list has two sub-lists, free-list and pending-list.
3. During bpf_update_elem(), it will try to get from the free-list
   of (the current CPU local list).
4. If the local free-list is empty, it will acquire from the
   global LRU list.  The global LRU list can either satisfy it
   by its global free-list or by shrinking the global inactive
   list.  Since we have acquired the global LRU list lock,
   it will try to get at most LOCAL_FREE_TARGET elements
   to the local free list.
5. When a new element is added to the bpf_htab, it will
   first sit at the pending-list (of the local list) first.
   The pending-list will be flushed to the global LRU list
   when it needs to acquire free nodes from the global list
   next time.

* Lock Consideration:
The LRU list has a lock (lru_lock).  Each bucket of htab has a
lock (buck_lock).  If both locks need to be acquired together,
the lock order is always lru_lock -> buck_lock and this only
happens in the bpf_lru_list.c logic.

In hashtab.c, both locks are not acquired together (i.e. one
lock is always released first before acquiring another lock).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
2024-08-13 23:40:21 +05:30
Michal Hocko
d9af72efb8 bpf: do not use KMALLOC_SHIFT_MAX
Commit 01b3f52157 ("bpf: fix allocation warnings in bpf maps and
integer overflow") has added checks for the maximum allocateable size.
It (ab)used KMALLOC_SHIFT_MAX for that purpose.

While this is not incorrect it is not very clean because we already have
KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
KMALLOC_MAX_SIZE instead.

The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".

Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
2024-08-13 23:40:21 +05:30
Yaroslav Furman
9ca25208f4 Revert "ANDROID: dm-crypt: run in a WQ_HIGHPRI workqueue"
This reverts commit e97c4ed917.

We don't need dm-crypt (FDE, which shouldn't ever be used on b4s4
anyway) to compete with touch, boosting, and similar important things.

Signed-off-by: kdrag0n <dragon@khronodragon.com>
2024-08-13 23:40:19 +05:30
Tyler Nijmeh
20dfb57cb1 sched: Do not reduce perceived CPU capacity while idle
CPUs that are idle are excellent candidates for latency sensitive or
high-performance tasks. Decrementing their capacity while they are idle
will result in these CPUs being chosen less, and they will prefer to
schedule smaller tasks instead of large ones. Disable this.

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
2024-08-13 23:36:15 +05:30
Tyler Nijmeh
f5daa9d7ec sched: Enable NEXT_BUDDY for better cache locality
By scheduling the last woken task first, we can increase cache locality
since that task is likely to touch the same data as before.

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
2024-08-13 23:36:15 +05:30
Tyler Nijmeh
970b81bf75 cpufreq: schedutil: Enforce realtime priority
Even the interactive governor utilizes a realtime priority. It is
beneficial for schedutil to process it's workload at a >= priority
than mundane tasks (KGSL/AUDIO/ETC).

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: clarencelol <clarencekuiek@icloud.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
2024-08-13 23:36:14 +05:30
Sultan Alsawaf
26a793cb28 Revert "mutex: Add a delay into the SPIN_ON_OWNER wait loop."
This reverts commit c8de3f45ee.

This doesn't make sense for a few reasons. Firstly, upstream uses this
mutex code and it works fine on all arches; why should arm be any
different?

Secondly, once the mutex owner starts to spin on `wait_lock`,
preemption is disabled and the owner will be in an actively-running
state. The optimistic mutex spinning occurs when the lock owner is
actively running on a CPU, and while the optimistic spinning takes
place, no attempt to acquire `wait_lock` is made by the new waiter.
Therefore, it is guaranteed that new mutex waiters which optimistically
spin will not contend the `wait_lock` spin lock that the owner needs to
acquire in order to make forward progress.

Another potential source of `wait_lock` contention can come from tasks
that call mutex_trylock(), but this isn't actually problematic (and if
it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This
won't introduce significant contention on `wait_lock` because the
trylock code exits before attempting to lock `wait_lock`, specifically
when the atomic mutex counter indicates that the mutex is already
locked. So in reality, the amount of `wait_lock` contention that can
come from mutex_trylock() amounts to only one task. And once it
finishes, `wait_lock` will no longer be contended and the previous
mutex owner can proceed with clean up.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Albert I <kras@raphielgang.org>
2024-08-13 23:36:11 +05:30
Yaroslav Furman
b463d71ac6 drm/sde: sde_fence: don't copy fence names
This is in screen rendering path. Calling snprintf there is unwise.
This also has an advantage of reducing size of struct sde_fence from 152b to 128b.

Change-Id: I26f54537fc13a69a1f726d018a93bde5ef3477ac
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
2024-08-13 23:36:08 +05:30
Sultan Alsawaf
58ceb150b3 msm: kgsl: Use lock-less list for page pools
Page pool additions and removals are very hot during GPU workloads, so
they should be optimized accordingly. We can use a lock-less list for
storing the free pages in order to speed things up. The lock-less list
allows for one llist_del_first() user and unlimited llist_add() users to
run concurrently, so only a spin lock around the llist_del_first() is
needed; everything else is lock-free. The per-pool page count is now an
atomic to make it lock-free as well.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[jjpprrrr: adapted _kgsl_pool_get_page() because k4.9 does not update
vmstat counter for memory held in pools]
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
2024-08-13 23:36:07 +05:30
Sultan Alsawaf
bb2b4a801f msm: kgsl: Don't try to wait for fences that have been signaled
Trying to wait for fences that have already been signaled incurs a high
setup cost, since dynamic memory allocation must be used. Avoiding this
overhead when it isn't needed improves performance.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: palaych <palaychm@yandex.ru>
Change-Id: Iea6f84553c4c3d053858021948b18f2421a4d26e
2024-08-13 23:36:07 +05:30
Sultan Alsawaf
ecafe799d0 msm: kgsl: Dispatch commands using a master kthread
Instead of coordinating between a worker when dispatching commands and
abusing a mutex lock for synchronization, it's faster to keep a single
kthread dispatching commands whenever needed. This reduces GPU
processing latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[0ctobot: Adapted for msm-4.9, this reverts commit:
2eb74d7 ("msm: kgsl: Defer issue commands to worker thread")]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>

Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Raphiel Rollerscaperers <raphielscape@outlook.com>
2024-08-13 23:36:07 +05:30
Sultan Alsawaf
1119fd06fe defconfig: b1c1: Remove MSM event-timer
The event timer driver is accessed directly from CPU idle and is not
RT-friendly. Since the event timer is only used by the old MDSS driver,
just remove it since it's unused on sdm670.
2024-08-13 23:35:12 +05:30
Sultan Alsawaf
90e9c6d036 defconfig: bonito: Remove MSM event-timer
The event timer driver is accessed directly from CPU idle and is not
RT-friendly. Since the event timer is only used by the old MDSS driver,
just remove it since it's unused on sdm670.
2024-08-13 23:32:56 +05:30
Sultan Alsawaf
0e39b53ee6 PM / freezer: Reduce freeze timeout to 1 second for Android
Freezing processes on Android usually takes less than 100 ms, and if it
takes longer than that to the point where the 20 second freeze timeout is
reached, it's because the remaining processes to be frozen are deadlocked
waiting for something from a process which is already frozen. There's no
point in burning power trying to freeze for that long, so reduce the freeze
timeout to a very generous 1 second for Android and don't let anything mess
with it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: GhostMaster69-dev <rathore6375@gmail.com>
2024-08-13 23:32:53 +05:30