• Xiaomi's default segment reclamation (4096), along with that of other OEMs, is designed for MIUI, HyperOS's, and other $hit stocks roms.
heavy I/O patterns (background services, frequent writes).
For AOSP: 1. Lower Overhead: Smaller Garbage Collection (GC) batches
reduce CPU/storage load.
2. Better Responsiveness: Avoid long GC pauses during light
usage.
3. Battery Savings: Less aggressive reclaiming reduces
power spikes.
Change-Id: 710531a8b778ce9bc190b87ffe22fa3a52e51499
Signed-off-by: Kanishk <kanishkthederp@gmail.com>
Signed-off-by: TogoFire <togofire@mailfence.com>
- This is a heavily modified version of susfs v1.5.12
- It does not comply with the upstream offical susfs v1.5.12
- sus_mount functionality still remain in v1.5.5 as backporting it to the latest version will result a mount detection leak in some apps/detectors
- Increase susfs_open_redirect UID limit to <11000
- susfs magic mount support is still implemented and enabled
- sus_map is implemented and complied with the upstream v1.5.12 codebase
This commit requires a bunch of backports commits from v4.19 and v5.x to make sus_map working:
0a8cbf3725edbacc5f1ead33eeae7e4d78823b5a proc: less memory for /proc/*/map_files readdir
37ae2444584654f6785f2cc49181f05af788c9b2 mm: smaps: split PSS into components
49a5115e11350ee68f6a5fbd56b3e817bf9e5aac fs/task_mmu: add pkeys header
6f94042bed51121f8f28a5e572cda20c21fed2e1 mm/pkeys: Add an empty arch_pkeys_enabled()
bbd5aec12b32097a71dc6a0097194a18f3ee9a17 mm/pkeys, powerpc, x86: Provide an empty vma_pkey() in linux/pkeys.h
849ca8ce954d9dbb082dcf83c98af861e98e5635 mm: /proc/pid/smaps_rollup: convert to single value seq_file
6071a482c8e603be25895cc2cac5f0eab61c4051 mm: /proc/pid/smaps: factor out common stats printing
03fd2fbe9c40da8128cec5c69ef54755c0f38c6c mm: /proc/pid/smaps: factor out mem stats gathering
95f8be4c8a86a491a1c2ac9bfe470aef9e1baa8f mm: /proc/pid/*maps remove is_pid and related wrappers
27956d255e3b012372951dd6131e07c106d2daae procfs: add seq_put_hex_ll to speed up /proc/pid/maps
7f2847d02cdc4491b5ee6d4a0043854cbd6c7a1a proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps
For KernelSU side patches for this commit you need the sidex15's KernelSU-Next fork:
https://github.com/sidex15/KernelSU-Next/tree/n3x7g3n-kernel
Or if you want to patch on your own here's the commit patch of susfs in the KernelSU-Next:
13b1dfd6e2
Co-authored-by: simonpunk <simonpunk2016@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
Report separate components (anon, file, and shmem) for PSS in
smaps_rollup.
This helps understand and tune the memory manager behavior in consumer
devices, particularly mobile devices. Many of them (e.g. chromebooks and
Android-based devices) use zram for anon memory, and perform disk reads
for discarded file pages. The difference in latency is large (e.g.
reading a single page from SSD is 30 times slower than decompressing a
zram page on one popular device), thus it is useful to know how much of
the PSS is anon vs. file.
All the information is already present in /proc/pid/smaps, but much more
expensive to obtain because of the large size of that procfs entry.
This patch also removes a small code duplication in smaps_account, which
would have gotten worse otherwise.
Also updated Documentation/filesystems/proc.txt (the smaps section was a
bit stale, and I added a smaps_rollup section) and
Documentation/ABI/testing/procfs-smaps_rollup.
[semenzato@chromium.org: v5]
Link: http://lkml.kernel.org/r/20190626234333.44608-1-semenzato@chromium.org
Link: http://lkml.kernel.org/r/20190626180429.174569-1-semenzato@chromium.org
Signed-off-by: Luigi Semenzato <semenzato@chromium.org>
Acked-by: Yu Zhao <yuzhao@chromium.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Yu Zhao <yuzhao@chromium.org>
Cc: Brian Geffon <bgeffon@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's. However, the rollup file doesn't print anything
for each vma, only accumulate the stats.
There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.
Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.
This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally. This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
[vbabka@suse.c: use seq_file infrastructure]
Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
Patch series "cleanups and refactor of /proc/pid/smaps*".
The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
preparations for that. Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.
Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures. Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.
So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
This patch (of 4):
Commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps. However it didn't work properly and
was ultimately removed by commit b18cb64ead ("fs/proc: Stop trying to
report thread stacks").
Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.
Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a
specified minimal field width.
It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it
works much faster.
== test_smaps.py
num = 0
with open("/proc/1/smaps") as f:
for x in xrange(10000):
data = f.read()
f.seek(0, 0)
==
== Before patch ==
$ time python test_smaps.py
real 0m4.593s
user 0m0.398s
sys 0m4.158s
== After patch ==
$ time python test_smaps.py
real 0m3.828s
user 0m0.413s
sys 0m3.408s
$ perf -g record python test_smaps.py
== Before patch ==
- 79.01% 3.36% python [kernel.kallsyms] [k] show_smap.isra.33
- 75.65% show_smap.isra.33
+ 48.85% seq_printf
+ 15.75% __walk_page_range
+ 9.70% show_map_vma.isra.23
0.61% seq_puts
== After patch ==
- 75.51% 4.62% python [kernel.kallsyms] [k] show_smap.isra.33
- 70.88% show_smap.isra.33
+ 24.82% seq_put_decimal_ull_w
+ 19.78% __walk_page_range
+ 12.74% seq_printf
+ 11.08% show_map_vma.isra.23
+ 1.68% seq_puts
[akpm@linux-foundation.org: fix drivers/of/unittest.c build]
Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
[ Upstream commit aaf8c0b9ae042494cb4585883b15c1332de77840 ]
We may trigger high frequent checkpoint for below case:
1. mkdir /mnt/dir1; set dir1 encrypted
2. touch /mnt/file1; fsync /mnt/file1
3. mkdir /mnt/dir2; set dir2 encrypted
4. touch /mnt/file2; fsync /mnt/file2
...
Although, newly created dir and file are not related, due to
commit bbf156f7af ("f2fs: fix lost xattrs of directories"), we will
trigger checkpoint whenever fsync() comes after a new encrypted dir
created.
In order to avoid such performance regression issue, let's record an
entry including directory's ino in global cache whenever we update
directory's xattr data, and then triggerring checkpoint() only if
xattr metadata of target file's parent was updated.
This patch updates to cover below no encryption case as well:
1) parent is checkpointed
2) set_xattr(dir) w/ new xnid
3) create(file)
4) fsync(file)
Change-Id: Id7c4c5b70c239458b74f92edca537dd844b0be6f
Fixes: bbf156f7af ("f2fs: fix lost xattrs of directories")
Reported-by: wangzijie <wangzijie1@honor.com>
Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Tested-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reported-by: Yunlei He <heyunlei@hihonor.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
There are cases where EXT4 is a bit too conservative sending barriers down to
the disk; there are cases where the transaction in progress is not the one
that sent the barrier (in other words: the fsync is for a file for which the
IO happened more time ago and all data was already sent to the disk).
For that case, a more performing tradeoff can be made on SSD devices (which
have the ability to flush their dram caches in a hurry on a power fail event)
where the barrier gets sent to the disk, but we don't need to wait for the
barrier to complete. Any consecutive IO will block on the barrier correctly.
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
These get allocated and freed millions of times on this kernel tree.
Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
These get allocated and freed millions of times on this kernel tree.
Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
MIUI-1428085
The discard thread can only process 8 requests at a time by default.
So fstrim need to handle the remaining discard requests while using
discard option.
Change-Id: I5eac38c34182607e8dceeb13273522b10ce02af8
Signed-off-by: liuchao12 <liuchao12@xiaomi.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
This deadlock is hitting Android users (Pixel 3/3a/4) with Magisk, due
to frequent umount/mount operations that trigger quota_sync, hitting
the race. See https://github.com/topjohnwu/Magisk/issues/3171 for
additional impact discussion.
In commit db6ec53b7e03, we added a semaphore to protect quota flags.
As part of this commit, we changed f2fs_quota_sync to call
f2fs_lock_op, in an attempt to prevent an AB/BA type deadlock with
quota_sem locking in block_operation. However, rwsem in Linux is not
recursive. Therefore, the following deadlock can occur:
f2fs_quota_sync
down_read(cp_rwsem) // f2fs_lock_op
filemap_fdatawrite
f2fs_write_data_pages
...
block_opertaion
down_write(cp_rwsem) - marks rwsem as
"writer pending"
down_read_trylock(cp_rwsem) - fails as there is
a writer pending.
Code keeps on trying,
live-locking the filesystem.
We solve this by creating a new rwsem, used specifically to
synchronize this case, instead of attempting to reuse an existing
lock.
Signed-off-by: Shachar Raindel <shacharr@gmail.com>
Fixes: db6ec53b7e03 f2fs: add a rw_sem to cover quota flag changes
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
Following commit c23401e6e15f73150f45e67287be679e4deb58f4,
we need to protect this node from Android writing to it.
Change-Id: I19ee51f06c9e373acf886d83026ade290645e243
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
Useful when we need to set a node RO to avoid Android over-riding
the custom set values.
Change-Id: Iad8cf81504d55b8ed75e6b5563f7cf397595ec1a
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
Android sets the value to 50ms via vold's IdleMaint service. Since
500ms is too long for GC to colllect all invalid segments in time
which results in performance degradation.
On un-encrypted device, vold fails to set this value to 50ms thus
degrades the performance over time.
Based on [1].
[1] https://github.com/topjohnwu/Magisk/pull/5462
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Change-Id: I80f2c29558393d726d5e696aaf285096c8108b23
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
On high fs utilization, congestion is hit quite frequently and waiting for a
whooping 20ms is too expensive, especially on critical paths.
Reduce it to an amount that is unlikely to affect UI rendering paths.
The new times are as follows:
100 Hz => 1 jiffy (effective: 10 ms)
250 Hz => 2 jiffies (effective: 8 ms)
300 Hz => 2 jiffies (effective: 6 ms)
1000 Hz => 6 jiffies (effective: 6 ms)
Co-authored-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
GC should run conservatively as possible to reduce latency spikes to the user.
Setting ioprio to idle class will allow the kernel to schedule GC thread's I/O
to not affect any other processes' I/O requests.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
In OPPO's kernel:
enlarge min_fsync_blocks to optimize performance
- yanwu@TECH.Storage.FS.oF2FS, 2019/08/12
Huawei is also doing this in their production kernel.
If this optimization is good for them and shipped
with their devices, it should be good for us.
Signed-off-by: Jesse Chan <jc@linux.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
To allow for easier build test coverage and run-time testing, this allows
multiple compression algorithms to be built into pstore. Still only one
is supported to operate at a time (which can be selected at build time
or at boot time, similar to how LSMs are selected).
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: I5956061c215db5d3d7846b11b399ab101feaceb9
[dereference23: Backport to 4.14]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Now that pstore_register() can correctly pass max_reason to the kmesg
dump facility, introduce a new "max_reason" module parameter and
"max-reason" Device Tree field.
The "dump_oops" module parameter and "dump-oops" Device
Tree field are now considered deprecated, but are now automatically
converted to their corresponding max_reason values when present, though
the new max_reason setting has precedence.
For struct ramoops_platform_data, the "dump_oops" member is entirely
replaced by a new "max_reason" member, with the only existing user
updated in place.
Additionally remove the "reason" filter logic from ramoops_pstore_write(),
as that is not specifically needed anymore, though technically
this is a change in behavior for any ramoops users also setting the
printk.always_kmsg_dump boot param, which will cause ramoops to behave as
if max_reason was set to KMSG_DUMP_MAX.
Co-developed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: If2ed5c5786a9c572aa1eb4683eca1a0b292bb143
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Add a new member to struct pstore_info for passing information about
kmesg dump maximum reason. This allows a finer control of what kmesg
dumps are sent to pstore storage backends.
Those backends that do not explicitly set this field (keeping it equal to
0), get the default behavior: store only Oopses and Panics, or everything
if the printk.always_kmsg_dump boot param is set.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-5-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: I6bdb11d3b3d74624b4f7b3b3da5811bb9ef23608
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Since the header is a fixed small maximum size, just use a stack variable
to avoid memory allocation in the write path.
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: I97974d792d079775d1e17dd47fa2135db99a69b2
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
If zero-length header happened in ramoops_write_kmsg_hdr(), that means
we will not be able to read back dmesg record later, since it will be
treated as invalid header in ramoops_pstore_read(). So we should not
execute the following code but return the error.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: Ic14e781085ef0359d21204d4b4ef9902b801e457
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Since only one single ramoops area allowed at a time, other probes
(like device tree) are meaningless, as it will waste CPU resources.
So let's check for being already initialized first.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: I81905fc676487984e7ee18f6ea788100d1b60675
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Sometimes pstore_console_write() will write records with zero size
to persistent ram zone, which is unnecessary. It will only increase
resource consumption. Also adjust ramoops_write_kmsg_hdr() to have
same logic if memory allocation fails.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: Ibf780aa7c1446c2a8c8520ba345621177d160383
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Asynchronous I/O latency to a solid-state disk greatly increased between the 2.6.32 and 3.0 kernels.
By removing the plug from do_io_submit(), we observed a 34% improvement in the I/O latency.
Unfortunately, at this level, we don't know if the request is to
a rotating disk or not.
Change-Id: I7101df956473ed9fd5dcff18e473dd93b688a5c1
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: linux-aio@kvack.org
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
timerfd doesn't create any wakelocks; eventpoll can, and is creating the
wakelocks we see called "[timerfd]". eventpoll creates two kinds of
wakelocks: a single top-level lock associated with the eventpoll fd
itself, and one additional lock for each fd it is polling that needs such
a lock (e.g. those using EPOLLWAKEUP). Current code names the per-fd
locks using the undecorated names of the fds' associated files (hence
"[timerfd]"), and is naming the top-level lock after the PID of the caller
and the name of the file behind the first fd for which a per-fd lock is
created. To make things clearer, the top-level lock is now named using
the caller PID and an "epollfd" designation, while the per-fd locks are
also named with the caller's PID (to associate them with the top-level
lock) and their respective fds' file names.
Port of fix already applied to previous 2 generations. Note that this
set of changes does not fully solve the problem of eventpoll/timerfd
wakelock attribution to the original process, since most activity is
relayed through system_server, but it does at least ensure that different
eventpoll wakelocks - and their stats - are properly disambiguated.
Test: Ran on device and observed new wakelock naming in
/d/wakeup_sources and (file naming in) lsof output.
Bug: 116363986
Change-Id: I34bada5ddab04cf3830762c745f46bfcd1549cb8
Signed-off-by: John Dias <joaodias@google.com>
Signed-off-by: Kelly Rossmoyer <krossmo@google.com>
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>