kernel_realme_nemo

Author	SHA1	Message	Date
Tim Zimmermann	5f89a2e6ff	syscall: Increase bpf fake uname to 5.4.299 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:41 +00:00
Gabriel Krisman Bertazi	7f11ef316a	UPSTREAM: unicode: Don't special case ignorable code points We don't need to handle them separately. Instead, just let them decompose/casefold to themselves. Change-Id: I01c3f2c98ae4d84269586cec09f18239cbee0abb Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> (cherry picked from commit 5c26d2f1d3f5e4be3e196526bead29ecb139cf91) Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:41 +00:00
Marco Elver	edb0e223fa	BACKPORT: panic: use error_report_end tracepoint on warnings Introduce the error detector "warning" to the error_report event and use the error_report_end tracepoint at the end of a warning report. This allows in-kernel tests but also userspace to more easily determine if a warning occurred without polling kernel logs. [akpm@linux-foundation.org: add comma to enum list, per Andy] Link: https://lkml.kernel.org/r/20211115085630.1756817-1-elver@google.com Change-Id: Ia82127785563994f9a8b07a4c9e5c2483242f9f0 Signed-off-by: Marco Elver <elver@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Popov <alex.popov@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:41 +00:00
Alexander Potapenko	0a38a31948	tracing: add error_report_end trace point Patch series "Add error_report_end tracepoint to KFENCE and KASAN", v3. This patchset adds a tracepoint, error_repor_end, that is to be used by KFENCE, KASAN, and potentially other bug detection tools, when they print an error report. One of the possible use cases is userspace collection of kernel error reports: interested parties can subscribe to the tracing event via tracefs, and get notified when an error report occurs. This patch (of 3): Introduce error_report_end tracepoint. It can be used in debugging tools like KASAN, KFENCE, etc. to provide extensions to the error reporting mechanisms (e.g. allow tests hook into error reporting, ease error report collection from production kernels). Another benefit would be making use of ftrace for debugging or benchmarking the tools themselves. Should we need it, the tracepoint name leaves us with the possibility to introduce a complementary error_report_start tracepoint in the future. Link: https://lkml.kernel.org/r/20210121131915.1331302-1-glider@google.com Link: https://lkml.kernel.org/r/20210121131915.1331302-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:41 +00:00
Peter Zijlstra	79efce0b75	sched: Provide sched_set_fifo() SCHED_FIFO (or any static priority scheduler) is a broken scheduler model; it is fundamentally incapable of resource management, the one thing an OS is actually supposed to do. It is impossible to compose static priority workloads. One cannot take two well designed and functional static priority workloads and mash them together and still expect them to work. Therefore it doesn't make sense to expose the priority field; the kernel is fundamentally incapable of setting a sensible value, it needs systems knowledge that it doesn't have. Take away sched_setschedule() / sched_setattr() from modules and replace them with: - sched_set_fifo(p); create a FIFO task (at prio 50) - sched_set_fifo_low(p); create a task higher than NORMAL, which ends up being a FIFO task at prio 1. - sched_set_normal(p, nice); (re)set the task to normal This stops the proliferation of randomly chosen, and irrelevant, FIFO priorities that dont't really mean anything anyway. The system administrator/integrator, whoever has insight into the actual system design and requirements (userspace) can set-up appropriate priorities if and when needed. Cc: airlied@redhat.com Cc: alexander.deucher@amd.com Cc: awalls@md.metrocast.net Cc: axboe@kernel.dk Cc: broonie@kernel.org Cc: daniel.lezcano@linaro.org Cc: gregkh@linuxfoundation.org Cc: hannes@cmpxchg.org Cc: herbert@gondor.apana.org.au Cc: hverkuil@xs4all.nl Cc: john.stultz@linaro.org Cc: nico@fluxnic.net Cc: paulmck@kernel.org Cc: rafael.j.wysocki@intel.com Cc: rmk+kernel@arm.linux.org.uk Cc: sudeep.holla@arm.com Cc: tglx@linutronix.de Cc: ulf.hansson@linaro.org Cc: wim@linux-watchdog.org Change-Id: I52ed4f1253e82ba3e8f40f3aa1aff62580163f25 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:41 +00:00
Michael Bestas	f6aad5a591	sched: Provide sched_setattr_nocheck() Based on upstream commit 794a56ebd9a57db12abaec63f038c6eb073461f7 Change-Id: I2a6be669c847da253f09e72c6f41437a9c0f11ef Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:41 +00:00
Matthew Wilcox	bf5fefeaa6	xarray: add the xa_lock to the radix_tree_root This results in no change in structure size on 64-bit machines as it fits in the padding between the gfp_t and the void *. 32-bit machines will grow the structure from 8 to 12 bytes. Almost all radix trees are protected with (at least) a spinlock, so as they are converted from radix trees to xarrays, the data structures will shrink again. Initialising the spinlock requires a name for the benefit of lockdep, so RADIX_TREE_INIT() now needs to know the name of the radix tree it's initialising, and so do IDR_INIT() and IDA_INIT(). Also add the xa_lock() and xa_unlock() family of wrappers to make it easier to use the lock. If we could rely on -fplan9-extensions in the compiler, we could avoid all of this syntactic sugar, but that wasn't added until gcc 4.6. Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [RealJohnGalt: adapt to 4.14] Change-Id: Iea510d7f531356e14be1b9c002df075629c2f729 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:40 +00:00
Matthew Wilcox	6387ecd76c	idr: Add idr_alloc_u32 helper All current users of idr_alloc_ext() actually want to allocate a u32 and idr_alloc_u32() fits their needs better. Like idr_get_next(), it uses a 'nextid' argument which serves as both a pointer to the start ID and the assigned ID (instead of a separate minimum and pointer-to-assigned-ID argument). It uses a 'max' argument rather than 'end' because the semantics that idr_alloc has for 'end' don't work well for unsigned types. Since idr_alloc_u32() returns an errno instead of the allocated ID, mark it as __must_check to help callers use it correctly. Include copious kernel-doc. Chris Mi <chrism@mellanox.com> has promised to contribute test-cases for idr_alloc_u32. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Change-Id: Id32ab4ba4c31d1ebb12945fffeb2b7ea7b146b4f Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:40 +00:00
Matthew Wilcox	11384d6312	fscache: use appropriate radix tree accessors Don't open-code accesses to data structure internals. Link: http: //lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I08de5ac61e1a660cad05de8af6b214ed42a00171 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:40 +00:00
Goldwyn Rodrigues	6454054b9d	fs: export generic_file_buffered_read() Export generic_file_buffered_read() to be used to supplement incomplete direct reads. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Zlatan Radovanovic <zlatan.radovanovic@fet.ba> Change-Id: I5577fca5f9a836348d71999839f5de47e72caf98 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:40 +00:00
Fangrui Song	521d224afa	BACKPORT: x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem_64.S Commit `393f203f5f` ("x86_64: kasan: add interceptors for memset/memmove/memcpy functions") added .weak directives to arch/x86/lib/mem_64.S instead of changing the existing ENTRY macros to WEAK. This can lead to the assembly snippet .weak memcpy ... .globl memcpy which will produce a STB_WEAK memcpy with GNU as but STB_GLOBAL memcpy with LLVM's integrated assembler before LLVM 12. LLVM 12 (since https://reviews.llvm.org/D90108) will error on such an overridden symbol binding. Commit ef1e03152cb0 ("x86/asm: Make some functions local") changed ENTRY in arch/x86/lib/memcpy_64.S to SYM_FUNC_START_LOCAL, which was ineffective due to the preceding .weak directive. Use the appropriate SYM_FUNC_START_WEAK instead. Fixes: `393f203f5f` ("x86_64: kasan: add interceptors for memset/memmove/memcpy functions") Fixes: ef1e03152cb0 ("x86/asm: Make some functions local") Reported-by: Sami Tolvanen <samitolvanen@google.com> Change-Id: I77420199abb62cacbed4de8d3b244f77e43a7f38 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201103012358.168682-1-maskray@google.com Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:40 +00:00
Peiyong Wang	cbdea98273	BACKPORT: net: core: enable SO_BINDTODEVICE for non-root users Currently, SO_BINDTODEVICE requires CAP_NET_RAW. This change allows a non-root user to bind a socket to an interface if it is not already bound. This is useful to allow an application to bind itself to a specific VRF for outgoing or incoming connections. Currently, an application wanting to manage connections through several VRF need to be privileged. Previously, IP_UNICAST_IF and IPV6_UNICAST_IF were added for Wine (`76e21053b5` and `c4062dfc42`) specifically for use by non-root processes. However, they are restricted to sendmsg() and not usable with TCP. Allowing SO_BINDTODEVICE would allow TCP clients to get the same privilege. As for TCP servers, outside the VRF use case, SO_BINDTODEVICE would only further restrict connections a server could accept. When an application is restricted to a VRF (with `ip vrf exec`), the socket is bound to an interface at creation and therefore, a non-privileged call to SO_BINDTODEVICE to escape the VRF fails. When an application bound a socket to SO_BINDTODEVICE and transmit it to a non-privileged process through a Unix socket, a tentative to change the bound device also fails. Before: >>> import socket >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") Traceback (most recent call last): File "<stdin>", line 1, in <module> PermissionError: [Errno 1] Operation not permitted After: >>> import socket >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") Traceback (most recent call last): File "<stdin>", line 1, in <module> PermissionError: [Errno 1] Operation not permitted Bug: 323792489 Signed-off-by: Vincent Bernat <vincent@bernat.ch> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit c427bfec18f2190b8f4718785ee8ed2db4f84ee6) Change-Id: Ie3f4c536b78da12dd961dc681c6dfb0cdf1a06b7 Signed-off-by: Peiyong Wang <wangpeiyong@xiaomi.corp-partner.google.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:40 +00:00
Andrey Konovalov	cb10151c06	UPSTREAM: usb: raw-gadget: properly handle interrupted requests Currently, if a USB request that was queued by Raw Gadget is interrupted (via a signal), wait_for_completion_interruptible returns -ERESTARTSYS. Raw Gadget then attempts to propagate this value to userspace as a return value from its ioctls. However, when -ERESTARTSYS is returned by a syscall handler, the kernel internally restarts the syscall. This doesn't allow userspace applications to interrupt requests queued by Raw Gadget (which is required when the emulated device is asked to switch altsettings). It also violates the implied interface of Raw Gadget that a single ioctl must only queue a single USB request. Instead, make Raw Gadget do what GadgetFS does: check whether the request was interrupted (dequeued with status == -ECONNRESET) and report -EINTR to userspace. Bug: 254441685 Fixes: f2c2e717642c ("usb: gadget: add raw-gadget interface") Cc: stable <stable@kernel.org> Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/0db45b1d7cc466e3d4d1ab353f61d63c977fbbc5.1698350424.git.andreyknvl@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit e8033bde451eddfb9b1bbd6e2d848c1b5c277222) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I7c684cc6079d2ec31986633c29e5a41954b80c84 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:40 +00:00
Zi Yan	e2703cfb68	UPSTREAM: mm/cma: use nth_page() in place of direct struct page manipulation Patch series "Use nth_page() in place of direct struct page manipulation", v3. On SPARSEMEM without VMEMMAP, struct page is not guaranteed to be contiguous, since each memory section's memmap might be allocated independently. hugetlb pages can go beyond a memory section size, thus direct struct page manipulation on hugetlb pages/subpages might give wrong struct page. Kernel provides nth_page() to do the manipulation properly. Use that whenever code can see hugetlb pages. This patch (of 5): When dealing with hugetlb pages, manipulating struct page pointers directly can get to wrong struct page, since struct page is not guaranteed to be contiguous on SPARSEMEM without VMEMMAP. Use nth_page() to handle it properly. Without the fix, page_kasan_tag_reset() could reset wrong page tags, causing a wrong kasan result. No related bug is reported. The fix comes from code inspection. Bug: 254441685 Link: https://lkml.kernel.org/r/20230913201248.452081-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20230913201248.452081-2-zi.yan@sent.com Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 2e7cfe5cd5b6b0b98abf57a3074885979e187c1c) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: Ib62455867ec73728b47f7f93e809bd6d0131208a Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Ruidong Tian	3afd1b776f	UPSTREAM: coresight: tmc: Explicit type conversions to prevent integer overflow Perf cs_etm session executed unexpectedly when AUX buffer > 1G. perf record -C 0 -m ,2G -e cs_etm// -- <workload> [ perf record: Captured and wrote 2.615 MB perf.data ] Perf only collect about 2M perf data rather than 2G. This is becasuse the operation, "nr_pages << PAGE_SHIFT", in coresight tmc driver, will overflow when nr_pages >= 0x80000(correspond to 1G AUX buffer). The overflow cause buffer allocation to fail, and TMC driver will alloc minimal buffer size(1M). You can just get about 2M perf data(1M AUX buffer + perf data header) at least. Explicit convert nr_pages to 64 bit to avoid overflow. Bug: 254441685 Fixes: 22f429f19c41 ("coresight: etm-perf: Add support for ETR backend") Fixes: 99443ea19e8b ("coresight: Add generic TMC sg table framework") Fixes: `2e499bbc1a` ("coresight: tmc: implementing TMC-ETF AUX space API") Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Reviewed-by: James Clark <james.clark@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230804081514.120171-2-tianruidong@linux.alibaba.com (cherry picked from commit fd380097cdb305582b7a1f9476391330299d2c59) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I0ebd4afabf2b6bef525712416135ffb34d1f7cd3 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Kees Cook	f3d5089d73	UPSTREAM: kheaders: Use array declaration instead of char Under CONFIG_FORTIFY_SOURCE, memcpy() will check the size of destination and source buffers. Defining kernel_headers_data as "char" would trip this check. Since these addresses are treated as byte arrays, define them as arrays (as done everywhere else). This was seen with: $ cat /sys/kernel/kheaders.tar.xz >> /dev/null detected buffer overflow in memcpy kernel BUG at lib/string_helpers.c:1027! ... RIP: 0010:fortify_panic+0xf/0x20 [...] Call Trace: <TASK> ikheaders_read+0x45/0x50 [kheaders] kernfs_fop_read_iter+0x1a4/0x2f0 ... Bug: 254441685 Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20230302112130.6e402a98@kernel.org/ Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Jakub Kicinski <kuba@kernel.org> Fixes: 43d8ce9d65a5 ("Provide in-kernel headers to make extending kernel easier") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230302224946.never.243-kees@kernel.org (cherry picked from commit b69edab47f1da8edd8e7bfdf8c70f51a2a5d89fb) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I73c7530b9c558c1c8dac5f8962dbc31c553c0be7 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Munehisa Kamata	28cd8d4b2a	UPSTREAM: sched/psi: Fix use-after-free in ep_remove_wait_queue() If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path: do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit: fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue This results in use-after-free as pasted below. The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it is in line with the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this. BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404 CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace: <TASK> dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 kasan_report+0xc3/0xf0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Allocated by task 4404: kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 4407: kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Bug: 254441685 Fixes: 0e94682b73bf ("psi: introduce psi monitor") Signed-off-by: Munehisa Kamata <kamatam@amazon.com> Signed-off-by: Mengchi Cheng <mengcc@amazon.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/ Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com (cherry picked from commit c2dbe32d5db5c4ead121cf86dabd5ab691fb47fe) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I9677499b2885149a1070f508931113ad8a02277a Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Eric Dumazet	94e103c9b6	UPSTREAM: xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr() int type = nla_type(nla); if (type > XFRMA_MAX) { return -EOPNOTSUPP; } @type is then used as an array index and can be used as a Spectre v1 gadget. if (nla_len(nla) < compat_policy[type].len) { array_index_nospec() can be used to prevent leaking content of kernel memory to malicious users. Bug: 254441685 Fixes: 5106f4a8acff ("xfrm/compat: Add 32=>64-bit messages translator") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Safonov <dima@arista.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> (cherry picked from commit b6ee896385380aa621102e8ea402ba12db1cabff) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: Iac8d61100685ad513e04d2623fe0b79ba331167a Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Namhyung Kim	5dce631799	UPSTREAM: perf/core: Call LSM hook after copying perf_event_attr It passes the attr struct to the security_perf_event_open() but it's not initialized yet. Bug: 254441685 Fixes: da97e18458fb ("perf_event: Add support for LSM and SELinux checks") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221220223140.4020470-1-namhyung@kernel.org (cherry picked from commit 0a041ebca4956292cadfb14a63ace3a9c1dcb0a3) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I0bda5722f4cff80252e2ec483b5b1f18b1194356 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Mukesh Ojha	26c4558a2a	UPSTREAM: gcov: clang: fix the buffer overflow issue Currently, in clang version of gcov code when module is getting removed gcov_info_add() incorrectly adds the sfn_ptr->counter to all the dst->functions and it result in the kernel panic in below crash report. Fix this by properly handling it. [ 8.899094][ T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000 [ 8.899100][ T599] Mem abort info: [ 8.899102][ T599] ESR = 0x9600004f [ 8.899103][ T599] EC = 0x25: DABT (current EL), IL = 32 bits [ 8.899105][ T599] SET = 0, FnV = 0 [ 8.899107][ T599] EA = 0, S1PTW = 0 [ 8.899108][ T599] FSC = 0x0f: level 3 permission fault [ 8.899110][ T599] Data abort info: [ 8.899111][ T599] ISV = 0, ISS = 0x0000004f [ 8.899113][ T599] CM = 0, WnR = 1 [ 8.899114][ T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000 [ 8.899116][ T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787 [ 8.899124][ T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP [ 8.899265][ T599] Skip md ftrace buffer dump for: 0x1609e0 .... .., [ 8.899544][ T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S OE 5.15.41-android13-8-g38e9b1af6bce #1 [ 8.899547][ T599] Hardware name: XXX (DT) [ 8.899549][ T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--) [ 8.899551][ T599] pc : gcov_info_add+0x9c/0xb8 [ 8.899557][ T599] lr : gcov_event+0x28c/0x6b8 [ 8.899559][ T599] sp : ffffffc00e733b00 [ 8.899560][ T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470 [ 8.899563][ T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000 [ 8.899566][ T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000 [ 8.899569][ T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058 [ 8.899572][ T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785 [ 8.899575][ T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785 [ 8.899577][ T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80 [ 8.899580][ T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000 [ 8.899583][ T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f [ 8.899586][ T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0 [ 8.899589][ T599] Call trace: [ 8.899590][ T599] gcov_info_add+0x9c/0xb8 [ 8.899592][ T599] gcov_module_notifier+0xbc/0x120 [ 8.899595][ T599] blocking_notifier_call_chain+0xa0/0x11c [ 8.899598][ T599] do_init_module+0x2a8/0x33c [ 8.899600][ T599] load_module+0x23cc/0x261c [ 8.899602][ T599] __arm64_sys_finit_module+0x158/0x194 [ 8.899604][ T599] invoke_syscall+0x94/0x2bc [ 8.899607][ T599] el0_svc_common+0x1d8/0x34c [ 8.899609][ T599] do_el0_svc+0x40/0x54 [ 8.899611][ T599] el0_svc+0x94/0x2f0 [ 8.899613][ T599] el0t_64_sync_handler+0x88/0xec [ 8.899615][ T599] el0t_64_sync+0x1b4/0x1b8 [ 8.899618][ T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c) [ 8.899620][ T599] ---[ end trace ed5218e9e5b6e2e6 ]--- Bug: 254441685 Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com Fixes: e178a5beb369 ("gcov: clang support") Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> [5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit a6f810efabfd789d3bbafeacb4502958ec56c5ce) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: If73014531a63392cda8b1ce2607573b85978be30 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
GONG, Ruiqi	2409e26769	BACKPORT: selinux: enable use of both GFP_KERNEL and GFP_ATOMIC in convert_context() The following warning was triggered on a hardware environment: SELinux: Converting 162 SID table entries... BUG: sleeping function called from invalid context at __might_sleep+0x60/0x74 0x0 in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 5943, name: tar CPU: 7 PID: 5943 Comm: tar Tainted: P O 5.10.0 #1 Call trace: dump_backtrace+0x0/0x1c8 show_stack+0x18/0x28 dump_stack+0xe8/0x15c ___might_sleep+0x168/0x17c __might_sleep+0x60/0x74 __kmalloc_track_caller+0xa0/0x7dc kstrdup+0x54/0xac convert_context+0x48/0x2e4 sidtab_context_to_sid+0x1c4/0x36c security_context_to_sid_core+0x168/0x238 security_context_to_sid_default+0x14/0x24 inode_doinit_use_xattr+0x164/0x1e4 inode_doinit_with_dentry+0x1c0/0x488 selinux_d_instantiate+0x20/0x34 security_d_instantiate+0x70/0xbc d_splice_alias+0x4c/0x3c0 ext4_lookup+0x1d8/0x200 [ext4] __lookup_slow+0x12c/0x1e4 walk_component+0x100/0x200 path_lookupat+0x88/0x118 filename_lookup+0x98/0x130 user_path_at_empty+0x48/0x60 vfs_statx+0x84/0x140 vfs_fstatat+0x20/0x30 __se_sys_newfstatat+0x30/0x74 __arm64_sys_newfstatat+0x1c/0x2c el0_svc_common.constprop.0+0x100/0x184 do_el0_svc+0x1c/0x2c el0_svc+0x20/0x34 el0_sync_handler+0x80/0x17c el0_sync+0x13c/0x140 SELinux: Context system_u:object_r:pssp_rsyslog_log_t:s0:c0 is not valid (left unmapped). It was found that within a critical section of spin_lock_irqsave in sidtab_context_to_sid(), convert_context() (hooked by sidtab_convert_params.func) might cause the process to sleep via allocating memory with GFP_KERNEL, which is problematic. As Ondrej pointed out [1], convert_context()/sidtab_convert_params.func has another caller sidtab_convert_tree(), which is okay with GFP_KERNEL. Therefore, fix this problem by adding a gfp_t argument for convert_context()/sidtab_convert_params.func and pass GFP_KERNEL/_ATOMIC properly in individual callers. Bug: 254441685 Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221018120111.1474581-1-gongruiqi1@huawei.com/ [1] Reported-by: Tan Ninghao <tanninghao1@huawei.com> Fixes: ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve performance") Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> [PM: wrap long BUG() output lines, tweak subject line] Signed-off-by: Paul Moore <paul@paul-moore.com> (cherry picked from commit abe3c631447dcd1ba7af972fe6f054bee6f136fa) [Lee: Trivial white-space differences] Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: Id5d763b0e858d917629c95005ce982d421a3f54f Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Johannes Berg	6cc485a1e1	UPSTREAM: wifi: mac80211_hwsim: set virtio device ready in probe() Just like a similar commit to arch/um/drivers/virt-pci.c, call virtio_device_ready() to make this driver work after commit b4ec69d7e09 ("virtio: harden vring IRQ"), since the driver uses the virtqueues in the probe function. (The virtio core sets the device ready when probe returns.) Bug: 254441685 Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ") Fixes: 5d44fe7c9808 ("mac80211_hwsim: add frame transmission support over virtio") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20220613210401.327958-1-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> (cherry picked from commit 3f3558c8054f82950b6decf928738306f556edf3) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I570a33f2f49de46a46005faa772e5aaec4ef6be6 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:39 +00:00
Jérôme Pouiller	5bc94649a1	UPSTREAM: dma-buf: fix use of DMA_BUF_SET_NAME_{A,B} in userspace The typedefs u32 and u64 are not available in userspace. Thus user get an error he try to use DMA_BUF_SET_NAME_A or DMA_BUF_SET_NAME_B: $ gcc -Wall -c -MMD -c -o ioctls_list.o ioctls_list.c In file included from /usr/include/x86_64-linux-gnu/asm/ioctl.h:1, from /usr/include/linux/ioctl.h:5, from /usr/include/asm-generic/ioctls.h:5, from ioctls_list.c:11: ioctls_list.c:463:29: error: ‘u32’ undeclared here (not in a function) 463 \| { "DMA_BUF_SET_NAME_A", DMA_BUF_SET_NAME_A, -1, -1 }, // linux/dma-buf.h \| ^~~~~~~~~~~~~~~~~~ ioctls_list.c:464:29: error: ‘u64’ undeclared here (not in a function) 464 \| { "DMA_BUF_SET_NAME_B", DMA_BUF_SET_NAME_B, -1, -1 }, // linux/dma-buf.h \| ^~~~~~~~~~~~~~~~~~ The issue was initially reported here[1]. [1]: https://github.com/jerome-pouiller/ioctl/pull/14 Bug: 254441685 Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com> Reviewed-by: Christian König <christian.koenig@amd.com> Fixes: a5bff92eaac4 ("dma-buf: Fix SET_NAME ioctl uapi") CC: stable@vger.kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/20220517072708.245265-1-Jerome.Pouiller@silabs.com Signed-off-by: Christian König <christian.koenig@amd.com> (cherry picked from commit 7c3e9fcad9c7d8bb5d69a576044fb16b1d2e8a01) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: If83a6fecc7ef885ca070214b4c03d317851f207a Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
Jann Horn	018ddec939	UPSTREAM: usb: raw-gadget: fix handling of dual-direction-capable endpoints Under dummy_hcd, every available endpoint is either IN or OUT capable. But with some real hardware, there are endpoints that support both IN and OUT. In particular, the PLX 2380 has four available endpoints that each support both IN and OUT. raw-gadget currently gets confused and thinks that any endpoint that is usable as an IN endpoint can never be used as an OUT endpoint. Fix it by looking at the direction in the configured endpoint descriptor instead of looking at the hardware capabilities. With this change, I can use the PLX 2380 with raw-gadget. Bug: 254441685 Fixes: f2c2e717642c ("usb: gadget: add raw-gadget interface") Cc: stable <stable@vger.kernel.org> Tested-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20220126205214.2149936-1-jannh@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 292d2c82b105d92082c2120a44a58de9767e44f1) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I7e24ca4777f3aa2a53e2d85947a1a469282f9ee9 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
Wei Yongjun	2781dc2d53	UPSTREAM: mac80211_hwsim: use GFP_ATOMIC under spin lock A spin lock is taken here so we should use GFP_ATOMIC. Bug: 254441685 Fixes: 5d44fe7c9808 ("mac80211_hwsim: add frame transmission support over virtio") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Link: https://lore.kernel.org/r/20200422020154.112088-1-weiyongjun1@huawei.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> (cherry picked from commit 0379861217dc2dd46e3bc517010060065b0dd6fc) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I60bb1e94123f511827c53cbe3f3706c005652758 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
Lee Jones	5234756182	ANDROID: sign-file: Fix warning when OPENSSL_NO_ENGINE is set Place drain_openssl_errors() function under the same build constraints as the code that calls it. scripts/sign-file.c:96:13: warning: unused function 'drain_openssl_errors' [-Wunused-function] static void drain_openssl_errors(void) ^ 1 warning generated. For some reason this wasn't picked-up on during automated testing. Fixes: e9d39639a529 ("FROMLIST: sign-file: Use OpenSSL provided define to compile out deprecated APIs") Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I3b337a9deac4ee83cb780792ece8f5f701a01f5f (cherry picked from commit fad17703b529eeb423eab346ffb8e8fd16baf745) Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
Lee Jones	acfd5720b4	FROMLIST: sign-file: Use OpenSSL provided define to compile out deprecated APIs OpenSSL's ENGINE API is deprecated in OpenSSL v3.0. Use OPENSSL_NO_ENGINE to disallow its use and fall back on the BIO API. This is required for fully hermetic builds in android-kernel. Link: https://lore.kernel.org/lkml/20211005161833.1522737-1-lee.jones@linaro.org/ Fixes: bce40b72a381b ("ANDROID: Disable hermetic toolchain for allmodconfig builds") Co-developed-by: Adam Langley <agl@google.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I5ecac477c274ef040934710fd4a042c133942e34 (cherry picked from commit e9d39639a5297c1601f025c8fddd30a936fedc16) Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
Tim Zimmermann	a997cbe640	Revert "kbuild: avoid static_assert for genksyms" This reverts commit `436b01c`. Reason for revert: This patch was not included in mainline. Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
jsdizkcksv	46b2e111a2	ARM64: defconfig: add cgroup blk support * Android16 QPR1 Apex mounting depends on it, otherwise it will cause apexed-failed failure and boot loop Change-Id: 35fbbabda99a5ef82684ed6e83542bfac28f0a70 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
Al Viro	79318131b6	UPSTREAM: define __poll_t, annotate constants Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:33 +00:00
Magnus Karlsson	ddcd04f65d	BACKPORT: dev: packet: make packet_direct_xmit a common function The new dev_direct_xmit will be used by AF_XDP in later commits. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:42:12 +00:00
Magnus Karlsson	93a00e6193	UPSTREAM: net: add umem reference in netdev{_rx}_queue These references to the umem will be used to store information on what kind of AF_XDP umem that is bound to a queue id, if any. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:41:52 +00:00
Heiko Carstens	77ee6fb69f	BACKPORT: epoll: fix compat syscall wire up of epoll_pwait2 Commit b0a0c2615f6f ("epoll: wire up syscall epoll_pwait2") wired up the 64 bit syscall instead of the compat variant in a couple of places. Fixes: b0a0c2615f6f ("epoll: wire up syscall epoll_pwait2") Change-Id: Ida2267e953240c036d64bc61177ee72bb20f35a3 Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:40:47 +00:00
Willem de Bruijn	ae0c687aad	BACKPORT: epoll: wire up syscall epoll_pwait2 Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20201121144401.3727659-4-willemdebruijn.kernel@gmail.com Change-Id: I48dfae6f721b24ebc53de603e393289954a95908 Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:40:24 +00:00
Willem de Bruijn	a4ca32b58e	UPSTREAM: epoll: add syscall epoll_pwait2 Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that replaces int timeout with struct timespec. It is equivalent otherwise. int epoll_pwait2(int fd, struct epoll_event events, int maxevents, const struct timespec timeout, const sigset_t *sigset); The underlying hrtimer is already programmed with nsec resolution. pselect and ppoll also set nsec resolution timeout with timespec. The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs the same. For timespec, only support this new interface on 2038 aware platforms that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME. Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.com Change-Id: I8cb4e756aacb4bee7cbe1c2cb5320a59e07626f8 Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:35 +00:00
Willem de Bruijn	b144b3a80f	BACKPORT: epoll: convert internal api to timespec64 Patch series "add epoll_pwait2 syscall", v4. Enable nanosecond timeouts for epoll. Analogous to pselect and ppoll, introduce an epoll_wait syscall variant that takes a struct timespec instead of int timeout. This patch (of 4): Make epoll more consistent with select/poll: pass along the timeout as timespec64 pointer. In anticipation of additional changes affecting all three polling mechanisms: - add epoll_pwait2 syscall with timespec semantics, and share poll_select_set_timeout implementation. - compute slack before conversion to absolute time, to save one ktime_get_ts64 call. Link: https://lkml.kernel.org/r/20201121144401.3727659-1-willemdebruijn.kernel@gmail.com Link: https://lkml.kernel.org/r/20201121144401.3727659-2-willemdebruijn.kernel@gmail.com Change-Id: Id5851ad620849d202d18a99351a98b8bc3b6820a Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Dominik Brodowski	08c6ece41a	UPSTREAM: fs: add do_epoll_() helpers; remove internal calls to sys_epoll_() Using the helper functions do_epoll_create() and do_epoll_wait() allows us to remove in-kernel calls to the related syscall functions. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Alexander Viro <viro@zeniv.linux.org.uk> Change-Id: I73c8f43d7db845e0b7ec5ec00fbc575fbcb8ca5e Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Al Viro	17e8f8bca2	UPSTREAM: close_range(): fix the logics in descriptor table trimming commit 678379e1d4f7443b170939525d3312cfc37bf86b upstream. Cloning a descriptor table picks the size that would cover all currently opened files. That's fine for clone() and unshare(), but for close_range() there's an additional twist - we clone before we close, and it would be a shame to have close_range(3, ~0U, CLOSE_RANGE_UNSHARE) leave us with a huge descriptor table when we are not going to keep anything past stderr, just because some large file descriptor used to be open before our call has taken it out. Unfortunately, it had been dealt with in an inherently racy way - sane_fdtable_size() gets a "don't copy anything past that" argument (passed via unshare_fd() and dup_fd()), close_range() decides how much should be trimmed and passes that to unshare_fd(). The problem is, a range that used to extend to the end of descriptor table back when close_range() had looked at it might very well have stuff grown after it by the time dup_fd() has allocated a new files_struct and started to figure out the capacity of fdtable to be attached to that. That leads to interesting pathological cases; at the very least it's a QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes a snapshot of descriptor table one might have observed at some point. Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination of unshare(CLONE_FILES) with plain close_range(), ending up with a weird state that would never occur with unshare(2) is confusing, to put it mildly. It's not hard to get rid of - all it takes is passing both ends of the range down to sane_fdtable_size(). There we are under ->files_lock, so the race is trivially avoided. So we do the following: * switch close_files() from calling unshare_fd() to calling dup_fd(). * undo the calling convention change done to unshare_fd() in 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE" * introduce struct fd_range, pass a pointer to that to dup_fd() and sane_fdtable_size() instead of "trim everything past that point" they are currently getting. NULL means "we are not going to be punching any holes"; NR_OPEN_MAX is gone. * make sane_fdtable_size() use find_last_bit() instead of open-coding it; it's easier to follow that way. * while we are at it, have dup_fd() report errors by returning ERR_PTR(), no need to use a separate int *errorp argument. Fixes: 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE" Cc: stable@vger.kernel.org Change-Id: I6782a2edf98970b6c2d662048061e28f7e57b9c9 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Linus Torvalds	98e936fcbf	UPSTREAM: fs: fix fd table size alignment properly [ Upstream commit d888c83fcec75194a8a48ccd283953bdba7b2550 ] Jason Donenfeld reports that my commit 1c24a186398f ("fs: fd tables have to be multiples of BITS_PER_LONG") doesn't work, and the reason is an embarrassing brown-paper-bag bug. Yes, we want to align the number of fds to BITS_PER_LONG, and yes, the reason they might not be aligned is because the incoming 'max_fd' argument might not be aligned. But aligining the argument - while simple - will cause a "infinitely big" maxfd (eg NR_OPEN_MAX) to just overflow to zero. Which most definitely isn't what we want either. The obvious fix was always just to do the alignment last, but I had moved it earlier just to make the patch smaller and the code look simpler. Duh. It certainly made _me_ look simple. Fixes: 1c24a186398f ("fs: fd tables have to be multiples of BITS_PER_LONG") Reported-and-tested-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Fedor Pchelkin <aissur0002@gmail.com> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Christian Brauner <brauner@kernel.org> Change-Id: I6d3a0c28896e8cf8cab30f60679aab785aeee193 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Linus Torvalds	18456e65a9	UPSTREAM: fs: fd tables have to be multiples of BITS_PER_LONG [ Upstream commit 1c24a186398f59c80adb9a967486b65c1423a59d ] This has always been the rule: fdtables have several bitmaps in them, and as a result they have to be sized properly for bitmaps. We walk those bitmaps in chunks of 'unsigned long' in serveral cases, but even when we don't, we use the regular kernel bitops that are defined to work on arrays of 'unsigned long', not on some byte array. Now, the distinction between arrays of bytes and 'unsigned long' normally only really ends up being noticeable on big-endian systems, but Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps() could be called with an argument that wasn't even a multiple of BITS_PER_BYTE. And then it fails to do the proper copy even on little-endian machines. The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which didn't actually sanitize the fdtable size sufficiently, and never made sure it had the proper BITS_PER_LONG alignment. That's partly because the alignment historically came not from having to explicitly align things, but simply from previous fdtable sizes, and from count_open_files(), which counts the file descriptors by walking them one 'unsigned long' word at a time and thus naturally ends up doing sizing in the proper 'chunks of unsigned long'. But with the introduction of close_range(), we now have an external source of "this is how many files we want to have", and so sane_fdtable_size() needs to do a better job. This also adds that explicit alignment to alloc_fdtable(), although there it is mainly just for documentation at a source code level. The arithmetic we do there to pick a reasonable fdtable size already aligns the result sufficiently. In fact,clang notices that the added ALIGN() in that function doesn't actually do anything, and does not generate any extra code for it. It turns out that gcc ends up confusing itself by combining a previous constant-sized shift operation with the variable-sized shift operations in roundup_pow_of_two(). And probably due to that doesn't notice that the ALIGN() is a no-op. But that's a (tiny) gcc misfeature that doesn't matter. Having the explicit alignment makes sense, and would actually matter on a 128-bit architecture if we ever go there. This also adds big comments above both functions about how fdtable sizes have to have that BITS_PER_LONG alignment. Fixes: 60997c3d45d9 ("close_range: add CLOSE_RANGE_UNSHARE") Reported-by: Fedor Pchelkin <aissur0002@gmail.com> Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Link: https://lore.kernel.org/all/20220326114009.1690-1-aissur0002@gmail.com/ Tested-and-acked-by: Christian Brauner <brauner@kernel.org> Change-Id: Ib64319f20fecec1c367bf0167e1b14b47751537a Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Al Viro	a9ff632e95	UPSTREAM: alloc_fdtable(): change calling conventions. [ Upstream commit 1d3b4bec3ce55e0c46cdce7d0402dbd6b4af3a3d ] First of all, tell it how many slots do we want, not which slot is wanted. It makes one caller (dup_fd()) more straightforward and doesn't harm another (expand_fdtable()). Furthermore, make it return ERR_PTR() on failure rather than returning NULL. Simplifies the callers. Simplify the size calculation, while we are at it - note that we always have slots_wanted greater than BITS_PER_LONG. What the rules boil down to is * use the smallest power of two large enough to give us that many slots * on 32bit skip 64 and 128 - the minimal capacity we want there is 256 slots (i.e. 1Kb fd array). * on 64bit don't skip anything, the minimal capacity is 128 - and we'll never be asked for 64 or less. 128 slots means 1Kb fd array, again. * on 128bit, if that ever happens, don't skip anything - we'll never be asked for 128 or less, so the fd array allocation will be at least 2Kb. Reviewed-by: Christian Brauner <brauner@kernel.org> Change-Id: Id6ae21da951bdb42fdb4361285ada3eb377f3a6b Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Ulrich Hecht <uli@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Christian Brauner	607629b0d4	UPSTREAM: file: simplify logic in __close_range() It never looked too pleasant and it doesn't really buy us anything anymore now that CLOSE_RANGE_CLOEXEC exists and we need to retake the current maximum under the lock for it anyway. This also makes the logic easier to follow. Cc: Christoph Hellwig <hch@lst.de> Cc: Giuseppe Scrivano <gscrivan@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Change-Id: I72be5b078d7edc061318b0eda36a351e1dc01e56 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Christian Brauner	a14e424f63	UPSTREAM: file: fix close_range() for unshare+cloexec syzbot reported a bug when putting the last reference to a tasks file descriptor table. Debugging this showed we didn't recalculate the current maximum fd number for CLOSE_RANGE_UNSHARE \| CLOSE_RANGE_CLOEXEC after we unshared the file descriptors table. So max_fd could exceed the current fdtable maximum causing us to set excessive bits. As a concrete example, let's say the user requested everything from fd 4 to ~0UL to be closed and their current fdtable size is 256 with their highest open fd being 4. With CLOSE_RANGE_UNSHARE the caller will end up with a new fdtable which has room for 64 file descriptors since that is the lowest fdtable size we accept. But now max_fd will still point to 255 and needs to be adjusted. Fix this by retrieving the correct maximum fd value in __range_cloexec(). Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC") Fixes: fec8a6a69103 ("close_range: unshare all fds for CLOSE_RANGE_UNSHARE \| CLOSE_RANGE_CLOEXEC") Cc: Christoph Hellwig <hch@lst.de> Cc: Giuseppe Scrivano <gscrivan@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Change-Id: Ib6aeb02d2b4556ba048cd4ca8150f9d2c29abdb3 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:34 +00:00
Christian Brauner	17f0e04489	UPSTREAM: close_range: unshare all fds for CLOSE_RANGE_UNSHARE \| CLOSE_RANGE_CLOEXEC After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE. When CLOSE_RANGE_UNSHARE is specified the caller will receive a private file descriptor table in case their file descriptor table is currently shared. For the case where the caller has requested all file descriptors to be actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that the caller does not need any of the file descriptors anymore and will optimize the close operation by only copying all files in the range from 0 to 3 and no others. However, if the caller requested CLOSE_RANGE_CLOEXEC together with CLOSE_RANGE_UNSHARE the caller wants to still make use of the file descriptors so the kernel needs to copy all of them and can't optimize. The original patch didn't account for this and thus could cause oopses as evidenced by the syzbot report because it assumed that all fds had been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case. syzbot reported ================================================================== BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline] BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline] BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274 Read of size 8 at addr 0000000000000077 by task syz-executor511/8522 CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 __kasan_report mm/kasan/report.c:549 [inline] kasan_report.cold+0x5/0x37 mm/kasan/report.c:562 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline] atomic_long_read include/asm-generic/atomic-long.h:29 [inline] filp_close+0x22/0x170 fs/open.c:1274 close_files fs/file.c:402 [inline] put_files_struct fs/file.c:417 [inline] put_files_struct+0x1cc/0x350 fs/file.c:414 exit_files+0x12a/0x170 fs/file.c:435 do_exit+0xb4f/0x2a00 kernel/exit.c:818 do_group_exit+0x125/0x310 kernel/exit.c:920 get_signal+0x428/0x2100 kernel/signal.c:2792 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x447039 Code: Unable to access opcode bytes at RIP 0x44700f. RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039 RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c ================================================================== syzbot has tested the proposed patch and the reproducer did not trigger any issue: Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com Tested on: commit: 10f7cddd selftests/core: add regression test for CLOSE_RAN.. git tree: git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs kernel config: https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3 dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93 compiler: gcc (GCC) 10.1.0-syz 20200507 Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC") Cc: Giuseppe Scrivano <gscrivan@redhat.com> Cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.com Change-Id: I92120c3f81e2811d3a2a1a76fa60cf2e75a56f42 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:33 +00:00
Giuseppe Scrivano	726b0b1d9e	UPSTREAM: fs, close_range: add flag CLOSE_RANGE_CLOEXEC When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't immediately close the files but it sets the close-on-exec bit. It is useful for e.g. container runtimes that usually install a seccomp profile "as late as possible" before execv'ing the container process itself. The container runtime could either do: 1 2 - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0); - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile(); - execve(...); - execve(...); Both alternative have some disadvantages. In the first variant the seccomp_profile cannot block the close_range syscall, as well as opendir/read/close/... for the fallback on older kernels. In the second variant, close_range() can be used only on the fds that are not going to be needed by the runtime anymore, and it must be potentially called multiple times to account for the different ranges that must be closed. Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues. The runtime is able to use the existing open fds, the seccomp profile can block close_range() and the syscalls used for its fallback. Change-Id: I1c84a733698c2853a0126cd22960ada25b229c5a Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:33 +00:00
Christian Brauner	f84e9989d8	BACKPORT: arch: wire-up close_range() This wires up the close_range() syscall into all arches at once. Suggested-by: Arnd Bergmann <arnd@arndb.de> Change-Id: Ib962b01f3a490c901b0e0f51e436df1a26887ca4 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:33 +00:00
Christian Brauner	8276673b59	BACKPORT: close_range: add CLOSE_RANGE_UNSHARE One of the use-cases of close_range() is to drop file descriptors just before execve(). This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE. This expands {dup,unshare)_fd() to take a max_fds argument that indicates the maximum number of file descriptors to copy from the old struct files. When the user requests that all file descriptors are supposed to be closed via close_range(min, max) then we can cap via unshare_fd(min) and hence don't need to do any of the heavy fput() work for everything above min. The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in fact currently share our file descriptor table we create a new private copy. We then close all fds in the requested range and finally after we're done we install the new fd table. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I0813045886501e40a45693ee1edad50bdf2b66e5 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:33 +00:00
Christian Brauner	9c4c8a6623	BACKPORT: open: add close_range() This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. I was contacted by FreeBSD as they wanted to have the same close_range() syscall as we proposed here. We've coordinated this and in the meantime, Kyle was fast enough to merge close_range() into FreeBSD already in April: https://reviews.freebsd.org/D21627 https://svnweb.freebsd.org/base?view=revision&revision=359836 and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2]) once its merged in Linux too. Python is in the process of switching to close_range() on FreeBSD and they are waiting on us to merge this to switch on Linux as well: https://bugs.python.org/issue38061 The syscall came up in a recent discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall (cf. [1]). Note, a syscall in this manner has been requested by various people over time. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive / unshare(CLONE_FILES); / we don't want anything past stderr here / close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). (Please note, unshare(CLONE_FILES) should only be needed if the calling task is multi-threaded and shares the file descriptor table with another thread in which case two threads could race with one thread allocating file descriptors and the other one closing them via close_range(). For the general case close_range() before the execve() is sufficient.) Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc/<pid>/fd/ and calling close() on each file descriptor. From looking at various large(ish) userspace code bases this or similar patterns are very common in: - service managers (cf. [4]) - libcs (cf. [6]) - container runtimes (cf. [5]) - programming language runtimes/standard libraries - Python (cf. [2]) - Rust (cf. [7], [8]) As Dmitry pointed out there's even a long-standing glibc bug about missing kernel support for this task (cf. [3]). In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery (cf. comment [8] on Rust). The performance is striking. For good measure, comparing the following simple close_all_fds() userspace implementation that is essentially just glibc's version in [6]: static int close_all_fds(void) { int dir_fd; DIR dir; struct dirent direntp; dir = opendir("/proc/self/fd"); if (!dir) return -1; dir_fd = dirfd(dir); while ((direntp = readdir(dir))) { int fd; if (strcmp(direntp->d_name, ".") == 0) continue; if (strcmp(direntp->d_name, "..") == 0) continue; fd = atoi(direntp->d_name); if (fd == dir_fd \|\| fd == 0 \|\| fd == 1 \|\| fd == 2) continue; close(fd); } closedir(dir); return 0; } to close_range() yields: 1. closing 4 open files: - close_all_fds(): ~280 us - close_range(): ~24 us 2. closing 1000 open files: - close_all_fds(): ~5000 us - close_range(): ~800 us close_range() is designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound. This is e.g. useful for scenarios where specific file descriptors are created with well-known numbers that are supposed to be excluded from getting closed. For extra paranoia close_range() comes with a flags argument. This can e.g. be used to implement extension. Once can imagine userspace wanting to stop at the first error instead of ignoring errors under certain circumstances. There might be other valid ideas in the future. In any case, a flag argument doesn't hurt and keeps us on the safe side. From an implementation side this is kept rather dumb. It saw some input from David and Jann but all nonsense is obviously my own! - Errors to close file descriptors are currently ignored. (Could be changed by setting a flag in the future if needed.) - __close_range() is a rather simplistic wrapper around __close_fd(). My reasoning behind this is based on the nature of how __close_fd() needs to release an fd. But maybe I misunderstood specifics: We take the files_lock and rcu-dereference the fdtable of the calling task, we find the entry in the fdtable, get the file and need to release files_lock before calling filp_close(). In the meantime the fdtable might have been altered so we can't just retake the spinlock and keep the old rcu-reference of the fdtable around. Instead we need to grab a fresh reference to the fdtable. If my reasoning is correct then there's really no point in fancyfying __close_range(): We just need to rcu-dereference the fdtable of the calling task once to cap the max_fd value correctly and then go on calling __close_fd() in a loop. /* References */ [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/ [2]: `9e4f2f3a6b/Modules/_posixsubprocess.c (L220)` [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7 [4]: `5238e95759/src/basic/fd-util.c (L217)` [5]: `ddf4b77e11/src/lxc/start.c (L236)` [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17 Note that this is an internal implementation that is not exported. Currently, libc seems to not provide an exported version of this because of missing kernel support to do this. Note, in a recent patch series Florian made grantpt() a nop thereby removing the code referenced here. [7]: https://github.com/rust-lang/rust/issues/12148 [8]: `5f47c0613e/src/libstd/sys/unix/process2.rs (L303-L308)` Rust's solution is slightly different but is equally unperformant. Rust calls getdtablesize() which is a glibc library function that simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then goes on to call close() on each fd. That's obviously overkill for most tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or OPEN_MAX. Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set to 1024. Even in this case, there's a very high chance that in the common case Rust is calling the close() syscall 1021 times pointlessly if the task just has 0, 1, and 2 open. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I2f7abcf9210a1e79855837d1b7c580cc7f7a38e2 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kyle Evans <self@kyle-evans.net> Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:03 +00:00
mcdofrenchfreis	b0fa8cb47c	nemo_defconfig: Slightly lower SLMK timeout Signed-off-by: mcdofrenchfreis <xyzevan@androidist.net>	2026-01-07 18:51:30 +00:00
RoHaN	d7f4dde444	nemo_defconfig: Enable SLMK	2026-01-07 18:51:30 +00:00

1 2 3 4 5 ...

752138 Commits