kernel_realme_nemo

Author	SHA1	Message	Date
Jérôme Pouiller	5bc94649a1	UPSTREAM: dma-buf: fix use of DMA_BUF_SET_NAME_{A,B} in userspace The typedefs u32 and u64 are not available in userspace. Thus user get an error he try to use DMA_BUF_SET_NAME_A or DMA_BUF_SET_NAME_B: $ gcc -Wall -c -MMD -c -o ioctls_list.o ioctls_list.c In file included from /usr/include/x86_64-linux-gnu/asm/ioctl.h:1, from /usr/include/linux/ioctl.h:5, from /usr/include/asm-generic/ioctls.h:5, from ioctls_list.c:11: ioctls_list.c:463:29: error: ‘u32’ undeclared here (not in a function) 463 \| { "DMA_BUF_SET_NAME_A", DMA_BUF_SET_NAME_A, -1, -1 }, // linux/dma-buf.h \| ^~~~~~~~~~~~~~~~~~ ioctls_list.c:464:29: error: ‘u64’ undeclared here (not in a function) 464 \| { "DMA_BUF_SET_NAME_B", DMA_BUF_SET_NAME_B, -1, -1 }, // linux/dma-buf.h \| ^~~~~~~~~~~~~~~~~~ The issue was initially reported here[1]. [1]: https://github.com/jerome-pouiller/ioctl/pull/14 Bug: 254441685 Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com> Reviewed-by: Christian König <christian.koenig@amd.com> Fixes: a5bff92eaac4 ("dma-buf: Fix SET_NAME ioctl uapi") CC: stable@vger.kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/20220517072708.245265-1-Jerome.Pouiller@silabs.com Signed-off-by: Christian König <christian.koenig@amd.com> (cherry picked from commit 7c3e9fcad9c7d8bb5d69a576044fb16b1d2e8a01) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: If83a6fecc7ef885ca070214b4c03d317851f207a Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:38 +00:00
Al Viro	79318131b6	UPSTREAM: define __poll_t, annotate constants Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-20 14:21:33 +00:00
Willem de Bruijn	ae0c687aad	BACKPORT: epoll: wire up syscall epoll_pwait2 Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20201121144401.3727659-4-willemdebruijn.kernel@gmail.com Change-Id: I48dfae6f721b24ebc53de603e393289954a95908 Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:40:24 +00:00
Giuseppe Scrivano	726b0b1d9e	UPSTREAM: fs, close_range: add flag CLOSE_RANGE_CLOEXEC When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't immediately close the files but it sets the close-on-exec bit. It is useful for e.g. container runtimes that usually install a seccomp profile "as late as possible" before execv'ing the container process itself. The container runtime could either do: 1 2 - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0); - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile(); - execve(...); - execve(...); Both alternative have some disadvantages. In the first variant the seccomp_profile cannot block the close_range syscall, as well as opendir/read/close/... for the fallback on older kernels. In the second variant, close_range() can be used only on the fds that are not going to be needed by the runtime anymore, and it must be potentially called multiple times to account for the different ranges that must be closed. Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues. The runtime is able to use the existing open fds, the seccomp profile can block close_range() and the syscalls used for its fallback. Change-Id: I1c84a733698c2853a0126cd22960ada25b229c5a Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:33 +00:00
Christian Brauner	f84e9989d8	BACKPORT: arch: wire-up close_range() This wires up the close_range() syscall into all arches at once. Suggested-by: Arnd Bergmann <arnd@arndb.de> Change-Id: Ib962b01f3a490c901b0e0f51e436df1a26887ca4 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:33 +00:00
Christian Brauner	8276673b59	BACKPORT: close_range: add CLOSE_RANGE_UNSHARE One of the use-cases of close_range() is to drop file descriptors just before execve(). This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE. This expands {dup,unshare)_fd() to take a max_fds argument that indicates the maximum number of file descriptors to copy from the old struct files. When the user requests that all file descriptors are supposed to be closed via close_range(min, max) then we can cap via unshare_fd(min) and hence don't need to do any of the heavy fput() work for everything above min. The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in fact currently share our file descriptor table we create a new private copy. We then close all fds in the requested range and finally after we're done we install the new fd table. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I0813045886501e40a45693ee1edad50bdf2b66e5 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2026-01-18 09:39:33 +00:00
Song Liu	fa9dfc9318	BACKPORT: perf, bpf: Introduce PERF_RECORD_BPF_EVENT For better performance analysis of BPF programs, this patch introduces PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program load/unload information to user space. Each BPF program may contain up to BPF_MAX_SUBPROGS (256) sub programs. The following example shows kernel symbols for a BPF program with 7 sub programs: ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi When a bpf program is loaded, PERF_RECORD_KSYMBOL is generated for each of these sub programs. Therefore, PERF_RECORD_BPF_EVENT is not needed for simple profiling. For annotation, user space need to listen to PERF_RECORD_BPF_EVENT and gather more information about these (sub) programs via sys_bpf. Change-Id: I8ed02f808501c32f406108c282c853a56d0dcc25 Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradeaed.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-4-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:54 +00:00
Song Liu	d71e3a985e	UPSTREAM: perf, bpf: Introduce PERF_RECORD_KSYMBOL For better performance analysis of dynamically JITed and loaded kernel functions, such as BPF programs, this patch introduces PERF_RECORD_KSYMBOL, a new perf_event_type that exposes kernel symbol register/unregister information to user space. The following data structure is used for PERF_RECORD_KSYMBOL. /* * struct { * struct perf_event_header header; * u64 addr; * u32 len; * u16 ksym_type; * u16 flags; * char name[]; * struct sample_id sample_id; * }; */ Change-Id: I3e6901ef579878015f6a75d15699230882f79e1f Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20190117161521.1341602-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:54 +00:00
Stanislav Fomichev	3d10143f13	BACKPORT: bpf/flow_dissector: support ipv6 flow_label and BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL Add support for exporting ipv6 flow label via bpf_flow_keys. Export flow label from bpf_flow.c and also return early when BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL is passed. Acked-by: Petar Penkov <ppenkov@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Petar Penkov <ppenkov@google.com> Change-Id: I6b4c3771022f19c184867fb6045351d59cfde68b Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:52 +00:00
Stanislav Fomichev	0a5f0608b3	UPSTREAM: bpf/flow_dissector: pass input flags to BPF flow dissector program C flow dissector supports input flags that tell it to customize parsing by either stopping early or trying to parse as deep as possible. Pass those flags to the BPF flow dissector so it can make the same decisions. In the next commits I'll add support for those flags to our reference bpf_flow.c v3: * Export copy of flow dissector flags instead of moving (Alexei Starovoitov) Acked-by: Petar Penkov <ppenkov@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Petar Penkov <ppenkov@google.com> Change-Id: I46a68f8b2249915fff5d97a1394ea662d9a0ac46 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:51 +00:00
Lorenz Bauer	07ac4380c7	UPSTREAM: bpf: respect size hint to BPF_PROG_TEST_RUN if present Use data_size_out as a size hint when copying test output to user space. ENOSPC is returned if the output buffer is too small. Callers which so far did not set data_size_out are not affected. Change-Id: Ic1a42d1903e96a26a27a56489b75be05c58996ff Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:50 +00:00
Alan Maguire	2f08c496aa	BACKPORT: bpf: fix whitespace for ENCAP_L2 defines in bpf.h replace tab after #define with space in line with rest of definitions Change-Id: I29e1364abd94abe1e251816032890f895e0159f0 Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:50 +00:00
Stanislav Fomichev	2663f489f2	BACKPORT: bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTT Performance impact should be minimal because it's under a new BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled. Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Change-Id: I19814edf78101f87e6dc364343f8173d9e230850 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:49 +00:00
Stanislav Fomichev	306e5caca9	UPSTREAM: bpf: support cloning sk storage on accept() Add new helper bpf_sk_storage_clone which optionally clones sk storage and call it from sk_clone_lock. Cc: Martin KaFai Lau <kafai@fb.com> Cc: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Change-Id: Iee30d2442b76f6fd7904829314e63f3391e7d811 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:49 +00:00
Jakub Sitnicki	f65ba6958c	UPSTREAM: bpf: Make dst_port field in struct bpf_sock 16-bit wide [ Upstream commit 4421a582718ab81608d8486734c18083b822390d ] Menglong Dong reports that the documentation for the dst_port field in struct bpf_sock is inaccurate and confusing. From the BPF program PoV, the field is a zero-padded 16-bit integer in network byte order. The value appears to the BPF user as if laid out in memory as so: offsetof(struct bpf_sock, dst_port) + 0 <port MSB> + 8 <port LSB> +16 0x00 +24 0x00 32-, 16-, and 8-bit wide loads from the field are all allowed, but only if the offset into the field is 0. 32-bit wide loads from dst_port are especially confusing. The loaded value, after converting to host byte order with bpf_ntohl(dst_port), contains the port number in the upper 16-bits. Remove the confusion by splitting the field into two 16-bit fields. For backward compatibility, allow 32-bit wide loads from offsetof(struct bpf_sock, dst_port). While at it, allow loads 8-bit loads at offset [0] and [1] from dst_port. Reported-by: Menglong Dong <imagedong@tencent.com> Change-Id: Id86817d538b4f552ca112639c0a40fb2d8bd9eb9 Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220130115518.213259-2-jakub@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:42 +00:00
Willem de Bruijn	87626b49ff	UPSTREAM: bpf: Add gso_size to __sk_buff BPF programs may want to know whether an skb is gso. The canonical answer is skb_is_gso(skb), which tests that gso_size != 0. Expose this field in the same manner as gso_segs. That field itself is not a sufficient signal, as the comment in skb_shared_info makes clear: gso_segs may be zero, e.g., from dodgy sources. Also prepare net/bpf/test_run for upcoming BPF_PROG_TEST_RUN tests of the feature. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200303200503.226217-2-willemdebruijn.kernel@gmail.com Note: backported without changes to net/bpf/test_run.c (cherry picked from commit cf62089b0edd7e74a1f474844b4d9f7b5697fb5c) Signed-off-by: Maciej Żenczykowski <maze@google.com> Change-Id: I1f7d1b49e5ac35f18546d468e3847deaae5056ca Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:41 +00:00
Petar Penkov	e05bdae21f	BACKPORT: bpf: add bpf_tcp_gen_syncookie helper This helper function allows BPF programs to try to generate SYN cookies, given a reference to a listener socket. The function works from XDP and with an skb context since bpf_skc_lookup_tcp can lookup a socket in both cases. Change-Id: Iac961811f33901dc0a63365669a79dcf2762fecf Signed-off-by: Petar Penkov <ppenkov@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:40 +00:00
Stanislav Fomichev	74d36e67b8	UPSTREAM: bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and msg_src_ip6 Add explicit check for u64 loads of user_ip6 and msg_src_ip6 and update the comment. Cc: Yonghong Song <yhs@fb.com> Change-Id: Id82bfd77c5a0297ef3473fd2576d125baaed1b02 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:40 +00:00
Stanislav Fomichev	8bb03b088d	UPSTREAM: bpf: add icsk_retransmits to bpf_tcp_sock Add some inet_connection_sock fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Change-Id: I94a94df91b77033bea7d1581b03273b778fd54e7 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:40 +00:00
Stanislav Fomichev	17edf92d9e	UPSTREAM: bpf: add dsack_dups/delivered{, _ce} to bpf_tcp_sock Add more fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Change-Id: I17a774bdb0e6b2f77f08ec80d5fcc1fba11cca95 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:40 +00:00
Wei Wang	84257900a0	BACKPORT: tcp: add dsack blocks received stats Introduce a new TCP stat to record the number of DSACK blocks received (RFC4989 tcpEStatsStackDSACKDups) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Change-Id: I794c95d3782f2e3ca1a875b50241151c86ad995b Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:39 +00:00
Wei Wang	c904e0b051	BACKPORT: tcp: add data bytes retransmitted stats Introduce a new TCP stat to record the number of bytes retransmitted (RFC4898 tcpEStatsPerfOctetsRetrans) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Change-Id: Iae6e0688405c758c84a6e77b6fe2139867493d3d Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:39 +00:00
Wei Wang	c1c7619d6a	BACKPORT: tcp: add data bytes sent stats Introduce a new TCP stat to record the number of bytes sent (RFC4898 tcpEStatsPerfHCDataOctetsOut) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Change-Id: Ie3c46706662494214863782d78cde8efbb362942 Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:39 +00:00
Yuchung Cheng	6299ef6adf	BACKPORT: tcp: export packets delivery info Export data delivered and delivered with CE marks to 1) SNMP TCPDelivered and TCPDeliveredCE 2) getsockopt(TCP_INFO) 3) Timestamping API SOF_TIMESTAMPING_OPT_STATS Note that for SCM_TSTAMP_ACK, the delivery info in SOF_TIMESTAMPING_OPT_STATS is reported before the info was fully updated on the ACK. These stats help application monitor TCP delivery and ECN status on per host, per connection, even per message level. Change-Id: I8d647905926e63412d579374da3323512a0428e0 Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:39 +00:00
Yousuk Seung	38dbf7edc1	UPSTREAM: tcp: add snd_ssthresh stat in SCM_TIMESTAMPING_OPT_STATS This patch adds TCP_NLA_SND_SSTHRESH stat into SCM_TIMESTAMPING_OPT_STATS that reports tcp_sock.snd_ssthresh. Change-Id: Ib6e0da3409e70da275de094fa7705d825f1f4db9 Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:39 +00:00
Priyaranjan Jha	a82a01b8bf	UPSTREAM: tcp: add ca_state stat in SCM_TIMESTAMPING_OPT_STATS This patch adds TCP_NLA_CA_STATE stat into SCM_TIMESTAMPING_OPT_STATS. It reports ca_state of socket, when timestamp is generated. Change-Id: Icaffeb5d78bd4fdd30d2d8ccf92ba44069b1d33c Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:38 +00:00
Priyaranjan Jha	ebe01d662f	UPSTREAM: tcp: add send queue size stat in SCM_TIMESTAMPING_OPT_STATS This patch adds TCP_NLA_SENDQ_SIZE stat into SCM_TIMESTAMPING_OPT_STATS. It reports no. of bytes present in send queue, when timestamp is generated. Change-Id: I6c015aaa83bc6a693435c0837bf9403ac27f0749 Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:38 +00:00
Stanislav Fomichev	f17e24c578	UPSTREAM: bpf: export bpf_sock for BPF_PROG_TYPE_SOCK_OPS prog type And let it use bpf_sk_storage_{get,delete} helpers to access socket storage. Kernel context (struct bpf_sock_ops_kern) already has sk member, so I just expose it to the BPF hooks. I use PTR_TO_SOCKET_OR_NULL and return NULL in !is_fullsock case. I also export bpf_tcp_sock to make it possible to access tcp socket stats. Cc: Martin Lau <kafai@fb.com> Change-Id: Ic77add758c1d4cb0e2745834749ee796c673c742 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:38 +00:00
Stanislav Fomichev	a7fa09d7c4	UPSTREAM: bpf: export bpf_sock for BPF_PROG_TYPE_CGROUP_SOCK_ADDR prog type And let it use bpf_sk_storage_{get,delete} helpers to access socket storage. Kernel context (struct bpf_sock_addr_kern) already has sk member, so I just expose it to the BPF hooks. Using PTR_TO_SOCKET instead of PTR_TO_SOCK_COMMON should be safe because the hook is called on bind/connect. Cc: Martin Lau <kafai@fb.com> Change-Id: I8ebe12a2f03f15386d5d1288157509053ca123ed Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:38 +00:00
Viet Hoang Tran	0071a81412	UPSTREAM: bpf: allow clearing all sock_ops callback flags The helper function bpf_sock_ops_cb_flags_set() can be used to both set and clear the sock_ops callback flags. However, its current behavior is not consistent. BPF program may clear a flag if more than one were set, or replace a flag with another one, but cannot clear all flags. This patch also updates the documentation to clarify the ability to clear flags of this helper function. Change-Id: Ib0a4971ca5a99c9e1832d7e85ae9bbe7297bdd55 Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:37 +00:00
Alan Maguire	effd8d4066	UPSTREAM: bpf: add layer 2 encap support to bpf_skb_adjust_room commit 868d523535c2 ("bpf: add bpf_skb_adjust_room encap flags") introduced support to bpf_skb_adjust_room for GSO-friendly GRE and UDP encapsulation. For GSO to work for skbs, the inner headers (mac and network) need to be marked. For L3 encapsulation using bpf_skb_adjust_room, the mac and network headers are identical. Here we provide a way of specifying the inner mac header length for cases where L2 encap is desired. Such an approach can support encapsulated ethernet headers, MPLS headers etc. For example to convert from a packet of form [eth][ip][tcp] to [eth][ip][udp][inner mac][ip][tcp], something like the following could be done: headroom = sizeof(iph) + sizeof(struct udphdr) + inner_maclen; ret = bpf_skb_adjust_room(skb, headroom, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_ENCAP_L4_UDP \| BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 \| BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen)); Change-Id: I451ddb130eb13f3e0c2f90fca379f7b931506c33 Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:36 +00:00
Lorenz Bauer	3eab4f19da	BACKPORT: bpf: add helper to check for a valid SYN cookie Using bpf_skc_lookup_tcp it's possible to ascertain whether a packet belongs to a known connection. However, there is one corner case: no sockets are created if SYN cookies are active. This means that the final ACK in the 3WHS is misclassified. Using the helper, we can look up the listening socket via bpf_skc_lookup_tcp and then check whether a packet is a valid SYN cookie ACK. Change-Id: If6df241e53af7fe53f842932fdcfd5afcc5aefd6 Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:36 +00:00
Peter Oskolkov	cd1d5cf2ba	UPSTREAM: bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap This patch adds all needed plumbing in preparation to allowing bpf programs to do IP encapping via bpf_lwt_push_encap. Actual implementation is added in the next patch in the patchset. Of note: - bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT prog types in addition to BPF_PROG_TYPE_LWT_IN; - if the skb being encapped has GSO set, encapsulation is limited to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6); - as route lookups are different for ingress vs egress, the single external bpf_lwt_push_encap BPF helper is routed internally to either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs, depending on prog type. v8 changes: fixed a typo. Change-Id: I32bdc99d964398db6535b2fce6aa7b1d7e6262ea Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:35 +00:00
Martin KaFai Lau	1c8826d21f	UPSTREAM: bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the bpf_sock. The userspace has already been using "state", e.g. inet_diag (ss -t) and getsockopt(TCP_INFO). This patch also allows narrow load on the following existing fields: "family", "type", "protocol" and "src_port". Unlike IP address, the load offset is resticted to the first byte for them but it can be relaxed later if there is a use case. This patch also folds __sock_filter_check_size() into bpf_sock_is_valid_access() since it is not called by any where else. All bpf_sock checking is in one place. Change-Id: I4523d9c76caf86351662a65197d27ca69615ed63 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:35 +00:00
John Fastabend	72a4bf702b	UPSTREAM: bpf: sockmap, metadata support for reporting size of msg This adds metadata to sk_msg_md for BPF programs to read the sk_msg size. When the SK_MSG program is running under an application that is using sendfile the data is not copied into sk_msg buffers by default. Rather the BPF program uses sk_msg_pull_data to read the bytes in. This avoids doing the costly memcopy instructions when they are not in fact needed. However, if we don't know the size of the sk_msg we have to guess if needed bytes are available by doing a pull request which may fail. By including the size of the sk_msg BPF programs can check the size before issuing sk_msg_pull_data requests. Additionally, the same applies for sendmsg calls when the application provides multiple iovs. Here the BPF program needs to pull in data to update data pointers but its not clear where the data ends without a size parameter. In many cases "guessing" is not easy to do and results in multiple calls to pull and without bounded loops everything gets fairly tricky. Clean this up by including a u32 size field. Note, all writes into sk_msg_md are rejected already from sk_msg_is_valid_access so nothing additional is needed there. Change-Id: I678f88a9d07c5dbdb593c5b85209764ea37e8efb Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:33 +00:00
John Fastabend	e189d1503c	BACKPORT: bpf: helper to pop data from messages This adds a BPF SK_MSG program helper so that we can pop data from a msg. We use this to pop metadata from a previous push data call. Change-Id: Idcd8ce6393b152481de7d042994a795a10424bec Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:33 +00:00
Dave Watson	ca0aac418d	UPSTREAM: tls: RX path for ktls Add rx path for tls software implementation. recvmsg, splice_read, and poll implemented. An additional sockopt TLS_RX is added, with the same interface as TLS_TX. Either TLX_RX or TLX_TX may be provided separately, or together (with two different setsockopt calls with appropriate keys). Control messages are passed via CMSG in a similar way to transmit. If no cmsg buffer is passed, then only application data records will be passed to userspace, and EIO is returned for other types of alerts. EBADMSG is passed for decryption errors, and EMSGSIZE is passed for framing too big, and EBADMSG for framing too small (matching openssl semantics). EINVAL is returned for TLS versions that do not match the original setsockopt call. All are unrecoverable. strparser is used to parse TLS framing. Decryption is done directly in to userspace buffers if they are large enough to support it, otherwise sk_cow_data is called (similar to ipsec), and buffers are decrypted in place and copied. splice_read always decrypts in place, since no buffers are provided to decrypt in to. sk_poll is overridden, and only returns POLLIN if a full TLS message is received. Otherwise we wait for strparser to finish reading a full frame. Actual decryption is only done during recvmsg or splice_read calls. Change-Id: I7118bc992c60dd0d92ac1a1a5bf8d189d6ff303a Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:22 +00:00
Dmitry V. Levin	d1c7b05785	UPSTREAM: uapi: fix linux/tls.h userspace compilation error Move inclusion of a private kernel header <net/tcp.h> from uapi/linux/tls.h to its only user - net/tls.h, to fix the following linux/tls.h userspace compilation error: /usr/include/linux/tls.h:41:21: fatal error: net/tcp.h: No such file or directory As to this point uapi/linux/tls.h was totaly unusuable for userspace, cleanup this header file further by moving other redundant includes to net/tls.h. Fixes: `3c4d755915` ("tls: kernel TLS support") Cc: <stable@vger.kernel.org> # v4.13+ Change-Id: Icc7f01d1a027534a803acc3979381ba6b106b549 Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:20 +00:00
Tim Zimmermann	0a5b023298	Squashed revert of 4.14 tls backports Revert "net: generalize sk_alloc_sg to work with scatterlist rings" This reverts commit e351d8782539f93a7aea83055b03de0585653f1d. Revert "sock: make static tls function alloc_sg generic sock helper" This reverts commit 1a4d78e879a1f8234444af3cb7c207d422873843. Revert "net/tls: Fixed return value when tls_complete_pending_work() fails" This reverts commit `39d9e1c62e`. Revert "tls: Use correct sk->sk_prot for IPV6" This reverts commit `2a0f5919e1`. Revert "tls: don't override sk_write_space if tls_set_sw_offload fails." This reverts commit `2b8b2e7622`. Revert "tls: Avoid copying crypto_info again after cipher_type check." This reverts commit `93f16446c8`. Revert "tls: Fix TLS ulp context leak, when TLS_TX setsockopt is not used." This reverts commit `797b8bb47f`. Revert "tls: Add function to update the TLS socket configuration" This reverts commit `25f03991a5`. Revert "tls: possible hang when do_tcp_sendpages hits sndbuf is full case" This reverts commit `f0a8c1257f`. Revert "tls: clear key material from kernel memory when do_tls_setsockopt_conf fails" This reverts commit `18fef87e05`. Revert "tls: zero the crypto information from tls_context before freeing" This reverts commit `0c0334299a`. Revert "tls: don't copy the key out of tls12_crypto_info_aes_gcm_128" This reverts commit `10cacaf131`. Revert "net/tls: Set count of SG entries if sk_alloc_sg returns -ENOSPC" This reverts commit `04f625fc5a`. Revert "tcp, ulp: add alias for all ulp modules" This reverts commit `0c02e0c3fd`. Revert "sock: fix sg page frag coalescing in sk_alloc_sg" This reverts commit `464e2326a7`. Revert "tls: Stricter error checking in zerocopy sendmsg path" This reverts commit `30a7a7b04f`. Revert "tls: fix use-after-free in tls_push_record" This reverts commit `5e8a5c3054`. Revert "tls: retrun the correct IV in getsockopt" This reverts commit `94203f213c`. Revert "net/tls: Fix connection stall on partial tls record" This reverts commit `8e1b8e3279`. Revert "net/tls: Don't recursively call push_record during tls_write_space callbacks" This reverts commit `3ac0f3e0b8`. Revert "tls: reset crypto_info when do_tls_setsockopt_tx fails" This reverts commit `ed10b9affb`. Revert "tls: return -EBUSY if crypto_info is already set" This reverts commit `2f54941c88`. Revert "tls: fix sw_ctx leak" This reverts commit `3a28f04bc4`. Revert "net/tls: Only attach to sockets in ESTABLISHED state" This reverts commit `a022bbe393`. Revert "net/tls: Fix inverted error codes to avoid endless loop" This reverts commit `d3048a12f3`. Revert "tls: Use kzalloc for aead_request allocation" This reverts commit `f0e1cd056e`. Revert "uapi: fix linux/tls.h userspace compilation error" This reverts commit `33e58deefa`. Change-Id: Iecd555c5b8723b77b18551d9bb944215eb04f053 Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:20 +00:00
John Fastabend	483025411b	BACKPORT: bpf: sk_msg program helper bpf_msg_push_data This allows user to push data into a msg using sk_msg program types. The format is as follows, bpf_msg_push_data(msg, offset, len, flags) this will insert 'len' bytes at offset 'offset'. For example to prepend 10 bytes at the front of the message the user can, bpf_msg_push_data(msg, 0, 10, 0); This will invalidate data bounds so BPF user will have to then recheck data bounds after calling this. After this the msg size will have been updated and the user is free to write into the added bytes. We allow any offset/len as long as it is within the (data, data_end) range. However, a copy will be required if the ring is full and its possible for the helper to fail with ENOMEM or EINVAL errors which need to be handled by the BPF program. This can be used similar to XDP metadata to pass data between sk_msg layer and lower layers. Change-Id: Ib70acf2419e2941d0bb67c3331b1dd007688e4e8 Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:50:19 +00:00
Alexei Starovoitov	085bcb55bc	UPSTREAM: bpf: introduce verifier internal test flag Introduce BPF_F_TEST_STATE_FREQ flag to stress test parentage chain and state pruning. Change-Id: I89a8f21d4436a181045b56779a2213bd8b5d071b Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:44 +00:00
Quentin Monnet	c436811591	UPSTREAM: bpf: add new BPF_BTF_GET_NEXT_ID syscall command Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used to cycle through all BTF objects loaded on the system. The motivation is to be able to inspect (list) all BTF objects presents on the system. Change-Id: I9b766d4c70048ff2f6c910f61820cb85f874dfdd Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:43 +00:00
Toke Høiland-Jørgensen	890e559021	UPSTREAM: xdp: Add devmap_hash map type for looking up devices by hashed index A common pattern when using xdp_redirect_map() is to create a device map where the lookup key is simply ifindex. Because device maps are arrays, this leaves holes in the map, and the map has to be sized to fit the largest ifindex, regardless of how many devices actually are actually needed in the map. This patch adds a second type of device map where the key is looked up using a hashmap, instead of being used as an array index. This allows maps to be densely packed, so they can be smaller. Change-Id: Ibf2e0d0722ca22b53349df8019115ac1c263a286 Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:41 +00:00
Toke Høiland-Jørgensen	d4e1e1eef5	UPSTREAM: bpf_xdp_redirect_map: Perform map lookup in eBPF helper The bpf_redirect_map() helper used by XDP programs doesn't return any indication of whether it can successfully redirect to the map index it was given. Instead, BPF programs have to track this themselves, leading to programs using duplicate maps to track which entries are populated in the devmap. This patch fixes this by moving the map lookup into the bpf_redirect_map() helper, which makes it possible to return failure to the eBPF program. The lower bits of the flags argument is used as the return code, which means that existing users who pass a '0' flag argument will get XDP_ABORTED. With this, a BPF program can check the return code from the helper call and react by, for instance, substituting a different redirect. This works for any type of map used for redirect. Change-Id: I5392fcb48de313d573b8d23e3c9f8e3a07bed5f1 Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:41 +00:00
Stanislav Fomichev	7948f7715b	UPSTREAM: bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr Since commit cd17d7770578 ("bpf/tools: sync bpf.h") clang decided that it can do a single u64 store into user_ip6[2] instead of two separate u32 ones: # 17: (18) r2 = 0x100000000000000 # ; ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2); # 19: (7b) (u64 )(r1 +16) = r2 # invalid bpf_context access off=16 size=8 >From the compiler point of view it does look like a correct thing to do, so let's support it on the kernel side. Credit to Andrii Nakryiko for a proper implementation of bpf_ctx_wide_store_ok. Cc: Andrii Nakryiko <andriin@fb.com> Cc: Yonghong Song <yhs@fb.com> Fixes: cd17d7770578 ("bpf/tools: sync bpf.h") Reported-by: kernel test robot <rong.a.chen@intel.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Change-Id: I60bf910a9f10a870028e3a8484f8c6638aa62e43 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:40 +00:00
Stanislav Fomichev	934ce717e9	UPSTREAM: bpf: implement getsockopt and setsockopt hooks Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks. BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before passing them down to the kernel or bypass kernel completely. BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that kernel returns. Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure. The buffer memory is pre-allocated (because I don't think there is a precedent for working with __user memory from bpf). This might be slow to do for each {s,g}etsockopt call, that's why I've added __cgroup_bpf_prog_array_is_empty that exits early if there is nothing attached to a cgroup. Note, however, that there is a race between __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup program layout might have changed; this should not be a problem because in general there is a race between multiple calls to {s,g}etsocktop and user adding/removing bpf progs from a cgroup. The return code of the BPF program is handled as follows: * 0: EPERM * 1: success, continue with next BPF program in the cgroup chain v9: * allow overwriting setsockopt arguments (Alexei Starovoitov): * use set_fs (same as kernel_setsockopt) * buffer is always kzalloc'd (no small on-stack buffer) v8: * use s32 for optlen (Andrii Nakryiko) v7: * return only 0 or 1 (Alexei Starovoitov) * always run all progs (Alexei Starovoitov) * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov) (decided to use optval=-1 instead, optval=0 might be a valid input) * call getsockopt hook after kernel handlers (Alexei Starovoitov) v6: * rework cgroup chaining; stop as soon as bpf program returns 0 or 2; see patch with the documentation for the details * drop Andrii's and Martin's Acked-by (not sure they are comfortable with the new state of things) v5: * skip copy_to_user() and put_user() when ret == 0 (Martin Lau) v4: * don't export bpf_sk_fullsock helper (Martin Lau) * size != sizeof(__u64) for uapi pointers (Martin Lau) * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau) v3: * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko) * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii Nakryiko) * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau) * use BPF_FIELD_SIZEOF() for consistency (Martin Lau) * new CG_SOCKOPT_ACCESS macro to wrap repeated parts v2: * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau) * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau) * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau) * added [0,2] return code check to verifier (Martin Lau) * dropped unused buf[64] from the stack (Martin Lau) * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau) * dropped bpf_target_off from ctx rewrites (Martin Lau) * use return code for kernel bypass (Martin Lau & Andrii Nakryiko) Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Change-Id: I2eb79aea6a353ca33381dcf7859f382affadf205 Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:39 +00:00
Deepa Dinamani	2dd20cd306	UPSTREAM: time: Add new y2038 safe __kernel_timespec The new struct __kernel_timespec is similar to current internal kernel struct timespec64 on 64 bit architecture. The compat structure however is similar to below on little endian systems (padding and tv_nsec are switched for big endian systems): typedef s32 compat_long_t; typedef s64 compat_kernel_time64_t; struct compat_kernel_timespec { compat_kernel_time64_t tv_sec; compat_long_t tv_nsec; compat_long_t padding; }; This allows for both the native and compat representations to be the same and syscalls using this type as part of their ABI can have a single entry point to both. Note that the compat define is not included anywhere in the kernel explicitly to avoid confusion. These types will be used by the new syscalls that will be introduced in the consequent patches. Most of the new syscalls are just an update to the existing native ones with this new type. Hence, put this new type under an ifdef so that the architectures can define CONFIG_64BIT_TIME when they are ready to handle this switch. Cc: linux-arch@vger.kernel.org Change-Id: I06fbf455714d0f1bae69b9ac9def451e9a5abd53 Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:35 +00:00
Arnd Bergmann	a042a2c294	UPSTREAM: y2038: Introduce struct __kernel_old_timeval Dealing with 'struct timeval' users in the y2038 series is a bit tricky: We have two definitions of timeval that are visible to user space, one comes from glibc (or some other C library), the other comes from linux/time.h. The kernel copy is what we want to be used for a number of structures defined by the kernel itself, e.g. elf_prstatus (used it core dumps), sysinfo and rusage (used in system calls). These generally tend to be used for passing time intervals rather than absolute (epoch-based) times, so they do not suffer from the y2038 overflow. Some of them could be changed to use 64-bit timestamps by creating new system calls, others like the core files cannot easily be changed. An application using these interfaces likely also uses gettimeofday() or other interfaces that use absolute times, and pass 'struct timeval' pointers directly into kernel interfaces, so glibc must redefine their timeval based on a 64-bit time_t when they introduce their y2038-safe interfaces. The only reasonable way forward I see is to remove the 'timeval' definion from the kernel's uapi headers, and change the interfaces that we do not want to (or cannot) duplicate for 64-bit times to use a new __kernel_old_timeval definition instead. This type should be avoided for all new interfaces (those can use 64-bit nanoseconds, or the 64-bit version of timespec instead), and should be used with great care when converting existing interfaces from timeval, to be sure they don't suffer from the y2038 overflow, and only with consensus for the particular user that using __kernel_old_timeval is better than moving to a 64-bit based interface. The structure name is intentionally chosen to not conflict with user space types, and to be ugly enough to discourage its use. Note that ioctl based interfaces that pass a bare 'timeval' pointer cannot change to '__kernel_old_timeval' because the user space source code refers to 'timeval' instead, and we don't want to modify the user space sources if possible. However, any application that relies on a structure to contain an embedded 'timeval' (e.g. by passing a pointer to the member into a function call that expects a timeval pointer) is broken when that structure gets converted to __kernel_old_timeval. I don't see any way around that, and we have to rely on the compiler to produce a warning or compile failure that will alert users when they recompile their sources against a new libc. Change-Id: I89fee6b13307424100b8a630f75414c77129ee5c Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20180315161739.576085-1-arnd@arndb.de Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:35 +00:00
Jonathan Lemon	66c8405efc	UPSTREAM: bpf: Allow bpf_map_lookup_elem() on an xskmap Currently, the AF_XDP code uses a separate map in order to determine if an xsk is bound to a queue. Instead of doing this, have bpf_map_lookup_elem() return a xdp_sock. Rearrange some xdp_sock members to eliminate structure holes. Remove selftest - will be added back in later patch. Change-Id: Icfad5e0d4f996eb52bc36e87344e57343354e834 Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:30 +00:00
Jakub Kicinski	d59ae27bf5	BACKPORT: xdp: support simultaneous driver and hw XDP attachment Split the query of HW-attached program from the software one. Introduce new .ndo_bpf command to query HW-attached program. This will allow drivers to install different programs in HW and SW at the same time. Netlink can now also carry multiple programs on dump (in which case mode will be set to XDP_ATTACHED_MULTI and user has to check per-attachment point attributes, IFLA_XDP_PROG_ID will not be present). We reuse IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size() doesn't need to be updated. Note that the installation side is still not there, since all drivers currently reject installing more than one program at the time. Change-Id: Ic313491b122cabdd17c7b46f04b61af634358ec6 Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>	2025-12-24 11:49:28 +00:00

1 2 3 4 5 ...

5884 Commits