53396 Commits

Author SHA1 Message Date
basamaryan
8d60ed7eac wireguard: compat: guard __kernel_timespec to <4.14
Change-Id: Iac483a5dd995b3e25e6f10c7fc1c6f3ebd6b7796
2026-01-15 15:45:46 +05:30
Arnd Bergmann
224943ee8f UPSTREAM: libceph: use timespec64 in for keepalive2 and ticket validity
ceph_con_keepalive_expired() is the last user of timespec_add() and some
of the last uses of ktime_get_real_ts().  Replacing this with timespec64
based interfaces  lets us remove that deprecated API.

I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
here that take timespec64 structures and convert to/from ceph_timespec,
which is defined to have an unsigned 32-bit tv_sec member. This extends
the range of valid times to year 2106, avoiding the year 2038 overflow.

The ceph file system portion still uses the old functions for inode
timestamps, this will be done separately after the VFS layer is converted.

Change-Id: I638be3e081770e6b766b972343ed7652a473383d
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-01-15 15:45:44 +05:30
theshaenix
9fbf85f1ff tcp: restore missing OPLUS modem TCP hook declarations
Restore OPLUS modem network power hook declarations in tcp_output.c
that were accidentally removed during previous cleanups.

These extern declarations and headers are required when
OPLUS_FEATURE_MODEM_DATA_NWPOWER is enabled to provide the
corresponding modem wakeup and TCP output hook infrastructure.

Fixes build failures caused by unresolved OPLUS TCP hook symbols.
2025-12-28 19:20:51 +05:30
Taehee Yoo
efcf368e9e UPSTREAM: net: bpfilter: disallow to remove bpfilter module while being used
The bpfilter.ko module can be removed while functions of the bpfilter.ko
are executing. so panic can occurred. in order to protect that, locks can
be used. a bpfilter_lock protects routines in the
__bpfilter_process_sockopt() but it's not enough because __exit routine
can be executed concurrently.

Now, the bpfilter_umh can not run in parallel.
So, the module do not removed while it's being used and it do not
double-create UMH process.
The members of the umh_info and the bpfilter_umh_ops are protected by
the bpfilter_umh_ops.lock.

test commands:
   while :
   do
	iptables -I FORWARD -m string --string ap --algo kmp &
	modprobe -rv bpfilter &
   done

splat looks like:
[  298.623435] BUG: unable to handle kernel paging request at fffffbfff807440b
[  298.628512] #PF error: [normal kernel read fault]
[  298.633018] PGD 124327067 P4D 124327067 PUD 11c1a3067 PMD 119eb2067 PTE 0
[  298.638859] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  298.638859] CPU: 0 PID: 2997 Comm: iptables Not tainted 4.20.0+ #154
[  298.638859] RIP: 0010:__mutex_lock+0x6b9/0x16a0
[  298.638859] Code: c0 00 00 e8 89 82 ff ff 80 bd 8f fc ff ff 00 0f 85 d9 05 00 00 48 8b 85 80 fc ff ff 48 bf 00 00 00 00 00 fc ff df 48 c1 e8 03 <80> 3c 38 00 0f 85 1d 0e 00 00 48 8b 85 c8 fc ff ff 49 39 47 58 c6
[  298.638859] RSP: 0018:ffff88810e7777a0 EFLAGS: 00010202
[  298.638859] RAX: 1ffffffff807440b RBX: ffff888111bd4d80 RCX: 0000000000000000
[  298.638859] RDX: 1ffff110235ff806 RSI: ffff888111bd5538 RDI: dffffc0000000000
[  298.638859] RBP: ffff88810e777b30 R08: 0000000080000002 R09: 0000000000000000
[  298.638859] R10: 0000000000000000 R11: 0000000000000000 R12: fffffbfff168a42c
[  298.638859] R13: ffff888111bd4d80 R14: ffff8881040e9a05 R15: ffffffffc03a2000
[  298.638859] FS:  00007f39e3758700(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
[  298.638859] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  298.638859] CR2: fffffbfff807440b CR3: 000000011243e000 CR4: 00000000001006f0
[  298.638859] Call Trace:
[  298.638859]  ? mutex_lock_io_nested+0x1560/0x1560
[  298.638859]  ? kasan_kmalloc+0xa0/0xd0
[  298.638859]  ? kmem_cache_alloc+0x1c2/0x260
[  298.638859]  ? __alloc_file+0x92/0x3c0
[  298.638859]  ? alloc_empty_file+0x43/0x120
[  298.638859]  ? alloc_file_pseudo+0x220/0x330
[  298.638859]  ? sock_alloc_file+0x39/0x160
[  298.638859]  ? __sys_socket+0x113/0x1d0
[  298.638859]  ? __x64_sys_socket+0x6f/0xb0
[  298.638859]  ? do_syscall_64+0x138/0x560
[  298.638859]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  298.638859]  ? __alloc_file+0x92/0x3c0
[  298.638859]  ? init_object+0x6b/0x80
[  298.638859]  ? cyc2ns_read_end+0x10/0x10
[  298.638859]  ? cyc2ns_read_end+0x10/0x10
[  298.638859]  ? hlock_class+0x140/0x140
[  298.638859]  ? sched_clock_local+0xd4/0x140
[  298.638859]  ? sched_clock_local+0xd4/0x140
[  298.638859]  ? check_flags.part.37+0x440/0x440
[  298.638859]  ? __lock_acquire+0x4f90/0x4f90
[  298.638859]  ? set_rq_offline.part.89+0x140/0x140
[ ... ]

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Change-Id: I3186f2ec6cbf8263aed16d562021f741ab25bf86
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:12 +05:30
Taehee Yoo
fd18f996d9 UPSTREAM: net: bpfilter: restart bpfilter_umh when error occurred
The bpfilter_umh will be stopped via __stop_umh() when the bpfilter
error occurred.
The bpfilter_umh() couldn't start again because there is no restart
routine.

The section of the bpfilter_umh_{start/end} is no longer .init.rodata
because these area should be reused in the restart routine. hence
the section name is changed to .bpfilter_umh.

The bpfilter_ops->start() is restart callback. it will be called when
bpfilter_umh is stopped.
The stop bit means bpfilter_umh is stopped. this bit is set by both
start and stop routine.

Before this patch,
Test commands:
   $ iptables -vnL
   $ kill -9 <pid of bpfilter_umh>
   $ iptables -vnL
   [  480.045136] bpfilter: write fail -32
   $ iptables -vnL

All iptables commands will fail.

After this patch,
Test commands:
   $ iptables -vnL
   $ kill -9 <pid of bpfilter_umh>
   $ iptables -vnL
   $ iptables -vnL

Now, all iptables commands will work.

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Change-Id: I4c1fa7fd3af172967e450f3db8fab77434b341b6
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:12 +05:30
Taehee Yoo
cccaf12d45 UPSTREAM: net: bpfilter: use cleanup callback to release umh_info
Now, UMH process is killed, do_exit() calls the umh_info->cleanup callback
to release members of the umh_info.
This patch makes bpfilter_umh's cleanup routine to use the
umh_info->cleanup callback.

Change-Id: I865aa89eb58d9bba883ed59e4b921fe401413853
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:11 +05:30
Eric W. Biederman
274c4e577c UPSTREAM: signal/bpfilter: Fix bpfilter_kernl to use send_sig not force_sig
[ Upstream commit 1dfd1711de2952fd1bfeea7152bd1687a4eea771 ]

The locking in force_sig_info is not prepared to deal with
a task that exits or execs (as sighand may change).  As force_sig
is only built to handle synchronous exceptions.

Further the function force_sig_info changes the signal state if the
signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
delivery of the signal.  The signal SIGKILL can not be ignored and can
not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
delivered.

So using force_sig rather than send_sig for SIGKILL is pointless.

Because it won't impact the sending of the signal and and because
using force_sig is wrong, replace force_sig with send_sig.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Change-Id: Id62ece4eade762ecc94dd3e4a940907e25c48fab
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:42:11 +05:30
Taehee Yoo
24bcc1cd34 UPSTREAM: net: bpfilter: use get_pid_task instead of pid_task
pid_task() dereferences rcu protected tasks array.
But there is no rcu_read_lock() in shutdown_umh() routine so that
rcu_read_lock() is needed.
get_pid_task() is wrapper function of pid_task. it holds rcu_read_lock()
then calls pid_task(). if task isn't NULL, it increases reference count
of task.

test commands:
   %modprobe bpfilter
   %modprobe -rv bpfilter

splat looks like:
[15102.030932] =============================
[15102.030957] WARNING: suspicious RCU usage
[15102.030985] 4.19.0-rc7+ #21 Not tainted
[15102.031010] -----------------------------
[15102.031038] kernel/pid.c:330 suspicious rcu_dereference_check() usage!
[15102.031063]
	       other info that might help us debug this:

[15102.031332]
	       rcu_scheduler_active = 2, debug_locks = 1
[15102.031363] 1 lock held by modprobe/1570:
[15102.031389]  #0: 00000000580ef2b0 (bpfilter_lock){+.+.}, at: stop_umh+0x13/0x52 [bpfilter]
[15102.031552]
               stack backtrace:
[15102.031583] CPU: 1 PID: 1570 Comm: modprobe Not tainted 4.19.0-rc7+ #21
[15102.031607] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[15102.031628] Call Trace:
[15102.031676]  dump_stack+0xc9/0x16b
[15102.031723]  ? show_regs_print_info+0x5/0x5
[15102.031801]  ? lockdep_rcu_suspicious+0x117/0x160
[15102.031855]  pid_task+0x134/0x160
[15102.031900]  ? find_vpid+0xf0/0xf0
[15102.032017]  shutdown_umh.constprop.1+0x1e/0x53 [bpfilter]
[15102.032055]  stop_umh+0x46/0x52 [bpfilter]
[15102.032092]  __x64_sys_delete_module+0x47e/0x570
[ ... ]

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Change-Id: Ideb9fdae191efb594fa2a258d5e559fb0bba39ce
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:10 +05:30
Shanthosh RK
927a815854 UPSTREAM: net: bpfilter: Fix type cast and pointer warnings
Fixes the following Sparse warnings:

net/bpfilter/bpfilter_kern.c:62:21: warning: cast removes address space
of expression
net/bpfilter/bpfilter_kern.c:101:49: warning: Using plain integer as
NULL pointer

Change-Id: I18ad265abd911bab943a3c84a3dac540e28a1da6
Signed-off-by: Shanthosh RK <shanthosh.rk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:10 +05:30
Masahiro Yamada
635b27ce03 BACKPORT: bpfilter: include bpfilter_umh in assembly instead of using objcopy
What we want here is to embed a user-space program into the kernel.
Instead of the complex ELF magic, let's simply wrap it in the assembly
with the '.incbin' directive.

Change-Id: I8a532c9f9c5d2bc853821becb40832d56cdbb72c
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:10 +05:30
Alexei Starovoitov
0e165cc9cc UPSTREAM: bpfilter: fix race in pipe access
syzbot reported the following crash
[  338.293946] bpfilter: read fail -512
[  338.304515] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  338.311863] general protection fault: 0000 [#1] SMP KASAN
[  338.344360] RIP: 0010:__vfs_write+0x4a6/0x960
[  338.426363] Call Trace:
[  338.456967]  __kernel_write+0x10c/0x380
[  338.460928]  __bpfilter_process_sockopt+0x1d8/0x35b
[  338.487103]  bpfilter_mbox_request+0x4d/0xb0
[  338.491492]  bpfilter_ip_get_sockopt+0x6b/0x90

This can happen when multiple cpus trying to talk to user mode process
via bpfilter_mbox_request(). One cpu grabs the mutex while another goes to
sleep on the same mutex. Then former cpu sees that umh pipe is down and
shuts down the pipes. Later cpu finally acquires the mutex and crashes
on freed pipe.
Fix the race by using info.pid as an indicator that umh and pipes are healthy
and check it after acquiring the mutex.

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Reported-by: syzbot+7ade6c94abb2774c0fee@syzkaller.appspotmail.com
Change-Id: Ic3f5aa80143601b532f6ca2feb4ef97b249805f0
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:09 +05:30
Arnd Bergmann
ef17e27535 UPSTREAM: bpfilter: fix building without CONFIG_INET
bpfilter_process_sockopt is a callback that gets called from
ip_setsockopt() and ip_getsockopt(). However, when CONFIG_INET is
disabled, it never gets called at all, and assigning a function to the
callback pointer results in a link failure:

net/bpfilter/bpfilter_kern.o: In function `__stop_umh':
bpfilter_kern.c:(.text.unlikely+0x3): undefined reference to `bpfilter_process_sockopt'
net/bpfilter/bpfilter_kern.o: In function `load_umh':
bpfilter_kern.c:(.init.text+0x73): undefined reference to `bpfilter_process_sockopt'

Since there is no caller in this configuration, I assume we can
simply make the assignment conditional.

Change-Id: Ia233a7b3be4b060f653aad9d4a3f045ba3b8a07e
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:09 +05:30
Stanislav Fomichev
d693449e3e BACKPORT: bpf/flow_dissector: support ipv6 flow_label and BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
Add support for exporting ipv6 flow label via bpf_flow_keys.
Export flow label from bpf_flow.c and also return early when
BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL is passed.

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Change-Id: I6b4c3771022f19c184867fb6045351d59cfde68b
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-28 18:42:07 +05:30
Stanislav Fomichev
9287f9616f UPSTREAM: bpf: Move skb->len == 0 checks into __bpf_redirect
[ Upstream commit 114039b342014680911c35bd6b72624180fd669a ]

To avoid potentially breaking existing users.

Both mac/no-mac cases have to be amended; mac_header >= network_header
is not enough (verified with a new test, see next patch).

Fixes: fd1894224407 ("bpf: Don't redirect packets with invalid pkt_len")
Change-Id: If63ce98321997875863f3fce255061e8a64bba5e
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221121180340.1983627-1-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:42:06 +05:30
Zhengchao Shao
457d766adf UPSTREAM: bpf: Don't redirect packets with invalid pkt_len
commit fd1894224407c484f652ad456e1ce423e89bb3eb upstream.

Syzbot found an issue [1]: fq_codel_drop() try to drop a flow whitout any
skbs, that is, the flow->head is null.
The root cause, as the [2] says, is because that bpf_prog_test_run_skb()
run a bpf prog which redirects empty skbs.
So we should determine whether the length of the packet modified by bpf
prog or others like bpf_prog_test is valid before forwarding it directly.

LINK: [1] https://syzkaller.appspot.com/bug?id=0b84da80c2917757915afa89f7738a9d16ec96c5
LINK: [2] https://www.spinics.net/lists/netdev/msg777503.html

Reported-by: syzbot+7a12909485b94426aceb@syzkaller.appspotmail.com
Change-Id: Ieba6ad52b2eb657fb02ef7d3e414932d7743f073
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220715115559.139691-1-shaozhengchao@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-28 18:42:06 +05:30
Stanislav Fomichev
0e87beec78 UPSTREAM: bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
This will allow us to write tests for those flags.

v2:
* Swap kfree(data) and kfree(user_ctx) (Song Liu)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Change-Id: Ic1b6920aa27ee21945df37585af2b0dfe37dca6e
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-28 18:42:05 +05:30
Stanislav Fomichev
4815ef912a UPSTREAM: bpf/flow_dissector: pass input flags to BPF flow dissector program
C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible. Pass
those flags to the BPF flow dissector so it can make the same
decisions. In the next commits I'll add support for those flags to
our reference bpf_flow.c

v3:
* Export copy of flow dissector flags instead of moving (Alexei Starovoitov)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Change-Id: I46a68f8b2249915fff5d97a1394ea662d9a0ac46
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-28 18:42:05 +05:30
Stanislav Fomichev
3807c70734 UPSTREAM: flow_dissector: remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
This flag is not used by any caller, remove it.

Change-Id: I280df0c749d35d6259fdcf3c7f9d3d142e6289a4
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:04 +05:30
Stanislav Fomichev
bca0f440f0 UPSTREAM: flow_dissector: handle no-skb use case
When called without skb, gather all required data from the
__skb_flow_dissect's arguments and use recently introduces
no-skb mode of bpf flow dissector.

Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.

Change-Id: Ice7d27b1adaeb9adb02ea52fda999450a6d765af
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:42:04 +05:30
Stanislav Fomichev
13988727f4 UPSTREAM: net: plumb network namespace into __skb_flow_dissect
This new argument will be used in the next patches for the
eth_get_headlen use case. eth_get_headlen calls flow dissector
with only data (without skb) so there is currently no way to
pull attached BPF flow dissector program. With this new argument,
we can amend the callers to explicitly pass network namespace
so we can use attached BPF program.

Change-Id: I07dbe25c77a13f8cc33b037e63c532979d290ad0
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:42:04 +05:30
Paolo Abeni
b0f67d760b BACKPORT: flow_dissector: do not rely on implicit casts
This change fixes a couple of type mismatch reported by the sparse
tool, explicitly using the requested type for the offending arguments.

Change-Id: I73920d0a210f1ef2229a406c49844970208d6ada
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:03 +05:30
Paolo Abeni
036ccf95fb BACKPORT: net: core: rework basic flow dissection helper
When the core networking needs to detect the transport offset in a given
packet and parse it explicitly, a full-blown flow_keys struct is used for
storage.
This patch introduces a smaller keys store, rework the basic flow dissect
helper to use it, and apply this new helper where possible - namely in
skb_probe_transport_header(). The used flow dissector data structures
are renamed to match more closely the new role.

The above gives ~50% performance improvement in micro benchmarking around
skb_probe_transport_header() and ~30% around eth_get_headlen(), mostly due
to the smaller memset. Small, but measurable improvement is measured also
in macro benchmarking.

v1 -> v2: use the new helper in eth_get_headlen() and skb_get_poff(),
  as per DaveM suggestion

[Linux4: Apply to include/linux/virtio_net.h]

Suggested-by: David Miller <davem@davemloft.net>
Change-Id: Iebe280d3793028c9dd640b5dc59d046601f02cca
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:02 +05:30
Eric Dumazet
80f5987ed0 UPSTREAM: flow_dissector: disable preemption around BPF calls
Various things in eBPF really require us to disable preemption
before running an eBPF program.

syzbot reported :

BUG: assuming atomic context at net/core/flow_dissector.c:737
in_atomic(): 0, irqs_disabled(): 0, pid: 24710, name: syz-executor.3
2 locks held by syz-executor.3/24710:
 #0: 00000000e81a4bf1 (&tfile->napi_mutex){+.+.}, at: tun_get_user+0x168e/0x3ff0 drivers/net/tun.c:1850
 #1: 00000000254afebd (rcu_read_lock){....}, at: __skb_flow_dissect+0x1e1/0x4bb0 net/core/flow_dissector.c:822
CPU: 1 PID: 24710 Comm: syz-executor.3 Not tainted 5.1.0+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 __cant_sleep kernel/sched/core.c:6165 [inline]
 __cant_sleep.cold+0xa3/0xbb kernel/sched/core.c:6142
 bpf_flow_dissect+0xfe/0x390 net/core/flow_dissector.c:737
 __skb_flow_dissect+0x362/0x4bb0 net/core/flow_dissector.c:853
 skb_flow_dissect_flow_keys_basic include/linux/skbuff.h:1322 [inline]
 skb_probe_transport_header include/linux/skbuff.h:2500 [inline]
 skb_probe_transport_header include/linux/skbuff.h:2493 [inline]
 tun_get_user+0x2cfe/0x3ff0 drivers/net/tun.c:1940
 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
 call_write_iter include/linux/fs.h:1872 [inline]
 do_iter_readv_writev+0x5fd/0x900 fs/read_write.c:693
 do_iter_write fs/read_write.c:970 [inline]
 do_iter_write+0x184/0x610 fs/read_write.c:951
 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
 do_writev+0x15b/0x330 fs/read_write.c:1058
 __do_sys_writev fs/read_write.c:1131 [inline]
 __se_sys_writev fs/read_write.c:1128 [inline]
 __x64_sys_writev+0x75/0xb0 fs/read_write.c:1128
 do_syscall_64+0x103/0x670 arch/x86/entry/common.c:298
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
Change-Id: I326ce1e572ae9c2dc060c172dbd0227bb8ee9eef
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Petar Penkov <ppenkov@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-12-28 18:42:01 +05:30
Matt Mullins
632a7711e4 UPSTREAM: selftests: bpf: test writable buffers in raw tps
This tests that:
  * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it
    uses either:
    * a variable offset to the tracepoint buffer, or
    * an offset beyond the size of the tracepoint buffer
  * a tracer can modify the buffer provided when attached to a writable
    tracepoint in bpf_prog_test_run

Change-Id: I3fdf650c1d9a68d64d08b170b14e46a35d180dd7
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-28 18:42:01 +05:30
Stanislav Fomichev
540ab7d6c0 UPSTREAM: bpf/flow_dissector: don't adjust nhoff by ETH_HLEN in BPF_PROG_TEST_RUN
Now that we use skb-less flow dissector let's return true nhoff and
thoff. We used to adjust them by ETH_HLEN because that's how it was
done in the skb case. For VLAN tests that looks confusing: nhoff is
pointing to vlan parts :-\

Warning, this is an API change for BPF_PROG_TEST_RUN! Feel free to drop
if you think that it's too late at this point to fix it.

Change-Id: Icae82218f62c91b7860981bc62269eba8e58c25c
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:42:00 +05:30
Stanislav Fomichev
e21018e474 UPSTREAM: bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
Now that we have bpf_flow_dissect which can work on raw data,
use it when doing BPF_PROG_TEST_RUN for flow dissector.

Simplifies bpf_prog_test_run_flow_dissector and allows us to
test no-skb mode.

Note, that previously, with bpf_flow_dissect_skb we used to call
eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called
skb_reset_network_header. That means flow_keys->nhoff would be
initialized to 0 (skb_network_offset) in init_flow_keys.
Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need
to undo it once the dissection is done to preserve the existing behavior.

Change-Id: Icaf9c2a76c6176c647081535f7d48d94fff2e126
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:42:00 +05:30
Stanislav Fomichev
fc4ff96a42 UPSTREAM: bpf: explicitly prohibit ctx_{in, out} in non-skb BPF_PROG_TEST_RUN
This should allow us later to extend BPF_PROG_TEST_RUN for non-skb case
and be sure that nobody is erroneously setting ctx_{in,out}.

Fixes: b0b9395d865e ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Change-Id: Ibe85a9217cd265966ef57e1298b37a6b9baa50c4
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:42:00 +05:30
Bo YU
c556d97c4d UPSTREAM: bpf: fix warning about using plain integer as NULL
Sparse warning below:

sudo make C=2 CF=-D__CHECK_ENDIAN__ M=net/bpf/
CHECK   net/bpf//test_run.c
net/bpf//test_run.c:19:77: warning: Using plain integer as NULL pointer
./include/linux/bpf-cgroup.h:295:77: warning: Using plain integer as NULL pointer

Fixes: 8bad74f9840f ("bpf: extend cgroup bpf core to allow multiple cgroup storage types")
Acked-by: Yonghong Song <yhs@fb.com>
Change-Id: I1dba06e69eb1e3b4a3163dc32b48c4fc6bdc755d
Signed-off-by: Bo YU <tsu.yubo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:41:59 +05:30
Stanislav Fomichev
027a301e01 UPSTREAM: bpf/test_run: fix unkillable BPF_PROG_TEST_RUN for flow dissector
Syzbot found out that running BPF_PROG_TEST_RUN with repeat=0xffffffff
makes process unkillable. The problem is that when CONFIG_PREEMPT is
enabled, we never see need_resched() return true. This is due to the
fact that preempt_enable() (which we do in bpf_test_run_one on each
iteration) now handles resched if it's needed.

Let's disable preemption for the whole run, not per test. In this case
we can properly see whether resched is needed.
Let's also properly return -EINTR to the userspace in case of a signal
interrupt.

This is a follow up for a recently fixed issue in bpf_test_run, see
commit df1a2cb7c74b ("bpf/test_run: fix unkillable
BPF_PROG_TEST_RUN").

Reported-by: syzbot <syzkaller@googlegroups.com>
Change-Id: I38d1d9553d1ee281666de7000d06684962658397
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:41:59 +05:30
Stanislav Fomichev
8bb7a7489c UPSTREAM: bpf/test_run: fix unkillable BPF_PROG_TEST_RUN
Syzbot found out that running BPF_PROG_TEST_RUN with repeat=0xffffffff
makes process unkillable. The problem is that when CONFIG_PREEMPT is
enabled, we never see need_resched() return true. This is due to the
fact that preempt_enable() (which we do in bpf_test_run_one on each
iteration) now handles resched if it's needed.

Let's disable preemption for the whole run, not per test. In this case
we can properly see whether resched is needed.
Let's also properly return -EINTR to the userspace in case of a signal
interrupt.

See recent discussion:
http://lore.kernel.org/netdev/CAH3MdRWHr4N8jei8jxDppXjmw-Nw=puNDLbu1dQOFQHxfU2onA@mail.gmail.com

I'll follow up with the same fix bpf_prog_test_run_flow_dissector in
bpf-next.

Reported-by: syzbot <syzkaller@googlegroups.com>
Change-Id: I6744e83dc918d20f44cdf9f37cbb0f223a970aa4
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:41:58 +05:30
Lorenz Bauer
85c8c0d034 UPSTREAM: bpf: respect size hint to BPF_PROG_TEST_RUN if present
Use data_size_out as a size hint when copying test output to user space.
ENOSPC is returned if the output buffer is too small.
Callers which so far did not set data_size_out are not affected.

Change-Id: Ic1a42d1903e96a26a27a56489b75be05c58996ff
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-28 18:41:58 +05:30
Stanislav Fomichev
0d382b1663 BACKPORT: bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTT
Performance impact should be minimal because it's under a new
BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled.

Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Priyaranjan Jha <priyarjha@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Change-Id: I19814edf78101f87e6dc364343f8173d9e230850
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:41:57 +05:30
Stanislav Fomichev
d9d8a2e741 UPSTREAM: bpf: support cloning sk storage on accept()
Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Change-Id: Iee30d2442b76f6fd7904829314e63f3391e7d811
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:41:56 +05:30
Stanislav Fomichev
510d66ef1a UPSTREAM: bpf: fix missing bpf_check_uarg_tail_zero in BPF_PROG_TEST_RUN
Commit b0b9395d865e ("bpf: support input __sk_buff context in
BPF_PROG_TEST_RUN") started using bpf_check_uarg_tail_zero in
BPF_PROG_TEST_RUN. However, bpf_check_uarg_tail_zero is not defined
for !CONFIG_BPF_SYSCALL:

net/bpf/test_run.c: In function ‘bpf_ctx_init’:
net/bpf/test_run.c:142:9: error: implicit declaration of function ‘bpf_check_uarg_tail_zero’ [-Werror=implicit-function-declaration]
   err = bpf_check_uarg_tail_zero(data_in, max_size, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~

Let's not build net/bpf/test_run.c when CONFIG_BPF_SYSCALL is not set.

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: b0b9395d865e ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN")
Change-Id: Iddc8eb00de3fbde87f765f533b91d2fbd3b41603
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-12-28 18:41:55 +05:30
Willem de Bruijn
a34192bb10 UPSTREAM: bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags
[ Upstream commit d4bac0288a2b444e468e6df9cb4ed69479ddf14a ]

Classic BPF socket filters with SKB_NET_OFF and SKB_LL_OFF fail to
read when these offsets extend into frags.

This has been observed with iwlwifi and reproduced with tun with
IFF_NAPI_FRAGS. The below straightforward socket filter on UDP port,
applied to a RAW socket, will silently miss matching packets.

    const int offset_proto = offsetof(struct ip6_hdr, ip6_nxt);
    const int offset_dport = sizeof(struct ip6_hdr) + offsetof(struct udphdr, dest);
    struct sock_filter filter_code[] = {
            BPF_STMT(BPF_LD  + BPF_B   + BPF_ABS, SKF_AD_OFF + SKF_AD_PKTTYPE),
            BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, PACKET_HOST, 0, 4),
            BPF_STMT(BPF_LD  + BPF_B   + BPF_ABS, SKF_NET_OFF + offset_proto),
            BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 2),
            BPF_STMT(BPF_LD  + BPF_H   + BPF_ABS, SKF_NET_OFF + offset_dport),

This is unexpected behavior. Socket filter programs should be
consistent regardless of environment. Silent misses are
particularly concerning as hard to detect.

Use skb_copy_bits for offsets outside linear, same as done for
non-SKF_(LL|NET) offsets.

Offset is always positive after subtracting the reference threshold
SKB_(LL|NET)_OFF, so is always >= skb_(mac|network)_offset. The sum of
the two is an offset against skb->data, and may be negative, but it
cannot point before skb->head, as skb_(mac|network)_offset would too.

This appears to go back to when frag support was introduced to
sk_run_filter in linux-2.4.4, before the introduction of git.

The amount of code change and 8/16/32 bit duplication are unfortunate.
But any attempt I made to be smarter saved very few LoC while
complicating the code.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/20250122200402.3461154-1-maze@google.com/
Link: https://elixir.bootlin.com/linux/2.4.4/source/net/core/filter.c#L244
Reported-by: Matt Moeller <moeller.matt@gmail.com>
Co-developed-by: Maciej Żenczykowski <maze@google.com>
Change-Id: Ia1ff7be23c39d872596848731818740696c86c29
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250408132833.195491-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:55 +05:30
Ben Dooks
a576634920 UPSTREAM: bpf: Add endian modifiers to fix endian warnings
[ Upstream commit 96a233e600df351bcb06e3c20efe048855552926 ]

A couple of the syscalls which load values (bpf_skb_load_helper_16() and
bpf_skb_load_helper_32()) are using u16/u32 types which are triggering
warnings as they are then converted from big-endian to CPU-endian. Fix
these by making the types __be instead.

Fixes the following sparse warnings:

  net/core/filter.c:246:32: warning: cast to restricted __be16
  net/core/filter.c:246:32: warning: cast to restricted __be16
  net/core/filter.c:246:32: warning: cast to restricted __be16
  net/core/filter.c:246:32: warning: cast to restricted __be16
  net/core/filter.c:273:32: warning: cast to restricted __be32
  net/core/filter.c:273:32: warning: cast to restricted __be32
  net/core/filter.c:273:32: warning: cast to restricted __be32
  net/core/filter.c:273:32: warning: cast to restricted __be32
  net/core/filter.c:273:32: warning: cast to restricted __be32
  net/core/filter.c:273:32: warning: cast to restricted __be32

Change-Id: I6486ac70a70ffae777a6cd01f4f4b6926f44797f
Signed-off-by: Ben Dooks <ben.dooks@sifive.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220714105101.297304-1-ben.dooks@sifive.com
Stable-dep-of: d4bac0288a2b ("bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:54 +05:30
Cong Wang
ca4bde13e5 UPSTREAM: bpf: Check negative offsets in __bpf_skb_min_len()
[ Upstream commit 9ecc4d858b92c1bb0673ad9c327298e600c55659 ]

skb_network_offset() and skb_transport_offset() can be negative when
they are called after we pull the transport header, for example, when
we use eBPF sockmap at the point of ->sk_data_ready().

__bpf_skb_min_len() uses an unsigned int to get these offsets, this
leads to a very large number which then causes bpf_skb_change_tail()
failed unexpectedly.

Fix this by using a signed int to get these offsets and ensure the
minimum is at least zero.

Fixes: 5293efe62d ("bpf: add bpf_skb_change_tail helper")
Change-Id: I6267f793ec92ff120f2a5b689976ba19443dae9a
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241213034057.246437-2-xiyou.wangcong@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:54 +05:30
Zijian Zhang
3051bf87f5 UPSTREAM: bpf, sockmap: Fix sk_msg_reset_curr
[ Upstream commit 955afd57dc4bf7e8c620a0a9e3af3c881c2c6dff ]

Found in the test_txmsg_pull in test_sockmap,
```
txmsg_cork = 512; // corking is importrant here
opt->iov_length = 3;
opt->iov_count = 1;
opt->rate = 512; // sendmsg will be invoked 512 times
```
The first sendmsg will send an sk_msg with size 3, and bpf_msg_pull_data
will be invoked the first time. sk_msg_reset_curr will reset the copybreak
from 3 to 0. In the second sendmsg, since we are in the stage of corking,
psock->cork will be reused in func sk_msg_alloc. msg->sg.copybreak is 0
now, the second msg will overwrite the first msg. As a result, we could
not pass the data integrity test.

The same problem happens in push and pop test. Thus, fix sk_msg_reset_curr
to restore the correct copybreak.

Fixes: bb9aefde5bba ("bpf: sockmap, updating the sg structure should also update curr")
Change-Id: Ibbac29c62d1b26f03d5f3588c14316829aac017f
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Link: https://lore.kernel.org/r/20241106222520.527076-9-zijianzhang@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:53 +05:30
Zijian Zhang
1ebe3da854 UPSTREAM: bpf, sockmap: Several fixes to bpf_msg_pop_data
[ Upstream commit 5d609ba262475db450ba69b8e8a557bd768ac07a ]

Several fixes to bpf_msg_pop_data,
1. In sk_msg_shift_left, we should put_page
2. if (len == 0), return early is better
3. pop the entire sk_msg (last == msg->sg.size) should be supported
4. Fix for the value of variable "a"
5. In sk_msg_shift_left, after shifting, i has already pointed to the next
element. Addtional sk_msg_iter_var_next may result in BUG.

Fixes: 7246d8ed4dcc ("bpf: helper to pop data from messages")
Change-Id: I03032e0683aec3b938bb1340b493e280c8e871b8
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20241106222520.527076-8-zijianzhang@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:53 +05:30
Zijian Zhang
d1493947c0 UPSTREAM: bpf, sockmap: Several fixes to bpf_msg_push_data
[ Upstream commit 15ab0548e3107665c34579ae523b2b6e7c22082a ]

Several fixes to bpf_msg_push_data,
1. test_sockmap has tests where bpf_msg_push_data is invoked to push some
data at the end of a message, but -EINVAL is returned. In this case, in
bpf_msg_push_data, after the first loop, i will be set to msg->sg.end, add
the logic to handle it.
2. In the code block of "if (start - offset)", it's possible that "i"
points to the last of sk_msg_elem. In this case, "sk_msg_iter_next(msg,
end)" might still be called twice, another invoking is in "if (!copy)"
code block, but actually only one is needed. Add the logic to handle it,
and reconstruct the code to make the logic more clear.

Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
Change-Id: Ie497b74e5b7e5ff9b8332096e8b4611109ff51be
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Link: https://lore.kernel.org/r/20241106222520.527076-7-zijianzhang@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:52 +05:30
Fred Li
f3f691f1ba UPSTREAM: bpf: Fix a segment issue when downgrading gso_size
[ Upstream commit fa5ef655615a01533035c6139248c5b33aa27028 ]

Linearize the skb when downgrading gso_size because it may trigger a
BUG_ON() later when the skb is segmented as described in [1,2].

Fixes: 2be7e212d5 ("bpf: add bpf_skb_adjust_room helper")
Change-Id: I130565140af5a4ca59c1742d7cff5fa2dacb7817
Signed-off-by: Fred Li <dracodingfly@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/all/20240626065555.35460-2-dracodingfly@gmail.com [1]
Link: https://lore.kernel.org/all/668d5cf1ec330_1c18c32947@willemb.c.googlers.com.notmuch [2]
Link: https://lore.kernel.org/bpf/20240719024653.77006-1-dracodingfly@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:52 +05:30
John Fastabend
401e6f7e7f UPSTREAM: bpf: sockmap, updating the sg structure should also update curr
[ Upstream commit bb9aefde5bbaf6c168c77ba635c155b4980c2287 ]

Curr pointer should be updated when the sg structure is shifted.

Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
Change-Id: I3c5d1dd94c2b874b6f0c34010f8e2a766f6081c1
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20231206232706.374377-3-john.fastabend@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:51 +05:30
Martin KaFai Lau
e04cccd4fc BACKPORT: bpf: bpf_fib_lookup should not return neigh in NUD_FAILED state
commit 1fe4850b34ab512ff911e2c035c75fb6438f7307 upstream.

The bpf_fib_lookup() helper does not only look up the fib (ie. route)
but it also looks up the neigh. Before returning the neigh, the helper
does not check for NUD_VALID. When a neigh state (neigh->nud_state)
is in NUD_FAILED, its dmac (neigh->ha) could be all zeros. The helper
still returns SUCCESS instead of NO_NEIGH in this case. Because of the
SUCCESS return value, the bpf prog directly uses the returned dmac
and ends up filling all zero in the eth header.

This patch checks for NUD_VALID and returns NO_NEIGH if the neigh is
not valid.

Change-Id: I1d81fbe4e026d2572099c684582a30a0e62e5f03
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230217004150.2980689-3-martin.lau@linux.dev
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-28 18:41:51 +05:30
Stanislav Fomichev
c534323a39 UPSTREAM: bpf: Move skb->len == 0 checks into __bpf_redirect
[ Upstream commit 114039b342014680911c35bd6b72624180fd669a ]

To avoid potentially breaking existing users.

Both mac/no-mac cases have to be amended; mac_header >= network_header
is not enough (verified with a new test, see next patch).

Fixes: fd1894224407 ("bpf: Don't redirect packets with invalid pkt_len")
Change-Id: I7f16f664a202d1f962bd9e2d8d8e60064f62ef4f
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221121180340.1983627-1-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:50 +05:30
Jon Maxwell
afda5c7962 UPSTREAM: bpf: Fix request_sock leak in sk lookup helpers
[ Upstream commit 3046a827316c0e55fc563b4fb78c93b9ca5c7c37 ]

A customer reported a request_socket leak in a Calico cloud environment. We
found that a BPF program was doing a socket lookup with takes a refcnt on
the socket and that it was finding the request_socket but returning the parent
LISTEN socket via sk_to_full_sk() without decrementing the child request socket
1st, resulting in request_sock slab object leak. This patch retains the
existing behaviour of returning full socks to the caller but it also decrements
the child request_socket if one is present before doing so to prevent the leak.

Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
thanks to Antoine Tenart for the reproducer and patch input.

v2 of this patch contains, refactor as per Daniel Borkmann's suggestions to
validate RCU flags on the listen socket so that it balances with bpf_sk_release()
and update comments as per Martin KaFai Lau's suggestion. One small change to
Daniels suggestion, put "sk = sk2" under "if (sk2 != sk)" to avoid an extra
instruction.

Fixes: f7355a6c0497 ("bpf: Check sk_fullsock() before returning from bpf_sk_lookup()")
Fixes: edbf8c01de5a ("bpf: add skc_lookup_tcp helper")
Co-developed-by: Antoine Tenart <atenart@kernel.org>
Change-Id: I4867220345184d2c036b76988b3267f5bc6aeb63
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/56d6f898-bde0-bb25-3427-12a330b29fb8@iogearbox.net
Link: https://lore.kernel.org/bpf/20220615011540.813025-1-jmaxwell37@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:49 +05:30
Maxim Mikityanskiy
6f28e767ab UPSTREAM: bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
[ Upstream commit 2e8702cc0cfa1080f29fd64003c00a3e24ac38de ]

bpf_tcp_gen_syncookie looks at the IP version in the IP header and
validates the address family of the socket. It supports IPv4 packets in
AF_INET6 dual-stack sockets.

On the other hand, bpf_tcp_check_syncookie looks only at the address
family of the socket, ignoring the real IP version in headers, and
validates only the packet size. This implementation has some drawbacks:

1. Packets are not validated properly, allowing a BPF program to trick
   bpf_tcp_check_syncookie into handling an IPv6 packet on an IPv4
   socket.

2. Dual-stack sockets fail the checks on IPv4 packets. IPv4 clients end
   up receiving a SYNACK with the cookie, but the following ACK gets
   dropped.

This patch fixes these issues by changing the checks in
bpf_tcp_check_syncookie to match the ones in bpf_tcp_gen_syncookie. IP
version from the header is taken into account, and it is validated
properly with address family.

Fixes: 399040847084 ("bpf: add helper to check for a valid SYN cookie")
Change-Id: Id130e436818a58b9775e61f9d11a029561b1e7a4
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Arthur Fabre <afabre@cloudflare.com>
Link: https://lore.kernel.org/bpf/20220406124113.2795730-1-maximmi@nvidia.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:49 +05:30
Jakub Sitnicki
1afd149e04 UPSTREAM: bpf: Make dst_port field in struct bpf_sock 16-bit wide
[ Upstream commit 4421a582718ab81608d8486734c18083b822390d ]

Menglong Dong reports that the documentation for the dst_port field in
struct bpf_sock is inaccurate and confusing. From the BPF program PoV, the
field is a zero-padded 16-bit integer in network byte order. The value
appears to the BPF user as if laid out in memory as so:

  offsetof(struct bpf_sock, dst_port) + 0  <port MSB>
                                      + 8  <port LSB>
                                      +16  0x00
                                      +24  0x00

32-, 16-, and 8-bit wide loads from the field are all allowed, but only if
the offset into the field is 0.

32-bit wide loads from dst_port are especially confusing. The loaded value,
after converting to host byte order with bpf_ntohl(dst_port), contains the
port number in the upper 16-bits.

Remove the confusion by splitting the field into two 16-bit fields. For
backward compatibility, allow 32-bit wide loads from offsetof(struct
bpf_sock, dst_port).

While at it, allow loads 8-bit loads at offset [0] and [1] from dst_port.

Reported-by: Menglong Dong <imagedong@tencent.com>
Change-Id: Id86817d538b4f552ca112639c0a40fb2d8bd9eb9
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220130115518.213259-2-jakub@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:48 +05:30
Felix Maurer
171f00c51c UPSTREAM: bpf: Do not try bpf_msg_push_data with len 0
commit 4a11678f683814df82fca9018d964771e02d7e6d upstream.

If bpf_msg_push_data() is called with len 0 (as it happens during
selftests/bpf/test_sockmap), we do not need to do anything and can
return early.

Calling bpf_msg_push_data() with len 0 previously lead to a wrong ENOMEM
error: we later called get_order(copy + len); if len was 0, copy + len
was also often 0 and get_order() returned some undefined value (at the
moment 52). alloc_pages() caught that and failed, but then bpf_msg_push_data()
returned ENOMEM. This was wrong because we are most probably not out of
memory and actually do not need any additional memory.

Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
Change-Id: I7d37f4ebfd344bb8c372a3376f33e02afef024e9
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/df69012695c7094ccb1943ca02b4920db3537466.1644421921.git.fmaurer@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-28 18:41:48 +05:30
Kuniyuki Iwashima
3c7b131958 UPSTREAM: bpf: Fix SO_RCVBUF/SO_SNDBUF handling in _bpf_setsockopt().
[ Upstream commit 04c350b1ae6bdb12b84009a4d0bf5ab4e621c47b ]

The commit 4057765f2dee ("sock: consistent handling of extreme
SO_SNDBUF/SO_RCVBUF values") added a change to prevent underflow
in setsockopt() around SO_SNDBUF/SO_RCVBUF.

This patch adds the same change to _bpf_setsockopt().

Fixes: 4057765f2dee ("sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values")
Change-Id: I797792499bbaee6cc5058519bfb6e33633d76af5
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220104013153.97906-2-kuniyu@amazon.co.jp
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:47 +05:30
John Fastabend
910788a608 UPSTREAM: bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
[ Upstream commit e0dc3b93bd7bcff8c3813d1df43e0908499c7cf0 ]

Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling
progress, e.g. offset and length of the skb. First this is poorly named and
inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at
this layer.

But, more importantly strparser is using the following to access its metadata.

  (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))

Where _strp_msg is defined as:

  struct _strp_msg {
        struct strp_msg            strp;                 /*     0     8 */
        int                        accum_len;            /*     8     4 */

        /* size: 12, cachelines: 1, members: 2 */
        /* last cacheline: 12 bytes */
  };

So we use 12 bytes of ->data[] in struct. However in BPF code running parser
and verdict the user has read capabilities into the data[] array as well. Its
not too problematic, but we should not be exposing internal state to BPF
program. If its really needed then we can use the probe_read() APIs which allow
reading kernel memory. And I don't believe cb[] layer poses any API breakage by
moving this around because programs can't depend on cb[] across layers.

In order to fix another issue with a ctx rewrite we need to stash a temp
variable somewhere. To make this work cleanly this patch builds a cb struct
for sk_skb types called sk_skb_cb struct. Then we can use this consistently
in the strparser, sockmap space. Additionally we can start allowing ->cb[]
write access after this.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Change-Id: I2512e377df9d6f73136b43e59c364c6c87be5d1d
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-28 18:41:46 +05:30