Changes in 5.15.64
wifi: rtlwifi: remove always-true condition pointed out by GCC 12
eth: sun: cassini: remove dead code
audit: fix potential double free on error path from fsnotify_add_inode_mark
cgroup: Fix race condition at rebind_subsystems()
parisc: Make CONFIG_64BIT available for ARCH=parisc64 only
parisc: Fix exception handler for fldw and fstw instructions
kernel/sys_ni: add compat entry for fadvise64_64
x86/entry: Move CLD to the start of the idtentry macro
block: add a bdev_max_zone_append_sectors helper
block: add bdev_max_segments() helper
btrfs: zoned: revive max_zone_append_bytes
btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
btrfs: convert count_max_extents() to use fs_info->max_extent_size
Input: i8042 - move __initconst to fix code styling warning
Input: i8042 - merge quirk tables
Input: i8042 - add TUXEDO devices to i8042 quirk tables
Input: i8042 - add additional TUXEDO devices to i8042 quirk tables
drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist
scsi: qla2xxx: Fix response queue handler reading stale packets
scsi: qla2xxx: edif: Fix dropped IKE message
btrfs: put initial index value of a directory in a constant
btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
btrfs: remove unnecessary parameter delalloc_start for writepage_delalloc()
riscv: lib: uaccess: fold fixups into body
riscv: lib: uaccess: fix CSR_STATUS SR_SUM bit
xfrm: fix refcount leak in __xfrm_policy_check()
xfrm: clone missing x->lastused in xfrm_do_migrate
af_key: Do not call xfrm_probe_algs in parallel
xfrm: policy: fix metadata dst->dev xmit null pointer dereference
fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
net: use eth_hw_addr_set() instead of ether_addr_copy()
Revert "net: macsec: update SCI upon MAC address change."
NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
NFSv4.2 fix problems with __nfs42_ssc_open
SUNRPC: RPC level errors should set task->tk_rpc_status
mm/smaps: don't access young/dirty bit if pte unpresent
ntfs: fix acl handling
rose: check NULL rose_loopback_neigh->loopback
r8152: fix the units of some registers for RTL8156A
r8152: fix the RX FIFO settings when suspending
nfc: pn533: Fix use-after-free bugs caused by pn532_cmd_timeout
ice: xsk: Force rings to be sized to power of 2
ice: xsk: prohibit usage of non-balanced queue id
net/mlx5e: Properly disable vlan strip on non-UL reps
net/mlx5: Avoid false positive lockdep warning by adding lock_class_key
net/mlx5e: Fix wrong application of the LRO state
net/mlx5e: Fix wrong tc flag used when set hw-tc-offload off
net: ipa: don't assume SMEM is page-aligned
net: phy: Don't WARN for PHY_READY state in mdio_bus_phy_resume()
net: moxa: get rid of asymmetry in DMA mapping/unmapping
bonding: 802.3ad: fix no transmission of LACPDUs
net: ipvtap - add __init/__exit annotations to module init/exit funcs
netfilter: ebtables: reject blobs that don't provide all entry points
bnxt_en: fix NQ resource accounting during vf creation on 57500 chips
netfilter: nf_tables: disallow updates of implicit chain
netfilter: nf_tables: make table handle allocation per-netns friendly
netfilter: nft_payload: report ERANGE for too long offset and length
netfilter: nft_payload: do not truncate csum_offset and csum_type
netfilter: nf_tables: do not leave chain stats enabled on error
netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families
netfilter: nft_tunnel: restrict it to netdev family
netfilter: nf_tables: consolidate rule verdict trace call
netfilter: nft_cmp: optimize comparison for 16-bytes
netfilter: bitwise: improve error goto labels
netfilter: nf_tables: upfront validation of data via nft_data_init()
netfilter: nf_tables: disallow jump to implicit chain from set element
netfilter: nf_tables: disallow binding to already bound chain
netfilter: flowtable: add function to invoke garbage collection immediately
netfilter: flowtable: fix stuck flows on cleanup due to pending work
net: Fix data-races around sysctl_[rw]mem_(max|default).
net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
net: Fix data-races around netdev_max_backlog.
net: Fix data-races around netdev_tstamp_prequeue.
ratelimit: Fix data-races in ___ratelimit().
net: Fix data-races around sysctl_optmem_max.
net: Fix a data-race around sysctl_tstamp_allow_data.
net: Fix a data-race around sysctl_net_busy_poll.
net: Fix a data-race around sysctl_net_busy_read.
net: Fix a data-race around netdev_budget.
tcp: expose the tcp_mark_push() and tcp_skb_entail() helpers
mptcp: stop relying on tcp_tx_skb_cache
net: Fix data-races around sysctl_max_skb_frags.
net: Fix a data-race around netdev_budget_usecs.
net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
net: Fix data-races around sysctl_devconf_inherit_init_net.
net: Fix a data-race around sysctl_somaxconn.
ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
i40e: Fix incorrect address type for IPv6 flow rules
rxrpc: Fix locking in rxrpc's sendmsg
ionic: widen queue_lock use around lif init and deinit
ionic: clear broken state on generation change
ionic: fix up issues with handling EAGAIN on FW cmds
ionic: VF initial random MAC address if no assigned mac
net: stmmac: work around sporadic tx issue on link-up
btrfs: fix silent failure when deleting root reference
btrfs: replace: drop assert for suspended replace
btrfs: add info when mount fails due to stale replace target
btrfs: check if root is readonly while setting security xattr
btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
perf/x86/lbr: Enable the branch type for the Arch LBR by default
x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
x86/bugs: Add "unknown" reporting for MMIO Stale Data
x86/nospec: Unwreck the RSB stuffing
loop: Check for overflow while configuring loop
writeback: avoid use-after-free after removing device
asm-generic: sections: refactor memory_intersects
mm/damon/dbgfs: avoid duplicate context directory creation
s390/mm: do not trigger write fault when vma does not allow VM_WRITE
bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
s390: fix double free of GS and RI CBs on fork() failure
fbdev: fbcon: Properly revert changes when vc_resize() failed
Revert "memcg: cleanup racy sum avoidance code"
ACPI: processor: Remove freq Qos request for all CPUs
nouveau: explicitly wait on the fence in nouveau_bo_move_m2mf
smb3: missing inode locks in punch hole
xen/privcmd: fix error exit of privcmd_ioctl_dm_op()
riscv: traps: add missing prototype
io_uring: fix issue with io_write() not always undoing sb_start_write()
Revert "usbnet: smsc95xx: Fix deadlock on runtime resume"
Revert "usbnet: smsc95xx: Forward PHY interrupts to PHY driver to avoid polling"
mm/hugetlb: fix hugetlb not supporting softdirty tracking
Revert "md-raid: destroy the bitmap after destroying the thread"
md: call __md_stop_writes in md_stop
mptcp: Fix crash due to tcp_tsorted_anchor was initialized before release skb
arm64: Fix match_list for erratum 1286807 on Arm Cortex-A76
binder_alloc: add missing mmap_lock calls when using the VMA
x86/nospec: Fix i386 RSB stuffing
Documentation/ABI: Mention retbleed vulnerability info file for sysfs
blk-mq: fix io hung due to missing commit_rqs
perf python: Fix build when PYTHON_CONFIG is user supplied
perf/x86/intel/uncore: Fix broken read_counter() for SNB IMC PMU
perf/x86/intel/ds: Fix precise store latency handling
perf stat: Clear evsel->reset_group for each stat run
scsi: ufs: core: Enable link lost interrupt
scsi: storvsc: Remove WQ_MEM_RECLAIM from storvsc_error_wq
bpf: Don't use tnum_range on array range checking for poke descriptors
Linux 5.15.64
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iaba96c173ad668df1c20b3bee08ce0e34f1068e1
[ Upstream commit a5612ca10d1aa05624ebe72633e0c8c792970833 ]
While reading sysctl_devconf_inherit_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 856c395cfa ("net: introduce a knob to control whether to inherit devconf config")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af67508ea6cbf0e4ea27f8120056fa2efce127dd ]
While reading sysctl_fb_tunnels_only_for_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 79134e6ce2 ("net: do not create fallback tunnels for non-default namespaces")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.15.53
Revert "drm/amdgpu/display: set vblank_disable_immediate for DC"
drm/amdgpu: To flush tlb for MMHUB of RAVEN series
ksmbd: set the range of bytes to zero without extending file size in FSCTL_ZERO_DATA
ksmbd: check invalid FileOffset and BeyondFinalZero in FSCTL_ZERO_DATA
ksmbd: use vfs_llseek instead of dereferencing NULL
ipv6: take care of disable_policy when restoring routes
net: phy: Don't trigger state machine while in suspend
nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG SX6000LNP (AKA SPECTRIX S40G)
nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA IM2P33F8ABR1
nvdimm: Fix badblocks clear off-by-one error
powerpc/prom_init: Fix kernel config grep
powerpc/book3e: Fix PUD allocation size in map_kernel_page()
powerpc/bpf: Fix use of user_pt_regs in uapi
dm raid: fix accesses beyond end of raid member array
dm raid: fix KASAN warning in raid5_add_disks
s390/archrandom: simplify back to earlier design and initialize earlier
SUNRPC: Fix READ_PLUS crasher
net: rose: fix UAF bugs caused by timer handler
net: usb: ax88179_178a: Fix packet receiving
virtio-net: fix race between ndo_open() and virtio_device_ready()
selftests/net: pass ipv6_args to udpgso_bench's IPv6 TCP test
net: dsa: bcm_sf2: force pause link settings
net: tun: unlink NAPI from device on destruction
net: tun: stop NAPI when detaching queues
net: dp83822: disable false carrier interrupt
net: dp83822: disable rx error interrupt
RDMA/qedr: Fix reporting QP timeout attribute
RDMA/cm: Fix memory leak in ib_cm_insert_listen
linux/dim: Fix divide by 0 in RDMA DIM
net: usb: asix: do not force pause frames support
usbnet: fix memory allocation in helpers
selftests: mptcp: more stable diag tests
net: ipv6: unexport __init-annotated seg6_hmac_net_init()
NFSD: restore EINVAL error translation in nfsd_commit()
vfs: fix copy_file_range() regression in cross-fs copies
caif_virtio: fix race between virtio_device_ready() and ndo_open()
PM / devfreq: exynos-ppmu: Fix refcount leak in of_get_devfreq_events
vdpa/mlx5: Update Control VQ callback information
s390: remove unneeded 'select BUILD_BIN2C'
netfilter: nft_dynset: restore set element counter when failing to update
net/dsa/hirschmann: Add missing of_node_get() in hellcreek_led_setup()
net/sched: act_api: Notify user space if any actions were flushed before error
net: asix: fix "can't send until first packet is send" issue
net: bonding: fix possible NULL deref in rlb code
net: phy: ax88772a: fix lost pause advertisement configuration
net: bonding: fix use-after-free after 802.3ad slave unbind
powerpc/memhotplug: Add add_pages override for PPC
nfc: nfcmrvl: Fix irq_of_parse_and_map() return value
NFC: nxp-nci: Don't issue a zero length i2c_master_read()
tipc: move bc link creation back to tipc_node_create
epic100: fix use after free on rmmod
io_uring: ensure that send/sendmsg and recv/recvmsg check sqe->ioprio
ACPI: video: Change how we determine if brightness key-presses are handled
tunnels: do not assume mac header is set in skb_tunnel_check_pmtu()
ipv6/sit: fix ipip6_tunnel_get_prl return value
ipv6: fix lockdep splat in in6_dump_addrs()
mlxsw: spectrum_router: Fix rollback in tunnel next hop init
net: tun: avoid disabling NAPI twice
MAINTAINERS: add Leah as xfs maintainer for 5.15.y
tcp: add a missing nf_reset_ct() in 3WHS handling
selftests/bpf: Add test_verifier support to fixup kfunc call insns
selftests/rseq: remove ARRAY_SIZE define from individual tests
selftests/rseq: introduce own copy of rseq uapi header
selftests/rseq: Remove useless assignment to cpu variable
selftests/rseq: Remove volatile from __rseq_abi
selftests/rseq: Introduce rseq_get_abi() helper
selftests/rseq: Introduce thread pointer getters
selftests/rseq: Uplift rseq selftests for compatibility with glibc-2.35
selftests/rseq: Fix ppc32: wrong rseq_cs 32-bit field pointer on big endian
selftests/rseq: Fix ppc32 missing instruction selection "u" and "x" for load/store
selftests/rseq: Fix ppc32 offsets by using long rather than off_t
selftests/rseq: Fix warnings about #if checks of undefined tokens
selftests/rseq: Remove arm/mips asm goto compiler work-around
selftests/rseq: Fix: work-around asm goto compiler bugs
selftests/rseq: x86-64: use %fs segment selector for accessing rseq thread area
selftests/rseq: x86-32: use %gs segment selector for accessing rseq thread area
selftests/rseq: Change type of rseq_offset to ptrdiff_t
xen/blkfront: fix leaking data in shared pages
xen/netfront: fix leaking data in shared pages
xen/netfront: force data bouncing when backend is untrusted
xen/blkfront: force data bouncing when backend is untrusted
xen-netfront: restore __skb_queue_tail() positioning in xennet_get_responses()
xen/arm: Fix race in RB-tree based P2M accounting
net: usb: qmi_wwan: add Telit 0x1070 composition
clocksource/drivers/ixp4xx: remove EXPORT_SYMBOL_GPL from ixp4xx_timer_setup()
fsi: occ: Force sequence numbering per OCC
net: fix IFF_TX_SKB_NO_LINEAR definition
drm/i915/gem: add missing else
drm/msm/gem: Fix error return on fence id alloc fail
drivers: cpufreq: Add missing of_node_put() in qoriq-cpufreq.c
platform/x86: panasonic-laptop: de-obfuscate button codes
platform/x86: panasonic-laptop: sort includes alphabetically
platform/x86: panasonic-laptop: revert "Resolve hotkey double trigger bug"
platform/x86: panasonic-laptop: don't report duplicate brightness key-presses
platform/x86: panasonic-laptop: filter out duplicate volume up/down/mute keypresses
drm/fourcc: fix integer type usage in uapi header
hwmon: (occ) Remove sequence numbering and checksum calculation
hwmon: (occ) Prevent power cap command overwriting poll response
hwmon: (ibmaem) don't call platform_device_del() if platform_device_add() fails
Linux 5.15.53
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia725fa748b108ad71b6d90bdf7d704aa269c5976
Try to mitigate potential future driver core api changes by adding a
padding to a lot of different networking structures:
struct ipv6_devconf
struct proto_ops
struct header_ops
struct napi_struct
struct netdev_queue
struct netdev_rx_queue
struct xfrmdev_ops
struct net_device_ops
struct net_device
struct packet_type
struct sk_buff
struct tlsdev_ops
Based on a change made to the RHEL/CENTOS 8 kernel.
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I590f004754dbc8beafa40e71cac70a0938c38b4a
[ Upstream commit cf2df74e202d81b09f09d84c2d8903e0e87e9274 ]
When calling dev_fill_forward_path on a pppoe device, the provided destination
address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
code needs to update ctx->daddr to the correct value.
Fix this by storing the address inside struct net_device_path_ctx
Fixes: f6efc675c9 ("net: ppp: resolve forwarding path for bridge pppoe devices")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit f96b2e0672)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I08f671d1cf16862b024fc940c45c6887ac8cdd2c
[ Upstream commit cf2df74e202d81b09f09d84c2d8903e0e87e9274 ]
When calling dev_fill_forward_path on a pppoe device, the provided destination
address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
code needs to update ctx->daddr to the correct value.
Fix this by storing the address inside struct net_device_path_ctx
Fixes: f6efc675c9 ("net: ppp: resolve forwarding path for bridge pppoe devices")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 47934e06b65637c88a762d9c98329ae6e3238888 upstream.
In one net namespace, after creating a packet socket without binding
it to a device, users in other net namespaces can observe the new
`packet_type` added by this packet socket by reading `/proc/net/ptype`
file. This is minor information leakage as packet socket is
namespace aware.
Add a net pointer in `packet_type` to keep the net namespace of
of corresponding packet socket. In `ptype_seq_show`, this net pointer
must be checked when it is not NULL.
Fixes: 2feb27dbe0 ("[NETNS]: Minor information leak via /proc/net/ptype file.")
Signed-off-by: Congyu Liu <liu3101@purdue.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Recent work on converting address list to a tree made it obvious
we need an abstraction around writing netdev->dev_addr. Without
such abstraction updating the main device address is invisible
to the core.
Introduce a number of helpers which for now just wrap memcpy()
but in the future can make necessary changes to the address
tree.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A common implementation of isatty(3) involves calling a ioctl passing
a dummy struct argument and checking whether the syscall failed --
bionic and glibc use TCGETS (passing a struct termios), and musl uses
TIOCGWINSZ (passing a struct winsize). If the FD is a socket, we will
copy sizeof(struct ifreq) bytes of data from the argument and return
-EFAULT if that fails. The result is that the isatty implementations
may return a non-POSIX-compliant value in errno in the case where part
of the dummy struct argument is inaccessible, as both struct termios
and struct winsize are smaller than struct ifreq (at least on arm64).
Although there is usually enough stack space following the argument
on the stack that this did not present a practical problem up to now,
with MTE stack instrumentation it's more likely for the copy to fail,
as the memory following the struct may have a different tag.
Fix the problem by adding an early check for whether the ioctl is a
valid socket ioctl, and return -ENOTTY if it isn't.
Fixes: 44c02a2c3d ("dev_ioctl(): move copyin/copyout to callers")
Link: https://linux-review.googlesource.com/id/I869da6cf6daabc3e4b7b82ac979683ba05e27d4d
Signed-off-by: Peter Collingbourne <pcc@google.com>
Cc: <stable@vger.kernel.org> # 4.19
Signed-off-by: David S. Miller <davem@davemloft.net>
The source of most of the slow down is the `dev_addr_lists.c` module,
which mainatins a linked list of HW addresses.
When using IPv6, this list grows for each IPv6 address added on a
VLAN, since each IPv6 address has a multicast HW address associated with
it.
When performing any modification to the involved links, this list is
traversed many times, often for nothing, all while holding the RTNL
lock.
Instead, this patch adds an auxilliary rbtree which cuts down
traversal time significantly.
Performance can be seen with the following script:
#!/bin/bash
ip netns del test || true 2>/dev/null
ip netns add test
echo 1 | ip netns exec test tee /proc/sys/net/ipv6/conf/all/keep_addr_on_down > /dev/null
set -e
ip -n test link add foo type veth peer name bar
ip -n test link add b1 type bond
ip -n test link add florp type vrf table 10
ip -n test link set bar master b1
ip -n test link set foo up
ip -n test link set bar up
ip -n test link set b1 up
ip -n test link set florp up
VLAN_COUNT=1500
BASE_DEV=b1
echo Creating vlans
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link add link $BASE_DEV name foo.\$i type vlan id \$i; done"
echo Bringing them up
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link set foo.\$i up; done"
echo Assiging IPv6 Addresses
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test address add dev foo.\$i 2000::\$i/64; done"
echo Attaching to VRF
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link set foo.\$i master florp; done"
On an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz machine, the performance
before the patch is (truncated):
Creating vlans
real 108.35
Bringing them up
real 4.96
Assiging IPv6 Addresses
real 19.22
Attaching to VRF
real 458.84
After the patch:
Creating vlans
real 5.59
Bringing them up
real 5.07
Assiging IPv6 Addresses
real 5.64
Attaching to VRF
real 25.37
Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Lu Wei <luwei32@huawei.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
bpf-next 2021-08-10
We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).
1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.
2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.
3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.
4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.
5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.
6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.
7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================
Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
into XDP_REDIRECT after BPF program run when the ingress device
is a bond slave.
The dev_xdp_prog_count is exposed so that slave devices can be checked
for loaded XDP programs in order to avoid the situation where both
bond master and slave have programs loaded according to xdp_state.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
Add the case if dev is NULL in dev_{put, hold}, so the caller doesn't
need to care whether dev is NULL or not.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
can fail which breaks drivers trying to implement reconfiguration
in a way that can't leave the device half-broken. In other words
those functions are incompatible with prepare/commit approach.
Luckily setting real number of queues can fail only if the number
is increased, meaning that if we order operations correctly we
can guarantee ending up with either new config (success), or
the old one (on error).
Provide a helper implementing such logic so that drivers don't
have to duplicate it.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is now only used by a handful of old ISA drivers,
and can be moved into the file they already all depend on.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds the infrastructure for managing MCTP netdevices; we add
a pointer to the AF_MCTP-specific data to struct netdevice, and hook up
the rtnetlink operations for adding and removing addresses.
Includes changes from Matt Johnston <matt@codeconstruct.com.au>.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
All other user triggered operations are gone from ndo_ioctl, so move
the SIOCBOND family into a custom operation as well.
The .ndo_ioctl() helper is no longer called by the dev_ioctl.c code now,
but there are still a few definitions in obsolete wireless drivers as well
as the appletalk and ieee802154 layers to call SIOCSIFADDR/SIOCGIFADDR
helpers from inside the kernel.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to further reduce the scope of ndo_do_ioctl(), move
out the SIOCWANDEV handling into a new network device operation
function.
Adjust the prototype to only pass the if_settings sub-structure
in place of the ifreq, and remove the redundant 'cmd' argument
in the process.
Cc: Krzysztof Halasa <khc@pm.waw.pl>
Cc: "Jan \"Yenya\" Kasprzak" <kas@fi.muni.cz>
Cc: Kevin Curtis <kevin.curtis@farsite.co.uk>
Cc: Zhao Qiang <qiang.zhao@nxp.com>
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: linux-x25@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The compat handlers for SIOCDEVPRIVATE are incorrect for any driver that
passes data as part of struct ifreq rather than as an ifr_data pointer, or
that passes data back this way, since the compat_ifr_data_ioctl() helper
overwrites the ifr_data pointer and does not copy anything back out.
Since all drivers using devprivate commands are now converted to the
new .ndo_siocdevprivate callback, fix this by adding the missing piece
and passing the pointer separately the whole way.
This further unifies the native and compat logic for socket ioctls,
as the new code now passes the correct pointer as well as the correct
data for both native and compat ioctls.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIOCDEVPRIVATE ioctl commands are mainly used in really old
drivers, and they have a number of problems:
- They hide behind the normal .ndo_do_ioctl function that
is also used for other things in modern drivers, so it's
hard to spot a driver that actually uses one of these
- Since drivers use a number different calling conventions,
it is impossible to support compat mode for them in
a generic way.
- With all drivers using the same 16 commands codes, there
is no way to introspect the data being passed through
things like strace.
Add a new net_device_ops callback pointer, to address the
first two of these. Separating them from .ndo_do_ioctl
makes it easy to grep for drivers with a .ndo_siocdevprivate
callback, and the unwieldy name hopefully makes it easier
to spot in code review.
By passing the ifreq structure and the ifr_data pointer
separately, it is no longer necessary to overload these,
and the driver can use either one for a given command.
Cc: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
compat_ifreq_ioctl() is one of the last users of copy_in_user() and
compat_alloc_user_space(), as it attempts to convert the 'struct ifreq'
arguments from 32-bit to 64-bit format as used by dev_ioctl() and a
couple of socket family specific interpretations.
The current implementation works correctly when calling dev_ioctl(),
inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl()
and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do
not interpret the arguments and only block the corresponding commands,
so they do not care.
For af_inet6 and af_decnet however, the compat conversion is slightly
incorrect, as it will copy more data than the native handler accesses,
both of them use a structure that is shorter than ifreq.
Replace the copy_in_user() conversion with a pair of accessor functions
to read and write the ifreq data in place with the correct length where
needed, while leaving the other ones to copy the (already compatible)
structures directly.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dev_ifconf() calling conventions make compat handling
more complicated than necessary, simplify this by moving
the in_compat_syscall() check into the function.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since dynamic registration of the gifconf() helper is only used for
IPv4, and this can not be in a loadable module, this can be simplified
noticeably by turning it into a direct function call as a preparation
for cleaning up the compat handling.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper can later be utilized in code that runs cpumap and devmap
programs in generic redirect mode and adjust skb based on changes made
to xdp_buff.
When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
generic redirect path invokes devmap/cpumap prog if set, it must
__skb_pull again as we expect mac header to be pulled.
It also drops the skb_reset_mac_len call after do_xdp_generic, as the
mac_header and network_header are advanced by the same offset, so the
difference (mac_len) remains constant.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The goal is to keep the mark during a bpf_redirect(), like it is done for
legacy encapsulation / decapsulation, when there is no x-netns.
This was initially done in commit 213dd74aee ("skbuff: Do not scrub skb
mark within the same name space").
When the call to skb_scrub_packet() was added in dev_forward_skb() (commit
8b27f27797 ("skb: allow skb_scrub_packet() to be used by tunnels")), the
second argument (xnet) was set to true to force a call to skb_orphan(). At
this time, the mark was always cleanned up by skb_scrub_packet(), whatever
xnet value was.
This call to skb_orphan() was removed later in commit
9c4c325252 ("skbuff: preserve sock reference when scrubbing the skb.").
But this 'true' stayed here without any real reason.
Let's correctly set xnet in ____dev_forward_skb(), this function has access
to the previous interface and to the new interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 devices were observed generating MLX5_PORT_CHANGE_SUBTYPE_ACTIVE
events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This
breaks link flap detection based on Linux carrier state transition
count as netif_carrier_on() does nothing if carrier is already on.
Make sure we count such events.
netif_carrier_event() increments the counters and fires the linkwatch
events. The latter is not necessary for the use case but seems like
the right thing to do.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Here is only one place where we want to specify new_ifindex. In all
other cases, callers pass 0 as new_ifindex. It looks reasonable to add a
low-level function with new_ifindex and to convert
dev_change_net_namespace to a static inline wrapper.
Fixes: eeb85a14ee ("net: Allow to specify ifindex when device is moved to another namespace")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we can specify ifindex on link creation. This change allows
to specify ifindex when a device is moved to another network namespace.
Even now, a device ifindex can be changed if there is another device
with the same ifindex in the target namespace. So this change doesn't
introduce completely new behavior, it adds more control to the process.
CRIU users want to restore containers with pre-created network devices.
A user will provide network devices and instructions where they have to
be restored, then CRIU will restore network namespaces and move devices
into them. The problem is that devices have to be restored with the same
indexes that they have before C/R.
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch might have already added the VLAN tag through PVID hardware
offload. Keep this extra VLAN in the flowtable but skip it on egress.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add .ndo_fill_forward_path for dsa slave port devices
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass on the PPPoE session ID, destination hardware address and the real
device.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Depending on the VLAN settings of the bridge and the port, the bridge can
either add or remove a tag. When vlan filtering is enabled, the fdb lookup
also needs to know the VLAN tag/proto for the destination address
To provide this, keep track of the stack of VLAN tags for the path in the
lookup context
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add .ndo_fill_forward_path for bridge devices.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add .ndo_fill_forward_path for vlan devices.
For instance, assuming the following topology:
IP forwarding
/ \
eth0.100 eth0
|
eth0
.
.
.
ethX
ab:cd:ef:ab:cd:ef
For packets going through IP forwarding to eth0.100 whose destination
MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the
following path:
eth0.100 -> eth0
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds dev_fill_forward_path() which resolves the path to reach
the real netdevice from the IP forwarding side. This function takes as
input the netdevice and the destination hardware address and it walks
down the devices calling .ndo_fill_forward_path() for each device until
the real device is found.
For instance, assuming the following topology:
IP forwarding
/ \
br0 eth0
/ \
eth1 eth2
.
.
.
ethX
ab:cd:ef:ab:cd:ef
where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
ethX is the interface in another box which is connected to the eth1
bridge port.
For packets going through IP forwarding to br0 whose destination MAC
address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the
following path:
br0 -> eth1
.ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
from the destination MAC address to get the bridge port eth1.
This information allows to create a fast path that bypasses the classic
bridge and IP forwarding paths, so packets go directly from the bridge
port eth1 to eth0 (wan interface) and vice versa.
fast path
.------------------------.
/ \
| IP forwarding |
| / \ \/
| br0 eth0
. / \
-> eth1 eth2
.
.
.
ethX
ab:cd:ef:ab:cd:ef
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_wait_allrefs() issues a warning if refcount does not drop to 0
after 10 seconds. While 10 second wait generally should not happen
under normal workload in normal environment, it seems to fire falsely
very often during fuzzing and/or in qemu emulation (~10x slower).
At least it's not possible to understand if it's really a false
positive or not. Automated testing generally bumps all timeouts
to very high values to avoid flake failures.
Add net.core.netdev_unregister_timeout_secs sysctl to make
the timeout configurable for automated testing systems.
Lowering the timeout may also be useful for e.g. manual bisection.
The default value matches the current behavior.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.
When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4
Fix is to change initial (and final) refcount to be 1.
Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.
Fixes: 919067cc84 ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ptype_all and ptype_base are declared in net/core/dev.c as non-static,
because they are used by net-procfs.c too. However, a "make W=1" build
complains that there was no previous declaration of ptype_all and
ptype_base in a header file, so this way of declaring things constitutes
a violation of coding style.
Let's move the extern declarations of ptype_all and ptype_base to the
linux/netdevice.h file, which is included by net-procfs.c too.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a function to set the dynamic queue limit minimum value.
Some specific drivers might have legitimate reasons to configure
dql.min_limit to a given value. Typically, this is the case when the
PDU of the protocol is smaller than the packet size to used to
carry those frames to the device.
Concrete example: a CAN (Control Area Network) device with an USB 2.0
interface. The PDU of classical CAN protocol are roughly 16 bytes but
the USB packet size (which is used to carry the CAN frames to the
device) might be up to 512 bytes. Wen small traffic burst occurs, BQL
algorithm is not able to immediately adjust and this would result in
having to send many small USB packets (i.e packet of 16 bytes for each
CAN frame). Filling up the USB packet with CAN frames is relatively
fast (small latency issue) but the gain of not having to send several
small USB packets is huge (big throughput increase). In this case,
forcing dql.min_limit to a given value that would allow to stuff the
USB packet is always a win.
This function is to be used by network drivers which are able to prove
through a rationale and through empirical tests on several environment
(with other applications, heavy context switching, virtualization...),
that they constantly reach better performances with a specific
predefined dql.min_limit value with no noticeable latency impact.
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was working on a syzbot issue, claiming one device could not be
dismantled because its refcount was -1
unregister_netdevice: waiting for sit0 to become free. Usage count = -1
It would be nice if syzbot could trigger a warning at the time
this reference count became negative.
This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
to per cpu variables (as before this patch) on SMP builds.
v2: free_dev label in alloc_netdev_mqs() is moved to avoid
a compiler warning (-Wunused-label), as reported
by kernel test robot <lkp@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>