Changes in 5.15.73
Makefile.extrawarn: Move -Wcast-function-type-strict to W=1
docs: update mediator information in CoC docs
xsk: Inherit need_wakeup flag for shared sockets
mm: gup: fix the fast GUP race against THP collapse
powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush
fs: fix UAF/GPF bug in nilfs_mdt_destroy
firmware: arm_scmi: Improve checks in the info_get operations
firmware: arm_scmi: Harden accesses to the sensor domains
firmware: arm_scmi: Add SCMI PM driver remove routine
dmaengine: xilinx_dma: Fix devm_platform_ioremap_resource error handling
dmaengine: xilinx_dma: cleanup for fetching xlnx,num-fstores property
dmaengine: xilinx_dma: Report error in case of dma_set_mask_and_coherent API failure
ARM: dts: fix Moxa SDIO 'compatible', remove 'sdhci' misnomer
scsi: qedf: Fix a UAF bug in __qedf_probe()
net/ieee802154: fix uninit value bug in dgram_sendmsg
net: marvell: prestera: add support for for Aldrin2
ALSA: hda/hdmi: Fix the converter reuse for the silent stream
um: Cleanup syscall_handler_t cast in syscalls_32.h
um: Cleanup compiler warning in arch/x86/um/tls_32.c
arch: um: Mark the stack non-executable to fix a binutils warning
net: atlantic: fix potential memory leak in aq_ndev_close()
drm/amd/display: Fix double cursor on non-video RGB MPO
drm/amd/display: Assume an LTTPR is always present on fixed_vs links
drm/amd/display: update gamut remap if plane has changed
drm/amd/display: skip audio setup when audio stream is enabled
mmc: core: Replace with already defined values for readability
mmc: core: Terminate infinite loop in SD-UHS voltage switch
perf parse-events: Identify broken modifiers
mm/huge_memory: minor cleanup for split_huge_pages_all
mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
wifi: cfg80211: fix MCS divisor value
net/mlx5: Disable irq when locking lag_lock
usb: mon: make mmapped memory read only
USB: serial: ftdi_sio: fix 300 bps rate for SIO
rpmsg: qcom: glink: replace strncpy() with strscpy_pad()
Revert "clk: ti: Stop using legacy clkctrl names for omap4 and 5"
Linux 5.15.73
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id86ceac9e522f6289ba84ee3439638d70e24376e
Changes in 5.15.54
mm/slub: add missing TID updates on slab deactivation
mm/filemap: fix UAF in find_lock_entries
Revert "selftests/bpf: Add test for bpf_timer overwriting crash"
ALSA: usb-audio: Workarounds for Behringer UMC 204/404 HD
ALSA: hda/realtek: Add quirk for Clevo L140PU
ALSA: cs46xx: Fix missing snd_card_free() call at probe error
can: bcm: use call_rcu() instead of costly synchronize_rcu()
can: grcan: grcan_probe(): remove extra of_node_get()
can: gs_usb: gs_usb_open/close(): fix memory leak
can: m_can: m_can_chip_config(): actually enable internal timestamping
can: m_can: m_can_{read_fifo,echo_tx_event}(): shift timestamp to full 32 bits
can: mcp251xfd: mcp251xfd_regmap_crc_read(): improve workaround handling for mcp2517fd
can: mcp251xfd: mcp251xfd_regmap_crc_read(): update workaround broken CRC on TBC register
bpf: Fix incorrect verifier simulation around jmp32's jeq/jne
bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_vals
usbnet: fix memory leak in error case
net: rose: fix UAF bug caused by rose_t0timer_expiry
netfilter: nft_set_pipapo: release elements in clone from abort path
netfilter: nf_tables: stricter validation of element data
btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk
btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
btrfs: fix invalid delayed ref after subvolume creation failure
btrfs: fix warning when freeing leaf after subvolume creation failure
Input: cpcap-pwrbutton - handle errors from platform_get_irq()
Input: goodix - change goodix_i2c_write() len parameter type to int
Input: goodix - add a goodix.h header file
Input: goodix - refactor reset handling
Input: goodix - try not to touch the reset-pin on x86/ACPI devices
dma-buf/poll: Get a file reference for outstanding fence callbacks
btrfs: fix deadlock between chunk allocation and chunk btree modifications
drm/i915: Disable bonding on gen12+ platforms
drm/i915/gt: Register the migrate contexts with their engines
drm/i915: Replace the unconditional clflush with drm_clflush_virt_range()
PCI/portdrv: Rename pm_iter() to pcie_port_device_iter()
PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
media: ir_toy: prevent device from hanging during transmit
memory: renesas-rpc-if: Avoid unaligned bus access for HyperFlash
ath11k: add hw_param for wakeup_mhi
qed: Improve the stack space of filter_config()
platform/x86: wmi: introduce helper to convert driver to WMI driver
platform/x86: wmi: Replace read_takes_no_args with a flags field
platform/x86: wmi: Fix driver->notify() vs ->probe() race
mt76: mt7921: get rid of mt7921_mac_set_beacon_filter
mt76: mt7921: introduce mt7921_mcu_set_beacon_filter utility routine
mt76: mt7921: fix a possible race enabling/disabling runtime-pm
bpf: Stop caching subprog index in the bpf_pseudo_func insn
bpf, arm64: Use emit_addr_mov_i64() for BPF_PSEUDO_FUNC
riscv: defconfig: enable DRM_NOUVEAU
RISC-V: defconfigs: Set CONFIG_FB=y, for FB console
net/mlx5e: Check action fwd/drop flag exists also for nic flows
net/mlx5e: Split actions_match_supported() into a sub function
net/mlx5e: TC, Reject rules with drop and modify hdr action
net/mlx5e: TC, Reject rules with forward and drop actions
ASoC: rt5682: Avoid the unexpected IRQ event during going to suspend
ASoC: rt5682: Re-detect the combo jack after resuming
ASoC: rt5682: Fix deadlock on resume
netfilter: nf_tables: convert pktinfo->tprot_set to flags field
netfilter: nft_payload: support for inner header matching / mangling
netfilter: nft_payload: don't allow th access for fragments
s390/boot: allocate amode31 section in decompressor
s390/setup: use physical pointers for memblock_reserve()
s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
ibmvnic: init init_done_rc earlier
ibmvnic: clear fop when retrying probe
ibmvnic: Allow queueing resets during probe
virtio-blk: avoid preallocating big SGL for data
io_uring: ensure that fsnotify is always called
block: use bdev_get_queue() in bio.c
block: only mark bio as tracked if it really is tracked
block: fix rq-qos breakage from skipping rq_qos_done_bio()
stddef: Introduce struct_group() helper macro
media: omap3isp: Use struct_group() for memcpy() region
media: davinci: vpif: fix use-after-free on driver unbind
mt76: mt76_connac: fix MCU_CE_CMD_SET_ROC definition error
mt76: mt7921: do not always disable fw runtime-pm
cxl/port: Hold port reference until decoder release
clk: renesas: r9a07g044: Update multiplier and divider values for PLL2/3
KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook
scsi: qla2xxx: Move heartbeat handling from DPC thread to workqueue
scsi: qla2xxx: Fix laggy FC remote port session recovery
scsi: qla2xxx: edif: Replace list_for_each_safe with list_for_each_entry_safe
scsi: qla2xxx: Fix crash during module load unload test
gfs2: Fix gfs2_file_buffered_write endless loop workaround
vdpa/mlx5: Avoid processing works if workqueue was destroyed
btrfs: handle device lookup with btrfs_dev_lookup_args
btrfs: add a btrfs_get_dev_args_from_path helper
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
btrfs: remove device item and update super block in the same transaction
drbd: add error handling support for add_disk()
drbd: Fix double free problem in drbd_create_device
drbd: fix an invalid memory access caused by incorrect use of list iterator
drm/amd/display: Set min dcfclk if pipe count is 0
drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw
NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
NFSD: COMMIT operations must not return NFS?ERR_INVAL
riscv/mm: Add XIP_FIXUP for riscv_pfn_base
iio: accel: mma8452: use the correct logic to get mma8452_data
batman-adv: Use netif_rx().
mtd: spi-nor: Skip erase logic when SPI_NOR_NO_ERASE is set
Compiler Attributes: add __alloc_size() for better bounds checking
mm: vmalloc: introduce array allocation functions
KVM: use __vcalloc for very large allocations
btrfs: don't access possibly stale fs_info data in device_list_add
KVM: s390x: fix SCK locking
scsi: qla2xxx: Fix loss of NVMe namespaces after driver reload test
powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
powerpc: flexible GPR range save/restore macros
powerpc/tm: Fix more userspace r13 corruption
serial: sc16is7xx: Clear RS485 bits in the shutdown
bus: mhi: core: Use correctly sized arguments for bit field
bus: mhi: Fix pm_state conversion to string
stddef: Introduce DECLARE_FLEX_ARRAY() helper
uapi/linux/stddef.h: Add include guards
ASoC: rt5682: move clk related code to rt5682_i2c_probe
ASoC: rt5682: fix an incorrect NULL check on list iterator
drm/amd/vcn: fix an error msg on vcn 3.0
KVM: Don't create VM debugfs files outside of the VM directory
tty: n_gsm: Modify CR,PF bit when config requester
tty: n_gsm: Save dlci address open status when config requester
tty: n_gsm: fix frame reception handling
ALSA: usb-audio: add mapping for MSI MPG X570S Carbon Max Wifi.
ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
tty: n_gsm: fix missing update of modem controls after DLCI open
btrfs: zoned: encapsulate inode locking for zoned relocation
btrfs: zoned: use dedicated lock for data relocation
KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
mm/hwpoison: mf_mutex for soft offline and unpoison
mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
mm/memory-failure.c: fix race with changing page compound again
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
tty: n_gsm: fix invalid use of MSC in advanced option
tty: n_gsm: fix sometimes uninitialized warning in gsm_dlci_modem_output()
serial: 8250_mtk: Make sure to select the right FEATURE_SEL
tty: n_gsm: fix invalid gsmtty_write_room() result
drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems
drm/i915: Fix a race between vma / object destruction and unbinding
drm/mediatek: Use mailbox rx_callback instead of cmdq_task_cb
drm/mediatek: Remove the pointer of struct cmdq_client
drm/mediatek: Detect CMDQ execution timeout
drm/mediatek: Add cmdq_handle in mtk_crtc
drm/mediatek: Add vblank register/unregister callback functions
Bluetooth: protect le accept and resolv lists with hdev->lock
Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event
io_uring: avoid io-wq -EAGAIN looping for !IOPOLL
irqchip/gic-v3: Ensure pseudo-NMIs have an ISB between ack and handling
irqchip/gic-v3: Refactor ISB + EOIR at ack time
rxrpc: Fix locking issue
dt-bindings: soc: qcom: smd-rpm: Add compatible for MSM8953 SoC
dt-bindings: soc: qcom: smd-rpm: Fix missing MSM8936 compatible
module: change to print useful messages from elf_validity_check()
module: fix [e_shstrndx].sh_size=0 OOB access
iommu/vt-d: Fix PCI bus rescan device hot add
fbdev: fbmem: Fix logo center image dx issue
fbmem: Check virtual screen sizes in fb_set_var()
fbcon: Disallow setting font bigger than screen size
fbcon: Prevent that screen size is smaller than font size
PM: runtime: Redefine pm_runtime_release_supplier()
memregion: Fix memregion_free() fallback definition
video: of_display_timing.h: include errno.h
powerpc/powernv: delay rng platform device creation until later in boot
net: dsa: qca8k: reset cpu port on MTU change
can: kvaser_usb: replace run-time checks with struct kvaser_usb_driver_info
can: kvaser_usb: kvaser_usb_leaf: fix CAN clock frequency regression
can: kvaser_usb: kvaser_usb_leaf: fix bittiming limits
xfs: remove incorrect ASSERT in xfs_rename
Revert "serial: sc16is7xx: Clear RS485 bits in the shutdown"
btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()
virtio-blk: modify the value type of num in virtio_queue_rq()
btrfs: fix use of uninitialized variable at rm device ioctl
tty: n_gsm: fix encoding of command/response bit
ARM: meson: Fix refcount leak in meson_smp_prepare_cpus
pinctrl: sunxi: a83t: Fix NAND function name for some pins
ASoC: rt711: Add endianness flag in snd_soc_component_driver
ASoC: rt711-sdca: Add endianness flag in snd_soc_component_driver
ASoC: codecs: rt700/rt711/rt711-sdca: resume bus/codec in .set_jack_detect
arm64: dts: qcom: msm8994: Fix CPU6/7 reg values
arm64: dts: qcom: sdm845: use dispcc AHB clock for mdss node
ARM: mxs_defconfig: Enable the framebuffer
arm64: dts: imx8mp-evk: correct mmc pad settings
arm64: dts: imx8mp-evk: correct the uart2 pinctl value
arm64: dts: imx8mp-evk: correct gpio-led pad settings
arm64: dts: imx8mp-evk: correct vbus pad settings
arm64: dts: imx8mp-evk: correct eqos pad settings
arm64: dts: imx8mp-evk: correct I2C1 pad settings
arm64: dts: imx8mp-evk: correct I2C3 pad settings
arm64: dts: imx8mp-phyboard-pollux-rdk: correct uart pad settings
arm64: dts: imx8mp-phyboard-pollux-rdk: correct eqos pad settings
arm64: dts: imx8mp-phyboard-pollux-rdk: correct i2c2 & mmc settings
pinctrl: sunxi: sunxi_pconf_set: use correct offset
arm64: dts: qcom: msm8992-*: Fix vdd_lvs1_2-supply typo
ARM: at91: pm: use proper compatible for sama5d2's rtc
ARM: at91: pm: use proper compatibles for sam9x60's rtc and rtt
ARM: at91: pm: use proper compatibles for sama7g5's rtc and rtt
ARM: dts: at91: sam9x60ek: fix eeprom compatible and size
ARM: dts: at91: sama5d2_icp: fix eeprom compatibles
ARM: at91: fix soc detection for SAM9X60 SiPs
xsk: Clear page contiguity bit when unmapping pool
i2c: piix4: Fix a memory leak in the EFCH MMIO support
i40e: Fix dropped jumbo frames statistics
i40e: Fix VF's MAC Address change on VM
ARM: dts: stm32: use usbphyc ck_usbo_48m as USBH OHCI clock on stm32mp151
ARM: dts: stm32: add missing usbh clock and fix clk order on stm32mp15
ibmvnic: Properly dispose of all skbs during a failover.
selftests: forwarding: fix flood_unicast_test when h2 supports IFF_UNICAST_FLT
selftests: forwarding: fix learning_test when h1 supports IFF_UNICAST_FLT
selftests: forwarding: fix error message in learning_test
r8169: fix accessing unset transport header
i2c: cadence: Unregister the clk notifier in error path
dmaengine: imx-sdma: Allow imx8m for imx7 FW revs
misc: rtsx_usb: fix use of dma mapped buffer for usb bulk transfer
misc: rtsx_usb: use separate command and response buffers
misc: rtsx_usb: set return value in rsp_buf alloc err path
Revert "mm/memory-failure.c: fix race with changing page compound again"
Revert "serial: 8250_mtk: Make sure to select the right FEATURE_SEL"
dt-bindings: dma: allwinner,sun50i-a64-dma: Fix min/max typo
ida: don't use BUG_ON() for debugging
dmaengine: pl330: Fix lockdep warning about non-static key
dmaengine: lgm: Fix an error handling path in intel_ldma_probe()
dmaengine: at_xdma: handle errors of at_xdmac_alloc_desc() correctly
dmaengine: ti: Fix refcount leak in ti_dra7_xbar_route_allocate
dmaengine: qcom: bam_dma: fix runtime PM underflow
dmaengine: ti: Add missing put_device in ti_dra7_xbar_route_allocate
dmaengine: idxd: force wq context cleanup on device disable path
selftests/net: fix section name when using xdp_dummy.o
Linux 5.15.54
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5434cb4ec4ac9e6d1f619f47a5139a11352b98e1
Changes in 5.15.51
random: schedule mix_interrupt_randomness() less often
random: quiet urandom warning ratelimit suppression message
ALSA: hda/via: Fix missing beep setup
ALSA: hda/conexant: Fix missing beep setup
ALSA: hda/realtek: Add mute LED quirk for HP Omen laptop
ALSA: hda/realtek - ALC897 headset MIC no sound
ALSA: hda/realtek: Apply fixup for Lenovo Yoga Duet 7 properly
ALSA: hda/realtek: Add quirk for Clevo PD70PNT
ALSA: hda/realtek: Add quirk for Clevo NS50PU
net: openvswitch: fix parsing of nw_proto for IPv6 fragments
9p: Fix refcounting during full path walks for fid lookups
9p: fix fid refcount leak in v9fs_vfs_atomic_open_dotl
9p: fix fid refcount leak in v9fs_vfs_get_link
btrfs: fix hang during unmount when block group reclaim task is running
btrfs: prevent remounting to v1 space cache for subpage mount
btrfs: add error messages to all unrecognized mount options
scsi: ibmvfc: Store vhost pointer during subcrq allocation
scsi: ibmvfc: Allocate/free queue resource only during probe/remove
mmc: sdhci-pci-o2micro: Fix card detect by dealing with debouncing
mmc: mediatek: wait dma stop bit reset to 0
xen/gntdev: Avoid blocking in unmap_grant_pages()
MAINTAINERS: Add new IOMMU development mailing list
mtd: rawnand: gpmi: Fix setting busy timeout setting
ata: libata: add qc->flags in ata_qc_complete_template tracepoint
dm era: commit metadata in postsuspend after worker stops
dm mirror log: clear log bits up to BITS_PER_LONG boundary
tracing/kprobes: Check whether get_kretprobe() returns NULL in kretprobe_dispatcher()
drm/i915: Implement w/a 22010492432 for adl-s
USB: serial: pl2303: add support for more HXN (G) types
USB: serial: option: add Telit LE910Cx 0x1250 composition
USB: serial: option: add Quectel EM05-G modem
USB: serial: option: add Quectel RM500K module support
drm/msm: Ensure mmap offset is initialized
drm/msm: Fix double pm_runtime_disable() call
netfilter: use get_random_u32 instead of prandom
scsi: scsi_debug: Fix zone transition to full condition
drm/msm: Switch ordering of runpm put vs devfreq_idle
scsi: iscsi: Exclude zero from the endpoint ID range
xsk: Fix generic transmit when completion queue reservation fails
drm/msm: use for_each_sgtable_sg to iterate over scatterlist
bpf: Fix request_sock leak in sk lookup helpers
drm/sun4i: Fix crash during suspend after component bind failure
bpf, x86: Fix tail call count offset calculation on bpf2bpf call
scsi: storvsc: Correct reporting of Hyper-V I/O size limits
phy: aquantia: Fix AN when higher speeds than 1G are not advertised
KVM: arm64: Prevent kmemleak from accessing pKVM memory
net: Write lock dev_base_lock without disabling bottom halves.
net: fix data-race in dev_isalive()
tipc: fix use-after-free Read in tipc_named_reinit
igb: fix a use-after-free issue in igb_clean_tx_ring
bonding: ARP monitor spams NETDEV_NOTIFY_PEERS notifiers
ethtool: Fix get module eeprom fallback
net/sched: sch_netem: Fix arithmetic in netem_dump() for 32-bit platforms
drm/msm/mdp4: Fix refcount leak in mdp4_modeset_init_intf
drm/msm/dp: check core_initialized before disable interrupts at dp_display_unbind()
drm/msm/dp: Drop now unused hpd_high member
drm/msm/dp: dp_link_parse_sink_count() return immediately if aux read failed
drm/msm/dp: do not initialize phy until plugin interrupt received
drm/msm/dp: force link training for display resolution change
perf arm-spe: Don't set data source if it's not a memory operation
erspan: do not assume transport header is always set
net/tls: fix tls_sk_proto_close executed repeatedly
udmabuf: add back sanity check
selftests: netfilter: correct PKTGEN_SCRIPT_PATHS in nft_concat_range.sh
xen-blkfront: Handle NULL gendisk
x86/xen: Remove undefined behavior in setup_features()
MIPS: Remove repetitive increase irq_err_count
afs: Fix dynamic root getattr
ice: ethtool: advertise 1000M speeds properly
regmap-irq: Fix a bug in regmap_irq_enable() for type_in_mask chips
regmap-irq: Fix offset/index mismatch in read_sub_irq_data()
igb: Make DMA faster when CPU is active on the PCIe link
virtio_net: fix xdp_rxq_info bug after suspend/resume
Revert "net/tls: fix tls_sk_proto_close executed repeatedly"
sock: redo the psock vs ULP protection check
nvme-pci: add NO APST quirk for Kioxia device
nvme: move the Samsung X5 quirk entry to the core quirks
gpio: winbond: Fix error code in winbond_gpio_get()
s390/cpumf: Handle events cycles and instructions identical
iio: mma8452: fix probe fail when device tree compatible is used.
iio: magnetometer: yas530: Fix memchr_inv() misuse
iio: adc: vf610: fix conversion mode sysfs node name
usb: typec: wcove: Drop wrong dependency to INTEL_SOC_PMIC
xhci: turn off port power in shutdown
xhci-pci: Allow host runtime PM as default for Intel Raptor Lake xHCI
xhci-pci: Allow host runtime PM as default for Intel Meteor Lake xHCI
usb: gadget: Fix non-unique driver names in raw-gadget driver
USB: gadget: Fix double-free bug in raw_gadget driver
usb: chipidea: udc: check request status before setting device address
dt-bindings: usb: ohci: Increase the number of PHYs
dt-bindings: usb: ehci: Increase the number of PHYs
btrfs: don't set lock_owner when locking extent buffer for reading
btrfs: fix deadlock with fsync+fiemap+transaction commit
f2fs: attach inline_data after setting compression
iio:humidity:hts221: rearrange iio trigger get and register
iio:chemical:ccs811: rearrange iio trigger get and register
iio:accel:kxcjk-1013: rearrange iio trigger get and register
iio:accel:bma180: rearrange iio trigger get and register
iio:accel:mxc4005: rearrange iio trigger get and register
iio: accel: mma8452: ignore the return value of reset operation
iio: gyro: mpu3050: Fix the error handling in mpu3050_power_up()
iio: trigger: sysfs: fix use-after-free on remove
iio: adc: stm32: fix maximum clock rate for stm32mp15x
iio: imu: inv_icm42600: Fix broken icm42600 (chip id 0 value)
iio: afe: rescale: Fix boolean logic bug
iio: adc: stm32: Fix ADCs iteration in irq handler
iio: adc: stm32: Fix IRQs on STM32F4 by removing custom spurious IRQs message
iio: adc: axp288: Override TS pin bias current for some models
iio: adc: rzg2l_adc: add missing fwnode_handle_put() in rzg2l_adc_parse_properties()
iio: adc: adi-axi-adc: Fix refcount leak in adi_axi_adc_attach_client
iio: adc: ti-ads131e08: add missing fwnode_handle_put() in ads131e08_alloc_channels()
xtensa: xtfpga: Fix refcount leak bug in setup
xtensa: Fix refcount leak bug in time.c
parisc/stifb: Fix fb_is_primary_device() only available with CONFIG_FB_STI
parisc: Enable ARCH_HAS_STRICT_MODULE_RWX
powerpc/microwatt: wire up rng during setup_arch()
powerpc: Enable execve syscall exit tracepoint
powerpc/rtas: Allow ibm,platform-dump RTAS call with null buffer address
powerpc/powernv: wire up rng during setup_arch
drm/msm/dp: Always clear mask bits to disable interrupts at dp_ctrl_reset_irq_ctrl()
ARM: dts: imx7: Move hsic_phy power domain to HSIC PHY node
ARM: dts: imx6qdl: correct PU regulator ramp delay
arm64: dts: ti: k3-am64-main: Remove support for HS400 speed mode
ARM: exynos: Fix refcount leak in exynos_map_pmu
soc: bcm: brcmstb: pm: pm-arm: Fix refcount leak in brcmstb_pm_probe
ARM: Fix refcount leak in axxia_boot_secondary
memory: samsung: exynos5422-dmc: Fix refcount leak in of_get_dram_timings
ARM: cns3xxx: Fix refcount leak in cns3xxx_init
modpost: fix section mismatch check for exported init/exit sections
ARM: dts: bcm2711-rpi-400: Fix GPIO line names
random: update comment from copy_to_user() -> copy_to_iter()
perf build-id: Fix caching files with a wrong build ID
dma-direct: use the correct size for dma_set_encrypted()
kbuild: link vmlinux only once for CONFIG_TRIM_UNUSED_KSYMS (2nd attempt)
powerpc/pseries: wire up rng during setup_arch()
Linux 5.15.51
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iac87c7937a5517ad126a1b4512680a2bbfb6447d
This reverts commit 761b4fa752 which is
commit d1bc532e99becf104635ed4da6fefa306f452321 upstream.
It breaks the Android kernel ABI and is not needed for Android devices,
so it is safe to revert for now. If it is determined that it is needed
in the future, it can be brought back in an abi-preserving way.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id228897a297ea667590bf555928722e3585ccbef
This reverts commit 71afd0ceb5 which is
commit d678cbd2f867a564a3c5b276c454e873f43f02f8 upstream.
It breaks the Android kernel ABI and is not needed for Android devices,
so it is safe to revert for now. If it is determined that it is needed
in the future, it can be brought back in an abi-preserving way.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5e5415f719b4db2e10c7ca45ccb02c89836613da
This reverts commit f7019562f1 which is
commit ba3beec2ec1d3b4fd8672ca6e781dac4b3267f6e upstream.
It breaks the Android kernel ABI and is not needed for Android devices,
so it is safe to revert for now. If it is determined that it is needed
in the future, it can be brought back in an abi-preserving way.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifa3fa4da4879274d2a983e0e6ea9dca3cb3bd72e
[ Upstream commit 512d1999b8e94a5d43fba3afc73e774849674742 ]
When a XSK pool gets mapped, xp_check_dma_contiguity() adds bit 0x1
to pages' DMA addresses that go in ascending order and at 4K stride.
The problem is that the bit does not get cleared before doing unmap.
As a result, a lot of warnings from iommu_dma_unmap_page() are seen
in dmesg, which indicates that lookups by iommu_iova_to_phys() fail.
Fixes: 2b43470add ("xsk: Introduce AF_XDP buffer allocation API")
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220628091848.534803-1-ivan.malov@oktetlabs.ru
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a6e944f25cdbe6b82275402b8bc9a55ad7aac10b ]
Two points of potential failure in the generic transmit function are:
1. completion queue (cq) reservation failure.
2. skb allocation failure
Originally the cq reservation was performed first, followed by the skb
allocation. Commit 675716400d ("xdp: fix possible cq entry leak")
reversed the order because at the time there was no mechanism available
to undo the cq reservation which could have led to possible cq entry leaks
in the event of skb allocation failure. However if the skb allocation is
performed first and the cq reservation then fails, the xsk skb destructor
is called which blindly adds the skb address to the already full cq leading
to undefined behavior.
This commit restores the original order (cq reservation followed by skb
allocation) and uses the xskq_prod_cancel helper to undo the cq reserve
in event of skb allocation failure.
Fixes: 675716400d ("xdp: fix possible cq entry leak")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220614070746.8871-1-ciara.loftus@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d678cbd2f867a564a3c5b276c454e873f43f02f8 ]
xdpxceiver run on a AF_XDP ZC enabled driver revealed a problem with XSK
Tx batching API. There is a test that checks how invalid Tx descriptors
are handled by AF_XDP. Each valid descriptor is followed by invalid one
on Tx side whereas the Rx side expects only to receive a set of valid
descriptors.
In current xsk_tx_peek_release_desc_batch() function, the amount of
available descriptors is hidden inside xskq_cons_peek_desc_batch(). This
can be problematic in cases where invalid descriptors are present due to
the fact that xskq_cons_peek_desc_batch() returns only a count of valid
descriptors. This means that it is impossible to properly update XSK
ring state when calling xskq_cons_release_n().
To address this issue, pull out the contents of
xskq_cons_peek_desc_batch() so that callers (currently only
xsk_tx_peek_release_desc_batch()) will always be able to update the
state of ring properly, as total count of entries is now available and
use this value as an argument in xskq_cons_release_n(). By
doing so, xskq_cons_peek_desc_batch() can be dropped altogether.
Fixes: 9349eb3a9d ("xsk: Introduce batched Tx descriptor interfaces")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220607142200.576735-1-maciej.fijalkowski@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d1bc532e99becf104635ed4da6fefa306f452321 ]
Move desc_array from the driver to the pool. The reason behind this is
that we can then reuse this array as a temporary storage for descriptors
in all zero-copy drivers that use the batched interface. This will make
it easier to add batching to more drivers.
i40e is the only driver that has a batched Tx zero-copy
implementation, so no need to touch any other driver.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8de8b71b787f38983d414d2dba169a3bfefa668a ]
While checking AF_XDP copy mode combined with busy poll, strange
results were observed. rxdrop and txonly scenarios worked fine, but
l2fwd broke immediately.
After a deeper look, it turned out that for l2fwd, Tx side was exiting
early due to xsk_no_wakeup() returning true and in the end
xsk_generic_xmit() was never called. Note that AF_XDP Tx in copy mode
is syscall steered, so the current behavior is broken.
Txonly scenario only worked due to the fact that
sk_mark_napi_id_once_xdp() was never called - since Rx side is not in
the picture for this case and mentioned function is called in
xsk_rcv_check(), sk::sk_napi_id was never set, which in turn meant that
xsk_no_wakeup() was returning false (see the sk->sk_napi_id >=
MIN_NAPI_ID check in there).
To fix this, prefer busy poll in xsk_sendmsg() only when zero copy is
enabled on a given AF_XDP socket. By doing so, busy poll in copy mode
would not exit early on Tx side and eventually xsk_generic_xmit() will
be called.
Fixes: a0731952d9 ("xsk: Add busy-poll support for {recv,send}msg()")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220406155804.434493-1-maciej.fijalkowski@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 18b1ab7aa76bde181bdb1ab19a87fa9523c32f21 ]
Fix a race in the xsk socket teardown code that can lead to a NULL pointer
dereference splat. The current xsk unbind code in xsk_unbind_dev() starts by
setting xs->state to XSK_UNBOUND, sets xs->dev to NULL and then waits for any
NAPI processing to terminate using synchronize_net(). After that, the release
code starts to tear down the socket state and free allocated memory.
BUG: kernel NULL pointer dereference, address: 00000000000000c0
PGD 8000000932469067 P4D 8000000932469067 PUD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G I 5.16.0+ #2
Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015
RIP: 0010:__xsk_sendmsg+0x2c/0x690
[...]
RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258
RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800
RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800
R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000
FS: 00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0
Call Trace:
<TASK>
? aa_sk_perm+0x43/0x1b0
xsk_sendmsg+0xf0/0x110
sock_sendmsg+0x65/0x70
__sys_sendto+0x113/0x190
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x23/0x50
? exit_to_user_mode_prepare+0xa5/0x1d0
__x64_sys_sendto+0x29/0x30
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
There are two problems with the current code. First, setting xs->dev to NULL
before waiting for all users to stop using the socket is not correct. The
entry to the data plane functions xsk_poll(), xsk_sendmsg(), and xsk_recvmsg()
are all guarded by a test that xs->state is in the state XSK_BOUND and if not,
it returns right away. But one process might have passed this test but still
have not gotten to the point in which it uses xs->dev in the code. In this
interim, a second process executing xsk_unbind_dev() might have set xs->dev to
NULL which will lead to a crash for the first process. The solution here is
just to get rid of this NULL assignment since it is not used anymore. Before
commit 42fddcc7c6 ("xsk: use state member for socket synchronization"),
xs->dev was the gatekeeper to admit processes into the data plane functions,
but it was replaced with the state variable xs->state in the aforementioned
commit.
The second problem is that synchronize_net() does not wait for any process in
xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete, which means that the
state they rely on might be cleaned up prematurely. This can happen when the
notifier gets called (at driver unload for example) as it uses xsk_unbind_dev().
Solve this by extending the RCU critical region from just the ndo_xsk_wakeup
to the whole functions mentioned above, so that both the test of xs->state ==
XSK_BOUND and the last use of any member of xs is covered by the RCU critical
section. This will guarantee that when synchronize_net() completes, there will
be no processes left executing xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() and
state can be cleaned up safely. Note that we need to drop the RCU lock for the
skb xmit path as it uses functions that might sleep. Due to this, we have to
retest the xs->state after we grab the mutex that protects the skb xmit code
from, among a number of things, an xsk_unbind_dev() being executed from the
notifier at the same time.
Fixes: 42fddcc7c6 ("xsk: use state member for socket synchronization")
Reported-by: Elza Mathew <elza.mathew@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20220228094552.10134-1-magnus.karlsson@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0706a78f31c4217ca144f630063ec9561a21548d upstream.
This reverts commit bd0687c18e635b63233dc87f38058cd728802ab4.
This patch causes a Tx only workload to go to sleep even when it does
not have to, leading to misserable performance in skb mode. It fixed
one rare problem but created a much worse one, so this need to be
reverted while I try to craft a proper solution to the original
problem.
Fixes: bd0687c18e63 ("xsk: Do not sleep in poll() when need_wakeup set")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211217145646.26449-1-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bd0687c18e635b63233dc87f38058cd728802ab4 upstream.
Do not sleep in poll() when the need_wakeup flag is set. When this
flag is set, the application needs to explicitly wake up the driver
with a syscall (poll, recvmsg, sendmsg, etc.) to guarantee that Rx
and/or Tx processing will be processed promptly. But the current code
in poll(), sleeps first then wakes up the driver. This means that no
driver processing will occur (baring any interrupts) until the timeout
has expired.
Fix this by checking the need_wakeup flag first and if set, wake the
driver and return to the application. Only if need_wakeup is not set
should the process sleep if there is a timeout set in the poll() call.
Fixes: 77cd0d7b3f ("xsk: add support for need_wakeup flag in AF_XDP rings")
Reported-by: Keith Wiles <keith.wiles@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20211214102607.7677-1-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
XDP_REDIRECT works by a three-step process: the bpf_redirect() and
bpf_redirect_map() helpers will lookup the target of the redirect and store
it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
Next, when the program returns the XDP_REDIRECT return code, the driver
will call xdp_do_redirect() which will use the information thus stored to
actually enqueue the frame into a bulk queue structure (that differs
slightly by map type, but shares the same principle). Finally, before
exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
flush all the different bulk queues, thus completing the redirect.
Pointers to the map entries will be kept around for this whole sequence of
steps, protected by RCU. However, there is no top-level rcu_read_lock() in
the core code; instead drivers add their own rcu_read_lock() around the XDP
portions of the code, but somewhat inconsistently as Martin discovered[0].
However, things still work because everything happens inside a single NAPI
poll sequence, which means it's between a pair of calls to
local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
document this intention by using rcu_dereference_check() with
rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
lockdep to verify that everything is done correctly.
This patch does just that: we add an __rcu annotation to the map entry
pointers and remove the various comments explaining the NAPI poll assurance
strewn through devmap.c in favour of a longer explanation in filter.c. The
goal is to have one coherent documentation of the entire flow, and rely on
the RCU annotations as a "standard" way of communicating the flow in the
map code (which can additionally be understood by sparse and lockdep).
The RCU annotation replacements result in a fairly straight-forward
replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
proper constructs to cast the pointer back and forth between __rcu and
__kernel address space (for the benefit of sparse). The one complication is
that xskmap has a few constructions where double-pointers are passed back
and forth; these simply all gain __rcu annotations, and only the final
reference/dereference to the inner-most pointer gets changed.
With this, everything can be run through sparse without eliciting
complaints, and lockdep can verify correctness even without the use of
rcu_read_lock() in the drivers. Subsequent patches will clean these up from
the drivers.
[0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
[1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
Fix broken Tx ring validation for AF_XDP. The commit under the Fixes
tag, fixed an off-by-one error in the validation but introduced
another error. Descriptors are now let through even if they straddle a
chunk boundary which they are not allowed to do in aligned mode. Worse
is that they are let through even if they straddle the end of the umem
itself, tricking the kernel to read data outside the allowed umem
region which might or might not be mapped at all.
Fix this by reintroducing the old code, but subtract the length by one
to fix the off-by-one error that the original patch was
addressing. The test chunk != chunk_end makes sure packets do not
straddle chunk boundraries. Note that packets of zero length are
allowed in the interface, therefore the test if the length is
non-zero.
Fixes: ac31565c21 ("xsk: Fix for xp_aligned_validate_desc() when len == chunk_size")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20210618075805.14412-1-magnus.karlsson@gmail.com
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.
With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.
When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.
Function bpf_clear_redirect_map() was removed in
commit ee75aef23a ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.
With test topology:
+-------------------+ +-------------------+
| Host A (i40e 10G) | ---------- | eno1(i40e 10G) |
+-------------------+ | |
| Host B |
+-------------------+ | |
| Host C (i40e 10G) | ---------- | eno2(i40e 10G) |
+-------------------+ | |
| +------+ |
| veth0 -- | Peer | |
| veth1 -- | | |
| veth2 -- | NS | |
| +------+ |
+-------------------+
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
All the veth peers in the NS have a XDP_DROP program loaded. The
forward_map max_entries in xdp_redirect_map_multi is modify to 4.
Testing the performance impact on the regular xdp_redirect path with and
without patch (to check impact of additional check for broadcast mode):
5.12 rc4 | redirect_map i40e->i40e | 2.0M | 9.7M
5.12 rc4 | redirect_map i40e->veth | 1.7M | 11.8M
5.12 rc4 + patch | redirect_map i40e->i40e | 2.0M | 9.6M
5.12 rc4 + patch | redirect_map i40e->veth | 1.7M | 11.7M
Testing the performance when cloning packets with the redirect_map_multi
test, using a redirect map size of 4, filled with 1-3 devices:
5.12 rc4 + patch | redirect_map multi i40e->veth (x1) | 1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi i40e->veth (x2) | 1.1M | 4.3M
5.12 rc4 + patch | redirect_map multi i40e->veth (x3) | 0.8M | 2.6M
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
When desc->len is equal to chunk_size, it is legal. But when the
xp_aligned_validate_desc() got chunk_end from desc->addr + desc->len
pointing to the next chunk during the check, it caused the check to
fail.
This problem was first introduced in bbff2f321a ("xsk: new descriptor
addressing scheme"). Later in 2b43470add ("xsk: Introduce AF_XDP buffer
allocation API") this piece of code was moved into the new function called
xp_aligned_validate_desc(). This function was then moved into xsk_queue.h
via 26062b185e ("xsk: Explicitly inline functions and move definitions").
Fixes: bbff2f321a ("xsk: new descriptor addressing scheme")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210428094424.54435-1-xuanzhuo@linux.alibaba.com
The XDP_REDIRECT implementations for maps and non-maps are fairly
similar, but obviously need to take different code paths depending on
if the target is using a map or not. Today, the redirect targets for
XDP either uses a map, or is based on ifindex.
Here, the map type and id are added to bpf_redirect_info, instead of
the actual map. Map type, map item/ifindex, and the map_id (if any) is
passed to xdp_do_redirect().
For ifindex-based redirect, used by the bpf_redirect() XDP BFP helper,
a special map type/id are used. Map type of UNSPEC together with map id
equal to INT_MAX has the special meaning of an ifindex based
redirect. Note that valid map ids are 1 inclusive, INT_MAX exclusive
([1,INT_MAX[).
In addition to making the code easier to follow, using explicit type
and id in bpf_redirect_info has a slight positive performance impact
by avoiding a pointer indirection for the map type lookup, and instead
use the cacheline for bpf_redirect_info.
Since the actual map is not passed via bpf_redirect_info anymore, the
map lookup is only done in the BPF helper. This means that the
bpf_clear_redirect_map() function can be removed. The actual map item
is RCU protected.
The bpf_redirect_info flags member is not used by XDP, and not
read/written any more. The map member is only written to when
required/used, and not unconditionally.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-3-bjorn.topel@gmail.com
Currently the bpf_redirect_map() implementation dispatches to the
correct map-lookup function via a switch-statement. To avoid the
dispatching, this change adds bpf_redirect_map() as a map
operation. Each map provides its bpf_redirect_map() version, and
correct function is automatically selected by the BPF verifier.
A nice side-effect of the code movement is that the map lookup
functions are now local to the map implementation files, which removes
one additional function call.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-2-bjorn.topel@gmail.com
Currently, the AF_XDP rings uses general smp_{r,w,}mb() barriers on
the kernel-side. On most modern architectures
load-acquire/store-release barriers perform better, and results in
simpler code for circular ring buffers.
This change updates the XDP socket rings to use
load-acquire/store-release barriers.
It is important to note that changing from the old smp_{r,w,}mb()
barriers, to load-acquire/store-release barriers does not break
compatibility. The old semantics work with the new one, and vice
versa.
As pointed out by "Documentation/memory-barriers.txt" in the "SMP
BARRIER PAIRING" section:
"General barriers pair with each other, though they also pair with
most other types of barriers, albeit without multicopy atomicity.
An acquire barrier pairs with a release barrier, but both may also
pair with other barriers, including of course general barriers."
How different barriers behaves and pairs is outlined in
"tools/memory-model/Documentation/cheatsheet.txt".
In order to make sure that compatibility is not broken, LKMM herd7
based litmus tests can be constructed and verified.
We generalize the XDP socket ring to a one entry ring, and create two
scenarios; One where the ring is full, where only the consumer can
proceed, followed by the producer. One where the ring is empty, where
only the producer can proceed, followed by the consumer. Each scenario
is then expanded to four different tests: general producer/general
consumer, general producer/acqrel consumer, acqrel producer/general
consumer, acqrel producer/acqrel consumer. In total eight tests.
The empty ring test:
C spsc-rb+empty
// Simple one entry ring:
// prod cons allowed action prod cons
// 0 0 => prod => 1 0
// 0 1 => cons => 0 0
// 1 0 => cons => 1 1
// 1 1 => prod => 0 1
{}
// We start at prod==0, cons==0, data==0, i.e. nothing has been
// written to the ring. From here only the producer can start, and
// should write 1. Afterwards, consumer can continue and read 1 to
// data. Can we enter state prod==1, cons==1, but consumer observed
// the incorrect value of 0?
P0(int *prod, int *cons, int *data)
{
... producer
}
P1(int *prod, int *cons, int *data)
{
... consumer
}
exists( 1:d=0 /\ prod=1 /\ cons=1 );
The full ring test:
C spsc-rb+full
// Simple one entry ring:
// prod cons allowed action prod cons
// 0 0 => prod => 1 0
// 0 1 => cons => 0 0
// 1 0 => cons => 1 1
// 1 1 => prod => 0 1
{ prod = 1; }
// We start at prod==1, cons==0, data==1, i.e. producer has
// written 0, so from here only the consumer can start, and should
// consume 0. Afterwards, producer can continue and write 1 to
// data. Can we enter state prod==0, cons==1, but consumer observed
// the write of 1?
P0(int *prod, int *cons, int *data)
{
... producer
}
P1(int *prod, int *cons, int *data)
{
... consumer
}
exists( 1:d=1 /\ prod=0 /\ cons=1 );
where P0 and P1 are:
P0(int *prod, int *cons, int *data)
{
int p;
p = READ_ONCE(*prod);
if (READ_ONCE(*cons) == p) {
WRITE_ONCE(*data, 1);
smp_wmb();
WRITE_ONCE(*prod, p ^ 1);
}
}
P0(int *prod, int *cons, int *data)
{
int p;
p = READ_ONCE(*prod);
if (READ_ONCE(*cons) == p) {
WRITE_ONCE(*data, 1);
smp_store_release(prod, p ^ 1);
}
}
P1(int *prod, int *cons, int *data)
{
int c;
int d = -1;
c = READ_ONCE(*cons);
if (READ_ONCE(*prod) != c) {
smp_rmb();
d = READ_ONCE(*data);
smp_mb();
WRITE_ONCE(*cons, c ^ 1);
}
}
P1(int *prod, int *cons, int *data)
{
int c;
int d = -1;
c = READ_ONCE(*cons);
if (smp_load_acquire(prod) != c) {
d = READ_ONCE(*data);
smp_store_release(cons, c ^ 1);
}
}
The full LKMM litmus tests are found at [1].
On x86-64 systems the l2fwd AF_XDP xdpsock sample performance
increases by 1%. This is mostly due to that the smp_mb() is removed,
which is a relatively expensive operation on these
platforms. Weakly-ordered platforms, such as ARM64 might benefit even
more.
[1] https://github.com/bjoto/litmus-xsk
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210305094113.413544-2-bjorn.topel@gmail.com
This patch is used to construct skb based on page to save memory copy
overhead.
This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.
---------------- Performance Testing ------------
The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t -S -s <msg size>
```
Test result data:
size 64 512 1024 1500
copy 1916747 1775988 1600203 1440054
page 1974058 1953655 1945463 1904478
percent 3.0% 10.0% 21.58% 32.3%
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210218204908.5455-6-alobakin@pm.me
xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.
Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):
Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).
Fixes: 35fcde7f8d ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210218204908.5455-5-alobakin@pm.me
The explicit_free parameter of the __xsk_rcv() function was used to
mark whether the call was via the generic XDP or the native XDP
path. Instead of clutter the code with if-statements and "true/false"
parameters which are hard to understand, simply move the explicit free
to the __xsk_map_redirect() which is always called from the native XDP
path.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20210122105351.11751-2-bjorn.topel@gmail.com
The number of queues can change by other means, rather than ethtool. For
example, attaching an mqprio qdisc with num_tc > 1 leads to creating
multiple sets of TX queues, which may be then destroyed when mqprio is
deleted. If an AF_XDP socket is created while mqprio is active,
dev->_tx[queue_id].pool will be filled, but then real_num_tx_queues may
decrease with deletion of mqprio, which will mean that the pool won't be
NULLed, and a further increase of the number of TX queues may expose a
dangling pointer.
To avoid any potential misbehavior, this commit clears pool for RX and
TX queues, regardless of real_num_*_queues, still taking into
consideration num_*_queues to avoid overflows.
Fixes: 1c1efc2af1 ("xsk: Create and free buffer pool independently from umem")
Fixes: a41b4f3c58 ("xsk: simplify xdp_clear_umem_at_qid implementation")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210118160333.333439-1-maximmi@mellanox.com
Rollback the reservation in the completion ring when we get a
NETDEV_TX_BUSY. When this error is received from the driver, we are
supposed to let the user application retry the transmit again. And in
order to do this, we need to roll back the failed send so it can be
retried. Unfortunately, we did not cancel the reservation we had made
in the completion ring. By not doing this, we actually make the
completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
after enough of these errors the completion ring will be of size zero
and transmit will stop working.
Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
error.
Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com
Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.
Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.
Fixes: 35fcde7f8d ("xsk: support for Tx")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
Fix a possible memory leak when a bind of an AF_XDP socket fails. When
the fill and completion rings are created, they are tied to the
socket. But when the buffer pool is later created at bind time, the
ownership of these two rings are transferred to the buffer pool as
they might be shared between sockets (and the buffer pool cannot be
created until we know what we are binding to). So, before the buffer
pool is created, these two rings are cleaned up with the socket, and
after they have been transferred they are cleaned up together with
the buffer pool.
The problem is that ownership was transferred before it was absolutely
certain that the buffer pool could be created and initialized
correctly and when one of these errors occurred, the fill and
completion rings did neither belong to the socket nor the pool and
where therefore leaked. Solve this by moving the ownership transfer
to the point where the buffer pool has been completely set up and
there is no way it can fail.
Fixes: 7361f9c3d7 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-12-14
1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.
2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.
3) Support for finding BTF based kernel attach targets through libbpf's
bpf_program__set_attach_target() API, from Andrii Nakryiko.
4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.
5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.
6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.
7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.
8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.
9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
KP Singh, Jiri Olsa and various others.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
bpf: Permits pointers on stack for helper calls
libbpf: Expose libbpf ring_buffer epoll_fd
selftests/bpf: Add set_attach_target() API selftest for module target
libbpf: Support modules in bpf_program__set_attach_target() API
selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
selftests/bpf: Drop the need for LLVM's llc
selftests/bpf: fix bpf_testmod.ko recompilation logic
samples/bpf: Fix possible hang in xdpsock with multiple threads
selftests/bpf: Make selftest compilation work on clang 11
selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
selftests/bpf: Drop tcp-{client,server}.py from Makefile
selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
selftests/bpf: Xsk selftests framework
bpf: Only provide bpf_sock_from_file with CONFIG_NET
bpf: Return -ENOTSUPP when attaching to non-kernel BTF
xsk: Validate socket state in xsk_recvmsg, prior touching socket members
...
====================
Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-12-03
The main changes are:
1) Support BTF in kernel modules, from Andrii.
2) Introduce preferred busy-polling, from Björn.
3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.
4) Memcg-based memory accounting for bpf objects, from Roman.
5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
selftests/bpf: Fix invalid use of strncat in test_sockmap
libbpf: Use memcpy instead of strncpy to please GCC
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
selftests/bpf: Add tp_btf CO-RE reloc test for modules
libbpf: Support attachment of BPF tracing programs to kernel modules
libbpf: Factor out low-level BPF program loading helper
bpf: Allow to specify kernel module BTFs when attaching BPF programs
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
selftests/bpf: Add support for marking sub-tests as skipped
selftests/bpf: Add bpf_testmod kernel module for testing
libbpf: Add kernel module BTF support for CO-RE relocations
libbpf: Refactor CO-RE relocs to not assume a single BTF object
libbpf: Add internal helper to load BTF data by FD
bpf: Keep module's btf_data_size intact after load
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
bpf: Adds support for setting window clamp
samples/bpf: Fix spelling mistake "recieving" -> "receiving"
bpf: Fix cold build of test_progs-no_alu32
...
====================
Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a check for need wake up in sendmsg(), so that if a user calls
sendmsg() when no wakeup is needed, do not trigger a wakeup.
To simplify the need wakeup check in the syscall, unconditionally
enable the need wakeup flag for Tx. This has a side-effect for poll();
If poll() is called for a socket without enabled need wakeup, a Tx
wakeup is unconditionally performed.
The wakeup matrix for AF_XDP now looks like:
need wakeup | poll() | sendmsg() | recvmsg()
------------+--------------+-------------+------------
disabled | wake Tx | wake Tx | nop
enabled | check flag; | check flag; | check flag;
| wake Tx/Rx | wake Tx | wake Rx
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com
Commit 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
addressed the problem that packets were discarded from the Tx AF_XDP
ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
bumping the skbuff reference count, so that the buffer would not be
freed by dev_direct_xmit(). A reference count larger than one means
that the skbuff is "shared", which is not the case.
If the "shared" skbuff is sent to the generic XDP receive path,
netif_receive_generic_xdp(), and pskb_expand_head() is entered the
BUG_ON(skb_shared(skb)) will trigger.
This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
where a user can select the skbuff free policy. This allows AF_XDP to
avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
behavior.
Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com