Get a zoned block device total number of zones. The device can be a
partition of the whole device. The number of zones is always 0 for
regular block devices.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Get a zoned block device zone size in number of 512 B sectors.
The zone size is always 0 for regular block devices.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit ead7f9b8de65632ef8060b84b0c55049a33cfea1 upstream.
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other
things, update the L4 checksum after reverse SNATing IPv6 packets. That
use case is however not currently supported and leads to invalid
skb->csum values in some cases. This patch adds support for IPv6 address
changes in bpf_l4_csum_update via a new flag.
When calling bpf_l4_csum_replace in Cilium, it ends up calling
inet_proto_csum_replace_by_diff:
1: void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
2: __wsum diff, bool pseudohdr)
3: {
4: if (skb->ip_summed != CHECKSUM_PARTIAL) {
5: csum_replace_by_diff(sum, diff);
6: if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
7: skb->csum = ~csum_sub(diff, skb->csum);
8: } else if (pseudohdr) {
9: *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
10: }
11: }
The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. The helper now updates the L4 header
checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't.
For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are set such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.
The same bug doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.
This special case for IPv6 L4 checksums is also described atop
inet_proto_csum_replace16, the function we should be using in this case.
This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6,
to indicate that we're updating the L4 checksum of an IPv6 packet. When
the flag is set, inet_proto_csum_replace_by_diff will skip the
skb->csum update.
Fixes: 7d672345ed ("bpf: add generic bpf_csum_diff helper")
Change-Id: Ia07e6770587fff91588ba133a9efadab92372ed9
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/96a6bc3a443e6f0b21ff7b7834000e17fb549e05.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ Note: Fixed conflict due to unrelated comment change. ]
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c32e8f8bc33a5f4b113a630857e46634e3e143b upstream.
Allow to pass sk_lookup programs to PROG_TEST_RUN. User space
provides the full bpf_sk_lookup struct as context. Since the
context includes a socket pointer that can't be exposed
to user space we define that PROG_TEST_RUN returns the cookie
of the selected socket or zero in place of the socket pointer.
We don't support testing programs that select a reuseport socket,
since this would mean running another (unrelated) BPF program
from the sk_lookup test handler.
Change-Id: I7af748e3f11804e4e1ad0c532685f0c3dfaf4816
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210303101816.36774-3-lmb@cloudflare.com
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the label says "for internal use only", then it doesn't belong
in the 'uapi' subtree.
Change-Id: Ia10de797ba5e5977870ebadb67a870405934e76e
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip:
CIP: Bump version suffix to -cip124 after merge from cip/linux-4.19.y-st tree
Update localversion-st, tree is up-to-date with 5.4.298.
f2fs: fix to do sanity check on ino and xnid
squashfs: fix memory leak in squashfs_fill_super
pNFS: Handle RPC size limit for layoutcommits
wifi: iwlwifi: fw: Fix possible memory leak in iwl_fw_dbg_collect
usb: core: usb_submit_urb: downgrade type check
udf: Verify partition map count
f2fs: fix to avoid panic in f2fs_evict_inode
usb: hub: Fix flushing and scheduling of delayed work that tunes runtime pm
Revert "drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS"
net: usb: qmi_wwan: add Telit Cinterion LE910C4-WWX new compositions
HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version()
HID: asus: fix UAF via HID_CLAIMED_INPUT validation
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
sctp: initialize more fields in sctp_v6_from_sk()
net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts
net/mlx5e: Set local Xoff after FW update
net: dlink: fix multicast stats being counted incorrectly
atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control().
net/atm: remove the atmdev_ops {get, set}sockopt methods
Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced
powerpc/kvm: Fix ifdef to remove build warning
net: ipv4: fix regression in local-broadcast routes
vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put()
scsi: core: sysfs: Correct sysfs attributes access rights
ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
alloc_fdtable(): change calling conventions.
ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit
ipv6: sr: validate HMAC algorithm ID in seg6_hmac_info_add
ALSA: usb-audio: Fix size validation in convert_chmap_v3()
scsi: qla4xxx: Prevent a potential error pointer dereference
usb: xhci: Fix slot_id resource race conflict
nfs: fix UAF in direct writes
NFS: Fix up commit deadlocks
Bluetooth: fix use-after-free in device_for_each_child()
selftests: forwarding: tc_actions.sh: add matchall mirror test
codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()
sch_qfq: make qfq_qlen_notify() idempotent
sch_hfsc: make hfsc_qlen_notify() idempotent
sch_drr: make drr_qlen_notify() idempotent
btrfs: populate otime when logging an inode item
media: venus: hfi: explicitly release IRQ during teardown
f2fs: fix to avoid out-of-boundary access in dnode page
media: venus: protect against spurious interrupts during probe
media: venus: vdec: Clamp param smaller than 1fps and bigger than 240.
drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
media: rainshadow-cec: fix TOCTOU race condition in rain_interrupt()
media: v4l2-ctrls: Don't reset handler's error in v4l2_ctrl_handler_free()
ata: Fix SATA_MOBILE_LPM_POLICY description in Kconfig
usb: musb: omap2430: fix device leak at unbind
NFS: Fix the setting of capabilities when automounting a new filesystem
NFS: Fix up handling of outstanding layoutcommit in nfs_update_inode()
NFSv4: Fix nfs4_bitmap_copy_adjust()
usb: typec: fusb302: cache PD RX state
cdc-acm: fix race between initial clearing halt and open
USB: cdc-acm: do not log successful probe on later errors
nfsd: handle get_client_locked() failure in nfsd4_setclientid_confirm()
tracing: Add down_write(trace_event_sem) when adding trace event
usb: hub: Don't try to recover devices lost during warm reset.
usb: hub: avoid warm port reset during USB3 disconnect
x86/mce/amd: Add default names for MCA banks and blocks
iio: hid-sensor-prox: Fix incorrect OFFSET calculation
mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage()
net: usbnet: Fix the wrong netif_carrier_on() call
net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value
kbuild: Add KBUILD_CPPFLAGS to as-option invocation
kbuild: add $(CLANG_FLAGS) to KBUILD_CPPFLAGS
kbuild: Add CLANG_FLAGS to as-instr
mips: Include KBUILD_CPPFLAGS in CHECKFLAGS invocation
kbuild: Update assembler calls to use proper flags and language target
ARM: 9448/1: Use an absolute path to unified.h in KBUILD_AFLAGS
usb: dwc3: Ignore late xferNotReady event to prevent halt timeout
USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles
usb: storage: realtek_cr: Use correct byte order for bcs->Residue
USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera
usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive
iio: proximity: isl29501: fix buffered read on big-endian systems
ftrace: Also allocate and copy hash for reading of filter files
fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable()
fs/buffer: fix use-after-free when call bh_read() helper
drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3
media: venus: Add a check for packet size after reading from shared memory
media: ov2659: Fix memory leaks in ov2659_probe()
media: usbtv: Lock resolution while streaming
media: gspca: Add bounds checking to firmware parser
jbd2: prevent softlockup in jbd2_log_do_checkpoint()
PCI: endpoint: Fix configfs group removal on driver teardown
PCI: endpoint: Fix configfs group list head handling
mtd: rawnand: fsmc: Add missing check after DMA map
wifi: brcmsmac: Remove const from tbl_ptr parameter in wlc_lcnphy_common_read_table()
zynq_fpga: use sgtable-based scatterlist wrappers
ata: libata-scsi: Fix ata_to_sense_error() status handling
ext4: fix reserved gdt blocks handling in fsmap
ext4: fix fsmap end of range reporting with bigalloc
ext4: check fast symlink for ea_inode correctly
Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()"
vt: defkeymap: Map keycodes above 127 to K_HOLE
usb: gadget: udc: renesas_usb3: fix device leak at unbind
usb: atm: cxacru: Merge cxacru_upload_firmware() into cxacru_heavy_init()
m68k: Fix lost column on framebuffer debug console
serial: 8250: fix panic due to PSLVERR
media: uvcvideo: Do not mark valid metadata as invalid
media: uvcvideo: Fix 1-byte out-of-bounds read in uvc_parse_format()
btrfs: fix log tree replay failure due to file with 0 links and extents
thunderbolt: Fix copy+paste error in match_service_id()
misc: rtsx: usb: Ensure mmc child device is active when card is present
scsi: lpfc: Remove redundant assignment to avoid memory leak
rtc: ds1307: remove clear of oscillator stop flag (OSF) in probe
pNFS: Fix uninited ptr deref in block/scsi layout
pNFS: Fix disk addr range check in block/scsi layout
pNFS: Fix stripe mapping in block/scsi layout
ipmi: Fix strcpy source and destination the same
kconfig: lxdialog: fix 'space' to (de)select options
kconfig: gconf: fix potential memory leak in renderer_edited()
kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed()
scsi: aacraid: Stop using PCI_IRQ_AFFINITY
scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans
kconfig: nconf: Ensure null termination where strncpy is used
kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
PCI: pnv_php: Work around switches with broken presence detection
media: uvcvideo: Fix bandwidth issue for Alcor camera
media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar
media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb()
media: usb: hdpvr: disable zero-length read messages
media: tc358743: Increase FIFO trigger level to 374
media: tc358743: Return an appropriate colorspace from tc358743_set_fmt
media: tc358743: Check I2C succeeded during probe
pinctrl: stm32: Manage irq affinity settings
scsi: mpt3sas: Correctly handle ATA device errors
RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()
MIPS: Don't crash in stack_top() for tasks without ABI or vDSO
jfs: upper bound check of tree index in dbAllocAG
jfs: Regular file corruption check
jfs: truncate good inode pages when hard link is 0
scsi: bfa: Double-free fix
MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free}
watchdog: dw_wdt: Fix default timeout
fs/orangefs: use snprintf() instead of sprintf()
scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated
ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
vhost: fail early when __vhost_add_used() fails
uapi: in6: restore visibility of most IPv6 socket options
net: ncsi: Fix buffer overflow in fetching version id
net: dsa: b53: fix b53_imp_vlan_setup for BCM5325
net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs
wifi: iwlegacy: Check rate_idx range after addition
netmem: fix skb_frag_address_safe with unreadable skbs
wifi: rtlwifi: fix possible skb memory leak in `_rtl_pci_rx_interrupt()`.
wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd()
net: fec: allow disable coalescing
(powerpc/512) Fix possible `dma_unmap_single()` on uninitialized pointer
s390/stp: Remove udelay from stp_sync_clock()
wifi: iwlwifi: mvm: fix scan request validation
net: thunderx: Fix format-truncation warning in bgx_acpi_match_id()
net: ipv4: fix incorrect MTU in broadcast routes
wifi: cfg80211: Fix interface type validation
et131x: Add missing check after DMA map
be2net: Use correct byte order and format string for TCP seq and ack_seq
s390/time: Use monotonic clock in get_cycles()
wifi: cfg80211: reject HTC bit for management frames
ktest.pl: Prevent recursion of default variable options
ASoC: codecs: rt5640: Retry DEVICE_ID verification
ALSA: usb-audio: Avoid precedence issues in mixer_quirks macros
ALSA: hda/ca0132: Fix buffer overflow in add_tuning_control
platform/x86: thinkpad_acpi: Handle KCOV __init vs inline mismatches
pm: cpupower: Fix the snapshot-order of tsc,mperf, clock in mperf_stop()
ALSA: intel8x0: Fix incorrect codec index usage in mixer for ICH4
ASoC: hdac_hdmi: Rate limit logging on connection and disconnection
mmc: rtsx_usb_sdmmc: Fix error-path in sd_set_power_mode()
ACPI: processor: fix acpi_object initialization
PM: sleep: console: Fix the black screen issue
thermal: sysfs: Return ENODATA instead of EAGAIN for reads
selftests: tracing: Use mutex_unlock for testing glob filter
ARM: tegra: Use I/O memcpy to write to IRAM
gpio: tps65912: check the return value of regmap_update_bits()
ASoC: soc-dapm: set bias_level if snd_soc_dapm_set_bias_level() was successed
cpufreq: Exit governor when failed to start old governor
usb: xhci: Avoid showing errors during surprise removal
usb: xhci: Set avg_trb_len = 8 for EP0 during Address Device Command
usb: xhci: Avoid showing warnings for dying controller
selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t
usb: xhci: print xhci->xhc_state when queue_command failed
securityfs: don't pin dentries twice, once is enough...
hfs: fix not erasing deleted b-tree node issue
drbd: add missing kref_get in handle_write_conflicts
arm64: Handle KCOV __init vs inline mismatches
hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file()
hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()
hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read()
hfs: fix slab-out-of-bounds in hfs_bnode_read()
sctp: linearize cloned gso packets in sctp_rcv
netfilter: ctnetlink: fix refcount leak on table dump
udp: also consider secpath when evaluating ipsec use for checksumming
fs: Prevent file descriptor table allocations exceeding INT_MAX
sunvdc: Balance device refcount in vdc_port_mpgroup_check
NFSD: detect mismatch of file handle and delegation stateid in OPEN op
net: dpaa: fix device leak when querying time stamp info
net: gianfar: fix device leak when querying time stamp info
netlink: avoid infinite retry looping in netlink_unicast()
ALSA: usb-audio: Validate UAC3 cluster segment descriptors
ALSA: usb-audio: Validate UAC3 power domain descriptors, too
usb: gadget : fix use-after-free in composite_dev_cleanup()
MIPS: mm: tlb-r4k: Uniquify TLB entries on init
USB: serial: option: add Foxconn T99W709
vsock: Do not allow binding to VMADDR_PORT_ANY
net/packet: fix a race in packet_set_ring() and packet_notifier()
perf/core: Prevent VMA split of buffer mappings
perf/core: Exit early on perf_mmap() fail
perf/core: Don't leak AUX buffer refcount on allocation failure
pptp: fix pptp_xmit() error path
smb: client: let recv_done() cleanup before notifying the callers.
benet: fix BUG when creating VFs
ipv6: reject malicious packets in ipv6_gso_segment()
pptp: ensure minimal skb length in pptp_xmit()
netpoll: prevent hanging NAPI when netcons gets enabled
NFS: Fix filehandle bounds checking in nfs_fh_to_dentry()
pci/hotplug/pnv-php: Wrap warnings in macro
pci/hotplug/pnv-php: Improve error msg on power state change failure
usb: chipidea: udc: fix sleeping function called from invalid context
f2fs: fix to avoid out-of-boundary access in devs.path
f2fs: fix to avoid UAF in f2fs_sync_inode_meta()
rtc: pcf8563: fix incorrect maximum clock rate handling
rtc: hym8563: fix incorrect maximum clock rate handling
rtc: ds1307: fix incorrect maximum clock rate handling
mtd: rawnand: atmel: set pmecc data setup time
mtd: rawnand: atmel: Fix dma_mapping_error() address
jfs: fix metapage reference count leak in dbAllocCtl
fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref
crypto: qat - fix seq_file position update in adf_ring_next()
dmaengine: nbpfaxi: Add missing check after DMA map
dmaengine: mv_xor: Fix missing check after DMA map and missing unmap
fs/orangefs: Allow 2 more characters in do_c_string()
crypto: img-hash - Fix dma_unmap_sg() nents value
scsi: isci: Fix dma_unmap_sg() nents value
scsi: mvsas: Fix dma_unmap_sg() nents value
scsi: ibmvscsi_tgt: Fix dma_unmap_sg() nents value
perf tests bp_account: Fix leaked file descriptor
crypto: ccp - Fix crash when rebind ccp device for ccp.ko
pinctrl: sunxi: Fix memory leak on krealloc failure
power: supply: max14577: Handle NULL pdata when CONFIG_OF is not set
clk: davinci: Add NULL check in davinci_lpsc_clk_register()
mtd: fix possible integer overflow in erase_xfer()
crypto: marvell/cesa - Fix engine load inaccuracy
PCI: rockchip-host: Fix "Unexpected Completion" log message
vrf: Drop existing dst reference in vrf_ip6_input_dst
netfilter: xt_nfacct: don't assume acct name is null-terminated
can: kvaser_usb: Assign netdev.dev_port based on device channel index
wifi: brcmfmac: fix P2P discovery failure in P2P peer due to missing P2P IE
Reapply "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()"
mwl8k: Add missing check after DMA map
wifi: rtl8xxxu: Fix RX skb size for aggregation disabled
net/sched: Restrict conditions for adding duplicating netems to qdisc tree
arch: powerpc: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX
netfilter: nf_tables: adjust lockdep assertions handling
drm/amd/pm/powerplay/hwmgr/smu_helper: fix order of mask and value
m68k: Don't unregister boot console needlessly
tcp: fix tcp_ofo_queue() to avoid including too much DUP SACK range
iwlwifi: Add missing check for alloc_ordered_workqueue
wifi: iwlwifi: Fix memory leak in iwl_mvm_init()
wifi: rtl818x: Kill URBs before clearing tx status queue
caif: reduce stack size, again
staging: nvec: Fix incorrect null termination of battery manufacturer
samples: mei: Fix building on musl libc
usb: early: xhci-dbc: Fix early_ioremap leak
Revert "vmci: Prevent the dispatching of uninitialized payloads"
pps: fix poll support
vmci: Prevent the dispatching of uninitialized payloads
staging: fbtft: fix potential memory leak in fbtft_framebuffer_alloc()
ARM: dts: vfxxx: Correctly use two tuples for timer address
ASoC: ops: dynamically allocate struct snd_ctl_elem_value
hfsplus: remove mutex_lock check in hfsplus_free_extents
ASoC: Intel: fix SND_SOC_SOF dependencies
ethernet: intel: fix building with large NR_CPUS
usb: phy: mxs: disconnect line when USB charger is attached
usb: chipidea: udc: protect usb interrupt enable
usb: chipidea: udc: add new API ci_hdrc_gadget_connect
comedi: comedi_test: Fix possible deletion of uninitialized timers
nilfs2: reject invalid file types when reading inodes
i2c: qup: jump out of the loop in case of timeout
net/sched: sch_qfq: Avoid triggering might_sleep in atomic context in qfq_delete_class
net: appletalk: Fix use-after-free in AARP proxy probe
net: appletalk: fix kerneldoc warnings
RDMA/core: Rate limit GID cache warning messages
usb: hub: fix detection of high tier USB3 devices behind suspended hubs
net_sched: sch_sfq: reject invalid perturb period
net_sched: sch_sfq: move the limit validation
net_sched: sch_sfq: use a temporary work area for validating configuration
net_sched: sch_sfq: don't allow 1 packet limit
net_sched: sch_sfq: handle bigger packets
net_sched: sch_sfq: annotate data-races around q->perturb_period
power: supply: bq24190_charger: Fix runtime PM imbalance on error
xhci: Disable stream for xHC controller with XHCI_BROKEN_STREAMS
virtio-net: ensure the received length does not exceed allocated size
usb: dwc3: qcom: Don't leave BCR asserted
usb: musb: fix gadget state on disconnect
net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtree
net: vlan: fix VLAN 0 refcount imbalance of toggling filtering during runtime
Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU
Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
Bluetooth: SMP: If an unallowed command is received consider it a failure
Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb()
usb: net: sierra: check for no status endpoint
net/sched: sch_qfq: Fix race condition on qfq_aggregate
net: emaclite: Fix missing pointer increment in aligned_read()
comedi: Fix use of uninitialized data in insn_rw_emulate_bits()
comedi: Fix some signed shift left operations
comedi: das6402: Fix bit shift out of bounds
comedi: das16m1: Fix bit shift out of bounds
comedi: aio_iiro_16: Fix bit shift out of bounds
comedi: pcl812: Fix bit shift out of bounds
iio: adc: max1363: Reorder mode_list[] entries
iio: adc: max1363: Fix MAX1363_4X_CHANS/MAX1363_8X_CHANS[]
soc: aspeed: lpc-snoop: Don't disable channels that aren't enabled
soc: aspeed: lpc-snoop: Cleanup resources in stack-order
mmc: sdhci-pci: Quirk for broken command queuing on Intel GLK-based Positivo models
memstick: core: Zero initialize id_reg in h_memstick_read_dev_id()
isofs: Verify inode mode when loading from disk
dmaengine: nbpfaxi: Fix memory corruption in probe()
af_packet: fix soft lockup issue caused by tpacket_snd()
af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd()
phonet/pep: Move call to pn_skb_get_dst_sockaddr() earlier in pep_sock_accept()
HID: core: do not bypass hid_hw_raw_request
HID: core: ensure __hid_request reserves the report ID as the first byte
HID: core: ensure the allocated report buffer can contain the reserved report ID
pch_uart: Fix dma_sync_sg_for_device() nents value
Input: xpad - set correct controller type for Acer NGR200
i2c: stm32: fix the device used for the DMA map
usb: gadget: configfs: Fix OOB read on empty string write
USB: serial: ftdi_sio: add support for NDI EMGUIDE GEMINI
USB: serial: option: add Foxconn T99W640
USB: serial: option: add Telit Cinterion FE910C04 (ECM) composition
dma-mapping: add generic helpers for mapping sgtable objects
usb: renesas_usbhs: Flush the notify_hotplug_work
gpio: rcar: Use raw_spinlock to protect register access
Change-Id: Ia6b8b00918487999c648f298d3550afc7eaaae03
Signed-off-by: bengris32 <bengris32@protonmail.ch>
PROPAGATED_from (CR)
Currently cgroup freezer is used to freeze the application threads, and
BINDER_FREEZE is used to freeze the corresponding binder interface.
There's already a mechanism in ioctl(BINDER_FREEZE) to wait for any
existing transactions to drain out before actually freezing the binder
interface.
But freezing an app requires 2 steps, freezing the binder interface with
ioctl(BINDER_FREEZE) and then freezing the application main threads with
cgroupfs. This is not an atomic operation. The following race issue
might happen.
1) Binder interface is frozen by ioctl(BINDER_FREEZE);
2) Main thread A initiates a new sync binder transaction to process B;
3) Main thread A is frozen by "echo 1 > cgroup.freeze";
4) The response from process B reaches the frozen thread, which will
unexpectedly fail.
This patch provides a mechanism to check if there's any new pending
transaction happening between ioctl(BINDER_FREEZE) and freezing the
main thread. If there's any, the main thread freezing operation can
be rolled back to finish the pending transaction.
Furthermore, the response might reach the binder driver before the
rollback actually happens. That will still cause failed transaction.
As the other process doesn't wait for another response of the response,
the response transaction failure can be fixed by treating the response
transaction l(CR) an oneway/async one, allowing it to reach the frozen
thread. And it will be consumed when the thread gets unfrozen later.
NOTE: This patch reuses the existing definition of struct
binder_frozen_status_info but expands the bit assignments of __u32
member sync_recv.
To ensure backward compatibility, bit 0 of sync_recv still indicates
there's an outstanding sync binder transaction. This patch adds new
information to bit 1 of sync_recv, indicating the binder transaction
happens exactly when there's a race.
If an existing userspace app runs on a new kernel, a sync binder call
will set bit 0 of sync_recv so ioctl(BINDER_GET_FROZEN_INFO) still
return the expected value (true). The app just doesn't check bit 1
intentionally so it doesn't have the ability to tell if there's a race.
This behavior is aligned with what happens on an old kernel which
doesn't set bit 1 at all.
A new userspace app can 1) check bit 0 to know if there's a sync binder
transaction happened when being frozen - same as before; and 2) check
bit 1 to know if that sync binder transaction happened exactly when
there's a race - a new information for rollback decision.
Fixes: 432ff1e91694 ("binder: BINDER_FREEZE ioctl")
Acked-by: Todd Kjos <tkjos@google.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Li Li <dualli@google.com>
Test: stress test with apps being frozen and initiating binder calls at
the same time, confirmed the pending transactions succeeded.
Link: https://lore.kernel.org/r/20210910164210.2282716-2-dualli@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 198493121
(cherry picked from commit b564171ade70570b7f335fa8ed17adb28409e3ac
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
char-misc-linus)
Change-Id: I488ba75056f18bb3094ba5007027b76b5caebec9
Reviewed-on: https://gerrit.mot.com/2496991
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver
Tested-by: Jira Key
Reviewed-by: Zhangqing Huang <huangzq2@motorola.com>
Reviewed-by: Hua Tan <tanhua1@motorola.com>
Submit-Approved: Jira Key
Signed-off-by: Levy Gabriel da Silva Galvao <levy@motorola.com>
Reviewed-on: https://gerrit.mot.com/2704333
Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com>
Reviewed-by: Rafael Ortolan <rafones@motorola.com>
When async binder buffer got exhausted, some normal oneway transactions
will also be discarded and may cause system or application failures. By
that time, the binder debug information we dump may not be relevant to
the root cause. And this issue is difficult to debug if without the
backtrace of the thread sending spam.
This change will send BR_ONEWAY_SPAM_SUSPECT to userspace when oneway
spamming is detected, request to dump current backtrace. Oneway spamming
will be reported only once when exceeding the threshold (target process
dips below 80% of its oneway space, and current process is responsible
for either more than 50 transactions, or more than 50% of the oneway
space). And the detection will restart when the async buffer has
returned to a healthy state.
Acked-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Hang Lu <hangl@codeaurora.org>
Link: https://lore.kernel.org/r/1617961246-4502-3-git-send-email-hangl@codeaurora.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 181190340
Change-Id: Id3d2526099bc89f04d8ad3ad6e48141b2a8f2515
(cherry picked from commit a7dc1e6f99df59799ab0128d9c4e47bbeceb934d)
Signed-off-by: Hang Lu <hangl@codeaurora.org>
User space needs to know if binder transactions occurred to frozen
processes. Introduce a new BINDER_GET_FROZEN ioctl and keep track of
transactions occurring to frozen proceses.
Bug: 180989544
(cherry picked from commit c55019c24b22d6770bd8e2f12fbddf3f83d37547
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git char-misc-testing)
Signed-off-by: Marco Ballesio <balejs@google.com>
Signed-off-by: Li Li <dualli@google.com>
Acked-by: Todd Kjos <tkjos@google.com>
Link: https://lore.kernel.org/r/20210316011630.1121213-4-dualli@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ie631f331ba4ca94a3bcdd43dec25fe9ba1306af2
Frozen tasks can't process binder transactions, so a way is required to
inform transmitting ends of communication failures due to the frozen
state of their receiving counterparts. Additionally, races are possible
between transitions to frozen state and binder transactions enqueued to
a specific process.
Implement BINDER_FREEZE ioctl for user space to inform the binder driver
about the intention to freeze or unfreeze a process. When the ioctl is
called, block the caller until any pending binder transactions toward
the target process are flushed. Return an error to transactions to
processes marked as frozen.
Bug: 180989544
(cherry picked from commit 15949c3cdd97bccdcd45c0c0f6c31058520b6494
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git char-misc-testing)
Co-developed-by: Todd Kjos <tkjos@google.com>
Acked-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Marco Ballesio <balejs@google.com>
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Li Li <dualli@google.com>
Link: https://lore.kernel.org/r/20210316011630.1121213-2-dualli@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ia1b5951cd99eeb98b59e06c3e27d59062dc725f6
Convert the zsfold filesystem to the new internal mount API as the old one
will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Change-Id: Ia3da9e9fd2ef5214c4e7fb71228f8b90c15f9e71
Signed-off-by: David Howells <dhowells@redhat.com>
[ Upstream commit b8e3a87a627b575896e448021e5c2f8a3bc19931 ]
Currently get_perf_callchain only supports user stack walking for
the current task. Passing the correct *crosstask* param will return
0 frames if the task passed to __bpf_get_stack isn't the current
one instead of a single incorrect frame/address. This change
passes the correct *crosstask* param but also does a preemptive
check in __bpf_get_stack if the task is current and returns
-EOPNOTSUPP if it is not.
This issue was found using bpf_get_task_stack inside a BPF
iterator ("iter/task"), which iterates over all tasks.
bpf_get_task_stack works fine for fetching kernel stacks
but because get_perf_callchain relies on the caller to know
if the requested *task* is the current one (via *crosstask*)
it was failing in a confusing way.
It might be possible to get user stacks for all tasks utilizing
something like access_process_vm but that requires the bpf
program calling bpf_get_task_stack to be sleepable and would
therefore be a breaking change.
Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()")
Change-Id: Iaaa9eecc1e9f9c4c7b4953e169725d1aa1808bc3
Signed-off-by: Jordan Rome <jordalgo@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7cb779a6867fea00b4209bcf6de2f178a743247d ]
Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
packets") exposed the fact that bpf_clone_redirect is capable of
returning raw NET_XMIT_XXX return codes.
This is in the conflict with its UAPI doc which says the following:
"0 on success, or a negative error in case of failure."
Update the UAPI to reflect the fact that bpf_clone_redirect can
return positive error numbers, but don't explicitly define
their meaning.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Change-Id: I5b81b2b05cb8369feceae99f9fddaaf6a0121dd1
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230911194731.286342-1-sdf@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ee2a098851bfbe8bcdd964c0121f4246f00ff41e upstream.
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip > 0, one keeps only 'num_elem - skip' frames.
This is because it sets init_nr in the perf_callchain_entry to the end
of the buffer to save num_elem entries only. I believe it was because
the perf callchain code unwound the stack frames until it reached the
global max size (sysctl_perf_event_max_stack).
However it now has perf_callchain_entry_ctx.max_stack to limit the
iteration locally. This simplifies the code to handle init_nr in the
BPF callstack entries and removes the confusion with the perf_event's
__PERF_SAMPLE_CALLCHAIN_EARLY which sets init_nr to 0.
Also change the comment on bpf_get_stack() in the header file to be
more explicit what the return value means.
Fixes: c195651e56 ("bpf: add bpf_get_stack helper")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com
Link: https://lore.kernel.org/bpf/20220314182042.71025-1-namhyung@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based-on-patch-by: Eugene Loh <eugene.loh@oracle.com>
Change-Id: If8b7c2f7705a2c5fc27208d472a4762b7525bfdd
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.
Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Change-Id: I67e5dde44023de422ec093935ce20647466db4b8
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on the discussion in [0], update the bpf_redirect_neigh() helper to
accept an optional parameter specifying the nexthop information. This makes
it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without
incurring a duplicate FIB lookup - since the FIB lookup helper will return
the nexthop information even if no neighbour is present, this can simply
be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns
BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen.
[0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/
Change-Id: I2c24b13263ccc6452023c6f76e635ab2114ce142
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk
Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
("bpf: Relax max_entries check for most of the inner map types") added support
for dynamic inner max elements for most map-in-map types. Exceptions were maps
like array or prog array where the map_gen_lookup() callback uses the maps'
max_entries field as a constant when emitting instructions.
We recently implemented Maglev consistent hashing into Cilium's load balancer
which uses map-in-map with an outer map being hash and inner being array holding
the Maglev backend table for each service. This has been designed this way in
order to reduce overall memory consumption given the outer hash map allows to
avoid preallocating a large, flat memory area for all services. Also, the
number of service mappings is not always known a-priori.
The use case for dynamic inner array map entries is to further reduce memory
overhead, for example, some services might just have a small number of back
ends while others could have a large number. Right now the Maglev backend table
for small and large number of backends would need to have the same inner array
map entries which adds a lot of unneeded overhead.
Dynamic inner array map entries can be realized by avoiding the inlined code
generation for their lookup. The lookup will still be efficient since it will
be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
inline code generation and relaxes array_map_meta_equal() check to ignore both
maps' max_entries. This also still allows to have faster lookups for map-in-map
when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
Example code generation where inner map is dynamic sized array:
# bpftool p d x i 125
int handle__sys_enter(void * ctx):
; int handle__sys_enter(void *ctx)
0: (b4) w1 = 0
; int key = 0;
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
;
3: (07) r2 += -4
; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
4: (18) r1 = map[id:468]
6: (07) r1 += 272
7: (61) r0 = *(u32 *)(r2 +0)
8: (35) if r0 >= 0x3 goto pc+5
9: (67) r0 <<= 3
10: (0f) r0 += r1
11: (79) r0 = *(u64 *)(r0 +0)
12: (15) if r0 == 0x0 goto pc+1
13: (05) goto pc+1
14: (b7) r0 = 0
15: (b4) w6 = -1
; if (!inner_map)
16: (15) if r0 == 0x0 goto pc+6
17: (bf) r2 = r10
;
18: (07) r2 += -4
; val = bpf_map_lookup_elem(inner_map, &key);
19: (bf) r1 = r0 | No inlining but instead
20: (85) call array_map_lookup_elem#149280 | call to array_map_lookup_elem()
; return val ? *val : -1; | for inner array lookup.
21: (15) if r0 == 0x0 goto pc+1
; return val ? *val : -1;
22: (61) r6 = *(u32 *)(r0 +0)
; }
23: (bc) w0 = w6
24: (95) exit
Change-Id: Ie3c14796539b261e3fb9433aa9ef46fe116294c4
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
Add an efficient ingress to ingress netns switch that can be used out of tc BPF
programs in order to redirect traffic from host ns ingress into a container
veth device ingress without having to go via CPU backlog queue [0]. For local
containers this can also be utilized and path via CPU backlog queue only needs
to be taken once, not twice. On a high level this borrows from ipvlan which does
similar switch in __netif_receive_skb_core() and then iterates via another_round.
This helps to reduce latency for mentioned use cases.
Pod to remote pod with redirect(), TCP_RR [1]:
# percpu_netperf 10.217.1.33
RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 )
MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 )
STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 )
MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 )
P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 )
P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 )
P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 )
TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 )
Pod to remote pod with redirect_peer(), TCP_RR:
# percpu_netperf 10.217.1.33
RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 )
MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 )
STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 )
MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 )
P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 )
P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 )
P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 )
TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 )
[0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
[1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf
Change-Id: I17d75ffbb776ea4e36326b8fdd04b71441a1982b
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
helper always returns a valid pointer, therefore no need to check
returned value for NULL. Also note that all programs run with
preemption disabled, which means that the returned pointer is stable
during all the execution of the program.
Change-Id: Idd23453d28e221430cc067c6b6f812eab0ac738d
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com
Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
except that it may return NULL. This happens when the cpu parameter is
out of range. So the caller must check the returned value.
Change-Id: I254209a7070abc34458020e4ee8ee73256e31886
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com
Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
ksym so that further dereferences on the ksym can use the BTF info
to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
the verifier reads the btf_id stored in the insn[0]'s imm field and
marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
which is encoded in btf_vminux by pahole. If the VAR is not of a struct
type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
and the mem_size is resolved to the size of the VAR's type.
>From the VAR btf_id, the verifier can also read the address of the
ksym's corresponding kernel var from kallsyms and use that to fill
dst_reg.
Therefore, the proper functionality of pseudo_btf_id depends on (1)
kallsyms and (2) the encoding of kernel global VARs in pahole, which
should be available since pahole v1.18.
Change-Id: I8e81d1d9c2ed4c669e5f68fcf4246b9c1c705bb6
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com
Currently, perf event in perf event array is removed from the array when
the map fd used to add the event is closed. This behavior makes it
difficult to the share perf events with perf event array.
Introduce perf event map that keeps the perf event open with a new flag
BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
removed when the original map fd is closed. Instead, the perf event will
stay in the map until 1) it is explicitly removed from the array; or 2)
the array is freed.
Change-Id: I7c0aaa5ae7431439acaaaaf7b5b689834b2313db
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com
Add a redirect_neigh() helper as redirect() drop-in replacement
for the xmit side. Main idea for the helper is to be very similar
in semantics to the latter just that the skb gets injected into
the neighboring subsystem in order to let the stack do the work
it knows best anyway to populate the L2 addresses of the packet
and then hand over to dev_queue_xmit() as redirect() does.
This solves two bigger items: i) skbs don't need to go up to the
stack on the host facing veth ingress side for traffic egressing
the container to achieve the same for populating L2 which also
has the huge advantage that ii) the skb->sk won't get orphaned in
ip_rcv_core() when entering the IP routing layer on the host stack.
Given that skb->sk neither gets orphaned when crossing the netns
as per 9c4c325252 ("skbuff: preserve sock reference when scrubbing
the skb.") the helper can then push the skbs directly to the phys
device where FQ scheduler can do its work and TCP stack gets proper
backpressure given we hold on to skb->sk as long as skb is still
residing in queues.
With the helper used in BPF data path to then push the skb to the
phys device, I observed a stable/consistent TCP_STREAM improvement
on veth devices for traffic going container -> host -> host ->
container from ~10Gbps to ~15Gbps for a single stream in my test
environment.
Change-Id: I9cca990db0d9091c889bf174a1de1c72e9a8d4de
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/f207de81629e1724899b73b8112e0013be782d35.1601477936.git.daniel@iogearbox.net
Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid
from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
based on the skb->sk, so it can be used as key as part of BPF map
lookups out of tc from host ns, in particular given the skb->sk is
retained these days when crossing net ns thanks to 9c4c325252
("skbuff: preserve sock reference when scrubbing the skb."). This
is similar to bpf_skb_cgroup_id() which implements the same for v2.
Kubernetes ecosystem is still operating on v1 however, hence net_cls
needs to be used there until this can be dropped in with the v2
helper of bpf_skb_cgroup_id().
Change-Id: Iafb100978944174dd962d8409f229dfd0fd3780e
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/ed633cf27a1c620e901c5aa99ebdefb028dce600.1601477936.git.daniel@iogearbox.net
This enables support for attaching freplace programs to multiple attach
points. It does this by amending the UAPI for bpf_link_Create with a target
btf ID that can be used to supply the new attachment point along with the
target program fd. The target must be compatible with the target that was
supplied at program load time.
The implementation reuses the checks that were factored out of
check_attach_btf_id() to ensure compatibility between the BTF types of the
old and new attachment. If these match, a new bpf_tracing_link will be
created for the new attach target, allowing multiple attachments to
co-exist simultaneously.
The code could theoretically support multiple-attach of other types of
tracing programs as well, but since I don't have a use case for any of
those, there is no API support for doing so.
Change-Id: Ifeca634ba0c7ae9b1c5bc3c11a9d2b83fb0b5d23
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk
A helper is added to allow seq file writing of kernel data
structures using vmlinux BTF. Its signature is
long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);
Flags and struct btf_ptr definitions/use are identical to the
bpf_snprintf_btf helper, and the helper returns 0 on success
or a negative error value.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: Ief0f9b8b9d9ed5f725159d71d1f3eb26f28c27c1
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com
A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF). Its signature is
long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);
struct btf_ptr * specifies
- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
are not to be confused with the BTF_F_* flags
below that control how the btf_ptr is displayed; the
flags member of the struct btf_ptr may be used to
disambiguate types in kernel versus module BTF, etc;
the main distinction is the flags relate to the type
and information needed in identifying it; not how it
is displayed.
For example a BPF program with a struct sk_buff *skb
could do the following:
static struct btf_ptr b = { };
b.ptr = skb;
b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);
Default output looks like this:
(struct sk_buff){
.transport_header = (__u16)65535,
.mac_header = (__u16)65535,
.end = (sk_buff_data_t)192,
.head = (unsigned char *)0x000000007524fd8b,
.data = (unsigned char *)0x000000007524fd8b,
.truesize = (unsigned int)768,
.users = (refcount_t){
.refs = (atomic_t){
.counter = (int)1,
},
},
}
Flags modifying display are as follows:
- BTF_F_COMPACT: no formatting around type information
- BTF_F_NONAME: no struct/union member names/types
- BTF_F_PTR_RAW: show raw (unobfuscated) pointer values;
equivalent to %px.
- BTF_F_ZERO: show zero-valued struct/union members;
they are not displayed by default
Change-Id: I77f2c2a0d41aee2f4f10e3288a36de475fd2cb46
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
Add .test_run for raw_tracepoint. Also, introduce a new feature that runs
the target program on a specific CPU. This is achieved by a new flag in
bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program
is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for
BPF programs that handle perf_event and other percpu resources, as the
program can access these resource locally.
Change-Id: Id8f992caff30d7b65df8195f8934bcb2d8b658cb
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com
This patch changes the bpf_sk_assign() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.
The bpf_sk_lookup_assign() is taking ARG_PTR_TO_SOCKET_"OR_NULL". Meaning
it specifically takes a literal NULL. ARG_PTR_TO_BTF_ID_SOCK_COMMON
does not allow a literal NULL, so another ARG type is required
for this purpose and another follow-up patch can be used if
there is such need.
Change-Id: Ia2747b017398c8dcc4454118dba03e71bdeba1dc
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200925000415.3857374-1-kafai@fb.com
This patch changes the bpf_tcp_*_syncookie() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.
Change-Id: I4f7712acada428dab29b94f0cb9bbbb19d9eaea2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200925000409.3856725-1-kafai@fb.com
This patch changes the bpf_sk_storage_*() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.
A micro benchmark has been done on a "cgroup_skb/egress" bpf program
which does a bpf_sk_storage_get(). It was driven by netperf doing
a 4096 connected UDP_STREAM test with 64bytes packet.
The stats from "kernel.bpf_stats_enabled" shows no meaningful difference.
The sk_storage_get_btf_proto, sk_storage_delete_btf_proto,
btf_sk_storage_get_proto, and btf_sk_storage_delete_proto are
no longer needed, so they are removed.
Change-Id: Ifef1303f11952714d181a3c369e154a70a864fd0
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200925000402.3856307-1-kafai@fb.com
The previous patch allows the networking bpf prog to use the
bpf_skc_to_*() helpers to get a PTR_TO_BTF_ID socket pointer,
e.g. "struct tcp_sock *". It allows the bpf prog to read all the
fields of the tcp_sock.
This patch changes the bpf_sk_release() and bpf_sk_*cgroup_id()
to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will
work with the pointer returned by the bpf_skc_to_*() helpers
also. For example, the following will work:
sk = bpf_skc_lookup_tcp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0);
if (!sk)
return;
tp = bpf_skc_to_tcp_sock(sk);
if (!tp) {
bpf_sk_release(sk);
return;
}
lsndtime = tp->lsndtime;
/* Pass tp to bpf_sk_release() will also work */
bpf_sk_release(tp);
Since PTR_TO_BTF_ID could be NULL, the helper taking
ARG_PTR_TO_BTF_ID_SOCK_COMMON has to check for NULL at runtime.
A btf_id of "struct sock" may not always mean a fullsock. Regardless
the helper's running context may get a non-fullsock or not,
considering fullsock check/handling is pretty cheap, it is better to
keep the same verifier expectation on helper that takes ARG_PTR_TO_BTF_ID*
will be able to handle the minisock situation. In the bpf_sk_*cgroup_id()
case, it will try to get a fullsock by using sk_to_full_sk() as its
skb variant bpf_sk"b"_*cgroup_id() has already been doing.
bpf_sk_release can already handle minisock, so nothing special has to
be done.
Change-Id: Ia9823f535d2a69b6ed4c16f599faa3f8c3ac3a42
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200925000356.3856047-1-kafai@fb.com
Fix a formatting error in the description of bpf_load_hdr_opt() (rst2man
complains about a wrong indentation, but what is missing is actually a
blank line before the bullet list).
Fix and harmonise the formatting for other helpers.
Change-Id: Ib42eedd569553e5bc7b730a6c598609546c9aca5
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200904161454.31135-3-quentin@isovalent.com
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.
The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.
There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.
When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();
Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.
This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.
Change-Id: I281471010348f21de4e0832dfc0265dfe85b4a0d
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
bpf_link_info.iter is used by link_query to return bpf_iter_link_info
to user space. Fields may be different, e.g., map_fd vs. map_id, so
we cannot reuse the exact structure. But make them similar, e.g.,
struct bpf_link_info {
/* common fields */
union {
struct { ... } raw_tracepoint;
struct { ... } tracing;
...
struct {
/* common fields for iter */
union {
struct {
__u32 map_id;
} map;
/* other structs for other targets */
};
};
};
};
so the structure is extensible the same way as bpf_iter_link_info.
Fixes: 6b0a249a301e ("bpf: Implement link_query for bpf iterators")
Change-Id: I5bb1481995ec2fa33ce79676b0cfb7594850bbd8
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200828051922.758950-1-yhs@fb.com
Adding d_path helper function that returns full path for
given 'struct path' object, which needs to be the kernel
BTF 'path' object. The path is returned in buffer provided
'buf' of size 'sz' and is zero terminated.
bpf_d_path(&file->f_path, buf, size);
The helper calls directly d_path function, so there's only
limited set of function it can be called from. Adding just
very modest set for the start.
Updating also bpf.h tools uapi header and adding 'path' to
bpf_helpers_doc.py script.
Change-Id: If390ec6189a537b730b9ae595d374cbfa83f6f5b
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200825192124.710397-11-jolsa@kernel.org
Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used
in LSM programs. These helpers are not used for tracing programs
(currently) as their usage is tied to the life-cycle of the object and
should only be used where the owning object won't be freed (when the
owning object is passed as an argument to the LSM hook). Thus, they
are safer to use in LSM hooks than tracing. Usage of local storage in
tracing programs will probably follow a per function based whitelist
approach.
Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock,
it, leads to a compilation warning for LSM programs, it's also updated
to accept a void * pointer instead.
Change-Id: Ib0afe90a58aa586852d48d13c77bc91cbada9bd1
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org
Similar to bpf_local_storage for sockets, add local storage for inodes.
The life-cycle of storage is managed with the life-cycle of the inode.
i.e. the storage is destroyed along with the owning inode.
The BPF LSM allocates an __rcu pointer to the bpf_local_storage in the
security blob which are now stackable and can co-exist with other LSMs.
Change-Id: I078273c02202a6ee3837865b6d5b307989eb9bed
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-6-kpsingh@chromium.org