version 4.19.325-cip124
* tag 'v4.19.325-cip124' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip:
CIP: Bump version suffix to -cip124 after merge from cip/linux-4.19.y-st tree
Update localversion-st, tree is up-to-date with 5.4.298.
f2fs: fix to do sanity check on ino and xnid
squashfs: fix memory leak in squashfs_fill_super
pNFS: Handle RPC size limit for layoutcommits
wifi: iwlwifi: fw: Fix possible memory leak in iwl_fw_dbg_collect
usb: core: usb_submit_urb: downgrade type check
udf: Verify partition map count
f2fs: fix to avoid panic in f2fs_evict_inode
usb: hub: Fix flushing and scheduling of delayed work that tunes runtime pm
Revert "drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS"
net: usb: qmi_wwan: add Telit Cinterion LE910C4-WWX new compositions
HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version()
HID: asus: fix UAF via HID_CLAIMED_INPUT validation
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
sctp: initialize more fields in sctp_v6_from_sk()
net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts
net/mlx5e: Set local Xoff after FW update
net: dlink: fix multicast stats being counted incorrectly
atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control().
net/atm: remove the atmdev_ops {get, set}sockopt methods
Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced
powerpc/kvm: Fix ifdef to remove build warning
net: ipv4: fix regression in local-broadcast routes
vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put()
scsi: core: sysfs: Correct sysfs attributes access rights
ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
alloc_fdtable(): change calling conventions.
ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit
ipv6: sr: validate HMAC algorithm ID in seg6_hmac_info_add
ALSA: usb-audio: Fix size validation in convert_chmap_v3()
scsi: qla4xxx: Prevent a potential error pointer dereference
usb: xhci: Fix slot_id resource race conflict
nfs: fix UAF in direct writes
NFS: Fix up commit deadlocks
Bluetooth: fix use-after-free in device_for_each_child()
selftests: forwarding: tc_actions.sh: add matchall mirror test
codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()
sch_qfq: make qfq_qlen_notify() idempotent
sch_hfsc: make hfsc_qlen_notify() idempotent
sch_drr: make drr_qlen_notify() idempotent
btrfs: populate otime when logging an inode item
media: venus: hfi: explicitly release IRQ during teardown
f2fs: fix to avoid out-of-boundary access in dnode page
media: venus: protect against spurious interrupts during probe
media: venus: vdec: Clamp param smaller than 1fps and bigger than 240.
drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
media: rainshadow-cec: fix TOCTOU race condition in rain_interrupt()
media: v4l2-ctrls: Don't reset handler's error in v4l2_ctrl_handler_free()
ata: Fix SATA_MOBILE_LPM_POLICY description in Kconfig
usb: musb: omap2430: fix device leak at unbind
NFS: Fix the setting of capabilities when automounting a new filesystem
NFS: Fix up handling of outstanding layoutcommit in nfs_update_inode()
NFSv4: Fix nfs4_bitmap_copy_adjust()
usb: typec: fusb302: cache PD RX state
cdc-acm: fix race between initial clearing halt and open
USB: cdc-acm: do not log successful probe on later errors
nfsd: handle get_client_locked() failure in nfsd4_setclientid_confirm()
tracing: Add down_write(trace_event_sem) when adding trace event
usb: hub: Don't try to recover devices lost during warm reset.
usb: hub: avoid warm port reset during USB3 disconnect
x86/mce/amd: Add default names for MCA banks and blocks
iio: hid-sensor-prox: Fix incorrect OFFSET calculation
mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage()
net: usbnet: Fix the wrong netif_carrier_on() call
net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value
kbuild: Add KBUILD_CPPFLAGS to as-option invocation
kbuild: add $(CLANG_FLAGS) to KBUILD_CPPFLAGS
kbuild: Add CLANG_FLAGS to as-instr
mips: Include KBUILD_CPPFLAGS in CHECKFLAGS invocation
kbuild: Update assembler calls to use proper flags and language target
ARM: 9448/1: Use an absolute path to unified.h in KBUILD_AFLAGS
usb: dwc3: Ignore late xferNotReady event to prevent halt timeout
USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles
usb: storage: realtek_cr: Use correct byte order for bcs->Residue
USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera
usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive
iio: proximity: isl29501: fix buffered read on big-endian systems
ftrace: Also allocate and copy hash for reading of filter files
fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable()
fs/buffer: fix use-after-free when call bh_read() helper
drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3
media: venus: Add a check for packet size after reading from shared memory
media: ov2659: Fix memory leaks in ov2659_probe()
media: usbtv: Lock resolution while streaming
media: gspca: Add bounds checking to firmware parser
jbd2: prevent softlockup in jbd2_log_do_checkpoint()
PCI: endpoint: Fix configfs group removal on driver teardown
PCI: endpoint: Fix configfs group list head handling
mtd: rawnand: fsmc: Add missing check after DMA map
wifi: brcmsmac: Remove const from tbl_ptr parameter in wlc_lcnphy_common_read_table()
zynq_fpga: use sgtable-based scatterlist wrappers
ata: libata-scsi: Fix ata_to_sense_error() status handling
ext4: fix reserved gdt blocks handling in fsmap
ext4: fix fsmap end of range reporting with bigalloc
ext4: check fast symlink for ea_inode correctly
Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()"
vt: defkeymap: Map keycodes above 127 to K_HOLE
usb: gadget: udc: renesas_usb3: fix device leak at unbind
usb: atm: cxacru: Merge cxacru_upload_firmware() into cxacru_heavy_init()
m68k: Fix lost column on framebuffer debug console
serial: 8250: fix panic due to PSLVERR
media: uvcvideo: Do not mark valid metadata as invalid
media: uvcvideo: Fix 1-byte out-of-bounds read in uvc_parse_format()
btrfs: fix log tree replay failure due to file with 0 links and extents
thunderbolt: Fix copy+paste error in match_service_id()
misc: rtsx: usb: Ensure mmc child device is active when card is present
scsi: lpfc: Remove redundant assignment to avoid memory leak
rtc: ds1307: remove clear of oscillator stop flag (OSF) in probe
pNFS: Fix uninited ptr deref in block/scsi layout
pNFS: Fix disk addr range check in block/scsi layout
pNFS: Fix stripe mapping in block/scsi layout
ipmi: Fix strcpy source and destination the same
kconfig: lxdialog: fix 'space' to (de)select options
kconfig: gconf: fix potential memory leak in renderer_edited()
kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed()
scsi: aacraid: Stop using PCI_IRQ_AFFINITY
scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans
kconfig: nconf: Ensure null termination where strncpy is used
kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
PCI: pnv_php: Work around switches with broken presence detection
media: uvcvideo: Fix bandwidth issue for Alcor camera
media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar
media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb()
media: usb: hdpvr: disable zero-length read messages
media: tc358743: Increase FIFO trigger level to 374
media: tc358743: Return an appropriate colorspace from tc358743_set_fmt
media: tc358743: Check I2C succeeded during probe
pinctrl: stm32: Manage irq affinity settings
scsi: mpt3sas: Correctly handle ATA device errors
RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()
MIPS: Don't crash in stack_top() for tasks without ABI or vDSO
jfs: upper bound check of tree index in dbAllocAG
jfs: Regular file corruption check
jfs: truncate good inode pages when hard link is 0
scsi: bfa: Double-free fix
MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free}
watchdog: dw_wdt: Fix default timeout
fs/orangefs: use snprintf() instead of sprintf()
scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated
ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
vhost: fail early when __vhost_add_used() fails
uapi: in6: restore visibility of most IPv6 socket options
net: ncsi: Fix buffer overflow in fetching version id
net: dsa: b53: fix b53_imp_vlan_setup for BCM5325
net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs
wifi: iwlegacy: Check rate_idx range after addition
netmem: fix skb_frag_address_safe with unreadable skbs
wifi: rtlwifi: fix possible skb memory leak in `_rtl_pci_rx_interrupt()`.
wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd()
net: fec: allow disable coalescing
(powerpc/512) Fix possible `dma_unmap_single()` on uninitialized pointer
s390/stp: Remove udelay from stp_sync_clock()
wifi: iwlwifi: mvm: fix scan request validation
net: thunderx: Fix format-truncation warning in bgx_acpi_match_id()
net: ipv4: fix incorrect MTU in broadcast routes
wifi: cfg80211: Fix interface type validation
et131x: Add missing check after DMA map
be2net: Use correct byte order and format string for TCP seq and ack_seq
s390/time: Use monotonic clock in get_cycles()
wifi: cfg80211: reject HTC bit for management frames
ktest.pl: Prevent recursion of default variable options
ASoC: codecs: rt5640: Retry DEVICE_ID verification
ALSA: usb-audio: Avoid precedence issues in mixer_quirks macros
ALSA: hda/ca0132: Fix buffer overflow in add_tuning_control
platform/x86: thinkpad_acpi: Handle KCOV __init vs inline mismatches
pm: cpupower: Fix the snapshot-order of tsc,mperf, clock in mperf_stop()
ALSA: intel8x0: Fix incorrect codec index usage in mixer for ICH4
ASoC: hdac_hdmi: Rate limit logging on connection and disconnection
mmc: rtsx_usb_sdmmc: Fix error-path in sd_set_power_mode()
ACPI: processor: fix acpi_object initialization
PM: sleep: console: Fix the black screen issue
thermal: sysfs: Return ENODATA instead of EAGAIN for reads
selftests: tracing: Use mutex_unlock for testing glob filter
ARM: tegra: Use I/O memcpy to write to IRAM
gpio: tps65912: check the return value of regmap_update_bits()
ASoC: soc-dapm: set bias_level if snd_soc_dapm_set_bias_level() was successed
cpufreq: Exit governor when failed to start old governor
usb: xhci: Avoid showing errors during surprise removal
usb: xhci: Set avg_trb_len = 8 for EP0 during Address Device Command
usb: xhci: Avoid showing warnings for dying controller
selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t
usb: xhci: print xhci->xhc_state when queue_command failed
securityfs: don't pin dentries twice, once is enough...
hfs: fix not erasing deleted b-tree node issue
drbd: add missing kref_get in handle_write_conflicts
arm64: Handle KCOV __init vs inline mismatches
hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file()
hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()
hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read()
hfs: fix slab-out-of-bounds in hfs_bnode_read()
sctp: linearize cloned gso packets in sctp_rcv
netfilter: ctnetlink: fix refcount leak on table dump
udp: also consider secpath when evaluating ipsec use for checksumming
fs: Prevent file descriptor table allocations exceeding INT_MAX
sunvdc: Balance device refcount in vdc_port_mpgroup_check
NFSD: detect mismatch of file handle and delegation stateid in OPEN op
net: dpaa: fix device leak when querying time stamp info
net: gianfar: fix device leak when querying time stamp info
netlink: avoid infinite retry looping in netlink_unicast()
ALSA: usb-audio: Validate UAC3 cluster segment descriptors
ALSA: usb-audio: Validate UAC3 power domain descriptors, too
usb: gadget : fix use-after-free in composite_dev_cleanup()
MIPS: mm: tlb-r4k: Uniquify TLB entries on init
USB: serial: option: add Foxconn T99W709
vsock: Do not allow binding to VMADDR_PORT_ANY
net/packet: fix a race in packet_set_ring() and packet_notifier()
perf/core: Prevent VMA split of buffer mappings
perf/core: Exit early on perf_mmap() fail
perf/core: Don't leak AUX buffer refcount on allocation failure
pptp: fix pptp_xmit() error path
smb: client: let recv_done() cleanup before notifying the callers.
benet: fix BUG when creating VFs
ipv6: reject malicious packets in ipv6_gso_segment()
pptp: ensure minimal skb length in pptp_xmit()
netpoll: prevent hanging NAPI when netcons gets enabled
NFS: Fix filehandle bounds checking in nfs_fh_to_dentry()
pci/hotplug/pnv-php: Wrap warnings in macro
pci/hotplug/pnv-php: Improve error msg on power state change failure
usb: chipidea: udc: fix sleeping function called from invalid context
f2fs: fix to avoid out-of-boundary access in devs.path
f2fs: fix to avoid UAF in f2fs_sync_inode_meta()
rtc: pcf8563: fix incorrect maximum clock rate handling
rtc: hym8563: fix incorrect maximum clock rate handling
rtc: ds1307: fix incorrect maximum clock rate handling
mtd: rawnand: atmel: set pmecc data setup time
mtd: rawnand: atmel: Fix dma_mapping_error() address
jfs: fix metapage reference count leak in dbAllocCtl
fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref
crypto: qat - fix seq_file position update in adf_ring_next()
dmaengine: nbpfaxi: Add missing check after DMA map
dmaengine: mv_xor: Fix missing check after DMA map and missing unmap
fs/orangefs: Allow 2 more characters in do_c_string()
crypto: img-hash - Fix dma_unmap_sg() nents value
scsi: isci: Fix dma_unmap_sg() nents value
scsi: mvsas: Fix dma_unmap_sg() nents value
scsi: ibmvscsi_tgt: Fix dma_unmap_sg() nents value
perf tests bp_account: Fix leaked file descriptor
crypto: ccp - Fix crash when rebind ccp device for ccp.ko
pinctrl: sunxi: Fix memory leak on krealloc failure
power: supply: max14577: Handle NULL pdata when CONFIG_OF is not set
clk: davinci: Add NULL check in davinci_lpsc_clk_register()
mtd: fix possible integer overflow in erase_xfer()
crypto: marvell/cesa - Fix engine load inaccuracy
PCI: rockchip-host: Fix "Unexpected Completion" log message
vrf: Drop existing dst reference in vrf_ip6_input_dst
netfilter: xt_nfacct: don't assume acct name is null-terminated
can: kvaser_usb: Assign netdev.dev_port based on device channel index
wifi: brcmfmac: fix P2P discovery failure in P2P peer due to missing P2P IE
Reapply "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()"
mwl8k: Add missing check after DMA map
wifi: rtl8xxxu: Fix RX skb size for aggregation disabled
net/sched: Restrict conditions for adding duplicating netems to qdisc tree
arch: powerpc: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX
netfilter: nf_tables: adjust lockdep assertions handling
drm/amd/pm/powerplay/hwmgr/smu_helper: fix order of mask and value
m68k: Don't unregister boot console needlessly
tcp: fix tcp_ofo_queue() to avoid including too much DUP SACK range
iwlwifi: Add missing check for alloc_ordered_workqueue
wifi: iwlwifi: Fix memory leak in iwl_mvm_init()
wifi: rtl818x: Kill URBs before clearing tx status queue
caif: reduce stack size, again
staging: nvec: Fix incorrect null termination of battery manufacturer
samples: mei: Fix building on musl libc
usb: early: xhci-dbc: Fix early_ioremap leak
Revert "vmci: Prevent the dispatching of uninitialized payloads"
pps: fix poll support
vmci: Prevent the dispatching of uninitialized payloads
staging: fbtft: fix potential memory leak in fbtft_framebuffer_alloc()
ARM: dts: vfxxx: Correctly use two tuples for timer address
ASoC: ops: dynamically allocate struct snd_ctl_elem_value
hfsplus: remove mutex_lock check in hfsplus_free_extents
ASoC: Intel: fix SND_SOC_SOF dependencies
ethernet: intel: fix building with large NR_CPUS
usb: phy: mxs: disconnect line when USB charger is attached
usb: chipidea: udc: protect usb interrupt enable
usb: chipidea: udc: add new API ci_hdrc_gadget_connect
comedi: comedi_test: Fix possible deletion of uninitialized timers
nilfs2: reject invalid file types when reading inodes
i2c: qup: jump out of the loop in case of timeout
net/sched: sch_qfq: Avoid triggering might_sleep in atomic context in qfq_delete_class
net: appletalk: Fix use-after-free in AARP proxy probe
net: appletalk: fix kerneldoc warnings
RDMA/core: Rate limit GID cache warning messages
usb: hub: fix detection of high tier USB3 devices behind suspended hubs
net_sched: sch_sfq: reject invalid perturb period
net_sched: sch_sfq: move the limit validation
net_sched: sch_sfq: use a temporary work area for validating configuration
net_sched: sch_sfq: don't allow 1 packet limit
net_sched: sch_sfq: handle bigger packets
net_sched: sch_sfq: annotate data-races around q->perturb_period
power: supply: bq24190_charger: Fix runtime PM imbalance on error
xhci: Disable stream for xHC controller with XHCI_BROKEN_STREAMS
virtio-net: ensure the received length does not exceed allocated size
usb: dwc3: qcom: Don't leave BCR asserted
usb: musb: fix gadget state on disconnect
net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtree
net: vlan: fix VLAN 0 refcount imbalance of toggling filtering during runtime
Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU
Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
Bluetooth: SMP: If an unallowed command is received consider it a failure
Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb()
usb: net: sierra: check for no status endpoint
net/sched: sch_qfq: Fix race condition on qfq_aggregate
net: emaclite: Fix missing pointer increment in aligned_read()
comedi: Fix use of uninitialized data in insn_rw_emulate_bits()
comedi: Fix some signed shift left operations
comedi: das6402: Fix bit shift out of bounds
comedi: das16m1: Fix bit shift out of bounds
comedi: aio_iiro_16: Fix bit shift out of bounds
comedi: pcl812: Fix bit shift out of bounds
iio: adc: max1363: Reorder mode_list[] entries
iio: adc: max1363: Fix MAX1363_4X_CHANS/MAX1363_8X_CHANS[]
soc: aspeed: lpc-snoop: Don't disable channels that aren't enabled
soc: aspeed: lpc-snoop: Cleanup resources in stack-order
mmc: sdhci-pci: Quirk for broken command queuing on Intel GLK-based Positivo models
memstick: core: Zero initialize id_reg in h_memstick_read_dev_id()
isofs: Verify inode mode when loading from disk
dmaengine: nbpfaxi: Fix memory corruption in probe()
af_packet: fix soft lockup issue caused by tpacket_snd()
af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd()
phonet/pep: Move call to pn_skb_get_dst_sockaddr() earlier in pep_sock_accept()
HID: core: do not bypass hid_hw_raw_request
HID: core: ensure __hid_request reserves the report ID as the first byte
HID: core: ensure the allocated report buffer can contain the reserved report ID
pch_uart: Fix dma_sync_sg_for_device() nents value
Input: xpad - set correct controller type for Acer NGR200
i2c: stm32: fix the device used for the DMA map
usb: gadget: configfs: Fix OOB read on empty string write
USB: serial: ftdi_sio: add support for NDI EMGUIDE GEMINI
USB: serial: option: add Foxconn T99W640
USB: serial: option: add Telit Cinterion FE910C04 (ECM) composition
dma-mapping: add generic helpers for mapping sgtable objects
usb: renesas_usbhs: Flush the notify_hotplug_work
gpio: rcar: Use raw_spinlock to protect register access
Conflicts:
Makefile
fs/f2fs/inode.c
mm/zsmalloc.c
Change-Id: If00246b113234f4ee7e5bb72cffd5d6f195de087
commit ead7f9b8de65632ef8060b84b0c55049a33cfea1 upstream.
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other
things, update the L4 checksum after reverse SNATing IPv6 packets. That
use case is however not currently supported and leads to invalid
skb->csum values in some cases. This patch adds support for IPv6 address
changes in bpf_l4_csum_update via a new flag.
When calling bpf_l4_csum_replace in Cilium, it ends up calling
inet_proto_csum_replace_by_diff:
1: void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
2: __wsum diff, bool pseudohdr)
3: {
4: if (skb->ip_summed != CHECKSUM_PARTIAL) {
5: csum_replace_by_diff(sum, diff);
6: if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
7: skb->csum = ~csum_sub(diff, skb->csum);
8: } else if (pseudohdr) {
9: *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
10: }
11: }
The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. The helper now updates the L4 header
checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't.
For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are set such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.
The same bug doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.
This special case for IPv6 L4 checksums is also described atop
inet_proto_csum_replace16, the function we should be using in this case.
This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6,
to indicate that we're updating the L4 checksum of an IPv6 packet. When
the flag is set, inet_proto_csum_replace_by_diff will skip the
skb->csum update.
Fixes: 7d672345ed ("bpf: add generic bpf_csum_diff helper")
Change-Id: Ia07e6770587fff91588ba133a9efadab92372ed9
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/96a6bc3a443e6f0b21ff7b7834000e17fb549e05.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ Note: Fixed conflict due to unrelated comment change. ]
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c32e8f8bc33a5f4b113a630857e46634e3e143b upstream.
Allow to pass sk_lookup programs to PROG_TEST_RUN. User space
provides the full bpf_sk_lookup struct as context. Since the
context includes a socket pointer that can't be exposed
to user space we define that PROG_TEST_RUN returns the cookie
of the selected socket or zero in place of the socket pointer.
We don't support testing programs that select a reuseport socket,
since this would mean running another (unrelated) BPF program
from the sk_lookup test handler.
Change-Id: I7af748e3f11804e4e1ad0c532685f0c3dfaf4816
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210303101816.36774-3-lmb@cloudflare.com
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the label says "for internal use only", then it doesn't belong
in the 'uapi' subtree.
Change-Id: Ia10de797ba5e5977870ebadb67a870405934e76e
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
[ Upstream commit 31557b3487b349464daf42bc4366153743c1e727 ]
A decade ago commit 6d08acd2d3 ("in6: fix conflict with glibc")
hid the definitions of IPV6 options, because GCC was complaining
about duplicates. The commit did not list the warnings seen, but
trying to recreate them now I think they are (building iproute2):
In file included from ./include/uapi/rdma/rdma_user_cm.h:39,
from rdma.h:16,
from res.h:9,
from res-ctx.c:7:
../include/uapi/linux/in6.h:171:9: warning: ‘IPV6_ADD_MEMBERSHIP’ redefined
171 | #define IPV6_ADD_MEMBERSHIP 20
| ^~~~~~~~~~~~~~~~~~~
In file included from /usr/include/netinet/in.h:37,
from rdma.h:13:
/usr/include/bits/in.h:233:10: note: this is the location of the previous definition
233 | # define IPV6_ADD_MEMBERSHIP IPV6_JOIN_GROUP
| ^~~~~~~~~~~~~~~~~~~
../include/uapi/linux/in6.h:172:9: warning: ‘IPV6_DROP_MEMBERSHIP’ redefined
172 | #define IPV6_DROP_MEMBERSHIP 21
| ^~~~~~~~~~~~~~~~~~~~
/usr/include/bits/in.h:234:10: note: this is the location of the previous definition
234 | # define IPV6_DROP_MEMBERSHIP IPV6_LEAVE_GROUP
| ^~~~~~~~~~~~~~~~~~~~
Compilers don't complain about redefinition if the defines
are identical, but here we have the kernel using the literal
value, and glibc using an indirection (defining to a name
of another define, with the same numerical value).
Problem is, the commit in question hid all the IPV6 socket
options, and glibc has a pretty sparse list. For instance
it lacks Flow Label related options. Willem called this out
in commit 3fb321fde22d ("selftests/net: ipv6 flowlabel"):
/* uapi/glibc weirdness may leave this undefined */
#ifndef IPV6_FLOWINFO
#define IPV6_FLOWINFO 11
#endif
More interestingly some applications (socat) use
a #ifdef IPV6_FLOWINFO to gate compilation of thier
rudimentary flow label support. (For added confusion
socat misspells it as IPV4_FLOWINFO in some places.)
Hide only the two defines we know glibc has a problem
with. If we discover more warnings we can hide more
but we should avoid covering the entire block of
defines for "IPV6 socket options".
Link: https://patch.msgid.link/20250609143933.1654417-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ulrich Hecht <uli@kernel.org>
version 4.19.325-cip123
* tag 'v4.19.325-cip123' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip: (2182 commits)
CIP: Bump version suffix to -cip123 after merge from cip/linux-4.19.y-st tree
Update localversion-st, tree is up-to-date with 5.4.296.
emulex/benet: Fix build by return mismatch in be_cmd_unlock()
net/sched: Abort __tc_modify_qdisc if parent class does not exist
mtk-sd: Prevent memory corruption from DMA map failure
mmc: mediatek: use data instead of mrq parameter from msdc_{un}prepare_data()
scsi: qla4xxx: Fix missing DMA mapping error in qla4xxx_alloc_pdu()
btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
VMCI: fix race between vmci_host_setup_notify and vmci_ctx_unset_notify
net: ipv6: Discard next-hop MTU less than minimum link MTU
Input: atkbd - do not skip atkbd_deactivate() when skipping ATKBD_CMD_GETID
HID: quirks: Add quirk for 2 Chicony Electronics HP 5MP Cameras
HID: Add IGNORE quirk for SMARTLINKTECHNOLOGY
vt: add missing notification when switching back to text mode
net: usb: qmi_wwan: add SIMCom 8230C composition
atm: idt77252: Add missing `dma_map_error()`
bnxt_en: Fix DCB ETS validation
can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
net: appletalk: Fix device refcount leak in atrtr_create()
md/raid1: Fix stack memory use after return in raid1_reshape
...
Conflicts:
Documentation/devicetree/bindings/arm/shmobile.txt
Documentation/devicetree/bindings/ata/sata_rcar.txt
Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt
Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt
Documentation/devicetree/bindings/display/bridge/renesas,lvds.txt
Documentation/devicetree/bindings/display/renesas,du.txt
Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt
Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt
Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt
Documentation/devicetree/bindings/i2c/renesas,i2c.txt
Documentation/devicetree/bindings/i2c/renesas,iic.txt
Documentation/devicetree/bindings/interrupt-controller/renesas,irqc.txt
Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.txt
Documentation/devicetree/bindings/media/rcar_vin.txt
Documentation/devicetree/bindings/media/renesas,fcp.txt
Documentation/devicetree/bindings/media/renesas,rcar-csi2.txt
Documentation/devicetree/bindings/media/renesas,vsp1.txt
Documentation/devicetree/bindings/mmc/tmio_mmc.txt
Documentation/devicetree/bindings/net/can/rcar_can.txt
Documentation/devicetree/bindings/net/can/rcar_canfd.txt
Documentation/devicetree/bindings/net/renesas,ravb.txt
Documentation/devicetree/bindings/pci/rcar-pci.txt
Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb2.txt
Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt
Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt
Documentation/devicetree/bindings/power/renesas,rcar-sysc.txt
Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt
Documentation/devicetree/bindings/reset/renesas,rst.txt
Documentation/devicetree/bindings/serial/renesas,sci-serial.txt
Documentation/devicetree/bindings/sound/renesas,rsnd.txt
Documentation/devicetree/bindings/spi/sh-msiof.txt
Documentation/devicetree/bindings/thermal/rcar-gen3-thermal.txt
Documentation/devicetree/bindings/thermal/rcar-thermal.txt
Documentation/devicetree/bindings/timer/renesas,cmt.txt
Documentation/devicetree/bindings/timer/renesas,tmu.txt
Documentation/devicetree/bindings/trivial-devices.txt
Documentation/devicetree/bindings/usb/renesas,usb3-peri.txt
Documentation/devicetree/bindings/usb/renesas,usbhs.txt
Documentation/devicetree/bindings/usb/usb-xhci.txt
Documentation/devicetree/bindings/vendor-prefixes.txt
Documentation/devicetree/bindings/watchdog/renesas,wdt.txt
drivers/clk/qcom/clk-alpha-pll.c
drivers/hid/hid-ids.h
drivers/irqchip/irq-gic-v3.c
drivers/mmc/host/sdhci.h
drivers/platform/x86/intel_cht_int33fe.c
drivers/usb/dwc3/core.c
drivers/usb/dwc3/gadget.c
drivers/usb/gadget/function/f_fs.c
drivers/usb/typec/hd3ss3220.c
drivers/usb/typec/mux.c
kernel/time/posix-timers.c
mm/oom_kill.c
Change-Id: I9d0df93b34a99e5a7071f60b3dfd2fb5d943e6c7
When async binder buffer got exhausted, some normal oneway transactions
will also be discarded and may cause system or application failures. By
that time, the binder debug information we dump may not be relevant to
the root cause. And this issue is difficult to debug if without the
backtrace of the thread sending spam.
This change will send BR_ONEWAY_SPAM_SUSPECT to userspace when oneway
spamming is detected, request to dump current backtrace. Oneway spamming
will be reported only once when exceeding the threshold (target process
dips below 80% of its oneway space, and current process is responsible
for either more than 50 transactions, or more than 50% of the oneway
space). And the detection will restart when the async buffer has
returned to a healthy state.
Acked-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Hang Lu <hangl@codeaurora.org>
Link: https://lore.kernel.org/r/1617961246-4502-3-git-send-email-hangl@codeaurora.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 181190340
Change-Id: Id3d2526099bc89f04d8ad3ad6e48141b2a8f2515
(cherry picked from commit a7dc1e6f99df59799ab0128d9c4e47bbeceb934d)
Signed-off-by: Hang Lu <hangl@codeaurora.org>
Convert the zsfold filesystem to the new internal mount API as the old one
will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Change-Id: Ia3da9e9fd2ef5214c4e7fb71228f8b90c15f9e71
Signed-off-by: David Howells <dhowells@redhat.com>
[ Upstream commit b8e3a87a627b575896e448021e5c2f8a3bc19931 ]
Currently get_perf_callchain only supports user stack walking for
the current task. Passing the correct *crosstask* param will return
0 frames if the task passed to __bpf_get_stack isn't the current
one instead of a single incorrect frame/address. This change
passes the correct *crosstask* param but also does a preemptive
check in __bpf_get_stack if the task is current and returns
-EOPNOTSUPP if it is not.
This issue was found using bpf_get_task_stack inside a BPF
iterator ("iter/task"), which iterates over all tasks.
bpf_get_task_stack works fine for fetching kernel stacks
but because get_perf_callchain relies on the caller to know
if the requested *task* is the current one (via *crosstask*)
it was failing in a confusing way.
It might be possible to get user stacks for all tasks utilizing
something like access_process_vm but that requires the bpf
program calling bpf_get_task_stack to be sleepable and would
therefore be a breaking change.
Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()")
Change-Id: Iaaa9eecc1e9f9c4c7b4953e169725d1aa1808bc3
Signed-off-by: Jordan Rome <jordalgo@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7cb779a6867fea00b4209bcf6de2f178a743247d ]
Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
packets") exposed the fact that bpf_clone_redirect is capable of
returning raw NET_XMIT_XXX return codes.
This is in the conflict with its UAPI doc which says the following:
"0 on success, or a negative error in case of failure."
Update the UAPI to reflect the fact that bpf_clone_redirect can
return positive error numbers, but don't explicitly define
their meaning.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Change-Id: I5b81b2b05cb8369feceae99f9fddaaf6a0121dd1
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230911194731.286342-1-sdf@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ee2a098851bfbe8bcdd964c0121f4246f00ff41e upstream.
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip > 0, one keeps only 'num_elem - skip' frames.
This is because it sets init_nr in the perf_callchain_entry to the end
of the buffer to save num_elem entries only. I believe it was because
the perf callchain code unwound the stack frames until it reached the
global max size (sysctl_perf_event_max_stack).
However it now has perf_callchain_entry_ctx.max_stack to limit the
iteration locally. This simplifies the code to handle init_nr in the
BPF callstack entries and removes the confusion with the perf_event's
__PERF_SAMPLE_CALLCHAIN_EARLY which sets init_nr to 0.
Also change the comment on bpf_get_stack() in the header file to be
more explicit what the return value means.
Fixes: c195651e56 ("bpf: add bpf_get_stack helper")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com
Link: https://lore.kernel.org/bpf/20220314182042.71025-1-namhyung@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based-on-patch-by: Eugene Loh <eugene.loh@oracle.com>
Change-Id: If8b7c2f7705a2c5fc27208d472a4762b7525bfdd
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.
Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Change-Id: I67e5dde44023de422ec093935ce20647466db4b8
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on the discussion in [0], update the bpf_redirect_neigh() helper to
accept an optional parameter specifying the nexthop information. This makes
it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without
incurring a duplicate FIB lookup - since the FIB lookup helper will return
the nexthop information even if no neighbour is present, this can simply
be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns
BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen.
[0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/
Change-Id: I2c24b13263ccc6452023c6f76e635ab2114ce142
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk
Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
("bpf: Relax max_entries check for most of the inner map types") added support
for dynamic inner max elements for most map-in-map types. Exceptions were maps
like array or prog array where the map_gen_lookup() callback uses the maps'
max_entries field as a constant when emitting instructions.
We recently implemented Maglev consistent hashing into Cilium's load balancer
which uses map-in-map with an outer map being hash and inner being array holding
the Maglev backend table for each service. This has been designed this way in
order to reduce overall memory consumption given the outer hash map allows to
avoid preallocating a large, flat memory area for all services. Also, the
number of service mappings is not always known a-priori.
The use case for dynamic inner array map entries is to further reduce memory
overhead, for example, some services might just have a small number of back
ends while others could have a large number. Right now the Maglev backend table
for small and large number of backends would need to have the same inner array
map entries which adds a lot of unneeded overhead.
Dynamic inner array map entries can be realized by avoiding the inlined code
generation for their lookup. The lookup will still be efficient since it will
be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
inline code generation and relaxes array_map_meta_equal() check to ignore both
maps' max_entries. This also still allows to have faster lookups for map-in-map
when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
Example code generation where inner map is dynamic sized array:
# bpftool p d x i 125
int handle__sys_enter(void * ctx):
; int handle__sys_enter(void *ctx)
0: (b4) w1 = 0
; int key = 0;
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
;
3: (07) r2 += -4
; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
4: (18) r1 = map[id:468]
6: (07) r1 += 272
7: (61) r0 = *(u32 *)(r2 +0)
8: (35) if r0 >= 0x3 goto pc+5
9: (67) r0 <<= 3
10: (0f) r0 += r1
11: (79) r0 = *(u64 *)(r0 +0)
12: (15) if r0 == 0x0 goto pc+1
13: (05) goto pc+1
14: (b7) r0 = 0
15: (b4) w6 = -1
; if (!inner_map)
16: (15) if r0 == 0x0 goto pc+6
17: (bf) r2 = r10
;
18: (07) r2 += -4
; val = bpf_map_lookup_elem(inner_map, &key);
19: (bf) r1 = r0 | No inlining but instead
20: (85) call array_map_lookup_elem#149280 | call to array_map_lookup_elem()
; return val ? *val : -1; | for inner array lookup.
21: (15) if r0 == 0x0 goto pc+1
; return val ? *val : -1;
22: (61) r6 = *(u32 *)(r0 +0)
; }
23: (bc) w0 = w6
24: (95) exit
Change-Id: Ie3c14796539b261e3fb9433aa9ef46fe116294c4
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
Add an efficient ingress to ingress netns switch that can be used out of tc BPF
programs in order to redirect traffic from host ns ingress into a container
veth device ingress without having to go via CPU backlog queue [0]. For local
containers this can also be utilized and path via CPU backlog queue only needs
to be taken once, not twice. On a high level this borrows from ipvlan which does
similar switch in __netif_receive_skb_core() and then iterates via another_round.
This helps to reduce latency for mentioned use cases.
Pod to remote pod with redirect(), TCP_RR [1]:
# percpu_netperf 10.217.1.33
RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 )
MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 )
STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 )
MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 )
P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 )
P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 )
P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 )
TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 )
Pod to remote pod with redirect_peer(), TCP_RR:
# percpu_netperf 10.217.1.33
RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 )
MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 )
STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 )
MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 )
P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 )
P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 )
P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 )
TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 )
[0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
[1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf
Change-Id: I17d75ffbb776ea4e36326b8fdd04b71441a1982b
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
helper always returns a valid pointer, therefore no need to check
returned value for NULL. Also note that all programs run with
preemption disabled, which means that the returned pointer is stable
during all the execution of the program.
Change-Id: Idd23453d28e221430cc067c6b6f812eab0ac738d
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com
Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
except that it may return NULL. This happens when the cpu parameter is
out of range. So the caller must check the returned value.
Change-Id: I254209a7070abc34458020e4ee8ee73256e31886
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com
Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
ksym so that further dereferences on the ksym can use the BTF info
to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
the verifier reads the btf_id stored in the insn[0]'s imm field and
marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
which is encoded in btf_vminux by pahole. If the VAR is not of a struct
type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
and the mem_size is resolved to the size of the VAR's type.
>From the VAR btf_id, the verifier can also read the address of the
ksym's corresponding kernel var from kallsyms and use that to fill
dst_reg.
Therefore, the proper functionality of pseudo_btf_id depends on (1)
kallsyms and (2) the encoding of kernel global VARs in pahole, which
should be available since pahole v1.18.
Change-Id: I8e81d1d9c2ed4c669e5f68fcf4246b9c1c705bb6
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com
Currently, perf event in perf event array is removed from the array when
the map fd used to add the event is closed. This behavior makes it
difficult to the share perf events with perf event array.
Introduce perf event map that keeps the perf event open with a new flag
BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
removed when the original map fd is closed. Instead, the perf event will
stay in the map until 1) it is explicitly removed from the array; or 2)
the array is freed.
Change-Id: I7c0aaa5ae7431439acaaaaf7b5b689834b2313db
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com
Add a redirect_neigh() helper as redirect() drop-in replacement
for the xmit side. Main idea for the helper is to be very similar
in semantics to the latter just that the skb gets injected into
the neighboring subsystem in order to let the stack do the work
it knows best anyway to populate the L2 addresses of the packet
and then hand over to dev_queue_xmit() as redirect() does.
This solves two bigger items: i) skbs don't need to go up to the
stack on the host facing veth ingress side for traffic egressing
the container to achieve the same for populating L2 which also
has the huge advantage that ii) the skb->sk won't get orphaned in
ip_rcv_core() when entering the IP routing layer on the host stack.
Given that skb->sk neither gets orphaned when crossing the netns
as per 9c4c325252 ("skbuff: preserve sock reference when scrubbing
the skb.") the helper can then push the skbs directly to the phys
device where FQ scheduler can do its work and TCP stack gets proper
backpressure given we hold on to skb->sk as long as skb is still
residing in queues.
With the helper used in BPF data path to then push the skb to the
phys device, I observed a stable/consistent TCP_STREAM improvement
on veth devices for traffic going container -> host -> host ->
container from ~10Gbps to ~15Gbps for a single stream in my test
environment.
Change-Id: I9cca990db0d9091c889bf174a1de1c72e9a8d4de
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/f207de81629e1724899b73b8112e0013be782d35.1601477936.git.daniel@iogearbox.net
Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid
from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
based on the skb->sk, so it can be used as key as part of BPF map
lookups out of tc from host ns, in particular given the skb->sk is
retained these days when crossing net ns thanks to 9c4c325252
("skbuff: preserve sock reference when scrubbing the skb."). This
is similar to bpf_skb_cgroup_id() which implements the same for v2.
Kubernetes ecosystem is still operating on v1 however, hence net_cls
needs to be used there until this can be dropped in with the v2
helper of bpf_skb_cgroup_id().
Change-Id: Iafb100978944174dd962d8409f229dfd0fd3780e
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/ed633cf27a1c620e901c5aa99ebdefb028dce600.1601477936.git.daniel@iogearbox.net
This enables support for attaching freplace programs to multiple attach
points. It does this by amending the UAPI for bpf_link_Create with a target
btf ID that can be used to supply the new attachment point along with the
target program fd. The target must be compatible with the target that was
supplied at program load time.
The implementation reuses the checks that were factored out of
check_attach_btf_id() to ensure compatibility between the BTF types of the
old and new attachment. If these match, a new bpf_tracing_link will be
created for the new attach target, allowing multiple attachments to
co-exist simultaneously.
The code could theoretically support multiple-attach of other types of
tracing programs as well, but since I don't have a use case for any of
those, there is no API support for doing so.
Change-Id: Ifeca634ba0c7ae9b1c5bc3c11a9d2b83fb0b5d23
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk
A helper is added to allow seq file writing of kernel data
structures using vmlinux BTF. Its signature is
long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);
Flags and struct btf_ptr definitions/use are identical to the
bpf_snprintf_btf helper, and the helper returns 0 on success
or a negative error value.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: Ief0f9b8b9d9ed5f725159d71d1f3eb26f28c27c1
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com
A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF). Its signature is
long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);
struct btf_ptr * specifies
- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
are not to be confused with the BTF_F_* flags
below that control how the btf_ptr is displayed; the
flags member of the struct btf_ptr may be used to
disambiguate types in kernel versus module BTF, etc;
the main distinction is the flags relate to the type
and information needed in identifying it; not how it
is displayed.
For example a BPF program with a struct sk_buff *skb
could do the following:
static struct btf_ptr b = { };
b.ptr = skb;
b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);
Default output looks like this:
(struct sk_buff){
.transport_header = (__u16)65535,
.mac_header = (__u16)65535,
.end = (sk_buff_data_t)192,
.head = (unsigned char *)0x000000007524fd8b,
.data = (unsigned char *)0x000000007524fd8b,
.truesize = (unsigned int)768,
.users = (refcount_t){
.refs = (atomic_t){
.counter = (int)1,
},
},
}
Flags modifying display are as follows:
- BTF_F_COMPACT: no formatting around type information
- BTF_F_NONAME: no struct/union member names/types
- BTF_F_PTR_RAW: show raw (unobfuscated) pointer values;
equivalent to %px.
- BTF_F_ZERO: show zero-valued struct/union members;
they are not displayed by default
Change-Id: I77f2c2a0d41aee2f4f10e3288a36de475fd2cb46
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
Add .test_run for raw_tracepoint. Also, introduce a new feature that runs
the target program on a specific CPU. This is achieved by a new flag in
bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program
is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for
BPF programs that handle perf_event and other percpu resources, as the
program can access these resource locally.
Change-Id: Id8f992caff30d7b65df8195f8934bcb2d8b658cb
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com
This patch changes the bpf_sk_assign() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.
The bpf_sk_lookup_assign() is taking ARG_PTR_TO_SOCKET_"OR_NULL". Meaning
it specifically takes a literal NULL. ARG_PTR_TO_BTF_ID_SOCK_COMMON
does not allow a literal NULL, so another ARG type is required
for this purpose and another follow-up patch can be used if
there is such need.
Change-Id: Ia2747b017398c8dcc4454118dba03e71bdeba1dc
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200925000415.3857374-1-kafai@fb.com
This patch changes the bpf_tcp_*_syncookie() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.
Change-Id: I4f7712acada428dab29b94f0cb9bbbb19d9eaea2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200925000409.3856725-1-kafai@fb.com
This patch changes the bpf_sk_storage_*() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.
A micro benchmark has been done on a "cgroup_skb/egress" bpf program
which does a bpf_sk_storage_get(). It was driven by netperf doing
a 4096 connected UDP_STREAM test with 64bytes packet.
The stats from "kernel.bpf_stats_enabled" shows no meaningful difference.
The sk_storage_get_btf_proto, sk_storage_delete_btf_proto,
btf_sk_storage_get_proto, and btf_sk_storage_delete_proto are
no longer needed, so they are removed.
Change-Id: Ifef1303f11952714d181a3c369e154a70a864fd0
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200925000402.3856307-1-kafai@fb.com
The previous patch allows the networking bpf prog to use the
bpf_skc_to_*() helpers to get a PTR_TO_BTF_ID socket pointer,
e.g. "struct tcp_sock *". It allows the bpf prog to read all the
fields of the tcp_sock.
This patch changes the bpf_sk_release() and bpf_sk_*cgroup_id()
to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will
work with the pointer returned by the bpf_skc_to_*() helpers
also. For example, the following will work:
sk = bpf_skc_lookup_tcp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0);
if (!sk)
return;
tp = bpf_skc_to_tcp_sock(sk);
if (!tp) {
bpf_sk_release(sk);
return;
}
lsndtime = tp->lsndtime;
/* Pass tp to bpf_sk_release() will also work */
bpf_sk_release(tp);
Since PTR_TO_BTF_ID could be NULL, the helper taking
ARG_PTR_TO_BTF_ID_SOCK_COMMON has to check for NULL at runtime.
A btf_id of "struct sock" may not always mean a fullsock. Regardless
the helper's running context may get a non-fullsock or not,
considering fullsock check/handling is pretty cheap, it is better to
keep the same verifier expectation on helper that takes ARG_PTR_TO_BTF_ID*
will be able to handle the minisock situation. In the bpf_sk_*cgroup_id()
case, it will try to get a fullsock by using sk_to_full_sk() as its
skb variant bpf_sk"b"_*cgroup_id() has already been doing.
bpf_sk_release can already handle minisock, so nothing special has to
be done.
Change-Id: Ia9823f535d2a69b6ed4c16f599faa3f8c3ac3a42
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200925000356.3856047-1-kafai@fb.com
Fix a formatting error in the description of bpf_load_hdr_opt() (rst2man
complains about a wrong indentation, but what is missing is actually a
blank line before the bullet list).
Fix and harmonise the formatting for other helpers.
Change-Id: Ib42eedd569553e5bc7b730a6c598609546c9aca5
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200904161454.31135-3-quentin@isovalent.com
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.
The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.
There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.
When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();
Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.
This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.
Change-Id: I281471010348f21de4e0832dfc0265dfe85b4a0d
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
bpf_link_info.iter is used by link_query to return bpf_iter_link_info
to user space. Fields may be different, e.g., map_fd vs. map_id, so
we cannot reuse the exact structure. But make them similar, e.g.,
struct bpf_link_info {
/* common fields */
union {
struct { ... } raw_tracepoint;
struct { ... } tracing;
...
struct {
/* common fields for iter */
union {
struct {
__u32 map_id;
} map;
/* other structs for other targets */
};
};
};
};
so the structure is extensible the same way as bpf_iter_link_info.
Fixes: 6b0a249a301e ("bpf: Implement link_query for bpf iterators")
Change-Id: I5bb1481995ec2fa33ce79676b0cfb7594850bbd8
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200828051922.758950-1-yhs@fb.com
Adding d_path helper function that returns full path for
given 'struct path' object, which needs to be the kernel
BTF 'path' object. The path is returned in buffer provided
'buf' of size 'sz' and is zero terminated.
bpf_d_path(&file->f_path, buf, size);
The helper calls directly d_path function, so there's only
limited set of function it can be called from. Adding just
very modest set for the start.
Updating also bpf.h tools uapi header and adding 'path' to
bpf_helpers_doc.py script.
Change-Id: If390ec6189a537b730b9ae595d374cbfa83f6f5b
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200825192124.710397-11-jolsa@kernel.org
Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used
in LSM programs. These helpers are not used for tracing programs
(currently) as their usage is tied to the life-cycle of the object and
should only be used where the owning object won't be freed (when the
owning object is passed as an argument to the LSM hook). Thus, they
are safer to use in LSM hooks than tracing. Usage of local storage in
tracing programs will probably follow a per function based whitelist
approach.
Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock,
it, leads to a compilation warning for LSM programs, it's also updated
to accept a void * pointer instead.
Change-Id: Ib0afe90a58aa586852d48d13c77bc91cbada9bd1
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org
Similar to bpf_local_storage for sockets, add local storage for inodes.
The life-cycle of storage is managed with the life-cycle of the inode.
i.e. the storage is destroyed along with the owning inode.
The BPF LSM allocates an __rcu pointer to the bpf_local_storage in the
security blob which are now stackable and can co-exist with other LSMs.
Change-Id: I078273c02202a6ee3837865b6d5b307989eb9bed
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-6-kpsingh@chromium.org
Refactor the functionality in bpf_sk_storage.c so that concept of
storage linked to kernel objects can be extended to other objects like
inode, task_struct etc.
Each new local storage will still be a separate map and provide its own
set of helpers. This allows for future object specific extensions and
still share a lot of the underlying implementation.
This includes the changes suggested by Martin in:
https://lore.kernel.org/bpf/20200725013047.4006241-1-kafai@fb.com/
adding new map operations to support bpf_local_storage maps:
* storages for different kernel objects to optionally have different
memory charging strategy (map_local_storage_charge,
map_local_storage_uncharge)
* Functionality to extract the storage pointer from a pointer to the
owning object (map_owner_storage_ptr)
Co-developed-by: Martin KaFai Lau <kafai@fb.com>
Change-Id: I02da579d09e497018432efed3a876e384d0cd561
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-4-kpsingh@chromium.org
This patch is adapted from Eric's patch in an earlier discussion [1].
The TCP_SAVE_SYN currently only stores the network header and
tcp header. This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.
It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option. Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".
The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.
[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/
Suggested-by: Eric Dumazet <edumazet@google.com>
Change-Id: Id12b23200ccba9017f1792fda9c8b6aec2998d38
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
[ Note: The TCP changes here is mainly to implement the bpf
pieces into the bpf_skops_*() functions introduced
in the earlier patches. ]
The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF. It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.
The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance. Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.
For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].
This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options. It currently supports most of
the TCP packet except RST.
Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt(). The helper will ensure there is no duplicated
option in the header.
By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog. Different bpf-prog can write its
own option kind. It could also allow the bpf-prog to support a
recently standardized option on an older kernel.
Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
A few words on the PARSE CB flags. When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.
The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option. There are
details comment on these new cb flags in the UAPI bpf.h.
sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options. They are read only.
The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.
Please refer to the comment in UAPI bpf.h. It has details
on what skb_data contains under different sock_ops->op.
3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.
* Passive side
When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog. The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later. Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
is not very useful here since the listen sk will be shared
by many concurrent connection requests.
Extending bpf_sk_storage support to request_sock will add weight
to the minisock and it is not necessary better than storing the
whole ~100 bytes SYN pkt. ]
When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback. At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.
The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario. More on this later.
There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.
The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
- (a) the just received syn (available when the bpf prog is writing SYNACK)
and it is the only way to get SYN during syncookie mode.
or
- (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
existing CB).
The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.
Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.
* Fastopen
Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.
* Syncookie
For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK. The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.
* Active side
The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().
* Turn off header CB flags after 3WHS
If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.
[1]: draft-wang-tcpm-low-latency-opt-00
https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00
Change-Id: If5c8cad06573cc8f9712fb0b0761b10df24d180d
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog. This will be the only SYN packet available to the bpf
prog during syncookie. For other regular cases, the bpf prog can
also use the saved_syn.
When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes. It is done by the new
bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly. 4 byte alignment will
be included in the "*remaining" before this function returns. The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.
Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options. The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).
The bpf prog is the last one getting a chance to reserve header space
and writing the header option.
These two functions are half implemented to highlight the changes in
TCP stack. The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Change-Id: If81bbf6ba9cd4a9ee363a447e780e9b0833ee92a
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
The patch adds a function bpf_skops_parse_hdr().
It will call the bpf prog to parse the TCP header received at
a tcp_sock that has at least reached the ESTABLISHED state.
For the packets received during the 3WHS (SYN, SYNACK and ACK),
the received skb will be available to the bpf prog during the callback
in bpf_skops_established() introduced in the previous patch and
in the bpf_skops_write_hdr_opt() that will be added in the
next patch.
Calling bpf prog to parse header is controlled by two new flags in
tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.
When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set,
the bpf prog will only be called when there is unknown
option in the TCP header.
When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set,
the bpf prog will be called on all received TCP header.
This function is half implemented to highlight the changes in
TCP stack. The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Change-Id: Ia70614c4388ce1eab849e426140f7c66bf68e784
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection. It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).
A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.
Change-Id: I32e313838af527d12b7e5f16036481bcaefa1e7b
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com