msm-5.15

Author	SHA1	Message	Date
Greg Kroah-Hartman	ac2a7a141f	Merge 5.15.79 into android13-5.15-lts Changes in 5.15.79 thunderbolt: Tear down existing tunnels when resuming from hibernate thunderbolt: Add DP OUT resource when DP tunnel is discovered fuse: fix readdir cache race drm/amdkfd: avoid recursive lock in migrations back to RAM drm/amdkfd: handle CPU fault on COW mapping drm/amdkfd: Fix NULL pointer dereference in svm_migrate_to_ram() hwspinlock: qcom: correct MMIO max register for newer SoCs phy: stm32: fix an error code in probe wifi: cfg80211: silence a sparse RCU warning wifi: cfg80211: fix memory leak in query_regdb_file() soundwire: qcom: reinit broadcast completion soundwire: qcom: check for outanding writes before doing a read bpf, verifier: Fix memory leak in array reallocation for stack state bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues wifi: mac80211: Set TWT Information Frame Disabled bit as 1 bpftool: Fix NULL pointer dereference when pin {PROG, MAP, LINK} without FILE HID: hyperv: fix possible memory leak in mousevsc_probe() bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues bpf: Fix sockmap calling sleepable function in teardown path bpf, sock_map: Move cancel_work_sync() out of sock lock bpf: Add helper macro bpf_for_each_reg_in_vstate bpf: Fix wrong reg type conversion in release_reference() net: gso: fix panic on frag_list with mixed head alloc types macsec: delete new rxsc when offload fails macsec: fix secy->n_rx_sc accounting macsec: fix detection of RXSCs when toggling offloading macsec: clear encryption keys from the stack after setting up offload octeontx2-pf: Use hardware register for CQE count octeontx2-pf: NIX TX overwrites SQ_CTX_HW_S[SQ_INT] net: tun: Fix memory leaks of napi_get_frags bnxt_en: Fix possible crash in bnxt_hwrm_set_coal() bnxt_en: fix potentially incorrect return value for ndo_rx_flow_steer net: fman: Unregister ethernet device on removal capabilities: fix undefined behavior in bit shift for CAP_TO_MASK phy: ralink: mt7621-pci: add sentinel to quirks table KVM: s390: pv: don't allow userspace to set the clock under PV net: lapbether: fix issue of dev reference count leakage in lapbeth_device_event() hamradio: fix issue of dev reference count leakage in bpq_device_event() net: wwan: iosm: fix memory leak in ipc_wwan_dellink net: wwan: mhi: fix memory leak in mhi_mbim_dellink drm/vc4: Fix missing platform_unregister_drivers() call in vc4_drm_register() tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent ipv6: addrlabel: fix infoleak when sending struct ifaddrlblmsg to network can: af_can: fix NULL pointer dereference in can_rx_register() net: stmmac: dwmac-meson8b: fix meson8b_devm_clk_prepare_enable() net: broadcom: Fix BCMGENET Kconfig tipc: fix the msg->req tlv len check in tipc_nl_compat_name_table_dump_header dmaengine: pxa_dma: use platform_get_irq_optional dmaengine: mv_xor_v2: Fix a resource leak in mv_xor_v2_remove() dmaengine: ti: k3-udma-glue: fix memory leak when register device fail net: lapbether: fix issue of invalid opcode in lapbeth_open() drivers: net: xgene: disable napi when register irq failed in xgene_enet_open() perf stat: Fix printing os->prefix in CSV metrics output perf tools: Add the include/perf/ directory to .gitignore netfilter: nfnetlink: fix potential dead lock in nfnetlink_rcv_msg() netfilter: Cleanup nft_net->module_list from nf_tables_exit_net() net: marvell: prestera: fix memory leak in prestera_rxtx_switch_init() net: nixge: disable napi when enable interrupts failed in nixge_open() net: wwan: iosm: fix memory leak in ipc_pcie_read_bios_cfg net/mlx5: Bridge, verify LAG state when adding bond to bridge net/mlx5: Allow async trigger completion execution on single CPU systems net/mlx5e: E-Switch, Fix comparing termination table instance net: cpsw: disable napi in cpsw_ndo_open() net: cxgb3_main: disable napi when bind qsets failed in cxgb_up() stmmac: intel: Enable 2.5Gbps for Intel AlderLake-S stmmac: intel: Update PCH PTP clock rate from 200MHz to 204.8MHz mctp: Fix an error handling path in mctp_init() cxgb4vf: shut down the adapter when t4vf_update_port_info() failed in cxgb4vf_open() stmmac: dwmac-loongson: fix missing pci_disable_msi() while module exiting stmmac: dwmac-loongson: fix missing pci_disable_device() in loongson_dwmac_probe() stmmac: dwmac-loongson: fix missing of_node_put() while module exiting net: phy: mscc: macsec: clear encryption keys when freeing a flow net: atlantic: macsec: clear encryption keys from the stack ethernet: s2io: disable napi when start nic failed in s2io_card_up() net: mv643xx_eth: disable napi when init rxq or txq failed in mv643xx_eth_open() ethernet: tundra: free irq when alloc ring failed in tsi108_open() net: macvlan: fix memory leaks of macvlan_common_newlink riscv: process: fix kernel info leakage riscv: vdso: fix build with llvm riscv: fix reserved memory setup arm64: efi: Fix handling of misaligned runtime regions and drop warning MIPS: jump_label: Fix compat branch range check mmc: cqhci: Provide helper for resetting both SDHCI and CQHCI mmc: sdhci-of-arasan: Fix SDHCI_RESET_ALL for CQHCI mmc: sdhci_am654: Fix SDHCI_RESET_ALL for CQHCI mmc: sdhci-tegra: Fix SDHCI_RESET_ALL for CQHCI mmc: sdhci-esdhc-imx: use the correct host caps for MMC_CAP_8_BIT_DATA ALSA: hda/hdmi - enable runtime pm for more AMD display audio ALSA: hda/ca0132: add quirk for EVGA Z390 DARK ALSA: hda: fix potential memleak in 'add_widget_node' ALSA: hda/realtek: Add Positivo C6300 model quirk ALSA: usb-audio: Yet more regression for for the delayed card registration ALSA: usb-audio: Add quirk entry for M-Audio Micro ALSA: usb-audio: Add DSD support for Accuphase DAC-60 vmlinux.lds.h: Fix placement of '.data..decrypted' section ata: libata-scsi: fix SYNCHRONIZE CACHE (16) command failure nilfs2: fix deadlock in nilfs_count_free_blocks() nilfs2: fix use-after-free bug of ns_writer on remount drm/i915/dmabuf: fix sg_table handling in map_dma_buf drm/amdgpu: disable BACO on special BEIGE_GOBY card platform/x86: hp_wmi: Fix rfkill causing soft blocked wifi wifi: ath11k: avoid deadlock during regulatory update in ath11k_regd_update() btrfs: fix match incorrectly in dev_args_match_device btrfs: selftests: fix wrong error check in btrfs_free_dummy_root() btrfs: zoned: initialize device's zone info for seeding mms: sdhci-esdhc-imx: Fix SDHCI_RESET_ALL for CQHCI udf: Fix a slab-out-of-bounds write bug in udf_find_entry() mm/damon/dbgfs: check if rm_contexts input is for a real context mm/memremap.c: map FS_DAX device memory as decrypted mm/shmem: use page_mapping() to detect page cache for uffd continue can: j1939: j1939_send_one(): fix missing CAN header initialization cert host tools: Stop complaining about deprecated OpenSSL functions dmaengine: at_hdmac: Fix at_lli struct definition dmaengine: at_hdmac: Don't start transactions at tx_submit level dmaengine: at_hdmac: Start transfer for cyclic channels in issue_pending dmaengine: at_hdmac: Fix premature completion of desc in issue_pending dmaengine: at_hdmac: Do not call the complete callback on device_terminate_all dmaengine: at_hdmac: Protect atchan->status with the channel lock dmaengine: at_hdmac: Fix concurrency problems by removing atc_complete_all() dmaengine: at_hdmac: Fix concurrency over descriptor dmaengine: at_hdmac: Free the memset buf without holding the chan lock dmaengine: at_hdmac: Fix concurrency over the active list dmaengine: at_hdmac: Fix descriptor handling when issuing it to hardware dmaengine: at_hdmac: Fix completion of unissued descriptor in case of errors dmaengine: at_hdmac: Don't allow CPU to reorder channel enable dmaengine: at_hdmac: Fix impossible condition dmaengine: at_hdmac: Check return code of dma_async_device_register marvell: octeontx2: build error: unknown type name 'u64' drm/amdkfd: Migrate in CPU page fault use current mm net: tun: call napi_schedule_prep() to ensure we own a napi x86/cpu: Restore AMD's DE_CFG MSR after resume Linux 5.15.79 Change-Id: I6f77aa724b7aa43abcef3444af951c7c62d46303 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2022-12-15 07:25:07 +00:00
Greg Kroah-Hartman	2d19e77e73	Revert "bpf: Fix reference state management for synchronous callbacks" This reverts commit `4ed5155043` which is commit 9d9d00ac29d0ef7ce426964de46fa6b380357d0a upstream. It breaks the Android kernel ABI and is not needed for normal Android systems so it is safe to remove. If it is needed in the future, it can be brought back in an abi-safe way. Bug: 161946584 Change-Id: Idc9e43f0bc47de086cefd5cd572ce8295bbb7b08 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2022-11-28 16:45:42 +00:00
Greg Kroah-Hartman	b049ff121c	Merge 5.15.75 into android13-5.15-lts Changes in 5.15.75 Revert "fs: check FMODE_LSEEK to control internal pipe splicing" ALSA: oss: Fix potential deadlock at unregistration ALSA: rawmidi: Drop register_mutex in snd_rawmidi_free() ALSA: usb-audio: Fix potential memory leaks ALSA: usb-audio: Fix NULL dererence at error path ALSA: hda/realtek: remove ALC289_FIXUP_DUAL_SPK for Dell 5530 ALSA: hda/realtek: Correct pin configs for ASUS G533Z ALSA: hda/realtek: Add quirk for ASUS GV601R laptop ALSA: hda/realtek: Add Intel Reference SSID to support headset keys mtd: rawnand: atmel: Unmap streaming DMA mappings io_uring/net: don't update msg_name if not provided hv_netvsc: Fix race between VF offering and VF association message from host cifs: destage dirty pages before re-reading them for cache=none cifs: Fix the error length of VALIDATE_NEGOTIATE_INFO message iio: dac: ad5593r: Fix i2c read protocol requirements iio: ltc2497: Fix reading conversion results iio: adc: ad7923: fix channel readings for some variants iio: pressure: dps310: Refactor startup procedure iio: pressure: dps310: Reset chip after timeout xhci: dbc: Fix memory leak in xhci_alloc_dbc() usb: add quirks for Lenovo OneLink+ Dock can: kvaser_usb: Fix use of uninitialized completion can: kvaser_usb_leaf: Fix overread with an invalid command can: kvaser_usb_leaf: Fix TX queue out of sync after restart can: kvaser_usb_leaf: Fix CAN state after restart mmc: sdhci-sprd: Fix minimum clock limit i2c: designware: Fix handling of real but unexpected device interrupts fs: dlm: fix race between test_bit() and queue_work() fs: dlm: handle -EBUSY first in lock arg validation HID: multitouch: Add memory barriers quota: Check next/prev free block number after reading from quota file platform/chrome: cros_ec_proto: Update version on GET_NEXT_EVENT failure ASoC: wcd9335: fix order of Slimbus unprepare/disable ASoC: wcd934x: fix order of Slimbus unprepare/disable hwmon: (gsc-hwmon) Call of_node_get() before of_find_xxx API net: thunderbolt: Enable DMA paths only after rings are enabled regulator: qcom_rpm: Fix circular deferral regression arm64: topology: move store_cpu_topology() to shared code riscv: topology: fix default topology reporting RISC-V: Make port I/O string accessors actually work parisc: fbdev/stifb: Align graphics memory size to 4MB riscv: Allow PROT_WRITE-only mmap() riscv: Make VM_WRITE imply VM_READ riscv: always honor the CONFIG_CMDLINE_FORCE when parsing dtb riscv: Pass -mno-relax only on lld < 15.0.0 UM: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK nvmem: core: Fix memleak in nvmem_register() nvme-multipath: fix possible hang in live ns resize with ANA access nvme-pci: set min_align_mask before calculating max_hw_sectors Revert "drm/amdgpu: use dirty framebuffer helper" dmaengine: mxs: use platform_driver_register drm/virtio: Check whether transferred 2D BO is shmem drm/virtio: Unlock reservations on virtio_gpu_object_shmem_init() error drm/virtio: Use appropriate atomic state in virtio_gpu_plane_cleanup_fb() drm/udl: Restore display mode on resume arm64: errata: Add Cortex-A55 to the repeat tlbi list mm/damon: validate if the pmd entry is present before accessing mm/mmap: undo ->mmap() when arch_validate_flags() fails xen/gntdev: Prevent leaking grants xen/gntdev: Accommodate VMA splitting PCI: Sanitise firmware BAR assignments behind a PCI-PCI bridge serial: 8250: Let drivers request full 16550A feature probing serial: 8250: Request full 16550A feature probing for OxSemi PCIe devices NFSD: Protect against send buffer overflow in NFSv3 READDIR NFSD: Protect against send buffer overflow in NFSv2 READ NFSD: Protect against send buffer overflow in NFSv3 READ powercap: intel_rapl: Use standard Energy Unit for SPR Dram RAPL domain powerpc/boot: Explicitly disable usage of SPE instructions slimbus: qcom-ngd: use correct error in message of pdr_add_lookup() failure slimbus: qcom-ngd: cleanup in probe error path scsi: qedf: Populate sysfs attributes for vport gpio: rockchip: request GPIO mux to pinctrl when setting direction pinctrl: rockchip: add pinmux_ops.gpio_set_direction callback fbdev: smscufx: Fix use-after-free in ufx_ops_open() ksmbd: fix endless loop when encryption for response fails ksmbd: Fix wrong return value and message length check in smb2_ioctl() ksmbd: Fix user namespace mapping fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE btrfs: fix race between quota enable and quota rescan ioctl btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer f2fs: complete checkpoints during remount f2fs: flush pending checkpoints when freezing super f2fs: increase the limit for reserve_root f2fs: fix to do sanity check on destination blkaddr during recovery f2fs: fix to do sanity check on summary info hardening: Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO hardening: Remove Clang's enable flag for -ftrivial-auto-var-init=zero jbd2: wake up journal waiters in FIFO order, not LIFO jbd2: fix potential buffer head reference count leak jbd2: fix potential use-after-free in jbd2_fc_wait_bufs jbd2: add miss release buffer head in fc_do_one_pass() ext4: avoid crash when inline data creation follows DIO write ext4: fix null-ptr-deref in ext4_write_info ext4: make ext4_lazyinit_thread freezable ext4: fix check for block being out of directory size ext4: don't increase iversion counter for ea_inodes ext4: ext4_read_bh_lock() should submit IO if the buffer isn't uptodate ext4: place buffer head allocation before handle start ext4: fix dir corruption when ext4_dx_add_entry() fails ext4: fix miss release buffer head in ext4_fc_write_inode ext4: fix potential memory leak in ext4_fc_record_modified_inode() ext4: fix potential memory leak in ext4_fc_record_regions() ext4: update 'state->fc_regions_size' after successful memory allocation livepatch: fix race between fork and KLP transition ftrace: Properly unset FTRACE_HASH_FL_MOD ring-buffer: Allow splice to read previous partially read pages ring-buffer: Have the shortest_full queue be the shortest not longest ring-buffer: Check pending waiters when doing wake ups as well ring-buffer: Add ring_buffer_wake_waiters() ring-buffer: Fix race between reset page and reading page tracing: Disable interrupt or preemption before acquiring arch_spinlock_t tracing: Wake up ring buffer waiters on closing of the file tracing: Wake up waiters when tracing is disabled tracing: Add ioctl() to force ring buffer waiters to wake up tracing: Move duplicate code of trace_kprobe/eprobe.c into header tracing: Add "(fault)" name injection to kernel probes tracing: Fix reading strings from synthetic events thunderbolt: Explicitly enable lane adapter hotplug events at startup efi: libstub: drop pointless get_memory_map() call media: cedrus: Set the platform driver data earlier media: cedrus: Fix endless loop in cedrus_h265_skip_bits() blk-wbt: call rq_qos_add() after wb_normal is initialized KVM: x86/emulator: Fix handing of POP SS to correctly set interruptibility KVM: nVMX: Unconditionally purge queued/injected events on nested "exit" KVM: nVMX: Don't propagate vmcs12's PERF_GLOBAL_CTRL settings to vmcs02 KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS staging: greybus: audio_helper: remove unused and wrong debugfs usage drm/nouveau/kms/nv140-: Disable interlacing drm/nouveau: fix a use-after-free in nouveau_gem_prime_import_sg_table() drm/i915: Fix watermark calculations for gen12+ RC CCS modifier drm/i915: Fix watermark calculations for gen12+ MC CCS modifier drm/i915: Fix watermark calculations for gen12+ CCS+CC modifier drm/amd/display: Fix vblank refcount in vrr transition smb3: must initialize two ACL struct fields to zero selinux: use "grep -E" instead of "egrep" ima: fix blocking of security.ima xattrs of unsupported algorithms userfaultfd: open userfaultfds with O_RDONLY ntfs3: rework xattr handlers and switch to POSIX ACL VFS helpers thermal: cpufreq_cooling: Check the policy first in cpufreq_cooling_register() sh: machvec: Use char[] for section boundaries MIPS: SGI-IP27: Free some unused memory MIPS: SGI-IP27: Fix platform-device leak in bridge_platform_create() ARM: 9244/1: dump: Fix wrong pg_level in walk_pmd() ARM: 9247/1: mm: set readonly for MT_MEMORY_RO with ARM_LPAE objtool: Preserve special st_shndx indexes in elf_update_symbol nfsd: Fix a memory leak in an error handling path SUNRPC: Fix svcxdr_init_decode's end-of-buffer calculation SUNRPC: Fix svcxdr_init_encode's buflen calculation NFSD: Protect against send buffer overflow in NFSv2 READDIR NFSD: Fix handling of oversized NFSv4 COMPOUND requests wifi: rtlwifi: 8192de: correct checking of IQK reload wifi: ath10k: add peer map clean up for peer delete in ath10k_sta_state() leds: lm3601x: Don't use mutex after it was destroyed bpf: Fix reference state management for synchronous callbacks wifi: mac80211: allow bw change during channel switch in mesh bpftool: Fix a wrong type cast in btf_dumper_int spi: mt7621: Fix an error message in mt7621_spi_probe() x86/resctrl: Fix to restore to original value when re-enabling hardware prefetch register xsk: Fix backpressure mechanism on Tx bpf: Disable preemption when increasing per-cpu map_locked bpf: Propagate error from htab_lock_bucket() to userspace bpf: Use this_cpu_{inc\|dec\|inc_return} for bpf_task_storage_busy Bluetooth: btusb: mediatek: fix WMT failure during runtime suspend wifi: rtl8xxxu: tighten bounds checking in rtl8xxxu_read_efuse() wifi: rtw88: add missing destroy_workqueue() on error path in rtw_core_init() selftests/xsk: Avoid use-after-free on ctx spi: qup: add missing clk_disable_unprepare on error in spi_qup_resume() spi: qup: add missing clk_disable_unprepare on error in spi_qup_pm_resume_runtime() wifi: rtl8xxxu: Fix skb misuse in TX queue selection spi: meson-spicc: do not rely on busy flag in pow2 clk ops bpf: btf: fix truncated last_member_type_id in btf_struct_resolve wifi: rtl8xxxu: gen2: Fix mistake in path B IQ calibration wifi: rtl8xxxu: Remove copy-paste leftover in gen2_update_rate_mask wifi: mt76: sdio: fix transmitting packet hangs wifi: mt76: mt7615: add mt7615_mutex_acquire/release in mt7615_sta_set_decap_offload wifi: mt76: mt7915: do not check state before configuring implicit beamform Bluetooth: RFCOMM: Fix possible deadlock on socket shutdown/release net: fs_enet: Fix wrong check in do_pd_setup bpf: Ensure correct locking around vulnerable function find_vpid() Bluetooth: hci_{ldisc,serdev}: check percpu_init_rwsem() failure netfilter: conntrack: fix the gc rescheduling delay netfilter: conntrack: revisit the gc initial rescheduling bias wifi: ath11k: fix number of VHT beamformee spatial streams x86/microcode/AMD: Track patch allocation size explicitly x86/cpu: Include the header of init_ia32_feat_ctl()'s prototype spi: dw: Fix PM disable depth imbalance in dw_spi_bt1_probe spi/omap100k:Fix PM disable depth imbalance in omap1_spi100k_probe skmsg: Schedule psock work if the cached skb exists on the psock i2c: mlxbf: support lock mechanism Bluetooth: hci_core: Fix not handling link timeouts propertly xfrm: Reinject transport-mode packets through workqueue netfilter: nft_fib: Fix for rpath check with VRF devices spi: s3c64xx: Fix large transfers with DMA wifi: rtl8xxxu: Fix AIFS written to REG_EDCA__PARAM vhost/vsock: Use kvmalloc/kvfree for larger packets. eth: alx: take rtnl_lock on resume mISDN: fix use-after-free bugs in l1oip timer handlers sctp: handle the error returned from sctp_auth_asoc_init_active_key tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited spi: Ensure that sg_table won't be used after being freed hwmon: (pmbus/mp2888) Fix sensors readouts for MPS Multi-phase mp2888 controller net: rds: don't hold sock lock when cancelling work from rds_tcp_reset_callbacks() bnx2x: fix potential memory leak in bnx2x_tpa_stop() net: wwan: iosm: Call mutex_init before locking it net/ieee802154: reject zero-sized raw_sendmsg() once: add DO_ONCE_SLOW() for sleepable contexts net: mvpp2: fix mvpp2 debugfs leak drm: bridge: adv7511: fix CEC power down control register offset drm: bridge: adv7511: unregister cec i2c device after cec adapter drm/bridge: Avoid uninitialized variable warning drm/mipi-dsi: Detach devices when removing the host drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling drm/bridge: parade-ps8640: Fix regulator supply order drm/dp_mst: fix drm_dp_dpcd_read return value checks drm:pl111: Add of_node_put() when breaking out of for_each_available_child_of_node() ASoC: mt6359: fix tests for platform_get_irq() failure platform/chrome: fix double-free in chromeos_laptop_prepare() platform/chrome: fix memory corruption in ioctl ASoC: tas2764: Allow mono streams ASoC: tas2764: Drop conflicting set_bias_level power setting ASoC: tas2764: Fix mute/unmute platform/x86: msi-laptop: Fix old-ec check for backlight registering platform/x86: msi-laptop: Fix resource cleanup platform/chrome: cros_ec_typec: Correct alt mode index drm/amdgpu: add missing pci_disable_device() in amdgpu_pmops_runtime_resume() drm/bridge: megachips: Fix a null pointer dereference bug ASoC: rsnd: Add check for rsnd_mod_power_on ALSA: hda: beep: Simplify keep-power-at-enable behavior drm/bochs: fix blanking drm/omap: dss: Fix refcount leak bugs drm/amdgpu: Fix memory leak in hpd_rx_irq_create_workqueue() mmc: au1xmmc: Fix an error handling path in au1xmmc_probe() ASoC: eureka-tlv320: Hold reference returned from of_find_xxx API drm/msm/dpu: index dpu_kms->hw_vbif using vbif_idx drm/msm/dp: correct 1.62G link rate at dp_catalog_ctrl_config_msa() drm/vmwgfx: Fix memory leak in vmw_mksstat_add_ioctl() ASoC: codecs: tx-macro: fix kcontrol put ASoC: da7219: Fix an error handling path in da7219_register_dai_clks() ALSA: dmaengine: increment buffer pointer atomically mmc: wmt-sdmmc: Fix an error handling path in wmt_mci_probe() ASoC: wm8997: Fix PM disable depth imbalance in wm8997_probe ASoC: wm5110: Fix PM disable depth imbalance in wm5110_probe ASoC: wm5102: Fix PM disable depth imbalance in wm5102_probe ASoC: mt6660: Fix PM disable depth imbalance in mt6660_i2c_probe ALSA: hda/hdmi: Don't skip notification handling during PM operation memory: pl353-smc: Fix refcount leak bug in pl353_smc_probe() memory: of: Fix refcount leak bug in of_get_ddr_timings() memory: of: Fix refcount leak bug in of_lpddr3_get_ddr_timings() locks: fix TOCTOU race when granting write lease soc: qcom: smsm: Fix refcount leak bugs in qcom_smsm_probe() soc: qcom: smem_state: Add refcounting for the 'state->of_node' ARM: dts: imx6qdl-kontron-samx6i: hook up DDC i2c bus ARM: dts: turris-omnia: Fix mpp26 pin name and comment ARM: dts: kirkwood: lsxl: fix serial line ARM: dts: kirkwood: lsxl: remove first ethernet port ia64: export memory_add_physaddr_to_nid to fix cxl build error soc/tegra: fuse: Drop Kconfig dependency on TEGRA20_APB_DMA arm64: dts: ti: k3-j7200: fix main pinmux range ARM: dts: exynos: correct s5k6a3 reset polarity on Midas family ARM: Drop CMDLINE_ dependency on ATAGS ext4: don't run ext4lazyinit for read-only filesystems arm64: ftrace: fix module PLTs with mcount ARM: dts: exynos: fix polarity of VBUS GPIO of Origen iio: adc: at91-sama5d2_adc: fix AT91_SAMA5D2_MR_TRACKTIM_MAX iio: adc: at91-sama5d2_adc: check return status for pressure and touch iio: adc: at91-sama5d2_adc: lock around oversampling and sample freq iio: adc: at91-sama5d2_adc: disable/prepare buffer on suspend/resume iio: inkern: only release the device node when done with it iio: inkern: fix return value in devm_of_iio_channel_get_by_name() iio: ABI: Fix wrong format of differential capacitance channel ABI. iio: magnetometer: yas530: Change data type of hard_offsets to signed RDMA/mlx5: Don't compare mkey tags in DEVX indirect mkey usb: common: debug: Check non-standard control requests clk: meson: Hold reference returned by of_get_parent() clk: oxnas: Hold reference returned by of_get_parent() clk: qoriq: Hold reference returned by of_get_parent() clk: berlin: Add of_node_put() for of_get_parent() clk: sprd: Hold reference returned by of_get_parent() clk: tegra: Fix refcount leak in tegra210_clock_init clk: tegra: Fix refcount leak in tegra114_clock_init clk: tegra20: Fix refcount leak in tegra20_clock_init HSI: omap_ssi: Fix refcount leak in ssi_probe HSI: omap_ssi_port: Fix dma_map_sg error check media: exynos4-is: fimc-is: Add of_node_put() when breaking out of loop tty: xilinx_uartps: Fix the ignore_status media: meson: vdec: add missing clk_disable_unprepare on error in vdec_hevc_start() media: uvcvideo: Fix memory leak in uvc_gpio_parse media: uvcvideo: Use entity get_cur in uvc_ctrl_set media: xilinx: vipp: Fix refcount leak in xvip_graph_dma_init RDMA/rxe: Fix "kernel NULL pointer dereference" error RDMA/rxe: Fix the error caused by qp->sk misc: ocxl: fix possible refcount leak in afu_ioctl() fpga: prevent integer overflow in dfl_feature_ioctl_set_irq() dmaengine: hisilicon: Disable channels when unregister hisi_dma dmaengine: hisilicon: Fix CQ head update dmaengine: hisilicon: Add multi-thread support for a DMA channel dyndbg: fix static_branch manipulation dyndbg: fix module.dyndbg handling dyndbg: let query-modname override actual module name dyndbg: drop EXPORTed dynamic_debug_exec_queries clk: qcom: sm6115: Select QCOM_GDSC mtd: devices: docg3: check the return value of devm_ioremap() in the probe phy: amlogic: phy-meson-axg-mipi-pcie-analog: Hold reference returned by of_get_parent() phy: phy-mtk-tphy: fix the phy type setting issue mtd: rawnand: intel: Read the chip-select line from the correct OF node mtd: rawnand: intel: Remove undocumented compatible string mtd: rawnand: fsl_elbc: Fix none ECC mode RDMA/irdma: Align AE id codes to correct flush code and event RDMA/srp: Fix srp_abort() RDMA/siw: Always consume all skbuf data in sk_data_ready() upcall. RDMA/siw: Fix QP destroy to wait for all references dropped. ata: fix ata_id_sense_reporting_enabled() and ata_id_has_sense_reporting() ata: fix ata_id_has_devslp() ata: fix ata_id_has_ncq_autosense() ata: fix ata_id_has_dipm() mtd: rawnand: meson: fix bit map use in meson_nfc_ecc_correct() md: Replace snprintf with scnprintf md/raid5: Ensure stripe_fill happens on non-read IO with journal md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() RDMA/cm: Use SLID in the work completion as the DLID in responder side IB: Set IOVA/LENGTH on IB_MR in core/uverbs layers xhci: Don't show warning for reinit on known broken suspend usb: gadget: function: fix dangling pnp_string in f_printer.c drivers: serial: jsm: fix some leaks in probe serial: 8250: Toggle IER bits on only after irq has been set up tty: serial: fsl_lpuart: disable dma rx/tx use flags in lpuart_dma_shutdown phy: qualcomm: call clk_disable_unprepare in the error handling staging: vt6655: fix some erroneous memory clean-up loops slimbus: qcom-ngd-ctrl: allow compile testing without QCOM_RPROC_COMMON firmware: google: Test spinlock on panic path to avoid lockups serial: 8250: Fix restoring termios speed after suspend scsi: libsas: Fix use-after-free bug in smp_execute_task_sg() scsi: iscsi: Rename iscsi_conn_queue_work() scsi: iscsi: Add recv workqueue helpers scsi: iscsi: Run recv path from workqueue scsi: iscsi: iscsi_tcp: Fix null-ptr-deref while calling getpeername() clk: qcom: apss-ipq6018: mark apcs_alias0_core_clk as critical clk: qcom: gcc-sm6115: Override default Alpha PLL regs RDMA/rxe: Fix resize_finish() in rxe_queue.c fsi: core: Check error number after calling ida_simple_get mfd: intel_soc_pmic: Fix an error handling path in intel_soc_pmic_i2c_probe() mfd: fsl-imx25: Fix an error handling path in mx25_tsadc_setup_irq() mfd: lp8788: Fix an error handling path in lp8788_probe() mfd: lp8788: Fix an error handling path in lp8788_irq_init() and lp8788_irq_init() mfd: fsl-imx25: Fix check for platform_get_irq() errors mfd: sm501: Add check for platform_driver_register() clk: mediatek: mt8183: mfgcfg: Propagate rate changes to parent dmaengine: ioat: stop mod_timer from resurrecting deleted timer in __cleanup() usb: mtu3: fix failed runtime suspend in host only mode spmi: pmic-arb: correct duplicate APID to PPID mapping logic clk: vc5: Fix 5P49V6901 outputs disabling when enabling FOD clk: baikal-t1: Fix invalid xGMAC PTP clock divider clk: baikal-t1: Add shared xGMAC ref/ptp clocks internal parent clk: baikal-t1: Add SATA internal ref clock buffer clk: bcm2835: fix bcm2835_clock_rate_from_divisor declaration clk: imx: scu: fix memleak on platform_device_add() fails clk: ti: dra7-atl: Fix reference leak in of_dra7_atl_clk_probe clk: ast2600: BCLK comes from EPLL mailbox: mpfs: fix handling of the reg property mailbox: mpfs: account for mbox offsets while sending mailbox: bcm-ferxrm-mailbox: Fix error check for dma_map_sg powerpc/configs: Properly enable PAPR_SCM in pseries_defconfig powerpc/math_emu/efp: Include module.h powerpc/sysdev/fsl_msi: Add missing of_node_put() powerpc/pci_dn: Add missing of_node_put() powerpc/powernv: add missing of_node_put() in opal_export_attrs() powerpc: Fix fallocate and fadvise64_64 compat parameter combination x86/hyperv: Fix 'struct hv_enlightened_vmcs' definition powerpc/64s: Fix GENERIC_CPU build flags for PPC970 / G5 powerpc: Fix SPE Power ISA properties for e500v1 platforms powerpc/kprobes: Fix null pointer reference in arch_prepare_kprobe() powerpc/pseries/vas: Pass hw_cpu_id to node associativity HCALL crypto: sahara - don't sleep when in softirq crypto: hisilicon/zip - fix mismatch in get/set sgl_sge_nr hwrng: arm-smccc-trng - fix NO_ENTROPY handling cgroup: Honor caller's cgroup NS when resolving path hwrng: imx-rngc - Moving IRQ handler registering after imx_rngc_irq_mask_clear() crypto: qat - fix default value of WDT timer crypto: hisilicon/qm - fix missing put dfx access cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset iommu/omap: Fix buffer overflow in debugfs crypto: akcipher - default implementation for setting a private key crypto: ccp - Release dma channels before dmaengine unrgister crypto: inside-secure - Change swab to swab32 crypto: qat - fix DMA transfer direction cifs: return correct error in ->calc_signature() iommu/iova: Fix module config properly tracing: kprobe: Fix kprobe event gen test module on exit tracing: kprobe: Make gen test module work in arm and riscv tracing/osnoise: Fix possible recursive locking in stop_per_cpu_kthreads kbuild: remove the target in signal traps when interrupted kbuild: rpm-pkg: fix breakage when V=1 is used crypto: marvell/octeontx - prevent integer overflows crypto: cavium - prevent integer overflow loading firmware thermal/drivers/qcom/tsens-v0_1: Fix MSM8939 fourth sensor hw_id ACPI: APEI: do not add task_work to kernel thread to avoid memory leak f2fs: fix race condition on setting FI_NO_EXTENT flag f2fs: fix to account FS_CP_DATA_IO correctly selftest: tpm2: Add Client.__del__() to close /dev/tpm* handle fs: dlm: fix race in lowcomms rcu: Avoid triggering strict-GP irq-work when RCU is idle rcu: Back off upon fill_page_cache_func() allocation failure rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() ACPI: video: Add Toshiba Satellite/Portege Z830 quirk ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys address cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode MIPS: BCM47XX: Cast memcmp() of function to (void *) powercap: intel_rapl: fix UBSAN shift-out-of-bounds issue thermal: intel_powerclamp: Use get_cpu() instead of smp_processor_id() to avoid crash ARM: decompressor: Include .data.rel.ro.local ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable x86/entry: Work around Clang __bdos() bug NFSD: Return nfserr_serverfault if splice_ok but buf->pages have data NFSD: fix use-after-free on source server when doing inter-server copy wifi: brcmfmac: fix invalid address access when enabling SCAN log level bpftool: Clear errno after libcap's checks ice: set tx_tstamps when creating new Tx rings via ethtool net: ethernet: ti: davinci_mdio: Add workaround for errata i2329 openvswitch: Fix double reporting of drops in dropwatch openvswitch: Fix overreporting of drops in dropwatch tcp: annotate data-race around tcp_md5sig_pool_populated x86/mce: Retrieve poison range from hardware wifi: ath9k: avoid uninit memory read in ath9k_htc_rx_msg() thunderbolt: Add back Intel Falcon Ridge end-to-end flow control workaround xfrm: Update ipcomp_scratches with NULL when freed iavf: Fix race between iavf_close and iavf_reset_task wifi: brcmfmac: fix use-after-free bug in brcmf_netdev_start_xmit() Bluetooth: btintel: Mark Intel controller to support LE_STATES quirk regulator: core: Prevent integer underflow wifi: mt76: mt7921: reset msta->airtime_ac while clearing up hw value Bluetooth: L2CAP: initialize delayed works at l2cap_chan_create() Bluetooth: hci_sysfs: Fix attempting to call device_add multiple times can: bcm: check the result of can_send() in bcm_can_tx() wifi: rt2x00: don't run Rt5592 IQ calibration on MT7620 wifi: rt2x00: set correct TX_SW_CFG1 MAC register for MT7620 wifi: rt2x00: set VGC gain for both chains of MT7620 wifi: rt2x00: set SoC wmac clock register wifi: rt2x00: correctly set BBP register 86 for MT7620 hwmon: (sht4x) do not overflow clamping operation on 32-bit platforms net: If sock is dead don't access sock's sk_wq in sk_stream_wait_memory Bluetooth: L2CAP: Fix user-after-free r8152: Rate limit overflow messages drm/nouveau/nouveau_bo: fix potential memory leak in nouveau_bo_alloc() drm: Use size_t type for len variable in drm_copy_field() drm: Prevent drm_copy_field() to attempt copying a NULL pointer drm/komeda: Fix handling of atomic commits in the atomic_commit_tail hook gpu: lontium-lt9611: Fix NULL pointer dereference in lt9611_connector_init() drm/amd/display: fix overflow on MIN_I64 definition udmabuf: Set ubuf->sg = NULL if the creation of sg table fails drm: bridge: dw_hdmi: only trigger hotplug event on link change ALSA: usb-audio: Register card at the last interface drm/vc4: vec: Fix timings for VEC modes drm: panel-orientation-quirks: Add quirk for Anbernic Win600 platform/chrome: cros_ec: Notify the PM of wake events during resume platform/x86: msi-laptop: Change DMI match / alias strings to fix module autoloading ASoC: SOF: pci: Change DMI match info to support all Chrome platforms drm/amdgpu: fix initial connector audio value drm/meson: reorder driver deinit sequence to fix use-after-free bug drm/meson: explicitly remove aggregate driver at module unload time mmc: sdhci-msm: add compatible string check for sdm670 drm/dp: Don't rewrite link config when setting phy test pattern drm/amd/display: Remove interface for periodic interrupt 1 ARM: dts: imx7d-sdb: config the max pressure for tsc2046 ARM: dts: imx6q: add missing properties for sram ARM: dts: imx6dl: add missing properties for sram ARM: dts: imx6qp: add missing properties for sram ARM: dts: imx6sl: add missing properties for sram ARM: dts: imx6sll: add missing properties for sram ARM: dts: imx6sx: add missing properties for sram kselftest/arm64: Fix validatation termination record after EXTRA_CONTEXT arm64: dts: imx8mq-librem5: Add bq25895 as max17055's power supply btrfs: dump extra info if one free space cache has more bitmaps than it should btrfs: scrub: try to fix super block errors btrfs: don't print information about space cache or tree every remount ARM: 9242/1: kasan: Only map modules if CONFIG_KASAN_VMALLOC=n clk: zynqmp: Fix stack-out-of-bounds in strncpy` media: cx88: Fix a null-ptr-deref bug in buffer_prepare() media: platform: fix some double free in meson-ge2d and mtk-jpeg and s5p-mfc clk: zynqmp: pll: rectify rate rounding in zynqmp_pll_round_rate usb: host: xhci-plat: suspend and resume clocks usb: host: xhci-plat: suspend/resume clks for brcm dmaengine: ti: k3-udma: Reset UDMA_CHAN_RT byte counters to prevent overflow scsi: 3w-9xxx: Avoid disabling device if failing to enable it nbd: Fix hung when signal interrupts nbd_start_device_ioctl() iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity power: supply: adp5061: fix out-of-bounds read in adp5061_get_chg_type() staging: vt6655: fix potential memory leak blk-throttle: prevent overflow while calculating wait time ata: libahci_platform: Sanity check the DT child nodes number bcache: fix set_at_max_writeback_rate() for multiple attached devices soundwire: cadence: Don't overwrite msg->buf during write commands soundwire: intel: fix error handling on dai registration issues HID: roccat: Fix use-after-free in roccat_read() eventfd: guard wake_up in eventfd fs calls as well md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d usb: host: xhci: Fix potential memory leak in xhci_alloc_stream_info() usb: musb: Fix musb_gadget.c rxstate overflow bug arm64: dts: imx8mp: Add snps,gfladj-refclk-lpm-sel quirk to USB nodes usb: dwc3: core: Enable GUCTL1 bit 10 for fixing termination error after resume bug Revert "usb: storage: Add quirk for Samsung Fit flash" staging: rtl8723bs: fix potential memory leak in rtw_init_drv_sw() staging: rtl8723bs: fix a potential memory leak in rtw_init_cmd_priv() scsi: tracing: Fix compile error in trace_array calls when TRACING is disabled ext2: Use kvmalloc() for group descriptor array nvme: copy firmware_rev on each init nvmet-tcp: add bounds check on Transfer Tag usb: idmouse: fix an uninit-value in idmouse_open clk: bcm2835: Make peripheral PLLC critical clk: bcm2835: Round UART input clock up perf intel-pt: Fix segfault in intel_pt_print_info() with uClibc io_uring/af_unix: defer registered files gc to io_uring release io_uring: correct pinned_vm accounting io_uring/rw: fix short rw error handling io_uring/rw: fix error'ed retry return values io_uring/rw: fix unexpected link breakage mm: hugetlb: fix UAF in hugetlb_handle_userfault net: ieee802154: return -EINVAL for unknown addr type ALSA: usb-audio: Fix last interface check for registration blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init() net: ethernet: ti: davinci_mdio: fix build for mdio bitbang uses Revert "net/ieee802154: reject zero-sized raw_sendmsg()" net/ieee802154: don't warn zero-sized raw_sendmsg() drm/amd/display: Fix build breakage with CONFIG_DEBUG_FS=n Kconfig.debug: simplify the dependency of DEBUG_INFO_DWARF4/5 Kconfig.debug: add toolchain checks for DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT lib/Kconfig.debug: Add check for non-constant .{s,u}leb128 support to DWARF5 ext4: continue to expand file system when the target size doesn't reach thermal: intel_powerclamp: Use first online CPU as control_cpu gcov: support GCC 12.1 and newer compilers io-wq: Fix memory leak in worker creation Linux 5.15.75 Change-Id: I5a3ef9688fb31003940d7e1828f863b9d50f1da9 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2022-11-16 16:28:57 +00:00
Kumar Kartikeya Dwivedi	35d8130f2a	bpf: Add helper macro bpf_for_each_reg_in_vstate [ Upstream commit b239da34203f49c40b5d656220c39647c3ff0b3c ] For a lot of use cases in future patches, we will want to modify the state of registers part of some same 'group' (e.g. same ref_obj_id). It won't just be limited to releasing reference state, but setting a type flag dynamically based on certain actions, etc. Hence, we need a way to easily pass a callback to the function that iterates over all registers in current bpf_verifier_state in all frames upto (and including) the curframe. While in C++ we would be able to easily use a lambda to pass state and the callback together, sadly we aren't using C++ in the kernel. The next best thing to avoid defining a function for each case seems like statement expressions in GNU C. The kernel already uses them heavily, hence they can passed to the macro in the style of a lambda. The statement expression will then be substituted in the for loop bodies. Variables __state and __reg are set to current bpf_func_state and reg for each invocation of the expression inside the passed in verifier state. Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers, release_reference, find_good_pkt_pointers, find_equal_scalars to use bpf_for_each_reg_in_vstate. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220904204145.3089-16-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: f1db20814af5 ("bpf: Fix wrong reg type conversion in release_reference()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-11-16 09:58:15 +01:00
Kumar Kartikeya Dwivedi	4ed5155043	bpf: Fix reference state management for synchronous callbacks [ Upstream commit 9d9d00ac29d0ef7ce426964de46fa6b380357d0a ] Currently, verifier verifies callback functions (sync and async) as if they will be executed once, (i.e. it explores execution state as if the function was being called once). The next insn to explore is set to start of subprog and the exit from nested frame is handled using curframe > 0 and prepare_func_exit. In case of async callback it uses a customized variant of push_stack simulating a kind of branch to set up custom state and execution context for the async callback. While this approach is simple and works when callback really will be executed only once, it is unsafe for all of our current helpers which are for_each style, i.e. they execute the callback multiple times. A callback releasing acquired references of the caller may do so multiple times, but currently verifier sees it as one call inside the frame, which then returns to caller. Hence, it thinks it released some reference that the cb e.g. got access through callback_ctx (register filled inside cb from spilled typed register on stack). Similarly, it may see that an acquire call is unpaired inside the callback, so the caller will copy the reference state of callback and then will have to release the register with new ref_obj_ids. But again, the callback may execute multiple times, but the verifier will only account for acquired references for a single symbolic execution of the callback, which will cause leaks. Note that for async callback case, things are different. While currently we have bpf_timer_set_callback which only executes it once, even for multiple executions it would be safe, as reference state is NULL and check_reference_leak would force program to release state before BPF_EXIT. The state is also unaffected by analysis for the caller frame. Hence async callback is safe. Since we want the reference state to be accessible, e.g. for pointers loaded from stack through callback_ctx's PTR_TO_STACK, we still have to copy caller's reference_state to callback's bpf_func_state, but we enforce that whatever references it adds to that reference_state has been released before it hits BPF_EXIT. This requires introducing a new callback_ref member in the reference state to distinguish between caller vs callee references. Hence, check_reference_leak now errors out if it sees we are in callback_fn and we have not released callback_ref refs. Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2 etc. we need to also distinguish between whether this particular ref belongs to this callback frame or parent, and only error for our own, so we store state->frameno (which is always non-zero for callbacks). In short, callbacks can read parent reference_state, but cannot mutate it, to be able to use pointers acquired by the caller. They must only undo their changes (by releasing their own acquired_refs before BPF_EXIT) on top of caller reference_state before returning (at which point the caller and callback state will match anyway, so no need to copy it back to caller). Fixes: `69c087ba62` ("bpf: Add bpf_for_each_map_elem() helper") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-10-26 12:34:39 +02:00
Hao Luo	13fc6550b0	UPSTREAM: bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX \| PTR_MAYBE_NULL commit c25b2ae136039ffa820c26138ed4a5e5f3ab3841 upstream. We have introduced a new type to make bpf_reg composable, by allocating bits in the type to represent flags. One of the flags is PTR_MAYBE_NULL which indicates a pointer may be NULL. This patch switches the qualified reg_types to use this flag. The reg_types changed in this patch include: 1. PTR_TO_MAP_VALUE_OR_NULL 2. PTR_TO_SOCKET_OR_NULL 3. PTR_TO_SOCK_COMMON_OR_NULL 4. PTR_TO_TCP_SOCK_OR_NULL 5. PTR_TO_BTF_ID_OR_NULL 6. PTR_TO_MEM_OR_NULL 7. PTR_TO_RDONLY_BUF_OR_NULL 8. PTR_TO_RDWR_BUF_OR_NULL [haoluo: backport notes There was a reg_type_may_be_null() in adjust_ptr_min_max_vals() in 5.15.x, but didn't exist in the upstream commit. This backport converted that reg_type_may_be_null() to type_may_be_null() as well.] Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20211217003152.48334-5-haoluo@google.com Cc: stable@vger.kernel.org # 5.15.x Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit `8d38cde47a`) Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Iad81be90619024512c0fd7f23e7115aec60244a4	2022-06-18 18:28:44 +00:00
Hao Luo	cc3b22bead	UPSTREAM: bpf: Introduce composable reg, ret and arg types. commit d639b9d13a39cf15639cbe6e8b2c43eb60148a73 upstream. There are some common properties shared between bpf reg, ret and arg values. For instance, a value may be a NULL pointer, or a pointer to a read-only memory. Previously, to express these properties, enumeration was used. For example, in order to test whether a reg value can be NULL, reg_type_may_be_null() simply enumerates all types that are possibly NULL. The problem of this approach is that it's not scalable and causes a lot of duplication. These properties can be combined, for example, a type could be either MAYBE_NULL or RDONLY, or both. This patch series rewrites the layout of reg_type, arg_type and ret_type, so that common properties can be extracted and represented as composable flag. For example, one can write ARG_PTR_TO_MEM \| PTR_MAYBE_NULL which is equivalent to the previous ARG_PTR_TO_MEM_OR_NULL The type ARG_PTR_TO_MEM are called "base type" in this patch. Base types can be extended with flags. A flag occupies the higher bits while base types sits in the lower bits. This patch in particular sets up a set of macro for this purpose. The following patches will rewrite arg_types, ret_types and reg_types respectively. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211217003152.48334-2-haoluo@google.com Cc: stable@vger.kernel.org # 5.15.x Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit `a76020980b`) Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I1827886452529c828de7819fbf9cbd238c2605bb	2022-06-18 18:28:44 +00:00
Greg Kroah-Hartman	ecda2085fd	Revert 5.15.37 merge into android13-5.15 This reverts the merge of 5.15.37 into the android13-5.15 There are lots of ABI issues, and many of the commits are not needed in the Android tree at this time. Revert the merge (except for the Makefile change), so that future merges will continue to work, and the needed individual changes from this release will be manually added to the tree at a later point in time. Fixes: f7dace75d276 ("Merge 5.15.37 into android13-5.15") Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I0632858e5c0fb94fc14c0f4216997330eca260a7	2022-05-18 08:59:14 +02:00
Greg Kroah-Hartman	ef5fed3c1e	Merge 5.15.37 into android13-5.15 Changes in 5.15.37 floppy: disable FDRAWCMD by default bpf: Introduce composable reg, ret and arg types. bpf: Replace ARG_XXX_OR_NULL with ARG_XXX \| PTR_MAYBE_NULL bpf: Replace RET_XXX_OR_NULL with RET_XXX \| PTR_MAYBE_NULL bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX \| PTR_MAYBE_NULL bpf: Introduce MEM_RDONLY flag bpf: Convert PTR_TO_MEM_OR_NULL to composable types. bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM. bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem. bpf/selftests: Test PTR_TO_RDONLY_MEM bpf: Fix crash due to out of bounds access into reg2btf_ids. spi: cadence-quadspi: fix write completion support ARM: dts: socfpga: change qspi to "intel,socfpga-qspi" mm: kfence: fix objcgs vector allocation gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable iov_iter: Introduce fault_in_iov_iter_writeable gfs2: Add wrapper for iomap_file_buffered_write gfs2: Clean up function may_grant gfs2: Introduce flag for glock holder auto-demotion gfs2: Move the inode glock locking to gfs2_file_buffered_write gfs2: Eliminate ip->i_gh gfs2: Fix mmap + page fault deadlocks for buffered I/O iomap: Fix iomap_dio_rw return value for user copies iomap: Support partial direct I/O on user copy failures iomap: Add done_before argument to iomap_dio_rw gup: Introduce FOLL_NOFAULT flag to disable page faults iov_iter: Introduce nofault flag to disable page faults gfs2: Fix mmap + page fault deadlocks for direct I/O btrfs: fix deadlock due to page faults during direct IO reads and writes btrfs: fallback to blocking mode when doing async dio over multiple extents mm: gup: make fault_in_safe_writeable() use fixup_user_fault() selftests/bpf: Add test for reg2btf_ids out of bounds access Linux 5.15.37 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ica39b8856d6e3928a82f4e34f8b401f1a5cba5ee	2022-05-18 08:59:02 +02:00
Greg Kroah-Hartman	521f2e62a3	ANDROID: add kabi padding for structures for the android13 release There are a lot of different structures that need to have a "frozen" abi for the next 5+ years. Add padding to a lot of them in order to be able to handle any future changes that might be needed due to LTS and security fixes that might come up. It's a best guess, based on what has happened in the past from the 5.10.0..5.10.110 release (1 1/2 years). Yes, past changes do not mean that future changes will also be needed in the same area, but that is a hint that those areas are both well maintained and looked after, and there have been previous problems found in them. Also the list of structures that are being required based on OEM usage in the android/ symbol lists were consulted as that's a larger list than what has been changed in the past. Hopefully we caught everything we need to worry about, only time will tell... Bug: 151154716 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I880bbcda0628a7459988eeb49d18655522697664	2022-05-04 13:39:14 -07:00
Hao Luo	8d38cde47a	bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX \| PTR_MAYBE_NULL commit c25b2ae136039ffa820c26138ed4a5e5f3ab3841 upstream. We have introduced a new type to make bpf_reg composable, by allocating bits in the type to represent flags. One of the flags is PTR_MAYBE_NULL which indicates a pointer may be NULL. This patch switches the qualified reg_types to use this flag. The reg_types changed in this patch include: 1. PTR_TO_MAP_VALUE_OR_NULL 2. PTR_TO_SOCKET_OR_NULL 3. PTR_TO_SOCK_COMMON_OR_NULL 4. PTR_TO_TCP_SOCK_OR_NULL 5. PTR_TO_BTF_ID_OR_NULL 6. PTR_TO_MEM_OR_NULL 7. PTR_TO_RDONLY_BUF_OR_NULL 8. PTR_TO_RDWR_BUF_OR_NULL [haoluo: backport notes There was a reg_type_may_be_null() in adjust_ptr_min_max_vals() in 5.15.x, but didn't exist in the upstream commit. This backport converted that reg_type_may_be_null() to type_may_be_null() as well.] Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20211217003152.48334-5-haoluo@google.com Cc: stable@vger.kernel.org # 5.15.x Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-01 17:22:24 +02:00
Hao Luo	a76020980b	bpf: Introduce composable reg, ret and arg types. commit d639b9d13a39cf15639cbe6e8b2c43eb60148a73 upstream. There are some common properties shared between bpf reg, ret and arg values. For instance, a value may be a NULL pointer, or a pointer to a read-only memory. Previously, to express these properties, enumeration was used. For example, in order to test whether a reg value can be NULL, reg_type_may_be_null() simply enumerates all types that are possibly NULL. The problem of this approach is that it's not scalable and causes a lot of duplication. These properties can be combined, for example, a type could be either MAYBE_NULL or RDONLY, or both. This patch series rewrites the layout of reg_type, arg_type and ret_type, so that common properties can be extracted and represented as composable flag. For example, one can write ARG_PTR_TO_MEM \| PTR_MAYBE_NULL which is equivalent to the previous ARG_PTR_TO_MEM_OR_NULL The type ARG_PTR_TO_MEM are called "base type" in this patch. Base types can be extended with flags. A flag occupies the higher bits while base types sits in the lower bits. This patch in particular sets up a set of macro for this purpose. The following patches will rewrite arg_types, ret_types and reg_types respectively. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211217003152.48334-2-haoluo@google.com Cc: stable@vger.kernel.org # 5.15.x Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-01 17:22:23 +02:00
Hou Tao	832d478ccd	bpf: Disallow BPF_LOG_KERNEL log level for bpf(BPF_BTF_LOAD) [ Upstream commit 866de407444398bc8140ea70de1dba5f91cc34ac ] BPF_LOG_KERNEL is only used internally, so disallow bpf_btf_load() to set log level as BPF_LOG_KERNEL. The same checking has already been done in bpf_check(), so factor out a helper to check the validity of log attributes and use it in both places. Fixes: `8580ac9404` ("bpf: Process in-kernel BTF") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20211203053001.740945-1-houtao1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-01-27 11:03:27 +01:00
Jakub Kicinski	d2e11fd2b7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicting commits, all resolutions pretty trivial: drivers/bus/mhi/pci_generic.c `5c2c853159` ("bus: mhi: pci-generic: configurable network interface MRU") `56f6f4c4eb` ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean") drivers/nfc/s3fwrn5/firmware.c `a0302ff590` ("nfc: s3fwrn5: remove unnecessary label") `46573e3ab0` ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") `801e541c79` ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") MAINTAINERS `7d901a1e87` ("net: phy: add Maxlinear GPY115/21x/24x driver") `8a7b46fa79` ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-07-31 09:14:46 -07:00
Daniel Borkmann	2039f26f3a	bpf: Fix leakage due to insufficient speculative store bypass mitigation Spectre v4 gadgets make use of memory disambiguation, which is a set of techniques that execute memory access instructions, that is, loads and stores, out of program order; Intel's optimization manual, section 2.4.4.5: A load instruction micro-op may depend on a preceding store. Many microarchitectures block loads until all preceding store addresses are known. The memory disambiguator predicts which loads will not depend on any previous stores. When the disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1 data cache. Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed. `af86ca4e30` ("bpf: Prevent memory disambiguation attack") tried to mitigate this attack by sanitizing the memory locations through preemptive "fast" (low latency) stores of zero prior to the actual "slow" (high latency) store of a pointer value such that upon dependency misprediction the CPU then speculatively executes the load of the pointer value and retrieves the zero value instead of the attacker controlled scalar value previously stored at that location, meaning, subsequent access in the speculative domain is then redirected to the "zero page". The sanitized preemptive store of zero prior to the actual "slow" store is done through a simple ST instruction based on r10 (frame pointer) with relative offset to the stack location that the verifier has been tracking on the original used register for STX, which does not have to be r10. Thus, there are no memory dependencies for this store, since it's only using r10 and immediate constant of zero; hence `af86ca4e30` /assumed/ a low latency operation. However, a recent attack demonstrated that this mitigation is not sufficient since the preemptive store of zero could also be turned into a "slow" store and is thus bypassed as well: [...] // r2 = oob address (e.g. scalar) // r7 = pointer to map value 31: (7b) (u64 )(r10 -16) = r2 // r9 will remain "fast" register, r10 will become "slow" register below 32: (bf) r9 = r10 // JIT maps BPF reg to x86 reg: // r9 -> r15 (callee saved) // r10 -> rbp // train store forward prediction to break dependency link between both r9 // and r10 by evicting them from the predictor's LRU table. 33: (61) r0 = (u32 )(r7 +24576) 34: (63) (u32 )(r7 +29696) = r0 35: (61) r0 = (u32 )(r7 +24580) 36: (63) (u32 )(r7 +29700) = r0 37: (61) r0 = (u32 )(r7 +24584) 38: (63) (u32 )(r7 +29704) = r0 39: (61) r0 = (u32 )(r7 +24588) 40: (63) (u32 )(r7 +29708) = r0 [...] 543: (61) r0 = (u32 )(r7 +25596) 544: (63) (u32 )(r7 +30716) = r0 // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain // in hardware registers. rbp becomes slow due to push/pop latency. below is // disasm of bpf_ringbuf_output() helper for better visual context: // // ffffffff8117ee20: 41 54 push r12 // ffffffff8117ee22: 55 push rbp // ffffffff8117ee23: 53 push rbx // ffffffff8117ee24: 48 f7 c1 fc ff ff ff test rcx,0xfffffffffffffffc // ffffffff8117ee2b: 0f 85 af 00 00 00 jne ffffffff8117eee0 <-- jump taken // [...] // ffffffff8117eee0: 49 c7 c4 ea ff ff ff mov r12,0xffffffffffffffea // ffffffff8117eee7: 5b pop rbx // ffffffff8117eee8: 5d pop rbp // ffffffff8117eee9: 4c 89 e0 mov rax,r12 // ffffffff8117eeec: 41 5c pop r12 // ffffffff8117eeee: c3 ret 545: (18) r1 = map[id:4] 547: (bf) r2 = r7 548: (b7) r3 = 0 549: (b7) r4 = 4 550: (85) call bpf_ringbuf_output#194288 // instruction 551 inserted by verifier \ 551: (7a) (u64 )(r10 -16) = 0 \| /both/ are now slow stores here // storing map value pointer r7 at fp-16 \| since value of r10 is "slow". 552: (7b) (u64 )(r10 -16) = r7 / // following "fast" read to the same memory location, but due to dependency // misprediction it will speculatively execute before insn 551/552 completes. 553: (79) r2 = (u64 )(r9 -16) // in speculative domain contains attacker controlled r2. in non-speculative // domain this contains r7, and thus accesses r7 +0 below. 554: (71) r3 = (u8 )(r2 +0) // leak r3 As can be seen, the current speculative store bypass mitigation which the verifier inserts at line 551 is insufficient since /both/, the write of the zero sanitation as well as the map value pointer are a high latency instruction due to prior memory access via push/pop of r10 (rbp) in contrast to the low latency read in line 553 as r9 (r15) which stays in hardware registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally, fp-16 can still be r2. Initial thoughts to address this issue was to track spilled pointer loads from stack and enforce their load via LDX through r10 as well so that /both/ the preemptive store of zero /as well as/ the load use the /same/ register such that a dependency is created between the store and load. However, this option is not sufficient either since it can be bypassed as well under speculation. An updated attack with pointer spill/fills now _all_ based on r10 would look as follows: [...] // r2 = oob address (e.g. scalar) // r7 = pointer to map value [...] // longer store forward prediction training sequence than before. 2062: (61) r0 = (u32 )(r7 +25588) 2063: (63) (u32 )(r7 +30708) = r0 2064: (61) r0 = (u32 )(r7 +25592) 2065: (63) (u32 )(r7 +30712) = r0 2066: (61) r0 = (u32 )(r7 +25596) 2067: (63) (u32 )(r7 +30716) = r0 // store the speculative load address (scalar) this time after the store // forward prediction training. 2068: (7b) (u64 )(r10 -16) = r2 // preoccupy the CPU store port by running sequence of dummy stores. 2069: (63) (u32 )(r7 +29696) = r0 2070: (63) (u32 )(r7 +29700) = r0 2071: (63) (u32 )(r7 +29704) = r0 2072: (63) (u32 )(r7 +29708) = r0 2073: (63) (u32 )(r7 +29712) = r0 2074: (63) (u32 )(r7 +29716) = r0 2075: (63) (u32 )(r7 +29720) = r0 2076: (63) (u32 )(r7 +29724) = r0 2077: (63) (u32 )(r7 +29728) = r0 2078: (63) (u32 )(r7 +29732) = r0 2079: (63) (u32 )(r7 +29736) = r0 2080: (63) (u32 )(r7 +29740) = r0 2081: (63) (u32 )(r7 +29744) = r0 2082: (63) (u32 )(r7 +29748) = r0 2083: (63) (u32 )(r7 +29752) = r0 2084: (63) (u32 )(r7 +29756) = r0 2085: (63) (u32 )(r7 +29760) = r0 2086: (63) (u32 )(r7 +29764) = r0 2087: (63) (u32 )(r7 +29768) = r0 2088: (63) (u32 )(r7 +29772) = r0 2089: (63) (u32 )(r7 +29776) = r0 2090: (63) (u32 )(r7 +29780) = r0 2091: (63) (u32 )(r7 +29784) = r0 2092: (63) (u32 )(r7 +29788) = r0 2093: (63) (u32 )(r7 +29792) = r0 2094: (63) (u32 )(r7 +29796) = r0 2095: (63) (u32 )(r7 +29800) = r0 2096: (63) (u32 )(r7 +29804) = r0 2097: (63) (u32 )(r7 +29808) = r0 2098: (63) (u32 )(r7 +29812) = r0 // overwrite scalar with dummy pointer; same as before, also including the // sanitation store with 0 from the current mitigation by the verifier. 2099: (7a) (u64 )(r10 -16) = 0 \| /both/ are now slow stores here 2100: (7b) (u64 )(r10 -16) = r7 \| since store unit is still busy. // load from stack intended to bypass stores. 2101: (79) r2 = (u64 )(r10 -16) 2102: (71) r3 = (u8 )(r2 +0) // leak r3 [...] Looking at the CPU microarchitecture, the scheduler might issue loads (such as seen in line 2101) before stores (line 2099,2100) because the load execution units become available while the store execution unit is still busy with the sequence of dummy stores (line 2069-2098). And so the load may use the prior stored scalar from r2 at address r10 -16 for speculation. The updated attack may work less reliable on CPU microarchitectures where loads and stores share execution resources. This concludes that the sanitizing with zero stores from `af86ca4e30` ("bpf: Prevent memory disambiguation attack") is insufficient. Moreover, the detection of stack reuse from `af86ca4e30` where previously data (STACK_MISC) has been written to a given stack slot where a pointer value is now to be stored does not have sufficient coverage as precondition for the mitigation either; for several reasons outlined as follows: 1) Stack content from prior program runs could still be preserved and is therefore not "random", best example is to split a speculative store bypass attack between tail calls, program A would prepare and store the oob address at a given stack slot and then tail call into program B which does the "slow" store of a pointer to the stack with subsequent "fast" read. From program B PoV such stack slot type is STACK_INVALID, and therefore also must be subject to mitigation. 2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr) condition, for example, the previous content of that memory location could also be a pointer to map or map value. Without the fix, a speculative store bypass is not mitigated in such precondition and can then lead to a type confusion in the speculative domain leaking kernel memory near these pointer types. While brainstorming on various alternative mitigation possibilities, we also stumbled upon a retrospective from Chrome developers [0]: [...] For variant 4, we implemented a mitigation to zero the unused memory of the heap prior to allocation, which cost about 1% when done concurrently and 4% for scavenging. Variant 4 defeats everything we could think of. We explored more mitigations for variant 4 but the threat proved to be more pervasive and dangerous than we anticipated. For example, stack slots used by the register allocator in the optimizing compiler could be subject to type confusion, leading to pointer crafting. Mitigating type confusion for stack slots alone would have required a complete redesign of the backend of the optimizing compiler, perhaps man years of work, without a guarantee of completeness. [...] From BPF side, the problem space is reduced, however, options are rather limited. One idea that has been explored was to xor-obfuscate pointer spills to the BPF stack: [...] // preoccupy the CPU store port by running sequence of dummy stores. [...] 2106: (63) (u32 )(r7 +29796) = r0 2107: (63) (u32 )(r7 +29800) = r0 2108: (63) (u32 )(r7 +29804) = r0 2109: (63) (u32 )(r7 +29808) = r0 2110: (63) (u32 )(r7 +29812) = r0 // overwrite scalar with dummy pointer; xored with random 'secret' value // of 943576462 before store ... 2111: (b4) w11 = 943576462 2112: (af) r11 ^= r7 2113: (7b) (u64 )(r10 -16) = r11 2114: (79) r11 = (u64 )(r10 -16) 2115: (b4) w2 = 943576462 2116: (af) r2 ^= r11 // ... and restored with the same 'secret' value with the help of AX reg. 2117: (71) r3 = (u8 )(r2 +0) [...] While the above would not prevent speculation, it would make data leakage infeasible by directing it to random locations. In order to be effective and prevent type confusion under speculation, such random secret would have to be regenerated for each store. The additional complexity involved for a tracking mechanism that prevents jumps such that restoring spilled pointers would not get corrupted is not worth the gain for unprivileged. Hence, the fix in here eventually opted for emitting a non-public BPF_ST \| BPF_NOSPEC instruction which the x86 JIT translates into a lfence opcode. Inserting the latter in between the store and load instruction is one of the mitigations options [1]. The x86 instruction manual notes: [...] An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible. [...] The latter meaning that the preceding store instruction finished execution and the store is at minimum guaranteed to be in the CPU's store queue, but it's not guaranteed to be in that CPU's L1 cache at that point (globally visible). The latter would only be guaranteed via sfence. So the load which is guaranteed to execute after the lfence for that local CPU would have to rely on store-to-load forwarding. [2], in section 2.3 on store buffers says: [...] For every store operation that is added to the ROB, an entry is allocated in the store buffer. This entry requires both the virtual and physical address of the target. Only if there is no free entry in the store buffer, the frontend stalls until there is an empty slot available in the store buffer again. Otherwise, the CPU can immediately continue adding subsequent instructions to the ROB and execute them out of order. On Intel CPUs, the store buffer has up to 56 entries. [...] One small upside on the fix is that it lifts constraints from `af86ca4e30` where the sanitize_stack_off relative to r10 must be the same when coming from different paths. The BPF_ST \| BPF_NOSPEC gets emitted after a BPF_STX or BPF_ST instruction. This happens either when we store a pointer or data value to the BPF stack for the first time, or upon later pointer spills. The former needs to be enforced since otherwise stale stack data could be leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST \| BPF_NOSPEC mapping is currently optimized away, but others could emit a speculation barrier as well if necessary. For real-world unprivileged programs e.g. generated by LLVM, pointer spill/fill is only generated upon register pressure and LLVM only tries to do that for pointers which are not used often. The program main impact will be the initial BPF_ST \| BPF_NOSPEC sanitation for the STACK_INVALID case when the first write to a stack slot occurs e.g. upon map lookup. In future we might refine ways to mitigate the latter cost. [0] https://arxiv.org/pdf/1902.05178.pdf [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/ [2] https://arxiv.org/pdf/1905.05725.pdf Fixes: `af86ca4e30` ("bpf: Prevent memory disambiguation attack") Fixes: `f7cf25b202` ("bpf: track spill/fill of constants") Co-developed-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de> Acked-by: Alexei Starovoitov <ast@kernel.org>	2021-07-29 00:27:52 +02:00
Daniel Borkmann	e042aa532c	bpf: Fix pointer arithmetic mask tightening under state pruning In `7fedb63a83` ("bpf: Tighten speculative pointer arithmetic mask") we narrowed the offset mask for unprivileged pointer arithmetic in order to mitigate a corner case where in the speculative domain it is possible to advance, for example, the map value pointer by up to value_size-1 out-of- bounds in order to leak kernel memory via side-channel to user space. The verifier's state pruning for scalars leaves one corner case open where in the first verification path R_x holds an unknown scalar with an aux->alu_limit of e.g. 7, and in a second verification path that same register R_x, here denoted as R_x', holds an unknown scalar which has tighter bounds and would thus satisfy range_within(R_x, R_x') as well as tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3: Given the second path fits the register constraints for pruning, the final generated mask from aux->alu_limit will remain at 7. While technically not wrong for the non-speculative domain, it would however be possible to craft similar cases where the mask would be too wide as in `7fedb63a83`. One way to fix it is to detect the presence of unknown scalar map pointer arithmetic and force a deeper search on unknown scalars to ensure that we do not run into a masking mismatch. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2021-07-16 16:57:07 +02:00
Alexei Starovoitov	7ddc80a476	bpf: Teach stack depth check about async callbacks. Teach max stack depth checking algorithm about async callbacks that don't increase bpf program stack size. Also add sanity check that bpf_tail_call didn't sneak into async cb. It's impossible, since PTR_TO_CTX is not available in async cb, hence the program cannot contain bpf_tail_call(ctx,...); Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com	2021-07-15 22:31:10 +02:00
Alexei Starovoitov	bfc6bb74e4	bpf: Implement verifier support for validation of async callbacks. bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on PTR_TO_FUNC infra in the verifier to validate addresses to subprograms and pass them into the helpers as function callbacks. In case of bpf_for_each_map_elem() the callback is invoked synchronously and the verifier treats it as a normal subprogram call by adding another bpf_func_state and new frame in __check_func_call(). bpf_timer_set_callback() doesn't invoke the callback directly. The subprogram will be called asynchronously from bpf_timer_cb(). Teach the verifier to validate such async callbacks as special kind of jump by pushing verifier state into stack and let pop_stack() process it. Special care needs to be taken during state pruning. The call insn doing bpf_timer_set_callback has to be a prune_point. Otherwise short timer callbacks might not have prune points in front of bpf_timer_set_callback() which means is_state_visited() will be called after this call insn is processed in __check_func_call(). Which means that another async_cb state will be pushed to be walked later and the verifier will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit. Since push_async_cb() looks like another push_stack() branch the infinite loop detection will trigger false positive. To recognize this case mark such states as in_async_callback_fn. To distinguish infinite loop in async callback vs the same callback called with different arguments for different map and timer add async_entry_cnt to bpf_func_state. Enforce return zero from async callbacks. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com	2021-07-15 22:31:10 +02:00
Alexei Starovoitov	3e8ce29850	bpf: Prevent pointer mismatch in bpf_timer_init. bpf_timer_init() arguments are: 1. pointer to a timer (which is embedded in map element). 2. pointer to a map. Make sure that pointer to a timer actually belongs to that map. Use map_uid (which is unique id of inner map) to reject: inner_map1 = bpf_map_lookup_elem(outer_map, key1) inner_map2 = bpf_map_lookup_elem(outer_map, key2) if (inner_map1 && inner_map2) { timer = bpf_map_lookup_elem(inner_map1); if (timer) // mismatch would have been allowed bpf_timer_init(timer, inner_map2); } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com	2021-07-15 22:31:10 +02:00
Alexei Starovoitov	387544bfa2	bpf: Introduce fd_idx Typical program loading sequence involves creating bpf maps and applying map FDs into bpf instructions in various places in the bpf program. This job is done by libbpf that is using compiler generated ELF relocations to patch certain instruction after maps are created and BTFs are loaded. The goal of fd_idx is to allow bpf instructions to stay immutable after compilation. At load time the libbpf would still create maps as usual, but it wouldn't need to patch instructions. It would store map_fds into __u32 fd_array[] and would pass that pointer to sys_bpf(BPF_PROG_LOAD). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210514003623.28033-9-alexei.starovoitov@gmail.com	2021-05-19 00:33:40 +02:00
Lorenz Bauer	c9e73e3d2b	bpf: verifier: Allocate idmap scratch in verifier env func_states_equal makes a very short lived allocation for idmap, probably because it's too large to fit on the stack. However the function is called quite often, leading to a lot of alloc / free churn. Replace the temporary allocation with dedicated scratch space in struct bpf_verifier_env. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com	2021-05-10 16:13:01 -07:00
Daniel Borkmann	801c6058d1	bpf: Fix leakage of uninitialized bpf stack under speculation The current implemented mechanisms to mitigate data disclosure under speculation mainly address stack and map value oob access from the speculative domain. However, Piotr discovered that uninitialized BPF stack is not protected yet, and thus old data from the kernel stack, potentially including addresses of kernel structures, could still be extracted from that 512 bytes large window. The BPF stack is special compared to map values since it's not zero initialized for every program invocation, whereas map values /are/ zero initialized upon their initial allocation and thus cannot leak any prior data in either domain. In the non-speculative domain, the verifier ensures that every stack slot read must have a prior stack slot write by the BPF program to avoid such data leaking issue. However, this is not enough: for example, when the pointer arithmetic operation moves the stack pointer from the last valid stack offset to the first valid offset, the sanitation logic allows for any intermediate offsets during speculative execution, which could then be used to extract any restricted stack content via side-channel. Given for unprivileged stack pointer arithmetic the use of unknown but bounded scalars is generally forbidden, we can simply turn the register-based arithmetic operation into an immediate-based arithmetic operation without the need for masking. This also gives the benefit of reducing the needed instructions for the operation. Given after the work in `7fedb63a83` ("bpf: Tighten speculative pointer arithmetic mask"), the aux->alu_limit already holds the final immediate value for the offset register with the known scalar. Thus, a simple mov of the immediate to AX register with using AX as the source for the original instruction is sufficient and possible now in this case. Reported-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>	2021-05-03 11:56:23 +02:00
Toke Høiland-Jørgensen	441e8c66b2	bpf: Return target info when a tracing bpf_link is queried There is currently no way to discover the target of a tracing program attachment after the fact. Add this information to bpf_link_info and return it when querying the bpf_link fd. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210413091607.58945-1-toke@redhat.com	2021-04-13 18:18:57 -07:00
Yonghong Song	69c087ba62	bpf: Add bpf_for_each_map_elem() helper The bpf_for_each_map_elem() helper is introduced which iterates all map elements with a callback function. The helper signature looks like long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags) and for each map element, the callback_fn will be called. For example, like hashmap, the callback signature may look like long callback_fn(map, key, val, callback_ctx) There are two known use cases for this. One is from upstream ([1]) where a for_each_map_elem helper may help implement a timeout mechanism in a more generic way. Another is from our internal discussion for a firewall use case where a map contains all the rules. The packet data can be compared to all these rules to decide allow or deny the packet. For array maps, users can already use a bounded loop to traverse elements. Using this helper can avoid using bounded loop. For other type of maps (e.g., hash maps) where bounded loop is hard or impossible to use, this helper provides a convenient way to operate on all elements. For callback_fn, besides map and map element, a callback_ctx, allocated on caller stack, is also passed to the callback function. This callback_ctx argument can provide additional input and allow to write to caller stack for output. If the callback_fn returns 0, the helper will iterate through next element if available. If the callback_fn returns 1, the helper will stop iterating and returns to the bpf program. Other return values are not used for now. Currently, this helper is only available with jit. It is possible to make it work with interpreter with so effort but I leave it as the future work. [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/ Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210226204925.3884923-1-yhs@fb.com	2021-02-26 13:23:52 -08:00
Dmitrii Banshchikov	e5069b9c23	bpf: Support pointers in global func args Add an ability to pass a pointer to a type with known size in arguments of a global function. Such pointers may be used to overcome the limit on the maximum number of arguments, avoid expensive and tricky workarounds and to have multiple output arguments. A referenced type may contain pointers but indirect access through them isn't supported. The implementation consists of two parts. If a global function has an argument that is a pointer to a type with known size then: 1) In btf_check_func_arg_match(): check that the corresponding register points to NULL or to a valid memory region that is large enough to contain the expected argument's type. 2) In btf_prepare_func_args(): set the corresponding register type to PTR_TO_MEM_OR_NULL and its size to the size of the expected type. Only global functions are supported because allowance of pointers for static functions might break validation. Consider the following scenario. A static function has a pointer argument. A caller passes pointer to its stack memory. Because the callee can change referenced memory verifier cannot longer assume any particular slot type of the caller's stack memory hence the slot type is changed to SLOT_MISC. If there is an operation that relies on slot type other than SLOT_MISC then verifier won't be able to infer safety of the operation. When verifier sees a static function that has a pointer argument different from PTR_TO_CTX then it skips arguments check and continues with "inline" validation with more information available. The operation that relies on the particular slot type now succeeds. Because global functions were not allowed to have pointer arguments different from PTR_TO_CTX it's not possible to break existing and valid code. Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210212205642.620788-4-me@ubique.spb.ru	2021-02-12 17:37:23 -08:00
Andrei Matei	01f810ace9	bpf: Allow variable-offset stack access Before this patch, variable offset access to the stack was dissalowed for regular instructions, but was allowed for "indirect" accesses (i.e. helpers). This patch removes the restriction, allowing reading and writing to the stack through stack pointers with variable offsets. This makes stack-allocated buffers more usable in programs, and brings stack pointers closer to other types of pointers. The motivation is being able to use stack-allocated buffers for data manipulation. When the stack size limit is sufficient, allocating buffers on the stack is simpler than per-cpu arrays, or other alternatives. In unpriviledged programs, variable-offset reads and writes are disallowed (they were already disallowed for the indirect access case) because the speculative execution checking code doesn't support them. Additionally, when writing through a variable-offset stack pointer, if any pointers are in the accessible range, there's possilibities of later leaking pointers because the write cannot be tracked precisely. Writes with variable offset mark the whole range as initialized, even though we don't know which stack slots are actually written. This is in order to not reject future reads to these slots. Note that this doesn't affect writes done through helpers; like before, helpers need the whole stack range to be initialized to begin with. All the stack slots are in range are considered scalars after the write; variable-offset register spills are not tracked. For reads, all the stack slots in the variable range needs to be initialized (but see above about what writes do), otherwise the read is rejected. All register spilled in stack slots that might be read are marked as having been read, however reads through such pointers don't do register filling; the target register will always be either a scalar or a constant zero. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.com	2021-02-10 10:44:19 -08:00
Andrii Nakryiko	541c3bad8d	bpf: Support BPF ksym variables in kernel modules Add support for directly accessing kernel module variables from BPF programs using special ldimm64 instructions. This functionality builds upon vmlinux ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow specifying kernel module BTF's FD in insn[1].imm field. During BPF program load time, verifier will resolve FD to BTF object and will take reference on BTF object itself and, for module BTFs, corresponding module as well, to make sure it won't be unloaded from under running BPF program. The mechanism used is similar to how bpf_prog keeps track of used bpf_maps. One interesting change is also in how per-CPU variable is determined. The logic is to find .data..percpu data section in provided BTF, but both vmlinux and module each have their own .data..percpu entries in BTF. So for module's case, the search for DATASEC record needs to look at only module's added BTF types. This is implemented with custom search function. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/bpf/20210112075520.4103414-6-andrii@kernel.org	2021-01-12 17:24:30 -08:00
Andrii Nakryiko	22dc4a0f5e	bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead, wherever BTF type IDs are involved, also track the instance of struct btf that goes along with the type ID. This allows to gradually add support for kernel module BTFs and using/tracking module types across BPF helper calls and registers. This patch also renames btf_id() function to btf_obj_id() to minimize naming clash with using btf_id to denote BTF type ID, rather than BTF object's ID. Also, altough btf_vmlinux can't get destructed and thus doesn't need refcounting, module BTFs need that, so apply BTF refcounting universally when BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This makes for simpler clean up code. Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF trampoline key to include BTF object ID. To differentiate that from target program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are not allowed to take full 32 bits, so there is no danger of confusing that bit with a valid BTF type ID. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org	2020-12-03 17:38:21 -08:00
Alexei Starovoitov	6d94e741a8	bpf: Support for pointers beyond pkt_end. This patch adds the verifier support to recognize inlined branch conditions. The LLVM knows that the branch evaluates to the same value, but the verifier couldn't track it. Hence causing valid programs to be rejected. The potential LLVM workaround: https://reviews.llvm.org/D87428 can have undesired side effects, since LLVM doesn't know that skb->data/data_end are being compared. LLVM has to introduce extra boolean variable and use inline_asm trick to force easier for the verifier assembly. Instead teach the verifier to recognize that r1 = skb->data; r1 += 10; r2 = skb->data_end; if (r1 > r2) { here r1 points beyond packet_end and subsequent if (r1 > r2) // always evaluates to "true". } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jiri Olsa <jolsa@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20201111031213.25109-2-alexei.starovoitov@gmail.com	2020-11-13 01:42:11 +01:00
Hao Luo	4976b718c3	bpf: Introduce pseudo_btf_id Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a ksym so that further dereferences on the ksym can use the BTF info to validate accesses. Internally, when seeing a pseudo_btf_id ld insn, the verifier reads the btf_id stored in the insn[0]'s imm field and marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND, which is encoded in btf_vminux by pahole. If the VAR is not of a struct type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID and the mem_size is resolved to the size of the VAR's type. >From the VAR btf_id, the verifier can also read the address of the ksym's corresponding kernel var from kallsyms and use that to fill dst_reg. Therefore, the proper functionality of pseudo_btf_id depends on (1) kallsyms and (2) the encoding of kernel global VARs in pahole, which should be available since pahole v1.18. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com	2020-10-02 14:59:25 -07:00
Toke Høiland-Jørgensen	f7b12b6fea	bpf: verifier: refactor check_attach_btf_id() The check_attach_btf_id() function really does three things: 1. It performs a bunch of checks on the program to ensure that the attachment is valid. 2. It stores a bunch of state about the attachment being requested in the verifier environment and struct bpf_prog objects. 3. It allocates a trampoline for the attachment. This patch splits out (1.) and (3.) into separate functions which will perform the checks, but return the computed values instead of directly modifying the environment. This is done in preparation for reusing the checks when the actual attachment is happening, which will allow tracing programs to have multiple (compatible) attachments. This also fixes a bug where a bunch of checks were skipped if a trampoline already existed for the tracing target. Fixes: `6ba43b761c` ("bpf: Attachment verification for BPF_MODIFY_RETURN") Fixes: `1e6c62a882` ("bpf: Introduce sleepable BPF programs") Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-09-28 17:10:34 -07:00
Toke Høiland-Jørgensen	efc68158c4	bpf: change logging calls from verbose() to bpf_log() and use log pointer In preparation for moving code around, change a bunch of references to env->log (and the verbose() logging helper) to use bpf_log() and a direct pointer to struct bpf_verifier_log. While we're touching the function signature, mark the 'prog' argument to bpf_check_type_match() as const. Also enhance the bpf_verifier_log_needed() check to handle NULL pointers for the log struct so we can re-use the code with logging disabled. Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-09-28 17:09:59 -07:00
Alexei Starovoitov	09b28d76ea	bpf: Add abnormal return checks. LD_[ABS\|IND] instructions may return from the function early. bpf_tail_call pseudo instruction is either fallthrough or return. Allow them in the subprograms only when subprograms are BTF annotated and have scalar return types. Allow ld_abs and tail_call in the main program even if it calls into subprograms. In the past that was not ok to do for ld_abs, since it was JITed with special exit sequence. Since bpf_gen_ld_abs() was introduced the ld_abs looks like normal exit insn from JIT point of view, so it's safe to allow them in the main program. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-09-17 19:56:07 -07:00
Maciej Fijalkowski	ebf7d1f508	bpf, x64: rework pro/epilogue and tailcall handling in JIT This commit serves two things: 1) it optimizes BPF prologue/epilogue generation 2) it makes possible to have tailcalls within BPF subprogram Both points are related to each other since without 1), 2) could not be achieved. In [1], Alexei says: "The prologue will look like: nop5 xor eax,eax // two new bytes if bpf_tail_call() is used in this // function push rbp mov rbp, rsp sub rsp, rounded_stack_depth push rax // zero init tail_call counter variable number of push rbx,r13,r14,r15 Then bpf_tail_call will pop variable number rbx,.. and final 'pop rax' Then 'add rsp, size_of_current_stack_frame' jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp' This way new function will set its own stack size and will init tail call counter with whatever value the parent had. If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. Instead it would need to have 'nop2' in there." Implement that suggestion. Since the layout of stack is changed, tail call counter handling can not rely anymore on popping it to rbx just like it have been handled for constant prologue case and later overwrite of rbx with actual value of rbx pushed to stack. Therefore, let's use one of the register (%rcx) that is considered to be volatile/caller-saved and pop the value of tail call counter in there in the epilogue. Drop the BUILD_BUG_ON in emit_prologue and in emit_bpf_tail_call_indirect where instruction layout is not constant anymore. Introduce new poke target, 'tailcall_bypass' to poke descriptor that is dedicated for skipping the register pops and stack unwind that are generated right before the actual jump to target program. For case when the target program is not present, BPF program will skip the pop instructions and nop5 dedicated for jmpq $target. An example of such state when only R6 of callee saved registers is used by program: ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4 ffffffffc0513aa6: 5b pop %rbx ffffffffc0513aa7: 58 pop %rax ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0513ab4: 48 89 df mov %rbx,%rdi When target program is inserted, the jump that was there to skip pops/nop5 will become the nop5, so CPU will go over pops and do the actual tailcall. One might ask why there simply can not be pushes after the nop5? In the following example snippet: ffffffffc037030c: 48 89 fb mov %rdi,%rbx (...) ffffffffc0370332: 5b pop %rbx ffffffffc0370333: 58 pop %rax ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp ffffffffc0370347: 50 push %rax ffffffffc0370348: 53 push %rbx ffffffffc0370349: 48 89 df mov %rbx,%rdi ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548 There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall and jump target is not present. ctx is in %rbx register and BPF subprogram that we will call into on ffffffffc037034c is relying on it, e.g. it will pick ctx from there. Such code layout is therefore broken as we would overwrite the content of %rbx with the value that was pushed on the prologue. That is the reason for the 'bypass' approach. Special care needs to be taken during the install/update/remove of tailcall target. In case when target program is not present, the CPU must not execute the pop instructions that precede the tailcall. To address that, the following states can be defined: A nop, unwind, nop B nop, unwind, tail C skip, unwind, nop D skip, unwind, tail A is forbidden (lead to incorrectness). The state transitions between tailcall install/update/remove will work as follows: First install tail call f: C->D->B(f) * poke the tailcall, after that get rid of the skip Update tail call f to f': B(f)->B(f') * poke the tailcall (poke->tailcall_target) and do NOT touch the poke->tailcall_bypass Remove tail call: B(f')->C(f') * poke->tailcall_bypass is poked back to jump, then we wait the RCU grace period so that other programs will finish its execution and after that we are safe to remove the poke->tailcall_target Install new tail call (f''): C(f')->D(f'')->B(f''). * same as first step This way CPU can never be exposed to "unwind, tail" state. Last but not least, when tailcalls get mixed with bpf2bpf calls, it would be possible to encounter the endless loop due to clearing the tailcall counter if for example we would use the tailcall3-like from BPF selftests program that would be subprogram-based, meaning the tailcall would be present within the BPF subprogram. This test, broken down to particular steps, would do: entry -> set tailcall counter to 0, bump it by 1, tailcall to func0 func0 -> call subprog_tail (we are NOT skipping the first 11 bytes of prologue and this subprogram has a tailcall, therefore we clear the counter...) subprog -> do the same thing as entry and then loop forever. To address this, the idea is to go through the call chain of bpf2bpf progs and look for a tailcall presence throughout whole chain. If we saw a single tail call then each node in this call chain needs to be marked as a subprog that can reach the tailcall. We would later feed the JIT with this info and: - set eax to 0 only when tailcall is reachable and this is the entry prog - if tailcall is reachable but there's no tailcall in insns of currently JITed prog then push rax anyway, so that it will be possible to propagate further down the call chain - finally if tailcall is reachable, then we need to precede the 'call' insn with mov rax, [rbp - (stack_depth + 8)] Tail call related cases from test_verifier kselftest are also working fine. Sample BPF programs that utilize tail calls (sockex3, tracex5) work properly as well. [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/ Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-09-17 19:55:30 -07:00
Maciej Fijalkowski	7f6e4312e1	bpf: Limit caller's stack depth 256 for subprogs with tailcalls Protect against potential stack overflow that might happen when bpf2bpf calls get combined with tailcalls. Limit the caller's stack depth for such case down to 256 so that the worst case scenario would result in 8k stack size (32 which is tailcall limit * 256 = 8k). Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-09-17 19:19:20 -07:00
Andrey Ignatov	41c48f3a98	bpf: Support access to bpf map fields There are multiple use-cases when it's convenient to have access to bpf map fields, both `struct bpf_map` and map type specific struct-s such as `struct bpf_array`, `struct bpf_htab`, etc. For example while working with sock arrays it can be necessary to calculate the key based on map->max_entries (some_hash % max_entries). Currently this is solved by communicating max_entries via "out-of-band" channel, e.g. via additional map with known key to get info about target map. That works, but is not very convenient and error-prone while working with many maps. In other cases necessary data is dynamic (i.e. unknown at loading time) and it's impossible to get it at all. For example while working with a hash table it can be convenient to know how much capacity is already used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case). At the same time kernel knows this info and can provide it to bpf program. Fill this gap by adding support to access bpf map fields from bpf program for both `struct bpf_map` and map type specific fields. Support is implemented via btf_struct_access() so that a user can define their own `struct bpf_map` or map type specific struct in their program with only necessary fields and preserve_access_index attribute, cast a map to this struct and use a field. For example: struct bpf_map { __u32 max_entries; } __attribute__((preserve_access_index)); struct bpf_array { struct bpf_map map; __u32 elem_size; } __attribute__((preserve_access_index)); struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 4); __type(key, __u32); __type(value, __u32); } m_array SEC(".maps"); SEC("cgroup_skb/egress") int cg_skb(void ctx) { struct bpf_array array = (struct bpf_array )&m_array; struct bpf_map map = (struct bpf_map )&m_array; / .. use map->max_entries or array->map.max_entries .. / } Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of corresponding struct. Only reading from map fields is supported. For btf_struct_access() to work there should be a way to know btf id of a struct that corresponds to a map type. To get btf id there should be a way to get a stringified name of map-specific struct, such as "bpf_array", "bpf_htab", etc for a map type. Two new fields are added to `struct bpf_map_ops` to handle it: .map_btf_name keeps a btf name of a struct returned by map_alloc(); * .map_btf_id is used to cache btf id of that struct. To make btf ids calculation cheaper they're calculated once while preparing btf_vmlinux and cached same way as it's done for btf_id field of `struct bpf_func_proto` While calculating btf ids, struct names are NOT checked for collision. Collisions will be checked as a part of the work to prepare btf ids used in verifier in compile time that should land soon. The only known collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs net/core/sock_map.c) was fixed earlier. Both new fields .map_btf_name and .map_btf_id must be set for a map type for the feature to work. If neither is set for a map type, verifier will return ENOTSUPP on a try to access map_ptr of corresponding type. If just one of them set, it's verifier misconfiguration. Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be supported separately. The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by perfmon_capable() so that unpriv programs won't have access to bpf map fields. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com	2020-06-22 22:22:58 +02:00
Andrii Nakryiko	457f44363a	bpf: Implement BPF ring buffer and verifier support for it This commit adds a new MPSC ring buffer implementation into BPF ecosystem, which allows multiple CPUs to submit data to a single shared ring buffer. On the consumption side, only single consumer is assumed. Motivation ---------- There are two distinctive motivators for this work, which are not satisfied by existing perf buffer, which prompted creation of a new ring buffer implementation. - more efficient memory utilization by sharing ring buffer across CPUs; - preserving ordering of events that happen sequentially in time, even across multiple CPUs (e.g., fork/exec/exit events for a task). These two problems are independent, but perf buffer fails to satisfy both. Both are a result of a choice to have per-CPU perf ring buffer. Both can be also solved by having an MPSC implementation of ring buffer. The ordering problem could technically be solved for perf buffer with some in-kernel counting, but given the first one requires an MPSC buffer, the same solution would solve the second problem automatically. Semantics and APIs ------------------ Single ring buffer is presented to BPF programs as an instance of BPF map of type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately rejected. One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce "same CPU only" rule. This would be more familiar interface compatible with existing perf buffer use in BPF, but would fail if application needed more advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses this with current approach. Additionally, given the performance of BPF ringbuf, many use cases would just opt into a simple single ring buffer shared among all CPUs, for which current approach would be an overkill. Another approach could introduce a new concept, alongside BPF map, to represent generic "container" object, which doesn't necessarily have key/value interface with lookup/update/delete operations. This approach would add a lot of extra infrastructure that has to be built for observability and verifier support. It would also add another concept that BPF developers would have to familiarize themselves with, new syntax in libbpf, etc. But then would really provide no additional benefits over the approach of using a map. BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so doesn't few other map types (e.g., queue and stack; array doesn't support delete, etc). The approach chosen has an advantage of re-using existing BPF map infrastructure (introspection APIs in kernel, libbpf support, etc), being familiar concept (no need to teach users a new type of object in BPF program), and utilizing existing tooling (bpftool). For common scenario of using a single ring buffer for all CPUs, it's as simple and straightforward, as would be with a dedicated "container" object. On the other hand, by being a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to implement a wide variety of topologies, from one ring buffer for each CPU (e.g., as a replacement for perf buffer use cases), to a complicated application hashing/sharding of ring buffers (e.g., having a small pool of ring buffers with hashed task's tgid being a look up key to preserve order, but reduce contention). Key and value sizes are enforced to be zero. max_entries is used to specify the size of ring buffer and has to be a power of 2 value. There are a bunch of similarities between perf buffer (BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics: - variable-length records; - if there is no more space left in ring buffer, reservation fails, no blocking; - memory-mappable data area for user-space applications for ease of consumption and high performance; - epoll notifications for new incoming data; - but still the ability to do busy polling for new data to achieve the lowest latency, if necessary. BPF ringbuf provides two sets of APIs to BPF programs: - bpf_ringbuf_output() allows to copy data from one place to a ring buffer, similarly to bpf_perf_event_output(); - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs split the whole process into two steps. First, a fixed amount of space is reserved. If successful, a pointer to a data inside ring buffer data area is returned, which BPF programs can use similarly to a data inside array/hash maps. Once ready, this piece of memory is either committed or discarded. Discard is similar to commit, but makes consumer ignore the record. bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because record has to be prepared in some other place first. But it allows to submit records of the length that's not known to verifier beforehand. It also closely matches bpf_perf_event_output(), so will simplify migration significantly. bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory pointer directly to ring buffer memory. In a lot of cases records are larger than BPF stack space allows, so many programs have use extra per-CPU array as a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs completely. But in exchange, it only allows a known constant size of memory to be reserved, such that verifier can verify that BPF program can't access memory outside its reserved record space. bpf_ringbuf_output(), while slightly slower due to extra memory copy, covers some use cases that are not suitable for bpf_ringbuf_reserve(). The difference between commit and discard is very small. Discard just marks a record as discarded, and such records are supposed to be ignored by consumer code. Discard is useful for some advanced use-cases, such as ensuring all-or-nothing multi-record submission, or emulating temporary malloc()/free() within single BPF program invocation. Each reserved record is tracked by verifier through existing reference-tracking logic, similar to socket ref-tracking. It is thus impossible to reserve a record, but forget to submit (or discard) it. bpf_ringbuf_query() helper allows to query various properties of ring buffer. Currently 4 are supported: - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer; - BPF_RB_RING_SIZE returns the size of ring buffer; - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of consumer/producer, respectively. Returned values are momentarily snapshots of ring buffer state and could be off by the time helper returns, so this should be used only for debugging/reporting reasons or for implementing various heuristics, that take into account highly-changeable nature of some of those characteristics. One such heuristic might involve more fine-grained control over poll/epoll notifications about new data availability in ring buffer. Together with BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers, it allows BPF program a high degree of control and, e.g., more efficient batched notifications. Default self-balancing strategy, though, should be adequate for most applications and will work reliable and efficiently already. Design and implementation ------------------------- This reserve/commit schema allows a natural way for multiple producers, either on different CPUs or even on the same CPU/in the same BPF program, to reserve independent records and work with them without blocking other producers. This means that if BPF program was interruped by another BPF program sharing the same ring buffer, they will both get a record reserved (provided there is enough space left) and can work with it and submit it independently. This applies to NMI context as well, except that due to using a spinlock during reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock, in which case reservation will fail even if ring buffer is not full. The ring buffer itself internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters (which might wrap around on 32-bit architectures, that's not a problem): - consumer counter shows up to which logical position consumer consumed the data; - producer counter denotes amount of data reserved by all producers. Each time a record is reserved, producer that "owns" the record will successfully advance producer counter. At that point, data is still not yet ready to be consumed, though. Each record has 8 byte header, which contains the length of reserved record, as well as two extra bits: busy bit to denote that record is still being worked on, and discard bit, which might be set at commit time if record is discarded. In the latter case, consumer is supposed to skip the record and move on to the next one. Record header also encodes record's relative offset from the beginning of ring buffer data area (in pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only the pointer to the record itself, without requiring also the pointer to ring buffer itself. Ring buffer memory location will be restored from record metadata header. This significantly simplifies verifier, as well as improving API usability. Producer counter increments are serialized under spinlock, so there is a strict ordering between reservations. Commits, on the other hand, are completely lockless and independent. All records become available to consumer in the order of reservations, but only after all previous records where already committed. It is thus possible for slow producers to temporarily hold off submitted records, that were reserved later. Reservation/commit/consumer protocol is verified by litmus tests in Documentation/litmus-test/bpf-rb. One interesting implementation bit, that significantly simplifies (and thus speeds up as well) implementation of both producers and consumers is how data area is mapped twice contiguously back-to-back in the virtual memory. This allows to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory. See comment and a simple ASCII diagram showing this visually in bpf_ringbuf_area_alloc(). Another feature that distinguishes BPF ringbuf from perf ring buffer is a self-pacing notifications of new data being availability. bpf_ringbuf_commit() implementation will send a notification of new record being available after commit only if consumer has already caught up right up to the record being committed. If not, consumer still has to catch up and thus will see new data anyways without needing an extra poll notification. Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that this allows to achieve a very high throughput without having to resort to tricks like "notify only every Nth sample", which are necessary with perf buffer. For extreme cases, when BPF program wants more manual control of notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data availability, but require extra caution and diligence in using this API. Comparison to alternatives -------------------------- Before considering implementing BPF ring buffer from scratch existing alternatives in kernel were evaluated, but didn't seem to meet the needs. They largely fell into few categores: - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations outlined above (ordering and memory consumption); - linked list-based implementations; while some were multi-producer designs, consuming these from user-space would be very complicated and most probably not performant; memory-mapping contiguous piece of memory is simpler and more performant for user-space consumers; - io_uring is SPSC, but also requires fixed-sized elements. Naively turning SPSC queue into MPSC w/ lock would have subpar performance compared to locked reserve + lockless commit, as with BPF ring buffer. Fixed sized elements would be too limiting for BPF programs, given existing BPF programs heavily rely on variable-sized perf buffer already; - specialized implementations (like a new printk ring buffer, [0]) with lots of printk-specific limitations and implications, that didn't seem to fit well for intended use with BPF programs. [0] https://lwn.net/Articles/779550/ Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 14:38:22 -07:00
Alexei Starovoitov	2c78ee898d	bpf: Implement CAP_BPF Implement permissions as stated in uapi/linux/capability.h In order to do that the verifier allow_ptr_leaks flag is split into four flags and they are set as: env->allow_ptr_leaks = bpf_allow_ptr_leaks(); env->bypass_spec_v1 = bpf_bypass_spec_v1(); env->bypass_spec_v4 = bpf_bypass_spec_v4(); env->bpf_capable = bpf_capable(); The first three currently equivalent to perfmon_capable(), since leaking kernel pointers and reading kernel memory via side channel attacks is roughly equivalent to reading kernel memory with cap_perfmon. 'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions, subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the verifier, run time mitigations in bpf array, and enables indirect variable access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code by the verifier. That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will have speculative checks done by the verifier and other spectre mitigation applied. Such networking BPF program will not be able to leak kernel pointers and will not be able to access arbitrary kernel memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com	2020-05-15 17:29:41 +02:00
John Fastabend	3f50f132d8	bpf: Verifier, do explicit ALU32 bounds tracking It is not possible for the current verifier to track ALU32 and JMP ops correctly. This can result in the verifier aborting with errors even though the program should be verifiable. BPF codes that hit this can work around it by changin int variables to 64-bit types, marking variables volatile, etc. But this is all very ugly so it would be better to avoid these tricks. But, the main reason to address this now is do_refine_retval_range() was assuming return values could not be negative. Once we fixed this code that was previously working will no longer work. See do_refine_retval_range() patch for details. And we don't want to suddenly cause programs that used to work to fail. The simplest example code snippet that illustrates the problem is likely this, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] The expected 64-bit and 32-bit bounds after each line are shown on the right. The current issue is without the w* bounds we are forced to use the worst case bound of [0, U32_MAX]. To resolve this type of case, jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit 32-bit register bounds s32_{min\|max}_value and u32_{min\|max}_value. Then from branch_taken logic creating new bounds we can track 32-bit bounds explicitly. The next case we observed is ALU ops after the jmp32, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] 55: w8 += 1 // r8 <- [0, U32_MAX+1] // w8 <- [0, X+1] In order to keep the bounds accurate at this point we also need to track ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU ops, mov, add, sub, etc. Finally there is a question of how and when to merge bounds. The cases enumerate here, 1. MOV ALU32 - zext 32-bit -> 64-bit 2. MOV ALU64 - copy 64-bit -> 32-bit 3. op ALU32 - zext 32-bit -> 64-bit 4. op ALU64 - n/a 5. jmp ALU32 - 64-bit: var32_off \| upper_32_bits(var64_off) 6. jmp ALU64 - 32-bit: (>> (<< var64_off)) Details for each case, For "MOV ALU32" BPF arch zero extends so we simply copy the bounds from 32-bit into 64-bit ensuring we truncate var_off and 64-bit bounds correctly. See zext_32_to_64. For "MOV ALU64" copy all bounds including 32-bit into new register. If the src register had 32-bit bounds the dst register will as well. For "op ALU32" zero extend 32-bit into 64-bit the same as move, see zext_32_to_64. For "op ALU64" calculate both 32-bit and 64-bit bounds no merging is done here. Except we have a special case. When RSH or ARSH is done we can't simply ignore shifting bits from 64-bit reg into the 32-bit subreg. So currently just push bounds from 64-bit into 32-bit. This will be correct in the sense that they will represent a valid state of the register. However we could lose some accuracy if an ARSH is following a jmp32 operation. We can handle this special case in a follow up series. For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds from tnum by setting var_off to ((<<(>>var_off)) \| var32_off). We special case if 64-bit bounds has zero'd upper 32bits at which point we can simply copy 32-bit bounds into 64-bit register. This catches a common compiler trick where upper 32-bits are zeroed and then 32-bit ops are used followed by a 64-bit compare or 64-bit op on a pointer. See __reg_combine_64_into_32(). For "jmp ALU64" cast the bounds of the 64bit to their 32-bit counterpart. For example s32_min_value = (s32)reg->smin_value. For tnum use only the lower 32bits via, (>>(<<var_off)). See __reg_combine_64_into_32(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower	2020-03-30 14:59:53 -07:00
Alexei Starovoitov	51c39bb1d5	bpf: Introduce function-by-function verification New llvm and old llvm with libbpf help produce BTF that distinguish global and static functions. Unlike arguments of static function the arguments of global functions cannot be removed or optimized away by llvm. The compiler has to use exactly the arguments specified in a function prototype. The argument type information allows the verifier validate each global function independently. For now only supported argument types are pointer to context and scalars. In the future pointers to structures, sizes, pointer to packet data can be supported as well. Consider the following example: static int f1(int ...) { ... } int f3(int b); int f2(int a) { f1(a) + f3(a); } int f3(int b) { ... } int main(...) { f1(...) + f2(...) + f3(...); } The verifier will start its safety checks from the first global function f2(). It will recursively descend into f1() because it's static. Then it will check that arguments match for the f3() invocation inside f2(). It will not descend into f3(). It will finish f2() that has to be successfully verified for all possible values of 'a'. Then it will proceed with f3(). That function also has to be safe for all possible values of 'b'. Then it will start subprog 0 (which is main() function). It will recursively descend into f1() and will skip full check of f2() and f3(), since they are global. The order of processing global functions doesn't affect safety, since all global functions must be proven safe based on their arguments only. Such function by function verification can drastically improve speed of the verification and reduce complexity. Note that the stack limit of 512 still applies to the call chain regardless whether functions were static or global. The nested level of 8 also still applies. The same recursion prevention checks are in place as well. The type information and static/global kind is preserved after the verification hence in the above example global function f2() and f3() can be replaced later by equivalent functions with the same types that are loaded and verified later without affecting safety of this main() program. Such replacement (re-linking) of global functions is a subject of future patches. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org	2020-01-10 17:20:07 +01:00
Daniel Borkmann	d2e4c1e6c2	bpf: Constant map key tracking for prog array pokes Add tracking of constant keys into tail call maps. The signature of bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3 is a index key. The direct call approach for tail calls can be enabled if the verifier asserted that for all branches leading to the tail call helper invocation, the map pointer and index key were both constant and the same. Tracking of map pointers we already do from prior work via `c93552c443` ("bpf: properly enforce index mask to prevent out-of-bounds speculation") and `09772d92cd` ("bpf: avoid retpoline for lookup/update/ delete calls on maps"). Given the tail call map index key is not on stack but directly in the register, we can add similar tracking approach and later in fixup_bpf_calls() add a poke descriptor to the progs poke_tab with the relevant information for the JITing phase. We internally reuse insn->imm for the rewritten BPF_JMP \| BPF_TAIL_CALL instruction in order to point into the prog's poke_tab, and keep insn->imm as 0 as indicator that current indirect tail call emission must be used. Note that publishing to the tracker must happen at the end of fixup_bpf_calls() since adding elements to the poke_tab reallocates its memory, so we need to wait until its in final state. Future work can generalize and add similar approach to optimize plain array map lookups. Difference there is that we need to look into the key value that sits on stack. For clarity in bpf_insn_aux_data, map_state has been renamed into map_ptr_state, so we get map_{ptr,key}_state as trackers. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net	2019-11-24 17:04:11 -08:00
Alexei Starovoitov	8c1b6e69dc	bpf: Compare BTF types of functions arguments with actual types Make the verifier check that BTF types of function arguments match actual types passed into top-level BPF program and into BPF-to-BPF calls. If types match such BPF programs and sub-programs will have full support of BPF trampoline. If types mismatch the trampoline has to be conservative. It has to save/restore five program arguments and assume 64-bit scalars. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org	2019-11-15 23:45:02 +01:00
Alexei Starovoitov	9e15db6613	bpf: Implement accurate raw_tp context access via BTF libbpf analyzes bpf C program, searches in-kernel BTF for given type name and stores it into expected_attach_type. The kernel verifier expects this btf_id to point to something like: typedef void (btf_trace_kfree_skb)(void , struct sk_buff skb, void loc); which represents signature of raw_tracepoint "kfree_skb". Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb' and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint. In first case it passes btf_id of 'struct sk_buff ' back to the verifier core and 'void ' in second case. Then the verifier tracks PTR_TO_BTF_ID as any other pointer type. Like PTR_TO_SOCKET points to 'struct bpf_sock', PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on. PTR_TO_BTF_ID points to in-kernel structs. If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF then PTR_TO_BTF_ID#1234 points to one of in kernel skbs. When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = (u64 )r1 + 32) the btf_struct_access() checks which field of 'struct sk_buff' is at offset 32. Checks that size of access matches type definition of the field and continues to track the dereferenced type. If that field was a pointer to 'struct net_device' the r2's type will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device' in vmlinux's BTF. Such verifier analysis prevents "cheating" in BPF C program. The program cannot cast arbitrary pointer to 'struct sk_buff *' and access it. C compiler would allow type cast, of course, but the verifier will notice type mismatch based on BPF assembly and in-kernel BTF. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org	2019-10-17 16:44:35 +02:00
Alexei Starovoitov	8580ac9404	bpf: Process in-kernel BTF If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux' for further use by the verifier. In-kernel BTF is trusted just like kallsyms and other build artifacts embedded into vmlinux. Yet run this BTF image through BTF verifier to make sure that it is valid and it wasn't mangled during the build. If in-kernel BTF is incorrect it means either gcc or pahole or kernel are buggy. In such case disallow loading BPF programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org	2019-10-17 16:44:35 +02:00
Alexei Starovoitov	10d274e880	bpf: introduce verifier internal test flag Introduce BPF_F_TEST_STATE_FREQ flag to stress test parentage chain and state pruning. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-28 00:30:11 +02:00
David S. Miller	dca73a65a6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-06-19 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin. 2) BTF based map definition, from Andrii. 3) support bpf_map_lookup_elem for xskmap, from Jonathan. 4) bounded loops and scalar precision logic in the verifier, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-20 00:06:27 -04:00
Alexei Starovoitov	b5dc0163d8	bpf: precise scalar_value tracking Introduce precision tracking logic that helps cilium programs the most: old clang old clang new clang new clang with all patches with all patches bpf_lb-DLB_L3.o 1838 2283 1923 1863 bpf_lb-DLB_L4.o 3218 2657 3077 2468 bpf_lb-DUNKNOWN.o 1064 545 1062 544 bpf_lxc-DDROP_ALL.o 26935 23045 166729 22629 bpf_lxc-DUNKNOWN.o 34439 35240 174607 28805 bpf_netdev.o 9721 8753 8407 6801 bpf_overlay.o 6184 7901 5420 4754 bpf_lxc_jit.o 39389 50925 39389 50925 Consider code: 654: (85) call bpf_get_hash_recalc#34 655: (bf) r7 = r0 656: (15) if r8 == 0x0 goto pc+29 657: (bf) r2 = r10 658: (07) r2 += -48 659: (18) r1 = 0xffff8881e41e1b00 661: (85) call bpf_map_lookup_elem#1 662: (15) if r0 == 0x0 goto pc+23 663: (69) r1 = (u16 )(r0 +0) 664: (15) if r1 == 0x0 goto pc+21 665: (bf) r8 = r7 666: (57) r8 &= 65535 667: (bf) r2 = r8 668: (3f) r2 /= r1 669: (2f) r2 = r1 670: (bf) r1 = r8 671: (1f) r1 -= r2 672: (57) r1 &= 255 673: (25) if r1 > 0x1e goto pc+12 R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f)) 674: (67) r1 <<= 1 675: (0f) r0 += r1 At this point the verifier will notice that scalar R1 is used in map pointer adjustment. R1 has to be precise for later operations on R0 to be validated properly. The verifier will backtrack the above code in the following way: last_idx 675 first_idx 664 regs=2 stack=0 before 675: (0f) r0 += r1 // started backtracking R1 regs=2 is a bitmask regs=2 stack=0 before 674: (67) r1 <<= 1 regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12 regs=2 stack=0 before 672: (57) r1 &= 255 regs=2 stack=0 before 671: (1f) r1 -= r2 // now both R1 and R2 has to be precise -> regs=6 mask regs=6 stack=0 before 670: (bf) r1 = r8 // after this insn R8 and R2 has to be precise regs=104 stack=0 before 669: (2f) r2 = r1 // after this one R8, R2, and R1 regs=106 stack=0 before 668: (3f) r2 /= r1 regs=106 stack=0 before 667: (bf) r2 = r8 regs=102 stack=0 before 666: (57) r8 &= 65535 regs=102 stack=0 before 665: (bf) r8 = r7 regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21 // this is the end of verifier state. The following regs will be marked precised: R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0) parent didn't have regs=82 stack=0 marks // so backtracking continues into parent state last_idx 663 first_idx 655 regs=82 stack=0 before 663: (69) r1 = (u16 )(r0 +0) // R1 was assigned no need to track it further regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23 // keep tracking R7 regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1 // keep tracking R7 regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00 regs=80 stack=0 before 658: (07) r2 += -48 regs=80 stack=0 before 657: (bf) r2 = r10 regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29 regs=80 stack=0 before 655: (bf) r7 = r0 // here the assignment into R7 // mark R0 to be precise: R0_rw=invP(id=0) parent didn't have regs=1 stack=0 marks // regs=1 -> tracking R0 last_idx 654 first_idx 644 regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value // nothing further to backtrack Two scalar registers not marked precise are equivalent from state pruning point of view. More details in the patch comments. It doesn't support bpf2bpf calls yet and enabled for root only. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:52 +02:00
Alexei Starovoitov	2589726d12	bpf: introduce bounded loops Allow the verifier to validate the loops by simulating their execution. Exisiting programs have used '#pragma unroll' to unroll the loops by the compiler. Instead let the verifier simulate all iterations of the loop. In order to do that introduce parentage chain of bpf_verifier_state and 'branches' counter for the number of branches left to explore. See more detailed algorithm description in bpf_verifier.h This algorithm borrows the key idea from Edward Cree approach: https://patchwork.ozlabs.org/patch/877222/ Additional state pruning heuristics make such brute force loop walk practical even for large loops. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:51 +02:00
David S. Miller	a6cdeeb16b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-07 11:00:14 -07:00
Thomas Gleixner	25763b3c86	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of version 2 of the gnu general public license as published by the free software foundation extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 107 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171438.615055994@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-30 11:29:53 -07:00

1 2 3

122 Commits