76 Commits

Author SHA1 Message Date
Eric W. Biederman
62b004c983 fs: Better permission checking for submounts
commit 93faccbbfa958a9668d3ab4e30f38dd205cee8d8 upstream.

To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.

The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.

Attempting to handle the automount case by using override_creds
almost works.  It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.

Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.

vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.

sget and sget_userns are modified to not perform any permission checks
on submounts.

follow_automount is modified to stop using override_creds as that
has proven problemantic.

do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.

autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.

cifs is modified to pass the mountpoint all of the way down to vfs_submount.

debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter.  To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.

Fixes: 069d5ac9ae0d ("autofs:  Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds")
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I09cb1f35368fb8dc4a64b5ac5a35c9d2843ef95b
2022-11-15 21:35:32 +01:00
Thierry Strudel
b11ab24fe6 Merged linux-4.4.70 into android-msm-wahoo-4.4
Linux 4.4.70
    drivers: char: mem: Check for address space wraparound with mmap()
    nfsd: encoders mustn't use unitialized values in error cases
    drm/edid: Add 10 bpc quirk for LGD 764 panel in HP zBook 17 G2
    PCI: Freeze PME scan before suspending devices
    PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
    tracing/kprobes: Enforce kprobes teardown after testing
    osf_wait4(): fix infoleak
    genirq: Fix chained interrupt data ordering
    uwb: fix device quirk on big-endian hosts
    metag/uaccess: Check access_ok in strncpy_from_user
    metag/uaccess: Fix access_ok()
    iommu/vt-d: Flush the IOTLB to get rid of the initial kdump mappings
    staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
    staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
    mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
    xc2028: Fix use-after-free bug properly
    arm64: documentation: document tagged pointer stack constraints
    arm64: uaccess: ensure extension of access_ok() addr
    arm64: xchg: hazard against entire exchange variable
    ARM: dts: at91: sama5d3_xplained: not all ADC channels are available
    ARM: dts: at91: sama5d3_xplained: fix ADC vref
    powerpc/64e: Fix hang when debugging programs with relocated kernel
    powerpc/pseries: Fix of_node_put() underflow during DLPAR remove
    powerpc/book3s/mce: Move add_taint() later in virtual mode
    cx231xx-cards: fix NULL-deref at probe
    cx231xx-audio: fix NULL-deref at probe
    cx231xx-audio: fix init error path
    dvb-frontends/cxd2841er: define symbol_rate_min/max in T/C fe-ops
    zr364xx: enforce minimum size when reading header
    dib0700: fix NULL-deref at probe
    s5p-mfc: Fix unbalanced call to clock management
    gspca: konica: add missing endpoint sanity check
    ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
    iio: proximity: as3935: fix as3935_write
    ipx: call ipxitf_put() in ioctl error path
    USB: hub: fix non-SS hub-descriptor handling
    USB: hub: fix SS hub-descriptor handling
    USB: serial: io_ti: fix div-by-zero in set_termios
    USB: serial: mct_u232: fix big-endian baud-rate handling
    USB: serial: qcserial: add more Lenovo EM74xx device IDs
    usb: serial: option: add Telit ME910 support
    USB: iowarrior: fix info ioctl on big-endian hosts
    usb: musb: tusb6010_omap: Do not reset the other direction's packet size
    ttusb2: limit messages to buffer size
    mceusb: fix NULL-deref at probe
    usbvision: fix NULL-deref at probe
    net: irda: irda-usb: fix firmware name on big-endian hosts
    usb: host: xhci-mem: allocate zeroed Scratchpad Buffer
    xhci: apply PME_STUCK_QUIRK and MISSING_CAS quirk for Denverton
    usb: host: xhci-plat: propagate return value of platform_get_irq()
    sched/fair: Initialize throttle_count for new task-groups lazily
    sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
    fscrypt: avoid collisions when presenting long encrypted filenames
    f2fs: check entire encrypted bigname when finding a dentry
    fscrypt: fix context consistency check when key(s) unavailable
    net: qmi_wwan: Add SIMCom 7230E
    ext4 crypto: fix some error handling
    ext4 crypto: don't let data integrity writebacks fail with ENOMEM
    USB: serial: ftdi_sio: add Olimex ARM-USB-TINY(H) PIDs
    USB: serial: ftdi_sio: fix setting latency for unprivileged users
    pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
    pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
    iio: dac: ad7303: fix channel description
    of: fix sparse warning in of_pci_range_parser_one
    proc: Fix unbalanced hard link numbers
    cdc-acm: fix possible invalid access when processing notification
    drm/nouveau/tmr: handle races with hw when updating the next alarm time
    drm/nouveau/tmr: avoid processing completed alarms when adding a new one
    drm/nouveau/tmr: fix corruption of the pending list when rescheduling an alarm
    drm/nouveau/tmr: ack interrupt before processing alarms
    drm/nouveau/therm: remove ineffective workarounds for alarm bugs
    drm/amdgpu: Make display watermark calculations more accurate
    drm/amdgpu: Avoid overflows/divide-by-zero in latency_watermark calculations.
    ath9k_htc: fix NULL-deref at probe
    ath9k_htc: Add support of AirTies 1eda:2315 AR9271 device
    s390/cputime: fix incorrect system time
    s390/kdump: Add final note
    regulator: tps65023: Fix inverted core enable logic.
    KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
    KVM: x86: Fix load damaged SSEx MXCSR register
    ima: accept previously set IMA_NEW_FILE
    mwifiex: pcie: fix cmd_buf use-after-free in remove/reset
    rtlwifi: rtl8821ae: setup 8812ae RFE according to device type
    md: update slab_cache before releasing new stripes when stripes resizing
    dm space map disk: fix some book keeping in the disk space map
    dm thin metadata: call precommit before saving the roots
    dm bufio: make the parameter "retain_bytes" unsigned long
    dm cache metadata: fail operations if fail_io mode has been established
    dm bufio: check new buffer allocation watermark every 30 seconds
    dm bufio: avoid a possible ABBA deadlock
    dm raid: select the Kconfig option CONFIG_MD_RAID0
    dm btree: fix for dm_btree_find_lowest_key()
    infiniband: call ipv6 route lookup via the stub interface
    tpm_crb: check for bad response size
    ARM: tegra: paz00: Mark panel regulator as enabled on boot
    USB: core: replace %p with %pK
    char: lp: fix possible integer overflow in lp_setup()
    watchdog: pcwd_usb: fix NULL-deref at probe
    USB: ene_usb6250: fix DMA to the stack
    usb: misc: legousbtower: Fix memory leak
    usb: misc: legousbtower: Fix buffers on stack
Linux 4.4.69
    ipmi: Fix kernel panic at ipmi_ssif_thread()
    wlcore: Add RX_BA_WIN_SIZE_CHANGE_EVENT event
    wlcore: Pass win_size taken from ieee80211_sta to FW
    mac80211: RX BA support for sta max_rx_aggregation_subframes
    mac80211: pass block ack session timeout to to driver
    mac80211: pass RX aggregation window size to driver
    Bluetooth: hci_intel: add missing tty-device sanity check
    Bluetooth: hci_bcm: add missing tty-device sanity check
    Bluetooth: Fix user channel for 32bit userspace on 64bit kernel
    tty: pty: Fix ldisc flush after userspace become aware of the data already
    serial: omap: suspend device on probe errors
    serial: omap: fix runtime-pm handling on unbind
    serial: samsung: Use right device for DMA-mapping calls
    arm64: KVM: Fix decoding of Rt/Rt2 when trapping AArch32 CP accesses
    padata: free correct variable
    CIFS: add misssing SFM mapping for doublequote
    cifs: fix CIFS_IOC_GET_MNT_INFO oops
    CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
    SMB3: Work around mount failure when using SMB3 dialect to Macs
    Set unicode flag on cifs echo request to avoid Mac error
    fs/block_dev: always invalidate cleancache in invalidate_bdev()
    ceph: fix memory leak in __ceph_setxattr()
    fs/xattr.c: zero out memory copied to userspace in getxattr
    ext4: evict inline data when writing to memory map
    IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level
    IB/mlx4: Fix ib device initialization error flow
    IB/IPoIB: ibX: failed to create mcg debug file
    IB/core: Fix sysfs registration error flow
    vfio/type1: Remove locked page accounting workqueue
    dm era: save spacemap metadata root after the pre-commit
    crypto: algif_aead - Require setkey before accept(2)
    block: fix blk_integrity_register to use template's interval_exp if not 0
    KVM: arm/arm64: fix races in kvm_psci_vcpu_on
    KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
    um: Fix PTRACE_POKEUSER on x86_64
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    selftests/x86/ldt_gdt_32: Work around a glibc sigaction() bug
    x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
    usb: hub: Do not attempt to autosuspend disconnected devices
    usb: hub: Fix error loop seen after hub communication errors
    usb: Make sure usb/phy/of gets built-in
    usb: misc: add missing continue in switch
    staging: comedi: jr3_pci: cope with jiffies wraparound
    staging: comedi: jr3_pci: fix possible null pointer dereference
    staging: gdm724x: gdm_mux: fix use-after-free on module unload
    staging: vt6656: use off stack for out buffer USB transfers.
    staging: vt6656: use off stack for in buffer USB transfers.
    USB: Proper handling of Race Condition when two USB class drivers try to call init_usb_class simultaneously
    USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
    usb: host: xhci: print correct command ring address
    iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
    target: Convert ACL change queue_depth se_session reference usage
    target/fileio: Fix zero-length READ and WRITE handling
    target: Fix compare_and_write_callback handling for non GOOD status
    xen: adjust early dom0 p2m handling to xen hypervisor behavior
Linux 4.4.68
    block: get rid of blk_integrity_revalidate()
    drm/ttm: fix use-after-free races in vm fault handling
    f2fs: sanity check segment count
    bnxt_en: allocate enough space for ->ntp_fltr_bmap
    ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
    ipv6: initialize route null entry in addrconf_init()
    rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
    ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
    tcp: do not inherit fastopen_req from parent
    tcp: fix wraparound issue in tcp_lp
    bpf, arm64: fix jit branch offset related to ldimm64
    tcp: do not underestimate skb->truesize in tcp_trim_head()
    ALSA: hda - Fix deadlock of controller device lock at unbinding
    staging: emxx_udc: remove incorrect __init annotations
    staging: wlan-ng: add missing byte order conversion
    brcmfmac: Make skb header writable before use
    brcmfmac: Ensure pointer correctly set if skb data location changes
    MIPS: R2-on-R6 MULTU/MADDU/MSUBU emulation bugfix
    scsi: mac_scsi: Fix MAC_SCSI=m option when SCSI=m
    serial: 8250_omap: Fix probe and remove for PM runtime
    phy: qcom-usb-hs: Add depends on EXTCON
    USB: serial: io_edgeport: fix descriptor error handling
    USB: serial: mct_u232: fix modem-status error handling
    USB: serial: quatech2: fix control-message error handling
    USB: serial: ftdi_sio: fix latency-timer error handling
    USB: serial: ark3116: fix open error handling
    USB: serial: ti_usb_3410_5052: fix control-message error handling
    USB: serial: io_edgeport: fix epic-descriptor handling
    USB: serial: ssu100: fix control-message error handling
    USB: serial: digi_acceleport: fix incomplete rx sanity check
    USB: serial: keyspan_pda: fix receive sanity checks
    usb: chipidea: Handle extcon events properly
    usb: chipidea: Only read/write OTGSC from one place
    usb: host: ohci-exynos: Decrese node refcount on exynos_ehci_get_phy() error paths
    usb: host: ehci-exynos: Decrese node refcount on exynos_ehci_get_phy() error paths
    KVM: nVMX: do not leak PML full vmexit to L1
    KVM: nVMX: initialize PML fields in vmcs02
    Revert "KVM: nested VMX: disable perf cpuid reporting"
    x86/platform/intel-mid: Correct MSI IRQ line for watchdog device
    kprobes/x86: Fix kernel panic when certain exception-handling addresses are probed
    clk: Make x86/ conditional on CONFIG_COMMON_CLK
    x86/pci-calgary: Fix iommu_free() comparison of unsigned expression >= 0
    x86/ioapic: Restore IO-APIC irq_chip retrigger callback
    mwifiex: Avoid skipping WEP key deletion for AP
    mwifiex: remove redundant dma padding in AMSDU
    mwifiex: debugfs: Fix (sometimes) off-by-1 SSID print
    ARM: OMAP5 / DRA7: Fix HYP mode boot for thumb2 build
    leds: ktd2692: avoid harmless maybe-uninitialized warning
    power: supply: bq24190_charger: Handle fault before status on interrupt
    power: supply: bq24190_charger: Don't read fault register outside irq_handle_thread()
    power: supply: bq24190_charger: Call power_supply_changed() for relevant component
    power: supply: bq24190_charger: Install irq_handler_thread() at end of probe()
    power: supply: bq24190_charger: Call set_mode_host() on pm_resume()
    power: supply: bq24190_charger: Fix irq trigger to IRQF_TRIGGER_FALLING
    powerpc/powernv: Fix opal_exit tracepoint opcode
    cpupower: Fix turbo frequency reporting for pre-Sandy Bridge cores
    ARM: 8452/3: PJ4: make coprocessor access sequences buildable in Thumb2 mode
    9p: fix a potential acl leak
Linux 4.4.67
    dm ioctl: prevent stack leak in dm ioctl call
    nfsd: stricter decoding of write-like NFSv2/v3 ops
    nfsd4: minor NFSv2/v3 write decoding cleanup
    ext4/fscrypto: avoid RCU lookup in d_revalidate
    ext4 crypto: use dget_parent() in ext4_d_revalidate()
    ext4 crypto: revalidate dentry after adding or removing the key
    ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
    IB/ehca: fix maybe-uninitialized warnings
    IB/qib: rename BITS_PER_PAGE to RVT_BITS_PER_PAGE
    netlink: Allow direct reclaim for fallback allocation
    8250_pci: Fix potential use-after-free in error path
    scsi: cxlflash: Improve EEH recovery time
    scsi: cxlflash: Fix to avoid EEH and host reset collisions
    scsi: cxlflash: Scan host only after the port is ready for I/O
    net: tg3: avoid uninitialized variable warning
    mtd: avoid stack overflow in MTD CFI code
    drbd: avoid redefinition of BITS_PER_PAGE
    ALSA: ppc/awacs: shut up maybe-uninitialized warning
    ASoC: intel: Fix PM and non-atomic crash in bytcr drivers
    Handle mismatched open calls
    timerfd: Protect the might cancel mechanism proper
Linux 4.4.66
    ftrace/x86: Fix triple fault with graph tracing and suspend-to-ram
    ARCv2: save r30 on kernel entry as gcc uses it for code-gen
    nfsd: check for oversized NFSv2/v3 arguments
    Input: i8042 - add Clevo P650RS to the i8042 reset list
    p9_client_readdir() fix
    MIPS: Avoid BUG warning in arch_check_elf
    MIPS: KGDB: Use kernel context for sleeping threads
    ALSA: seq: Don't break snd_use_lock_sync() loop by timeout
    ALSA: firewire-lib: fix inappropriate assignment between signed/unsigned type
    ipv6: check raw payload size correctly in ioctl
    ipv6: check skb->protocol before lookup for nexthop
    macvlan: Fix device ref leak when purging bc_queue
    ip6mr: fix notification device destruction
    netpoll: Check for skb->queue_mapping
    net: ipv6: RTF_PCPU should not be settable from userspace
    dp83640: don't recieve time stamps twice
    tcp: clear saved_syn in tcp_disconnect()
    sctp: listen on the sock only when it's state is listening or closed
    net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given
    l2tp: fix PPP pseudo-wire auto-loading
    l2tp: take reference on sessions being dumped
    net/packet: fix overflow in check for tp_reserve
    net/packet: fix overflow in check for tp_frame_nr
    l2tp: purge socket queues in the .destruct() callback
    net: phy: handle state correctly in phy_stop_machine
    net: neigh: guard against NULL solicit() method
    sparc64: Fix kernel panic due to erroneous #ifdef surrounding pmd_write()
    sparc64: kern_addr_valid regression
    xen/x86: don't lose event interrupts
    usb: gadget: f_midi: Fixed a bug when buflen was smaller than wMaxPacketSize
    regulator: core: Clear the supply pointer if enabling fails
    RDS: Fix the atomicity for congestion map update
    net_sched: close another race condition in tcf_mirred_release()
    net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndata
    MIPS: Fix crash registers on non-crashing CPUs
    md:raid1: fix a dead loop when read from a WriteMostly disk
    ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
    drm/amdgpu: fix array out of bounds
    crypto: testmgr - fix out of bound read in __test_aead()
    clk: sunxi: Add apb0 gates for H3
    ARM: OMAP2+: timer: add probe for clocksources
    xc2028: unlock on error in xc2028_set_config()
    f2fs: do more integrity verification for superblock
Linux 4.4.65
    perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
    ping: implement proper locking
    staging/android/ion : fix a race condition in the ion driver
    vfio/pci: Fix integer overflows, bitmask check
    tipc: check minimum bearer MTU
    netfilter: nfnetlink: correctly validate length of batch messages
    xc2028: avoid use after free
    mnt: Add a per mount namespace limit on the number of mounts
    tipc: fix socket timer deadlock
    tipc: fix random link resets while adding a second bearer
    gfs2: avoid uninitialized variable warning
    hostap: avoid uninitialized variable use in hfa384x_get_rid
    tty: nozomi: avoid a harmless gcc warning
    tipc: correct error in node fsm
    tipc: re-enable compensation for socket receive buffer double counting
    tipc: make dist queue pernet
    tipc: make sure IPv6 header fits in skb headroom
Linux 4.4.64
    tipc: fix crash during node removal
    block: fix del_gendisk() vs blkdev_ioctl crash
    x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
    hv: don't reset hv_context.tsc_page on crash
    Drivers: hv: balloon: account for gaps in hot add regions
    Drivers: hv: balloon: keep track of where ha_region starts
    Tools: hv: kvp: ensure kvp device fd is closed on exec
    kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd
    x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRs
    powerpc/kprobe: Fix oops when kprobed on 'stdu' instruction
    ubi/upd: Always flush after prepared for an update
    mac80211: reject ToDS broadcast data frames
    mmc: sdhci-esdhc-imx: increase the pad I/O drive strength for DDR50 card
    ACPI / power: Avoid maybe-uninitialized warning
    Input: elantech - add Fujitsu Lifebook E547 to force crc_enabled
    VSOCK: Detach QP check should filter out non matching QPs.
    Drivers: hv: vmbus: Reduce the delay between retries in vmbus_post_msg()
    Drivers: hv: get rid of timeout in vmbus_open()
    Drivers: hv: don't leak memory in vmbus_establish_gpadl()
    s390/mm: fix CMMA vs KSM vs others
    CIFS: remove bad_network_name flag
    cifs: Do not send echoes before Negotiate is complete
    ring-buffer: Have ring_buffer_iter_empty() return true when empty
    tracing: Allocate the snapshot buffer before enabling probe
    KEYS: fix keyctl_set_reqkey_keyring() to not leak thread keyrings
    KEYS: Change the name of the dead type to ".dead" to prevent user access
    KEYS: Disallow keyrings beginning with '.' to be joined as session keyrings
Linux 4.4.63
    MIPS: fix Select HAVE_IRQ_EXIT_ON_IRQ_STACK patch.
    sctp: deny peeloff operation on asocs with threads sleeping on it
    net: ipv6: check route protocol when deleting routes
    tty/serial: atmel: RS485 half duplex w/DMA: enable RX after TX is done
    SUNRPC: fix refcounting problems with auth_gss messages.
    ibmveth: calculate gso_segs for large packets
    catc: Use heap buffer for memory size test
    catc: Combine failure cleanup code in catc_probe()
    rtl8150: Use heap buffers for all register access
    pegasus: Use heap buffers for all register access
    virtio-console: avoid DMA from stack
    dvb-usb-firmware: don't do DMA on stack
    dvb-usb: don't use stack for firmware load
    mm: Tighten x86 /dev/mem with zeroing reads
    rtc: tegra: Implement clock handling
    platform/x86: acer-wmi: setup accelerometer when machine has appropriate notify event
    ext4: fix inode checksum calculation problem if i_extra_size is small
    dvb-usb-v2: avoid use-after-free
    ath9k: fix NULL pointer dereference
    crypto: ahash - Fix EINPROGRESS notification callback
    powerpc: Disable HFSCR[TM] if TM is not supported
    zram: do not use copy_page with non-page aligned address
    kvm: fix page struct leak in handle_vmon
    Revert "MIPS: Lantiq: Fix cascaded IRQ setup"
    char: lack of bool string made CONFIG_DEVPORT always on
    char: Drop bogus dependency of DEVPORT on !M68K
    ftrace: Fix removing of second function probe
    irqchip/irq-imx-gpcv2: Fix spinlock initialization
    libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
    xen, fbfront: fix connecting to backend
    scsi: sd: Fix capacity calculation with 32-bit sector_t
    scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable
    scsi: sr: Sanity check returned mode data
    iscsi-target: Drop work-around for legacy GlobalSAN initiator
    iscsi-target: Fix TMR reference leak during session shutdown
    acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)
    x86/vdso: Plug race between mapping and ELF header setup
    x86/vdso: Ensure vdso32_enabled gets set to valid values only
    perf/x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()
    Input: xpad - add support for Razer Wildcat gamepad
    CIFS: store results of cifs_reopen_file to avoid infinite wait
    drm/nouveau/mmu/nv4a: use nv04 mmu rather than the nv44 one
    drm/nouveau/mpeg: mthd returns true on success now
    thp: fix MADV_DONTNEED vs clear soft dirty race
    cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Linux 4.4.62
    ibmveth: set correct gso_size and gso_type
    net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
    net/mlx4_core: Fix racy CQ (Completion Queue) free
    net/mlx4_en: Fix bad WQE issue
    usb: hub: Wait for connection to be reestablished after port reset
    blk-mq: Avoid memory reclaim when remapping queues
    net/packet: fix overflow in check for priv area size
    crypto: caam - fix RNG deinstantiation error checking
    MIPS: IRQ Stack: Fix erroneous jal to plat_irq_dispatch
    MIPS: Select HAVE_IRQ_EXIT_ON_IRQ_STACK
    MIPS: Switch to the irq_stack in interrupts
    MIPS: Only change $28 to thread_info if coming from user mode
    MIPS: Stack unwinding while on IRQ stack
    MIPS: Introduce irq_stack
    mtd: bcm47xxpart: fix parsing first block after aligned TRX
    usb: dwc3: gadget: delay unmap of bounced requests
    drm/i915: Stop using RP_DOWN_EI on Baytrail
    drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
Linux 4.4.61
    mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
    MIPS: Flush wrong invalid FTLB entry for huge page
    MIPS: Lantiq: fix missing xbar kernel panic
    MIPS: End spinlocks with .insn
    MIPS: ralink: Fix typos in rt3883 pinctrl
    MIPS: Force o32 fp64 support on 32bit MIPS64r6 kernels
    s390/uaccess: get_user() should zero on failure (again)
    s390/decompressor: fix initrd corruption caused by bss clear
    nios2: reserve boot memory for device tree
    powerpc: Don't try to fix up misaligned load-with-reservation instructions
    powerpc/mm: Add missing global TLB invalidate if cxl is active
    metag/usercopy: Add missing fixups
    metag/usercopy: Fix src fixup in from user rapf loops
    metag/usercopy: Set flags before ADDZ
    metag/usercopy: Zero rest of buffer from copy_from_user
    metag/usercopy: Add early abort to copy_to_user
    metag/usercopy: Fix alignment error checking
    metag/usercopy: Drop unused macros
    ring-buffer: Fix return value check in test_ringbuffer()
    ptrace: fix PTRACE_LISTEN race corrupting task->state
    Reset TreeId to zero on SMB2 TREE_CONNECT
    iio: bmg160: reset chip when probing
    arm/arm64: KVM: Take mmap_sem in kvm_arch_prepare_memory_region
    arm/arm64: KVM: Take mmap_sem in stage2_unmap_vm
    staging: android: ashmem: lseek failed due to no FMODE_LSEEK.
    sysfs: be careful of error returns from ops->show()
    drm/vmwgfx: fix integer overflow in vmw_surface_define_ioctl()
    drm/vmwgfx: Remove getparam error message
    drm/ttm, drm/vmwgfx: Relax permission checking when opening surfaces
    drm/vmwgfx: avoid calling vzalloc with a 0 size in vmw_get_cap_3d_ioctl()
    drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()
    drm/vmwgfx: Type-check lookups of fence objects

Bug: 62730977
Change-Id: I4458200bbc977cf55a134fd9fd08627604e36d95
Signed-off-by: Thierry Strudel <tstrudel@google.com>
2017-09-20 15:50:18 -07:00
Eric W. Biederman
c50fd34e10 mnt: Add a per mount namespace limit on the number of mounts
commit d29216842a85c7970c536108e093963f02714498 upstream.

CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-30 05:49:28 +02:00
Daniel Rosenberg
2357b85345 ANDROID: mnt: Add filesystem private data to mount points
This starts to add private data associated directly
to mount points. The intent is to give filesystems
a sense of where they have come from, as a means of
letting a filesystem take different actions based on
this information.

Change-Id: Ie769d7b3bb2f5972afe05c1bf16cf88c91647ab2
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-01-10 10:43:26 -08:00
Linus Torvalds
8f502d5b9e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull usernamespace mount fixes from Eric Biederman:
 "Way back in October Andrey Vagin reported that umount(MNT_DETACH)
  could be used to defeat MNT_LOCKED.  As I worked to fix this I
  discovered that combined with mount propagation and an appropriate
  selection of shared subtrees a reference to a directory on an
  unmounted filesystem is not necessary.

  That MNT_DETACH is allowed in user namespace in a form that can break
  MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

  To avoid breaking existing userspace the conflict between MNT_DETACH
  and MNT_LOCKED is fixed by leaving mounts that are locked to their
  parents in the mount hash table until the last reference goes away.

  While investigating this issue I also found an issue with
  __detach_mounts.  The code was unnecessarily and incorrectly
  triggering mount propagation.  Resulting in too many mounts going away
  when a directory is deleted, and too many cpu cycles are burned while
  doing that.

  Looking some more I realized that __detach_mounts by only keeping
  mounts connected that were MNT_LOCKED it had the potential to still
  leak information so I tweaked the code to keep everything locked
  together that possibly could be.

  This code was almost ready last cycle but Al invented fs_pin which
  slightly simplifies this code but required rewrites and retesting, and
  I have not been in top form for a while so it took me a while to get
  all of that done.  Similiarly this pull request is late because I have
  been feeling absolutely miserable all week.

  The issue of being able to escape a bind mount has not yet been
  addressed, as the fixes are not yet mature"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  mnt: Update detach_mounts to leave mounts connected
  mnt: Fix the error check in __detach_mounts
  mnt: Honor MNT_LOCKED when detaching mounts
  fs_pin: Allow for the possibility that m_list or s_list go unused.
  mnt: Factor umount_mnt from umount_tree
  mnt: Factor out unhash_mnt from detach_mnt and umount_tree
  mnt: Fail collect_mounts when applied to unmounted mounts
  mnt: Don't propagate unmounts to locked mounts
  mnt: On an unmount propagate clearing of MNT_LOCKED
  mnt: Delay removal from the mount hash.
  mnt: Add MNT_UMOUNT flag
  mnt: In umount_tree reuse mnt_list instead of mnt_hash
  mnt: Don't propagate umounts in __detach_mounts
  mnt: Improve the umount_tree flags
  mnt: Use hlist_move_list in namespace_unlock
2015-04-18 11:20:31 -04:00
Dan Ehrenberg
e6e20a7a5f init: export name_to_dev_t and mark name argument as const
DM will switch its device lookup code to using name_to_dev_t() so it
must be exported.  Also, the @name argument should be marked const.

Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:18 -04:00
Eric W. Biederman
590ce4bcbf mnt: Add MNT_UMOUNT flag
In some instances it is necessary to know if the the unmounting
process has begun on a mount.  Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-04-02 20:34:18 -05:00
Miklos Szeredi
c771d683a6 vfs: introduce clone_private_mount()
Overlayfs needs a private clone of the mount, so create a function for
this and export to modules.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24 00:14:36 +02:00
Linus Torvalds
f6f993328b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Stuff in here:

   - acct.c fixes and general rework of mnt_pin mechanism.  That allows
     to go for delayed-mntput stuff, which will permit mntput() on deep
     stack without worrying about stack overflows - fs shutdown will
     happen on shallow stack.  IOW, we can do Eric's umount-on-rmdir
     series without introducing tons of stack overflows on new mntput()
     call chains it introduces.
   - Bruce's d_splice_alias() patches
   - more Miklos' rename() stuff.
   - a couple of regression fixes (stable fodder, in the end of branch)
     and a fix for API idiocy in iov_iter.c.

  There definitely will be another pile, maybe even two.  I'd like to
  get Eric's series in this time, but even if we miss it, it'll go right
  in the beginning of for-next in the next cycle - the tricky part of
  prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  fix copy_tree() regression
  __generic_file_write_iter(): fix handling of sync error after DIO
  switch iov_iter_get_pages() to passing maximal number of pages
  fs: mark __d_obtain_alias static
  dcache: d_splice_alias should detect loops
  exportfs: update Exporting documentation
  dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
  dcache: remove unused d_find_alias parameter
  dcache: d_obtain_alias callers don't all want DISCONNECTED
  dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
  dcache: d_splice_alias mustn't create directory aliases
  dcache: close d_move race in d_splice_alias
  dcache: move d_splice_alias
  namei: trivial fix to vfs_rename_dir comment
  VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
  cifs: support RENAME_NOREPLACE
  hostfs: support rename flags
  shmem: support RENAME_EXCHANGE
  shmem: support RENAME_NOREPLACE
  btrfs: add RENAME_NOREPLACE
  ...
2014-08-11 11:44:11 -07:00
Al Viro
3064c3563b death to mnt_pinned
Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone.  Then attach the pin to original
vfsmount.  Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out.  Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.

mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Eric W. Biederman
9566d67428 mnt: Correct permission checks in do_remount
While invesgiating the issue where in "mount --bind -oremount,ro ..."
would result in later "mount --bind -oremount,rw" succeeding even if
the mount started off locked I realized that there are several
additional mount flags that should be locked and are not.

In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
flags in addition to MNT_READONLY should all be locked.  These
flags are all per superblock, can all be changed with MS_BIND,
and should not be changable if set by a more privileged user.

The following additions to the current logic are added in this patch.
- nosuid may not be clearable by a less privileged user.
- nodev  may not be clearable by a less privielged user.
- noexec may not be clearable by a less privileged user.
- atime flags may not be changeable by a less privileged user.

The logic with atime is that always setting atime on access is a
global policy and backup software and auditing software could break if
atime bits are not updated (when they are configured to be updated),
and serious performance degradation could result (DOS attack) if atime
updates happen when they have been explicitly disabled.  Therefore an
unprivileged user should not be able to mess with the atime bits set
by a more privileged user.

The additional restrictions are implemented with the addition of
MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
mnt flags.

Taken together these changes and the fixes for MNT_LOCK_READONLY
should make it safe for an unprivileged user to create a user
namespace and to call "mount --bind -o remount,... ..." without
the danger of mount flags being changed maliciously.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-31 17:12:34 -07:00
Eric W. Biederman
a6138db815 mnt: Only change user settable mount flags in remount
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.

Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others.   This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-31 17:11:54 -07:00
Al Viro
f2ebb3a921 smarter propagate_mnt()
The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:08 -04:00
Al Viro
48a066e72d RCU'd vfsmounts
* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode.  We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock.  It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:19 -05:00
Eric W. Biederman
5ff9d8a65c vfs: Lock in place mounts from more privileged users
When creating a less privileged mount namespace or propogating mounts
from a more privileged to a less privileged mount namespace lock the
submounts so they may not be unmounted individually in the child mount
namespace revealing what is under them.

This enforces the reasonable expectation that it is not possible to
see under a mount point.  Most of the time mounts are on empty
directories and revealing that does not matter, however I have seen an
occassionaly sloppy configuration where there were interesting things
concealed under a mount point that probably should not be revealed.

Expirable submounts are not locked because they will eventually
unmount automatically so whatever is under them already needs
to be safe for unprivileged users to access.

From a practical standpoint these restrictions do not appear to be
significant for unprivileged users of the mount namespace.  Recursive
bind mounts and pivot_root continues to work, and mounts that are
created in a mount namespace may be unmounted there.  All of which
means that the common idiom of keeping a directory of interesting
files and using pivot_root to throw everything else away continues to
work just fine.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-07-24 09:14:46 -07:00
Eric W. Biederman
90563b198e vfs: Add a mount flag to lock read only bind mounts
When a read-only bind mount is copied from mount namespace in a higher
privileged user namespace to a mount namespace in a lesser privileged
user namespace, it should not be possible to remove the the read-only
restriction.

Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
remain read-only.

CC: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-27 07:50:04 -07:00
Al Viro
c63181e6b6 vfs: move fsnotify junk to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:12 -05:00
Al Viro
52ba1621de vfs: move mnt_devname
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:11 -05:00
Al Viro
1a4eeaf2a8 vfs: move mnt_list to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:11 -05:00
Al Viro
863d684f94 vfs: move the rest of int fields to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:10 -05:00
Al Viro
15169fe784 vfs: mnt_id/mnt_group_id moved
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:10 -05:00
Al Viro
143c8c91ce vfs: mnt_ns moved to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:09 -05:00
Al Viro
6776db3d32 vfs: take mnt_share/mnt_slave/mnt_slave_list and mnt_expire to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:08 -05:00
Al Viro
d10e8def07 vfs: take mnt_master to struct mount
make IS_MNT_SLAVE take struct mount * at the same time

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:08 -05:00
Al Viro
6b41d536f7 vfs: take mnt_child/mnt_mounts to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:06 -05:00
Al Viro
68e8a9feab vfs: all counters taken to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:06 -05:00
Al Viro
a73324da7a vfs: move mnt_mountpoint to struct mount
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:05 -05:00
Al Viro
3376f34fff vfs: mnt_parent moved to struct mount
the second victim...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:04 -05:00
Al Viro
1b8e5564b9 vfs: the first spoils - mnt_hash moved
taken out of struct vfsmount into struct mount

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:57:02 -05:00
Al Viro
2a79f17e4a vfs: mnt_drop_write_file()
new helper (wrapper around mnt_drop_write()) to be used in pair with
mnt_want_write_file().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Al Viro
79e801a906 vfs: make do_kern_mount() static
the only user outside of fs/namespace.c has died

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:39 -05:00
Arun Sharma
60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
Al Viro
f03c65993b sanitize vfsmount refcounting changes
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
	* 1 for each fs_struct.root.mnt
	* 1 for each fs_struct.pwd.mnt
	* 1 for having non-NULL ->mnt_ns
	* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc.  And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-16 13:47:07 -05:00
David Howells
ea5b778a8b Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself.  follow_automount() will then
do the addition.

This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer.  The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.

To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2.  One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.

d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used.  The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.

This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:48 -05:00
Nick Piggin
b3e19d924b fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
  for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
  is difficult for vfsmounts, because we can't hold preempt off for the life of
  a reference, so a counter would need to be per-thread or tied strongly to a
  particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
  the total difference and hence find the refcount when summing all CPUs. Then,
  keep a single integer "long" refcount for slow and long lasting references,
  and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:33 +11:00
Miklos Szeredi
532490f0a5 vfs: remove unused MNT_STRICTATIME
Commit d0adde574b added MNT_STRICTATIME
but it isn't actually used (MS_STRICTATIME clears MNT_RELATIME and
MNT_NOATIME rather than setting any mount flag).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-11 00:29:47 -04:00
Andreas Gruenbacher
2504c5d63b fsnotify/vfsmount: add fsnotify fields to struct vfsmount
This patch adds the list and mask fields needed to support vfsmount marks.
These are the same fields fsnotify needs on an inode.  They are not used,
just declared and we note where the cleanup hook should be (the function is
not yet defined)

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Linus Torvalds
0f2cc4ecd8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  init: Open /dev/console from rootfs
  mqueue: fix typo "failues" -> "failures"
  mqueue: only set error codes if they are really necessary
  mqueue: simplify do_open() error handling
  mqueue: apply mathematics distributivity on mq_bytes calculation
  mqueue: remove unneeded info->messages initialization
  mqueue: fix mq_open() file descriptor leak on user-space processes
  fix race in d_splice_alias()
  set S_DEAD on unlink() and non-directory rename() victims
  vfs: add NOFOLLOW flag to umount(2)
  get rid of ->mnt_parent in tomoyo/realpath
  hppfs can use existing proc_mnt, no need for do_kern_mount() in there
  Mirror MS_KERNMOUNT in ->mnt_flags
  get rid of useless vfsmount_lock use in put_mnt_ns()
  Take vfsmount_lock to fs/internal.h
  get rid of insanity with namespace roots in tomoyo
  take check for new events in namespace (guts of mounts_poll()) to namespace.c
  Don't mess with generic_permission() under ->d_lock in hpfs
  sanitize const/signedness for udf
  nilfs: sanitize const/signedness in dealing with ->d_name.name
  ...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04 08:15:33 -08:00
Al Viro
8089352a13 Mirror MS_KERNMOUNT in ->mnt_flags
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:08:00 -05:00
Al Viro
47cd813f29 Take vfsmount_lock to fs/internal.h
no more users left outside of fs/*.c (and very few outside of
fs/namespace.c, actually)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Valerie Aurora
495d6c9c65 VFS: Clean up shared mount flag propagation
The handling of mount flags in set_mnt_shared() got a little tangled
up during previous cleanups, with the following problems:

* MNT_PNODE_MASK is defined as a literal constant when it should be a
bitwise xor of other MNT_* flags
* set_mnt_shared() clears and then sets MNT_SHARED (part of MNT_PNODE_MASK)
* MNT_PNODE_MASK could use a comment in mount.h
* MNT_PNODE_MASK is a terrible name, change to MNT_SHARED_MASK

This patch fixes these problems.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:55 -05:00
Tejun Heo
003cb608a2 percpu: add __percpu sparse annotations to fs
Add __percpu sparse annotations to fs.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-02-17 11:17:38 +09:00
npiggin@suse.de
96029c4e09 fs: introduce mnt_clone_write
This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.

Before:
 avg = 462.286
 std = 5.46106

After:
 avg = 453.12
 std = 9.58257

(50 runs of each, stddev gives a reasonable confidence)

It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.

After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).

[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
npiggin@suse.de
d3ef3d7351 fs: mnt_want_write speedup
This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
A microbenchmark yes, but it exercises some important paths in the mm.

Before:
 avg = 501.9
 std = 14.7773

After:
 avg = 462.286
 std = 5.46106

(50 runs of each, stddev gives a reasonable confidence, but there is quite
a bit of variation there still)

It does this by removing the complex per-cpu locking and counter-cache and
replaces it with a percpu counter in struct vfsmount. This makes the code
much simpler, and avoids spinlocks (although the msync is still pretty
costly, unfortunately). It results in about 900 bytes smaller code too. It
does increase the size of a vfsmount, however.

It should also give a speedup on large systems if CPUs are frequently operating
on different mounts (because the existing scheme has to operate on an atomic in
the struct vfsmount when switching between mounts). But I'm most interested in
the single threaded path performance for the moment.

[AV: minor cleanup]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
Matthew Garrett
d0adde574b Add a strictatime mount option
Add support for explicitly requesting full atime updates. This makes it
possible for kernels to default to relatime but still allow userspace to
override it.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-26 10:56:35 -07:00
Adrian Bunk
693ac38932 include/linux/mount.h: remove CVS keyword
Remove a CVS keyword that wasn't updated for a long time from a comment.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:30 -07:00
Al Viro
8d66bf5481 [PATCH] pass struct path * to do_add_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:32 -04:00
Li Zefan
88b387824f [PATCH] vfs: use kstrdup() and check failing allocation
- use kstrdup() instead of kmalloc() + memcpy()
- return NULL if allocating ->mnt_devname failed
- mnt_devname should be const

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-07-26 20:53:24 -04:00
Robert P. J. Day
735643ee6c Remove "#ifdef __KERNEL__" checks from unexported headers
Remove the "#ifdef __KERNEL__" tests from unexported header files in
linux/include whose entire contents are wrapped in that preprocessor
test.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:54 -07:00
Miklos Szeredi
719f5d7f0b [patch 4/7] vfs: mountinfo: add mount peer group ID
Add a unique ID to each peer group using the IDR infrastructure.  The
identifiers are reused after the peer group dissolves.

The IDR structures are protected by holding namepspace_sem for write
while allocating or deallocating IDs.

IDs are allocated when a previously unshared vfsmount becomes the
first member of a peer group.  When a new member is added to an
existing group, the ID is copied from one of the old members.

IDs are freed when the last member of a peer group is unshared.

Setting the MNT_SHARED flag on members of a subtree is done as a
separate step, after all the IDs have been allocated.  This way an
allocation failure can be cleaned up easilty, without affecting the
propagation state.

Based on design sketch by Al Viro.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:04:51 -04:00