Changes v5 => v6:
- Blk-crypto's kernel crypto API fallback is no longer restricted to
8-byte DUNs. It's also now separately configurable from blk-crypto, and
can be disabled entirely, while still allowing the kernel to use inline
encryption hardware. Further, struct bio_crypt_ctx takes up less space,
and no longer contains the information needed by the crypto API
fallback - the fallback allocates the required memory when necessary.
- Blk-crypto now supports all file content encryption modes supported by
fscrypt.
- Fixed bio merging logic in blk-merge.c
- Fscrypt now supports inline encryption with the direct key policy, since
blk-crypto now has support for larger DUNs.
- Keyslot manager now uses a hashtable to lookup which keyslot contains
any particular key (thanks Eric!)
- Fscrypt support for inline encryption now handles filesystems with
multiple underlying block devices (thanks Eric!)
- Numerous cleanups
Bug: 137270441
Test: refer to I26376479ee38259b8c35732cb3a1d7e15f9b05a3
Change-Id: I13e2e327e0b4784b394cb1e7cf32a04856d95f01
Link: https://lore.kernel.org/linux-block/20191218145136.172774-1-satyat@google.com/
Signed-off-by: Satya Tangirala <satyat@google.com>
(cherry picked from commit b01c73ea71)
(dropped change to abi_gki_aarch64.xml)
(changed ufshcd_release(hba) to ufshcd_release(hba, false))
Wire up ext4 to support inline encryption via the helper functions which
fs/crypto/ now provides. This includes:
- Adding a mount option 'inlinecrypt' which enables inline encryption
on encrypted files where it can be used.
- Setting the bio_crypt_ctx on bios that will be submitted to an
inline-encrypted file.
Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for
this part, since ext4 sometimes uses ll_rw_block() on file data.
- Not adding logically discontiguous data to bios that will be submitted
to an inline-encrypted file.
- Not doing filesystem-layer crypto on inline-encrypted files.
Bug: 137270441
Test: tested as series; see I26aac0ac7845a9064f28bb1421eb2522828a6dec
Change-Id: I54a8efe388289918f4144d8138fb87aa507ae760
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Link: https://patchwork.kernel.org/patch/11214781/
(cherry picked from commit f2ca2620dd)
* refs/heads/tmp-d885da6:
Revert "coresight: etm4x: Add support to enable ETMv4.2"
Revert "usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded"
Linux 4.19.34
kprobes/x86: Blacklist non-attachable interrupt functions
bcache: fix potential div-zero error of writeback_rate_p_term_inverse
ACPI / video: Extend chassis-type detection with a "Lunch Box" check
net: stmmac: Avoid one more sometimes uninitialized Clang warning
drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
dmaengine: tegra: avoid overflow of byte tracking
clk: rockchip: fix frac settings of GPLL clock for rk3328
clk: meson: clean-up clock registration
drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
x86/build: Mark per-CPU symbols as absolute explicitly for LLD
wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
brcmfmac: Use firmware_request_nowarn for the clm_blob
selinux: do not override context on context mounts
x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
drm/nouveau: Stop using drm_crtc_force_disable
drm: Auto-set allow_fb_modifiers when given modifiers at plane init
pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
media: rcar-vin: Allow independent VIN link enablement
netfilter: physdev: relax br_netfilter dependency
dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
dmaengine: qcom_hidma: assign channel cookie correctly
dmaengine: imx-dma: fix warning comparison of distinct pointer types
cpu/hotplug: Mute hotplug lockdep during init
hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
f2fs: UBSAN: set boolean value iostat_enable correctly
HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
soc/tegra: fuse: Fix illegal free of IO base address
hwrng: virtio - Avoid repeated init of completion
media: mt9m111: set initial frame size other than 0x0
perf script python: Add trace_context extension module to sys.modules
perf script python: Use PyBytes for attr in trace-event-python
platform/x86: intel-hid: Missing power button release on some Dell models
usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
ALSA: dice: add support for Solid State Logic Duende Classic/Mini
drm/amd/display: Enable vblank interrupt during CRC capture
powerpc/pseries: Perform full re-add of CPU for topology update post-migration
tty: increase the default flip buffer limit to 2*640K
backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
powerpc/64s: Clear on-stack exception marker upon exception return
selftests/bpf: skip verifier tests for unsupported program types
bpf: fix missing prototype warnings
block, bfq: fix in-service-queue check for queue merging
ARM: avoid Cortex-A9 livelock on tight dmb loops
ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
mt7601u: bump supported EEPROM version
soc: qcom: gsbi: Fix error handling in gsbi_probe()
efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
drm/vkms: Bugfix extra vblank frame
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
efi/memattr: Don't bail on zero VA if it equals the region's PA
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
iwlwifi: mvm: fix RFH config command with >=10 CPUs
staging: spi: mt7621: Add return code check on device_reset()
i2c: of: Try to find an I2C adapter matching the parent
platform/x86: intel_pmc_core: Fix PCH IP sts reading
e1000e: Exclude device from suspend direct complete optimization
e1000e: fix cyclic resets at link up with active tx
perf/aux: Make perf_event accessible to setup_aux()
drm/amd/display: Disconnect mpcc when changing tg
drm/amd/display: Don't re-program planes for DPMS changes
drm: rcar-du: add missing of_node_put
cdrom: Fix race condition in cdrom_sysctl_register
fbdev: fbmem: fix memory access if logo is bigger than the screen
net: phy: consider latched link-down status in polling mode
iw_cxgb4: fix srqidx leak during connection abort
net: marvell: mvpp2: fix stuck in-band SGMII negotiation
genirq: Avoid summation loops for /proc/stat
bcache: improve sysfs_strtoul_clamp()
bcache: fix potential div-zero error of writeback_rate_i_term_inverse
bcache: fix input overflow to sequential_cutoff
bcache: fix input overflow to cache set sysfs file io_error_halflife
sched/topology: Fix percpu data types in struct sd_data & struct s_data
usb: f_fs: Avoid crash due to out-of-scope stack ptr access
ath10k: fix shadow register implementation for WCN3990
ALSA: PCM: check if ops are defined before suspending PCM
ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
ARM: 8833/1: Ensure that NEON code always compiles with Clang
netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
kprobes: Prohibit probing on RCU debug routine
kprobes: Prohibit probing on bsearch()
selftests: skip seccomp get_metadata test if not real root
ACPI / video: Refactor and fix dmi_is_desktop()
iwlwifi: pcie: fix emergency path
perf report: Add s390 diagnosic sampling descriptor size
leds: lp55xx: fix null deref on firmware load failure
jbd2: fix race when writing superblock
cgroup, rstat: Don't flush subtree root unless necessary
HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
xen/gntdev: Do not destroy context while dma-bufs are in use
mt76: usb: do not run mt76u_queues_deinit twice
media: mtk-jpeg: Correct return type for mem2mem buffer helpers
media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
media: s5p-g2d: Correct return type for mem2mem buffer helpers
media: rockchip/rga: Correct return type for mem2mem buffer helpers
media: s5p-jpeg: Correct return type for mem2mem buffer helpers
media: sh_veu: Correct return type for mem2mem buffer helpers
media: ov7740: fix runtime pm initialization
SoC: imx-sgtl5000: add missing put_device()
perf report: Don't shadow inlined symbol with different addr range
mwifiex: don't advertise IBSS features without FW support
perf test: Fix failure of 'evsel-tp-sched' test on s390
drm/amd/display: Clear stream->mode_changed after commit
scsi: fcoe: make use of fip_mode enum complete
scsi: megaraid_sas: return error when create DMA pool failed
s390/ism: ignore some errors during deregistration
efi: cper: Fix possible out-of-bounds access
cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
perf annotate: Fix getting source line failure
clk: fractional-divider: check parent rate only if flag is set
IB/mlx4: Increase the timeout for CM cache
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
platform/mellanox: mlxreg-hotplug: Fix KASAN warning
platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
mlxsw: spectrum: Avoid -Wformat-truncation warnings
e1000e: Fix -Wformat-truncation warnings
net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
mmc: omap: fix the maximum timeout setting
btrfs: qgroup: Make qgroup async transaction commit more aggressive
powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
ARM: 8840/1: use a raw_spinlock_t in unwind
serial: 8250_pxa: honor the port number from devicetree
coresight: etm4x: Add support to enable ETMv4.2
powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
usb: chipidea: Grab the (legacy) USB PHY by phandle first
crypto: cavium/zip - fix collision with generic cra_driver_name
crypto: crypto4xx - add missing of_node_put after of_device_is_available
mt76: fix a leaked reference by adding a missing of_node_put
wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
tools lib traceevent: Fix buffer overflow in arg_eval
fs: fix guard_bio_eod to check for real EOD errors
jbd2: fix invalid descriptor block checksum
netfilter: conntrack: tcp: only close if RST matches exact sequence
netfilter: nf_tables: check the result of dereferencing base_chain->stats
cifs: Fix NULL pointer dereference of devname
cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
f2fs: fix to check inline_xattr_size boundary correctly
dm thin: add sanity checks to thin-pool and external snapshot creation
cifs: use correct format characters
page_poison: play nicely with KASAN
fs/file.c: initialize init_files.resize_wait
f2fs: do not use mutex lock in atomic context
ocfs2: fix a panic problem caused by o2cb_ctl
mm/slab.c: kmemleak no scan alien caches
mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
mm, mempolicy: fix uninit memory access
memcg: killed threads should not invoke memcg OOM killer
mm,oom: don't kill global init via memory.oom.group
mm, swap: bounds check swap_info array accesses to avoid NULL derefs
mm/page_ext.c: fix an imbalance with kmemleak
mm/cma.c: cma_declare_contiguous: correct err handling
mm/sparse: fix a bad comparison
perf c2c: Fix c2c report for empty numa node
x86/hyperv: Fix kernel panic when kexec on HyperV
iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
scsi: hisi_sas: Set PHY linkrate when disconnected
libbpf: force fixdep compilation at the start of the build
enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
net: stmmac: Avoid sometimes uninitialized Clang warnings
sysctl: handle overflow for file-max
include/linux/relay.h: fix percpu annotation in struct rchan
gpio: gpio-omap: fix level interrupt idling
net/mlx5: Avoid panic when setting vport mac, getting vport config
net/mlx5: Avoid panic when setting vport rate
tracing: kdb: Fix ftdump to not sleep
f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
CIFS: fix POSIX lock leak and invalid ptr deref
tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
tty/serial: atmel: Add is_half_duplex helper
ext4: cleanup bh release code in ext4_ind_remove_space()
arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
ANDROID: cuttlefish_defconfig: Enable CONFIG_OVERLAY_FS
ANDROID: cuttlefish: enable CONFIG_NET_SCH_INGRESS=y
Conflicts:
drivers/usb/gadget/function/f_fs.c
mm/page_alloc.c
Change-Id: Ia2a8e99bfdae84d3933749f45ba86f33c5acd713
Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these these types of failures make CMA significantly
less reliable, especially under high filesystem usage.
Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.
Change-Id: I253f4ee2069e190c1115afc421dadd27a7fa87dc
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
[ Upstream commit dce30ca9e3b676fb288c33c1f4725a0621361185 ]
guard_bio_eod() can truncate a segment in bio to allow it to do IO on
odd last sectors of a device.
It already checks if the IO starts past EOD, but it does not consider
the possibility of an IO request starting within device boundaries can
contain more than one segment past EOD.
In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
underflow bvec->bv_len.
Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
This situation has been found on filesystems such as isofs and vfat,
which doesn't check the device size before mount, if the device is
smaller than the filesystem itself, a readahead on such filesystem,
which spans EOD, can trigger this situation, leading a call to
zero_user() with a wrong size possibly corrupting memory.
I didn't see any crash, or didn't let the system run long enough to
check if memory corruption will be hit somewhere, but adding
instrumentation to guard_bio_end() to check truncated_bytes size, was
enough to see the error.
The following script can trigger the error.
MNT=/mnt
IMG=./DISK.img
DEV=/dev/loop0
mkfs.vfat $IMG
mount $IMG $MNT
cp -R /etc $MNT &> /dev/null
umount $MNT
losetup -D
losetup --find --show --sizelimit 16247280 $IMG
mount $DEV $MNT
find $MNT -type f -exec cat {} + >/dev/null
Kudos to Eric Sandeen for coming up with the reproducer above
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 43636c804df0126da669c261fc820fb22f62bfc2 ]
When something let __find_get_block_slow() hit all_mapped path, it calls
printk() for 100+ times per a second. But there is no need to print same
message with such high frequency; it is just asking for stall warning, or
at least bloating log files.
[ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.873324][T15342] b_state=0x00000029, b_size=512
[ 399.878403][T15342] device loop0 blocksize: 4096
[ 399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.890400][T15342] b_state=0x00000029, b_size=512
[ 399.895595][T15342] device loop0 blocksize: 4096
[ 399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.907471][T15342] b_state=0x00000029, b_size=512
[ 399.912506][T15342] device loop0 blocksize: 4096
This patch reduces frequency to up to once per a second, in addition to
concatenating three lines into one.
[ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
The buffer_head can consume a significant amount of system memory and is
directly related to the amount of page cache. In our production
environment we have observed that a lot of machines are spending a
significant amount of memory as buffer_head and can not be left as
system memory overhead.
Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
allocation. The buffer_heads can be allocated in a memcg different from
the memcg of the page for which buffer_heads are being allocated. One
concrete example is memory reclaim. The reclaim can trigger I/O of
pages of any memcg on the system. So, the right way to charge
buffer_head is to extract the memcg from the page for which buffer_heads
are being allocated and then use targeted memcg charging API.
[shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In iomap_to_bh, not only mark buffer heads in IOMAP_UNWRITTEN maps as
new, but also buffer heads in IOMAP_MAPPED maps with the IOMAP_F_NEW
flag set. This will be used by filesystems like gfs2, which allocate
blocks in iomap->begin.
Minor corrections to the comment for IOMAP_UNWRITTEN maps.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Bits of the buffer.c based write_end implementations that don't know
about buffer_heads and can be reused by other implementations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This function is only used by the iomap code, depends on being called
from it, and will soon stop poking into buffer head internals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull vfs thaw updates from Al Viro:
"An ancient series that has fallen through the cracks in the previous
cycle"
* 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
buffer.c: call thaw_super during emergency thaw
vfs: factor sb iteration out of do_emergency_remount
There are 2 distinct freezing mechanisms - one operates on block
devices and another one directly on super blocks. Both end up with the
same result, but thaw of only one of these does not thaw the other.
In particular fsfreeze --freeze uses the ioctl variant going to the
super block. Since prior to this patch emergency thaw was not doing
a relevant thaw, filesystems frozen with this method remained
unaffected.
The patch is a hack which adds blind unfreezing.
In order to keep the super block write-locked the whole time the code
is shuffled around and the newly introduced __iterate_supers is
employed.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull misc vfs updates from Al Viro:
"All kinds of misc stuff, without any unifying topic, from various
people.
Neil's d_anon patch, several bugfixes, introduction of kvmalloc
analogue of kmemdup_user(), extending bitfield.h to deal with
fixed-endians, assorted cleanups all over the place..."
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
alpha: osf_sys.c: use timespec64 where appropriate
alpha: osf_sys.c: fix put_tv32 regression
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
dcache: delete unused d_hash_mask
dcache: subtract d_hash_shift from 32 in advance
fs/buffer.c: fold init_buffer() into init_page_buffers()
fs: fold __inode_permission() into inode_permission()
fs: add RWF_APPEND
sctp: use vmemdup_user() rather than badly open-coding memdup_user()
snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
replace_user_tlv(): switch to vmemdup_user()
new primitive: vmemdup_user()
memdup_user(): switch to GFP_USER
eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
eventfd: fold eventfd_ctx_read() into eventfd_read()
eventfd: convert to use anon_inode_getfd()
nfs4file: get rid of pointless include of btrfs.h
uvc_v4l2: clean copyin/copyout up
vme_user: don't use __copy_..._user()
usx2y: don't bother with memdup_user() for 16-byte structure
...
Since commit e76004093d ("fs/buffer.c: remove unnecessary init
operation after allocating buffer_head"), there are no callers of
init_buffer() outside of init_page_buffers(). So just fold it into
init_page_buffers().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.
Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:
- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.
- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.
- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)
- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)
- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).
- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).
- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).
- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).
- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).
- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.
- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.
- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"
* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
Pull ext4 updates from Ted Ts'o:
- Add support for online resizing of file systems with bigalloc
- Fix a two data corruption bugs involving DAX, as well as a corruption
bug after a crash during a racing fallocate and delayed allocation.
- Finally, a number of cleanups and optimizations.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: improve smp scalability for inode generation
ext4: add support for online resizing with bigalloc
ext4: mention noload when recovering on read-only device
Documentation: fix little inconsistencies
ext4: convert timers to use timer_setup()
jbd2: convert timers to use timer_setup()
ext4: remove duplicate extended attributes defs
ext4: add ext4_should_use_dax()
ext4: add sanity check for encryption + DAX
ext4: prevent data corruption with journaling + DAX
ext4: prevent data corruption with inline data + DAX
ext4: fix interaction between i_size, fallocate, and delalloc after a crash
ext4: retry allocations conservatively
ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
ext4: Add iomap support for inline data
iomap: Add IOMAP_F_DATA_INLINE flag
iomap: Switch from blkno to disk offset
guard_bio_eod() needs to look at the partition capacity, not just the
capacity of the whole device, when determining if truncation is
necessary.
[ 60.268688] attempt to access beyond end of device
[ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
[ 60.268693] buffer_io_error: 2 callbacks suppressed
[ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read
Fixes: 74d46992e0 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Cc: stable@vger.kernel.org # v4.13
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.
Since this is also the last user of free_more_memory(), kill
it off completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.
After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace iomap->blkno, the sector number, with iomap->addr, the disk
offset in bytes. For invalid disk offsets, use the special value
IOMAP_NULL_ADDR instead of IOMAP_NULL_BLOCK.
This allows to use iomap for mappings which are not block aligned, such
as inline data on ext4.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> # iomap, xfs
Reviewed-by: Jan Kara <jack@suse.cz>
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
We want only pages from given range in page_cache_seek_hole_data(). Use
pagevec_lookup_range() instead of pagevec_lookup() and remove
unnecessary code.
Note that the check for getting less pages than desired can be removed
because index gets updated by pagevec_lookup_range().
Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e64855c6cf ("fs: Add helper to clean bdev aliases under a bh
and use it") added a wrapper for clean_bdev_aliases() that invalidates
bdev aliases underlying a single buffer head.
However this has caused a performance regression for bonnie++ benchmark
on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
average of 3 runs:
Hmean SeqOut Char 164787.55 ( 0.00%) 107189.06 (-34.95%)
Hmean SeqOut Block 219883.89 ( 0.00%) 168870.32 (-23.20%)
The reason for this regression is that clean_bdev_aliases() is slower
when called for a single block because pagevec_lookup() it uses will end
up iterating through the radix tree until it finds a page (which may
take a while) but we are only interested whether there's a page at a
particular index.
Fix the problem by using pagevec_lookup_range() instead which avoids the
needless iteration.
Fixes: e64855c6cf ("fs: Add helper to clean bdev aliases under a bh and use it")
Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To install a buffer_head into the cpu's LRU queue, bh_lru_install()
would construct a new copy of the queue and then memcpy it over the real
queue. But it's easily possible to do the update in-place, which is
faster and simpler. Some work can also be skipped if the buffer_head
was already in the queue.
As a microbenchmark I timed how long it takes to run sb_getblk()
10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
Effectively, this benchmarks looking up buffer_heads that are in the
page cache but not in the LRU:
Before this patch: 1.758s
After this patch: 1.653s
This patch also removes about 350 bytes of compiled code (on x86_64),
partly due to removal of the memcpy() which was being inlined+unrolled.
Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull XFS updates from Darrick Wong:
"Here are some changes for you for 4.13. For the most part it's fixes
for bugs and deadlock problems, and preparation for online fsck in
some future merge window.
- Avoid quotacheck deadlocks
- Fix transaction overflows when bunmapping fragmented files
- Refactor directory readahead
- Allow admin to configure if ASSERT is fatal
- Improve transaction usage detail logging during overflows
- Minor cleanups
- Don't leak log items when the log shuts down
- Remove double-underscore typedefs
- Various preparation for online scrubbing
- Introduce new error injection configuration sysfs knobs
- Refactor dq_get_next to use extent map directly
- Fix problems with iterating the page cache for unwritten data
- Implement SEEK_{HOLE,DATA} via iomap
- Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
- Don't use MAXPATHLEN to check on-disk symlink target lengths"
* tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
xfs: don't crash on unexpected holes in dir/attr btrees
xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
xfs: fix contiguous dquot chunk iteration livelock
xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
vfs: Add iomap_seek_hole and iomap_seek_data helpers
vfs: Add page_cache_seek_hole_data helper
xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
xfs: Check for m_errortag initialization in xfs_errortag_test
xfs: grab dquots without taking the ilock
xfs: fix semicolon.cocci warnings
xfs: Don't clear SGID when inheriting ACLs
xfs: free cowblocks and retry on buffered write ENOSPC
xfs: replace log_badcrc_factor knob with error injection tag
xfs: convert drop_writes to use the errortag mechanism
xfs: remove unneeded parameter from XFS_TEST_ERROR
xfs: expose errortag knobs via sysfs
xfs: make errortag a per-mountpoint structure
xfs: free uncommitted transactions during log recovery
xfs: don't allow bmap on rt files
...
Pull ext4 updates from Ted Ts'o:
"The first major feature for ext4 this merge window is the largedir
feature, which allows ext4 directories to support over 2 billion
directory entries (assuming ~64 byte file names; in practice, users
will run into practical performance limits first.) This feature was
originally written by the Lustre team, and credit goes to Artem
Blagodarenko from Seagate for getting this feature upstream.
The second major major feature allows ext4 to support extended
attribute values up to 64k. This feature was also originally from
Lustre, and has been enhanced by Tahsin Erdogan from Google with a
deduplication feature so that if multiple files have the same xattr
value (for example, Windows ACL's stored by Samba), only one copy will
be stored on disk for encoding and caching efficiency.
We also have the usual set of bug fixes, cleanups, and optimizations"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
ext4: fix spelling mistake: "prellocated" -> "preallocated"
ext4: fix __ext4_new_inode() journal credits calculation
ext4: skip ext4_init_security() and encryption on ea_inodes
fs: generic_block_bmap(): initialize all of the fields in the temp bh
ext4: change fast symlink test to not rely on i_blocks
ext4: require key for truncate(2) of encrypted file
ext4: don't bother checking for encryption key in ->mmap()
ext4: check return value of kstrtoull correctly in reserved_clusters_store
ext4: fix off-by-one fsmap error on 1k block filesystems
ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
ext4: return EIO on read error in ext4_find_entry
ext4: forbid encrypting root directory
ext4: send parallel discards on commit completions
ext4: avoid unnecessary stalls in ext4_evict_inode()
ext4: add nombcache mount option
ext4: strong binding of xattr inode references
ext4: eliminate xattr entry e_hash recalculation for removes
ext4: reserve space for xattr entries/names
quota: add get_inode_usage callback to transfer multi-inode charges
ext4: xattr inode deduplication
...
Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.
The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.
For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.
Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.
But wait...it gets worse!
The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.
This pile aims to do three things:
1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing
2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.
3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.
To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.
Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:
1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.
2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.
This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:
https://lwn.net/Articles/718734/https://lwn.net/Articles/724307/"
* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty
I noticed on xfs that I could still sometimes get back an error on fsync
on a fd that was opened after the error condition had been cleared.
The problem is that the buffer code sets the write_io_error flag and
then later checks that flag to set the error in the mapping. That flag
perisists for quite a while however. If the file is later opened with
O_TRUNC, the buffers will then be invalidated and the mapping's error
set such that a subsequent fsync will return error. I think this is
incorrect, as there was no writeback between the open and fsync.
Add a new mark_buffer_write_io_error operation that sets the flag and
the error in the mapping at the same time. Replace all calls to
set_buffer_write_io_error with mark_buffer_write_io_error, and remove
the places that check this flag in order to set the error in the
mapping.
This sets the error in the mapping earlier, at the time that it's first
detected.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
KMSAN (KernelMemorySanitizer, a new error detection tool) reports the
use of uninitialized memory in ext4_update_bh_state():
==================================================================
BUG: KMSAN: use of unitialized memory
CPU: 3 PID: 1 Comm: swapper/0 Tainted: G B 4.8.0-rc6+ #597
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
0000000000000282 ffff88003cc96f68 ffffffff81f30856 0000003000000008
ffff88003cc96f78 0000000000000096 ffffffff8169742a ffff88003cc96ff8
ffffffff812fc1fc 0000000000000008 ffff88003a1980e8 0000000100000000
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff81f30856>] dump_stack+0xa6/0xc0 lib/dump_stack.c:51
[<ffffffff812fc1fc>] kmsan_report+0x1ec/0x300 mm/kmsan/kmsan.c:?
[<ffffffff812fc33b>] __msan_warning+0x2b/0x40 ??:?
[< inline >] ext4_update_bh_state fs/ext4/inode.c:727
[<ffffffff8169742a>] _ext4_get_block+0x6ca/0x8a0 fs/ext4/inode.c:759
[<ffffffff81696d4c>] ext4_get_block+0x8c/0xa0 fs/ext4/inode.c:769
[<ffffffff814a2d36>] generic_block_bmap+0x246/0x2b0 fs/buffer.c:2991
[<ffffffff816ca30e>] ext4_bmap+0x5ee/0x660 fs/ext4/inode.c:3177
...
origin description: ----tmp@generic_block_bmap
==================================================================
(the line numbers are relative to 4.8-rc6, but the bug persists
upstream)
The local |tmp| is created in generic_block_bmap() and then passed into
ext4_bmap() => ext4_get_block() => _ext4_get_block() =>
ext4_update_bh_state(). Along the way tmp.b_page is never initialized
before ext4_update_bh_state() checks its value.
[ Use the approach suggested by Kees Cook of initializing the whole bh
structure.]
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Both ext4 and xfs implement seeking for the next hole or piece of data
in unwritten extents by scanning the page cache, and both versions share
the same bug when iterating the buffers of a page: the start offset into
the page isn't taken into account, so when a page fits more than two
filesystem blocks, things will go wrong. For example, on a filesystem
with a block size of 1k, the following command will fail:
xfs_io -f -c "falloc 0 4k" \
-c "pwrite 1k 1k" \
-c "pwrite 3k 1k" \
-c "seek -a -r 0" foo
In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.
Introduce a generic vfs helper for seeking in the page cache that gets
this right. The next commits will replace the filesystem specific
implementations.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
[hch: dropped the export]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull misc vfs updates from Al Viro:
"Assorted bits and pieces from various people. No common topic in this
pile, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/affs: add rename exchange
fs/affs: add rename2 to prepare multiple methods
Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
fs: don't set *REFERENCED on single use objects
fs: compat: Remove warning from COMPATIBLE_IOCTL
remove pointless extern of atime_need_update_rcu()
fs: completely ignore unknown open flags
fs: add a VALID_OPEN_FLAGS
fs: remove _submit_bh()
fs: constify tree_descr arrays passed to simple_fill_super()
fs: drop duplicate header percpu-rwsem.h
fs/affs: bugfix: Write files greater than page size on OFS
fs/affs: bugfix: enable writes on OFS disks
fs/affs: remove node generation check
fs/affs: import amigaffs.h
fs/affs: bugfix: make symbolic links work again
_submit_bh() allowed submitting a buffer_head for I/O using custom
bio_flags. It used to be used by jbd to set BIO_SNAP_STABLE, introduced
by commit 7136851117 ("mm: make snapshotting pages for stable writes a
per-bio operation"). However, the code and flag has since been removed
and no _submit_bh() users remain.
These days, bio_flags are mostly used internally by the block layer to
track the state of bio's. As such, it doesn't really make sense for
filesystems to use them instead of op_flags when wanting special
behavior for block requests.
Therefore, remove _submit_bh() and trim the bio_flags argument from
submit_bh_wbc().
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of including the full <linux/signal.h>, we are going to include the
types-only <linux/signal_types.h> header in <linux/sched.h>, to further
decouple the scheduler header from the signal headers.
This means that various files which relied on the full <linux/signal.h> need
to be updated to gain an explicit dependency on it.
Update the code that relies on sched.h's inclusion of the <linux/signal.h> header.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The first block to be cleaned may start at a non-zero page offset. In
such a scenario clean_bdev_aliases() will end up cleaning blocks that
do not fall in the range of blocks to be cleaned. This commit fixes the
issue by skipping blocks that do not fall in valid block range.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>