d4414bc0e93d8da170fd0fc9fef65fe84015677d
168 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0d750eaafc |
Merge tag 'ASB-2024-08-05_4.19-stable' of https://android.googlesource.com/kernel/common into android-msm-pixel-4.19
https://source.android.com/docs/security/bulletin/2024-08-01 CVE-2024-36971 * tag 'ASB-2024-08-05_4.19-stable' of https://android.googlesource.com/kernel/common: (2363 commits) Linux 4.19.318 i2c: rcar: bring hardware to known state when probing nilfs2: fix kernel bug on rename operation of broken directory SUNRPC: Fix RPC client cleaned up the freed pipefs dentries tcp: avoid too many retransmit packets tcp: use signed arithmetic in tcp_rtx_probe0_timed_out() net: tcp: fix unexcepted socket die when snd_wnd is 0 tcp: refactor tcp_retransmit_timer() libceph: fix race between delayed_work() and ceph_monc_stop() hpet: Support 32-bit userspace USB: core: Fix duplicate endpoint bug by clearing reserved bits in the descriptor usb: gadget: configfs: Prevent OOB read/write in usb_string_copy() USB: Add USB_QUIRK_NO_SET_INTF quirk for START BP-850k USB: serial: option: add Rolling RW350-GL variants USB: serial: option: add Netprisma LCUK54 series modules USB: serial: option: add support for Foxconn T99W651 USB: serial: option: add Fibocom FM350-GL USB: serial: option: add Telit FN912 rmnet compositions USB: serial: option: add Telit generic core-dump composition ARM: davinci: Convert comma to semicolon ... Conflicts: Documentation/devicetree/bindings/sound/rt5645.txt android/abi_gki_aarch64.xml drivers/clk/qcom/clk-rcg2.c drivers/hwtracing/coresight/coresight-etm4x.c drivers/leds/leds-pwm.c drivers/mmc/core/host.c drivers/mmc/core/sdio.c drivers/mmc/host/cqhci.c drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c drivers/rpmsg/qcom_glink_native.c drivers/scsi/ufs/ufshcd.c drivers/thermal/thermal_core.c drivers/usb/dwc3/core.c drivers/usb/gadget/function/f_ncm.c fs/f2fs/gc.c fs/pstore/ram_core.c include/linux/fs.h include/linux/timer.h include/net/tcp.h init/initramfs.c kernel/events/core.c kernel/sched/idle.c kernel/time/timer.c mm/page_alloc.c net/wireless/scan.c scripts/checkpatch.pl Change-Id: Ice08f3ba5dc64a093bc381710ef2408d963cb983 |
||
|
|
ec0ad95612 |
Merge 4.19.312 into android-4.19-stable
Changes in 4.19.312
Documentation/hw-vuln: Update spectre doc
x86/cpu: Support AMD Automatic IBRS
x86/bugs: Use sysfs_emit()
timer/trace: Replace deprecated vsprintf pointer extension %pf by %ps
timer/trace: Improve timer tracing
timers: Prepare support for PREEMPT_RT
timers: Update kernel-doc for various functions
timers: Use del_timer_sync() even on UP
timers: Rename del_timer_sync() to timer_delete_sync()
wifi: brcmfmac: Fix use-after-free bug in brcmf_cfg80211_detach
smack: Set SMACK64TRANSMUTE only for dirs in smack_inode_setxattr()
smack: Handle SMACK64TRANSMUTE in smack_inode_setsecurity()
ARM: dts: mmp2-brownstone: Don't redeclare phandle references
arm: dts: marvell: Fix maxium->maxim typo in brownstone dts
media: xc4000: Fix atomicity violation in xc4000_get_frequency
KVM: Always flush async #PF workqueue when vCPU is being destroyed
sparc64: NMI watchdog: fix return value of __setup handler
sparc: vDSO: fix return value of __setup handler
crypto: qat - fix double free during reset
crypto: qat - resolve race condition during AER recovery
fat: fix uninitialized field in nostale filehandles
ubifs: Set page uptodate in the correct place
ubi: Check for too small LEB size in VTBL code
ubi: correct the calculation of fastmap size
parisc: Do not hardcode registers in checksum functions
parisc: Fix ip_fast_csum
parisc: Fix csum_ipv6_magic on 32-bit systems
parisc: Fix csum_ipv6_magic on 64-bit systems
parisc: Strip upper 32 bit of sum in csum_ipv6_magic for 64-bit builds
PM: suspend: Set mem_sleep_current during kernel command line setup
clk: qcom: gcc-ipq8074: fix terminating of frequency table arrays
clk: qcom: mmcc-apq8084: fix terminating of frequency table arrays
clk: qcom: mmcc-msm8974: fix terminating of frequency table arrays
powerpc/fsl: Fix mfpmr build errors with newer binutils
USB: serial: ftdi_sio: add support for GMC Z216C Adapter IR-USB
USB: serial: add device ID for VeriFone adapter
USB: serial: cp210x: add ID for MGP Instruments PDS100
USB: serial: option: add MeiG Smart SLM320 product
USB: serial: cp210x: add pid/vid for TDK NC0110013M and MM0110113M
PM: sleep: wakeirq: fix wake irq warning in system suspend
mmc: tmio: avoid concurrent runs of mmc_request_done()
fuse: don't unhash root
PCI: Drop pci_device_remove() test of pci_dev->driver
PCI/PM: Drain runtime-idle callbacks before driver removal
Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""
dm-raid: fix lockdep waring in "pers->hot_add_disk"
mmc: core: Fix switch on gp3 partition
hwmon: (amc6821) add of_match table
ext4: fix corruption during on-line resize
slimbus: core: Remove usage of the deprecated ida_simple_xx() API
speakup: Fix 8bit characters from direct synth
kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1
vfio/platform: Disable virqfds on cleanup
soc: fsl: qbman: Always disable interrupts when taking cgr_lock
soc: fsl: qbman: Add helper for sanity checking cgr ops
soc: fsl: qbman: Add CGR update function
soc: fsl: qbman: Use raw spinlock for cgr_lock
s390/zcrypt: fix reference counting on zcrypt card objects
drm/imx/ipuv3: do not return negative values from .get_modes()
drm/vc4: hdmi: do not return negative values from .get_modes()
memtest: use {READ,WRITE}_ONCE in memory scanning
nilfs2: fix failure to detect DAT corruption in btree and direct mappings
nilfs2: use a more common logging style
nilfs2: prevent kernel bug at submit_bh_wbc()
x86/CPU/AMD: Update the Zenbleed microcode revisions
ahci: asm1064: correct count of reported ports
ahci: asm1064: asm1166: don't limit reported ports
comedi: comedi_test: Prevent timers rescheduling during deletion
netfilter: nf_tables: disallow anonymous set with timeout flag
netfilter: nf_tables: reject constant set with timeout
xfrm: Avoid clang fortify warning in copy_to_user_tmpl()
ALSA: hda/realtek - Fix headset Mic no show at resume back for Lenovo ALC897 platform
USB: usb-storage: Prevent divide-by-0 error in isd200_ata_command
usb: gadget: ncm: Fix handling of zero block length packets
usb: port: Don't try to peer unused USB ports based on location
tty: serial: fsl_lpuart: avoid idle preamble pending if CTS is enabled
vt: fix unicode buffer corruption when deleting characters
vt: fix memory overlapping when deleting chars in the buffer
mm/memory-failure: fix an incorrect use of tail pages
mm/migrate: set swap entry values of THP tail pages properly.
wifi: mac80211: check/clear fast rx for non-4addr sta VLAN changes
exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
usb: cdc-wdm: close race between read and workqueue
ALSA: sh: aica: reorder cleanup operations to avoid UAF bugs
fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion
printk: Update @console_may_schedule in console_trylock_spinning()
btrfs: allocate btrfs_ioctl_defrag_range_args on stack
Revert "loop: Check for overflow while configuring loop"
loop: Call loop_config_discard() only after new config is applied
loop: Remove sector_t truncation checks
loop: Factor out setting loop device size
loop: Refactor loop_set_status() size calculation
loop: properly observe rotational flag of underlying device
perf/core: Fix reentry problem in perf_output_read_group()
efivarfs: Request at most 512 bytes for variable names
powerpc: xor_vmx: Add '-mhard-float' to CFLAGS
loop: Factor out configuring loop from status
loop: Check for overflow while configuring loop
loop: loop_set_status_from_info() check before assignment
usb: dwc2: host: Fix remote wakeup from hibernation
usb: dwc2: host: Fix hibernation flow
usb: dwc2: host: Fix ISOC flow in DDMA mode
usb: dwc2: gadget: LPM flow fix
usb: udc: remove warning when queue disabled ep
scsi: qla2xxx: Fix command flush on cable pull
x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabled
scsi: lpfc: Correct size for wqe for memset()
USB: core: Fix deadlock in usb_deauthorize_interface()
nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet
mptcp: add sk_stop_timer_sync helper
tcp: properly terminate timers for kernel sockets
r8169: fix issue caused by buggy BIOS on certain boards with RTL8168d
Bluetooth: hci_event: set the conn encrypted before conn establishes
Bluetooth: Fix TOCTOU in HCI debugfs implementation
netfilter: nf_tables: disallow timeout for anonymous sets
net/rds: fix possible cp null dereference
Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations
netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
net/sched: act_skbmod: prevent kernel-infoleak
net: stmmac: fix rx queue priority assignment
selftests: reuseaddr_conflict: add missing new line at the end of the output
ipv6: Fix infinite recursion in fib6_dump_done().
i40e: fix vf may be used uninitialized in this function warning
staging: mmal-vchiq: Avoid use of bool in structures
staging: mmal-vchiq: Allocate and free components as required
staging: mmal-vchiq: Fix client_component for 64 bit kernel
staging: vc04_services: changen strncpy() to strscpy_pad()
staging: vc04_services: fix information leak in create_component()
initramfs: factor out a helper to populate the initrd image
fs: add a vfs_fchown helper
fs: add a vfs_fchmod helper
initramfs: switch initramfs unpacking to struct file based APIs
init: open /initrd.image with O_LARGEFILE
erspan: Add type I version 0 support.
erspan: make sure erspan_base_hdr is present in skb->head
ASoC: ops: Fix wraparound for mask in snd_soc_get_volsw
ata: sata_sx4: fix pdc20621_get_from_dimm() on 64-bit
ata: sata_mv: Fix PCI device ID table declaration compilation warning
ALSA: hda/realtek: Update Panasonic CF-SZ6 quirk to support headset with microphone
wifi: ath9k: fix LNA selection in ath_ant_try_scan()
VMCI: Fix memcpy() run-time warning in dg_dispatch_as_host()
arm64: dts: rockchip: fix rk3399 hdmi ports node
tools/power x86_energy_perf_policy: Fix file leak in get_pkg_num()
btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()
btrfs: export: handle invalid inode or root reference in btrfs_get_parent()
btrfs: send: handle path ref underflow in header iterate_inode_ref()
Bluetooth: btintel: Fix null ptr deref in btintel_read_version
Input: synaptics-rmi4 - fail probing if memory allocation for "phys" fails
sysv: don't call sb_bread() with pointers_lock held
scsi: lpfc: Fix possible memory leak in lpfc_rcv_padisc()
isofs: handle CDs with bad root inode but good Joliet root directory
media: sta2x11: fix irq handler cast
drm/amd/display: Fix nanosec stat overflow
SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned int
block: prevent division by zero in blk_rq_stat_sum()
Input: allocate keycode for Display refresh rate toggle
ktest: force $buildonly = 1 for 'make_warnings_file' test type
tools: iio: replace seekdir() in iio_generic_buffer
usb: sl811-hcd: only defined function checkdone if QUIRK2 is defined
fbdev: viafb: fix typo in hw_bitblt_1 and hw_bitblt_2
fbmon: prevent division by zero in fb_videomode_from_videomode()
tty: n_gsm: require CAP_NET_ADMIN to attach N_GSM0710 ldisc
drm/vkms: call drm_atomic_helper_shutdown before drm_dev_put()
virtio: reenable config if freezing device failed
x86/mm/pat: fix VM_PAT handling in COW mappings
Bluetooth: btintel: Fixe build regression
VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler()
erspan: Check IFLA_GRE_ERSPAN_VER is set.
ip_gre: do not report erspan version on GRE interface
initramfs: fix populate_initrd_image() section mismatch
amdkfd: use calloc instead of kzalloc to avoid integer overflow
Linux 4.19.312
Change-Id: Ic4c50de6afb4c88c8011be6cc93f960d2dc968e0
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
|
||
|
|
c82a659cc8 |
mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations
commit 803de9000f334b771afacb6ff3e78622916668b0 upstream.
Sven reports an infinite loop in __alloc_pages_slowpath() for costly order
__GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination
can happen in a suspend/resume context where a GFP_KERNEL allocation can
have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
with __GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath tries to get a page from the
freelist. This fails because there is nothing free of that costly
order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
which bails out because a zone is ready to be compacted; it pretends
to have made a single page of progress.
4. page alloc tries to compact, but this always bails out early because
__GFP_IO is not set (it's not passed by the snd allocator, and even
if it were, we are suspending so the __GFP_IO flag would be cleared
anyway).
5. page alloc believes reclaim progress was made (because of the
pretense in item 3) and so it checks whether it should retry
compaction. The compaction retry logic thinks it should try again,
because:
a) reclaim is needed because of the early bail-out in item 4
b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from
__alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
indicating a lack of order-0 pages, and in step 5 evaluating that in
should_compact_retry() as a reason to retry, before incrementing and
limiting the number of retries. There are however other places that
wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
evaluation and switch the open-coded test in try_to_compact_pages() to use
it.
Also use the new helper in:
- compaction_ready(), which will make reclaim not bail out in step 3, so
there's at least one attempt to actually reclaim, even if chances are
small for a costly order
- in_reclaim_compaction() which will make should_continue_reclaim()
return false and we don't over-reclaim unnecessarily
- in __alloc_pages_slowpath() to set a local variable can_compact,
which is then used to avoid retrying reclaim/compaction for costly
allocations (step 5) if we can't compact and also to skip the early
compaction attempt that we do in some cases
Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz
Fixes:
|
||
|
|
5ef711393d |
GKI: cma: redirect page allocation to CMA (part deux)
In commit
|
||
|
|
c29070e5b9 |
ANDROID: GKI: cma: redirect page allocation to CMA
CMA pages are designed to be used as fallback for movable allocations
and cannot be used for non-movable allocations. If CMA pages are
utilized poorly, non-movable allocations may end up getting starved if
all regular movable pages are allocated and the only pages left are
CMA. Always using CMA pages first creates unacceptable performance
problems. As a midway alternative, use CMA pages for certain
userspace allocations. The userspace pages can be migrated or dropped
quickly which giving decent utilization.
Change-Id: I6165dda01b705309eebabc6dfa67146b7a95c174
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Heesub Shin <heesub.shin@samsung.com>
[lauraa@codeaurora.org: Missing CONFIG_CMA guards, add commit text]
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
[lmark@codeaurora.org: resolve conflicts relating to MIGRATE_HIGHATOMIC]
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
[swatsrid@codeaurora.org: Fix merge conflicts]
Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org>
(cherry picked from commit
|
||
|
|
b434e4bcd4 |
Merge android-4.19-q.84 (314ab78) into msm-4.19
* refs/heads/tmp-314ab78: Linux 4.19.84 kvm: x86: mmu: Recovery of shattered NX large pages kvm: Add helper function for creating VM worker threads kvm: mmu: ITLB_MULTIHIT mitigation KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active KVM: x86: add tracepoints around __direct_map and FNAME(fetch) KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ON KVM: x86: remove now unneeded hugepage gfn adjustment KVM: x86: make FNAME(fetch) and __direct_map more similar kvm: mmu: Do not release the page inside mmu_set_spte() kvm: Convert kvm_lock to a mutex kvm: x86, powerpc: do not allow clearing largepages debugfs entry Documentation: Add ITLB_MULTIHIT documentation cpu/speculation: Uninline and export CPU mitigations helpers x86/cpu: Add Tremont to the cpu vulnerability whitelist x86/bugs: Add ITLB_MULTIHIT bug infrastructure x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs x86/tsx: Add config options to set tsx=on|off|auto x86/speculation/taa: Add documentation for TSX Async Abort x86/tsx: Add "auto" option to the tsx= cmdline parameter kvm/x86: Export MDS_NO=0 to guests when TSX is enabled x86/speculation/taa: Add sysfs reporting for TSX Async Abort x86/speculation/taa: Add mitigation for TSX Async Abort x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default x86/cpu: Add a helper function x86_read_arch_cap_msr() x86/msr: Add the IA32_TSX_CTRL MSR KVM: x86: use Intel speculation bugs and features as derived in generic x86 code drm/i915/cmdparser: Fix jump whitelist clearing drm/i915/gen8+: Add RC6 CTX corruption WA drm/i915: Lower RM timeout to avoid DSI hard hangs drm/i915/cmdparser: Ignore Length operands during command matching drm/i915/cmdparser: Add support for backward jumps drm/i915/cmdparser: Use explicit goto for error paths drm/i915: Add gen9 BCS cmdparsing drm/i915: Allow parsing of unsized batches drm/i915: Support ro ppgtt mapped cmdparser shadow buffers drm/i915: Add support for mandatory cmdparsing drm/i915: Remove Master tables from cmdparser drm/i915: Disable Secure Batches for gen6+ drm/i915: Rename gen7 cmdparser tables vsock/virtio: fix sock refcnt holding during the shutdown iio: imu: mpu6050: Fix FIFO layout for ICM20602 net: prevent load/store tearing on sk->sk_stamp netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 sets usbip: Fix free of unallocated memory in vhci tx cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead mm/filemap.c: don't initiate writeback if mapping has no dirty pages iio: imu: inv_mpu6050: fix no data on MPU6050 iio: imu: mpu6050: Add support for the ICM 20602 IMU blkcg: make blkcg_print_stat() print stats only for online blkgs pinctrl: cherryview: Fix irq_valid_mask calculation ocfs2: protect extent tree in ocfs2_prepare_inode_for_write() pinctrl: intel: Avoid potential glitches if pin is in GPIO mode e1000: fix memory leaks igb: Fix constant media auto sense switching when no cable is connected net: ethernet: arc: add the missed clk_disable_unprepare NFSv4: Don't allow a cached open with a revoked delegation usb: dwc3: gadget: fix race when disabling ep with cancelled xfers hv_netvsc: Fix error handling in netvsc_attach() drm/amd/display: Passive DP->HDMI dongle detection fix drm/amdgpu: If amdgpu_ib_schedule fails return back the error. iommu/amd: Apply the same IVRS IOAPIC workaround to Acer Aspire A315-41 net: mscc: ocelot: refuse to overwrite the port's native vlan net: mscc: ocelot: fix vlan_filtering when enslaving to bridge before link is up net: hisilicon: Fix "Trying to free already-free IRQ" fjes: Handle workqueue allocation failure nvme-multipath: fix possible io hang after ctrl reconnect scsi: qla2xxx: stop timer in shutdown path RDMA/hns: Prevent memory leaks of eq->buf_list RDMA/iw_cxgb4: Avoid freeing skb twice in arp failure case usbip: tools: Fix read_usb_vudc_device() error path handling USB: ldusb: use unsigned size format specifiers USB: Skip endpoints with 0 maxpacket length perf/x86/uncore: Fix event group support perf/x86/amd/ibs: Handle erratum #420 only on the affected CPU family (10h) perf/x86/amd/ibs: Fix reading of the IBS OpData register and thus precise RIP validity usb: dwc3: remove the call trace of USBx_GFLADJ usb: gadget: configfs: fix concurrent issue between composite APIs usb: dwc3: pci: prevent memory leak in dwc3_pci_probe usb: gadget: composite: Fix possible double free memory bug usb: gadget: udc: atmel: Fix interrupt storm in FIFO mode. usb: fsl: Check memory resource before releasing it macsec: fix refcnt leak in module exit routine bonding: fix unexpected IFF_BONDING bit unset ipvs: move old_secure_tcp into struct netns_ipvs ipvs: don't ignore errors in case refcounting ip_vs module fails netfilter: nf_flow_table: set timeout before insertion into hashes scsi: qla2xxx: Initialized mailbox to prevent driver load failure scsi: lpfc: Honor module parameter lpfc_use_adisc net: openvswitch: free vport unless register_netdevice() succeeds RDMA/uverbs: Prevent potential underflow scsi: qla2xxx: fixup incorrect usage of host_byte net/mlx5: prevent memory leak in mlx5_fpga_conn_create_cq net/mlx5e: TX, Fix consumer index of error cqe dump RDMA/qedr: Fix reported firmware version iw_cxgb4: fix ECN check on the passive accept RDMA/mlx5: Clear old rate limit when closing QP HID: intel-ish-hid: fix wrong error handling in ishtp_cl_alloc_tx_ring() dmaengine: sprd: Fix the possible memory leak issue dmaengine: xilinx_dma: Fix control reg update in vdma_channel_set_config HID: google: add magnemite/masterball USB ids PCI: tegra: Enable Relaxed Ordering only for Tegra20 & Tegra30 usbip: Implement SG support to vhci-hcd and stub driver usbip: Fix vhci_urb_enqueue() URB null transfer buffer error path sched/fair: Fix -Wunused-but-set-variable warnings sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices ALSA: usb-audio: Fix copy&paste error in the validator ALSA: usb-audio: remove some dead code ALSA: usb-audio: Fix possible NULL dereference at create_yamaha_midi_quirk() ALSA: usb-audio: Clean up check_input_term() ALSA: usb-audio: Remove superfluous bLength checks ALSA: usb-audio: Unify the release of usb_mixer_elem_info objects ALSA: usb-audio: Simplify parse_audio_unit() ALSA: usb-audio: More validations of descriptor units configfs: fix a deadlock in configfs_symlink() configfs: provide exclusion between IO and removals configfs: new object reprsenting tree fragments configfs_register_group() shouldn't be (and isn't) called in rmdirable parts configfs: stash the data we need into configfs_buffer at open time can: peak_usb: fix slab info leak can: mcba_usb: fix use-after-free on disconnect can: dev: add missing of_node_put() after calling of_get_child_by_name() can: gs_usb: gs_can_open(): prevent memory leak can: rx-offload: can_rx_offload_queue_sorted(): fix error handling, avoid skb mem leak can: peak_usb: fix a potential out-of-sync while decoding packets can: c_can: c_can_poll(): only read status register after status IRQ can: flexcan: disable completely the ECC mechanism can: usb_8dev: fix use-after-free on disconnect SMB3: Fix persistent handles reconnect x86/apic/32: Avoid bogus LDR warnings intel_th: pci: Add Jasper Lake PCH support intel_th: pci: Add Comet Lake PCH support netfilter: ipset: Fix an error code in ip_set_sockfn_get() netfilter: nf_tables: Align nft_expr private data to 64-bit ARM: sunxi: Fix CPU powerdown on A83T iio: srf04: fix wrong limitation in distance measuring iio: imu: adis16480: make sure provided frequency is positive iio: adc: stm32-adc: fix stopping dma ceph: add missing check in d_revalidate snapdir handling ceph: fix use-after-free in __ceph_remove_cap() arm64: Do not mask out PTE_RDONLY in pte_same() soundwire: bus: set initial value to port_status soundwire: depend on ACPI HID: wacom: generic: Treat serial number and related fields as unsigned drm/radeon: fix si_enable_smc_cac() failed issue perf tools: Fix time sorting tools: gpio: Use !building_out_of_srctree to determine srctree dump_stack: avoid the livelock of the dump_lock mm, vmstat: hide /proc/pagetypeinfo from normal users mm: thp: handle page cache THP correctly in PageTransCompoundMap mm, meminit: recalculate pcpu batch and high limits after init completes mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges ALSA: hda/ca0132 - Fix possible workqueue stall ALSA: bebob: fix to detect configured source of sampling clock for Focusrite Saffire Pro i/o series ALSA: timer: Fix incorrectly assigned timer instance net: hns: Fix the stray netpoll locks causing deadlock in NAPI path ipv6: fixes rt6_probe() and fib6_nh->last_probe init net: mscc: ocelot: fix NULL pointer on LAG slave removal net: mscc: ocelot: don't handle netdev events for other netdevs qede: fix NULL pointer deref in __qede_remove() NFC: st21nfca: fix double free nfc: netlink: fix double device reference drop NFC: fdp: fix incorrect free object net: usb: qmi_wwan: add support for DW5821e with eSIM support net: qualcomm: rmnet: Fix potential UAF when unregistering net: fix data-race in neigh_event_send() net: ethernet: octeon_mgmt: Account for second possible VLAN header ipv4: Fix table id reference in fib_sync_down_addr CDC-NCM: handle incomplete transfer of MTU bonding: fix state transition issue in link monitoring Linux 4.19.83 usb: gadget: udc: core: Fix segfault if udc_bind_to_driver() for pending driver fails arm64: dts: ti: k3-am65-main: Fix gic-its node unit-address ASoC: pcm3168a: The codec does not support S32_LE selftests/powerpc: Fix compile error on tlbie_test due to newer gcc selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 platform/x86: pmc_atom: Add Siemens SIMATIC IPC227E to critclk_systems DMI table wireless: Skip directory when generating certificates net/flow_dissector: switch to siphash r8152: add device id for Lenovo ThinkPad USB-C Dock Gen 2 net: dsa: fix switch tree list net: usb: lan78xx: Connect PHY before registering MAC net: bcmgenet: reset 40nm EPHY on energy detect net: phy: bcm7xxx: define soft_reset for 40nm EPHY net: bcmgenet: don't set phydev->link from MAC net: dsa: b53: Do not clear existing mirrored port mask net/mlx5e: Fix ethtool self test: link speed r8169: fix wrong PHY ID issue with RTL8168dp net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budget selftests: fib_tests: add more tests for metric update ipv4: fix route update on metric change. net: add READ_ONCE() annotation in __skb_wait_for_more_packets() net: use skb_queue_empty_lockless() in busy poll contexts net: use skb_queue_empty_lockless() in poll() handlers udp: use skb_queue_empty_lockless() net: add skb_queue_empty_lockless() vxlan: check tun_info options_len properly udp: fix data-race in udp_set_dev_scratch() selftests: net: reuseport_dualstack: fix uninitalized parameter net: Zeroing the structure ethtool_wolinfo in ethtool_get_wol() net: usb: lan78xx: Disable interrupts before calling generic_handle_irq() netns: fix GFP flags in rtnl_net_notifyid() net/mlx4_core: Dynamically set guaranteed amount of counters per VF net: hisilicon: Fix ping latency when deal with high throughput net: fix sk_page_frag() recursion from memory reclaim net: ethernet: ftgmac100: Fix DMA coherency issue with SW checksum net: dsa: bcm_sf2: Fix IMP setup for port different than 8 net: annotate lockless accesses to sk->sk_napi_id net: annotate accesses to sk->sk_incoming_cpu inet: stop leaking jiffies on the wire erspan: fix the tun_info options_len check for erspan dccp: do not leak jiffies on the wire cxgb4: fix panic when attaching to ULD fail nbd: handle racing with error'ed out commands nbd: protect cmd->status with cmd->lock cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs i2c: stm32f7: remove warning when compiling with W=1 i2c: stm32f7: fix a race in slave mode with arbitration loss irq i2c: stm32f7: fix first byte to send in slave mode irqchip/gic-v3-its: Use the exact ITSList for VMOVP MIPS: bmips: mark exception vectors as char arrays of: unittest: fix memory leak in unittest_data_add ARM: 8926/1: v7m: remove register save to stack before svc tracing: Fix "gfp_t" format for synthetic events scsi: target: core: Do not overwrite CDB byte 1 drm/amdgpu: fix potential VM faults ARM: davinci: dm365: Fix McBSP dma_slave_map entry perf kmem: Fix memory leak in compact_gfp_flags() 8250-men-mcb: fix error checking when get_num_ports returns -ENODEV perf c2c: Fix memory leak in build_cl_output() ARM: dts: imx7s: Correct GPT's ipg clock source scsi: fix kconfig dependency warning related to 53C700_LE_ON_BE scsi: sni_53c710: fix compilation error scsi: scsi_dh_alua: handle RTPG sense code correctly during state transitions scsi: qla2xxx: fix a potential NULL pointer dereference ARM: mm: fix alignment handler faults under memory pressure pinctrl: ns2: Fix off by one bugs in ns2_pinmux_enable() ARM: dts: logicpd-torpedo-som: Remove twl_keypad ASoc: rockchip: i2s: Fix RPM imbalance ASoC: wm_adsp: Don't generate kcontrols without READ flags regulator: pfuze100-regulator: Variable "val" in pfuze100_regulator_probe() could be uninitialized ASoC: rt5682: add NULL handler to set_jack function regulator: ti-abb: Fix timeout in ti_abb_wait_txdone/ti_abb_clear_all_txdone arm64: dts: Fix gpio to pinmux mapping arm64: dts: allwinner: a64: sopine-baseboard: Add PHY regulator delay arm64: dts: allwinner: a64: pine64-plus: Add PHY regulator delay ASoC: wm8994: Do not register inapplicable controls for WM1811 regulator: of: fix suspend-min/max-voltage parsing kbuild: add -fcf-protection=none when using retpoline flags Linux 4.19.82 Revert "ALSA: hda: Flush interrupts on disabling" powerpc/powernv: Fix CPU idle to be called with IRQs disabled ALSA: usb-audio: Add DSD support for Gustard U16/X26 USB Interface ALSA: usb-audio: Update DSD support quirks for Oppo and Rotel ALSA: usb-audio: DSD auto-detection for Playback Designs ALSA: timer: Fix mutex deadlock at releasing card ALSA: timer: Simplify error path in snd_timer_open() sch_netem: fix rcu splat in netem_enqueue() net: usb: sr9800: fix uninitialized local variable bonding: fix potential NULL deref in bond_update_slave_arr NFC: pn533: fix use-after-free and memleaks rxrpc: Fix trace-after-put looking at the put peer record rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record rxrpc: Fix call ref leak llc: fix sk_buff leak in llc_conn_service() llc: fix sk_buff leak in llc_sap_state_process() batman-adv: Avoid free/alloc race when handling OGM buffer NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid() drm/amdgpu/powerplay/vega10: allow undervolting in p7 dmaengine: cppi41: Fix cppi41_dma_prep_slave_sg() when idle dmaengine: qcom: bam_dma: Fix resource leak rtlwifi: Fix potential overflow on P2P code arm64: Ensure VM_WRITE|VM_SHARED ptes are clean by default s390/idle: fix cpu idle time calculation s390/cmm: fix information leak in cmm_timeout_handler() nl80211: fix validation of mesh path nexthop HID: fix error message in hid_open_report() HID: Fix assumption that devices have inputs HID: i2c-hid: add Trekstor Primebook C11B to descriptor override scsi: target: cxgbit: Fix cxgbit_fw4_ack() USB: serial: whiteheat: fix line-speed endianness USB: serial: whiteheat: fix potential slab corruption usb: xhci: fix __le32/__le64 accessors in debugfs code USB: ldusb: fix control-message timeout USB: ldusb: fix ring-buffer locking usb-storage: Revert commit 747668dbc061 ("usb-storage: Set virt_boundary_mask to avoid SG overflows") USB: gadget: Reject endpoints with 0 maxpacket value UAS: Revert commit 3ae62a42090f ("UAS: fix alignment of scatter/gather segments") ALSA: hda/realtek - Add support for ALC623 ALSA: hda/realtek - Fix 2 front mics of codec 0x623 ALSA: bebob: Fix prototype of helper function to return negative value fuse: truncate pending writes on O_TRUNC fuse: flush dirty data/metadata before non-truncate setattr ath6kl: fix a NULL-ptr-deref bug in ath6kl_usb_alloc_urb_from_pipe() thunderbolt: Use 32-bit writes when writing ring producer/consumer USB: legousbtower: fix a signedness bug in tower_probe() nbd: verify socket is supported during setup iwlwifi: exclude GEO SAR support for 3168 ALSA: hda/realtek: Reduce the Headphone static noise on XPS 9350/9360 ARM: 8914/1: NOMMU: Fix exc_ret for XIP tracing: Initialize iter->seq after zeroing in tracing_read_pipe() s390/uaccess: avoid (false positive) compiler warnings NFSv4: Fix leak of clp->cl_acceptor string nbd: fix possible sysfs duplicate warning virt: vbox: fix memory leak in hgcm_call_preprocess_linaddr MIPS: fw: sni: Fix out of bounds init of o32 stack MIPS: include: Mark __xchg as __always_inline iio: imu: adis16400: release allocated memory on failure drm/amdgpu: fix memory leak perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp sched/vtime: Fix guest/system mis-accounting on task switch x86/cpu: Add Comet Lake to the Intel CPU models header arm64: armv8_deprecated: Checking return value for memory allocation fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc() fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock() fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry() ocfs2: clear zero in unaligned direct IO x86/xen: Return from panic notifier MIPS: include: Mark __cmpxchg as __always_inline efi/x86: Do not clean dummy variable in kexec path efi/cper: Fix endianness of PCIe class code serial: mctrl_gpio: Check for NULL pointer fs: cifs: mute -Wunused-const-variable message gpio: max77620: Use correct unit for debounce times tty: n_hdlc: fix build on SPARC tty: serial: owl: Fix the link time qualifier of 'owl_uart_exit()' arm64: ftrace: Ensure synchronisation in PLT setup for Neoverse-N1 #1542419 nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request HID: hyperv: Use in-place iterator API in the channel callback RDMA/iwcm: Fix a lock inversion issue RDMA/hfi1: Prevent memory leak in sdma_init staging: rtl8188eu: fix null dereference when kzalloc fails perf annotate: Return appropriate error code for allocation failures perf annotate: Propagate the symbol__annotate() error return perf annotate: Fix the signedness of failure returns perf annotate: Propagate perf_env__arch() error perf tools: Propagate get_cpuid() error perf jevents: Fix period for Intel fixed counters perf script brstackinsn: Fix recovery from LBR/binary mismatch perf map: Fix overlapped map handling perf tests: Avoid raising SEGV using an obvious NULL dereference libsubcmd: Make _FORTIFY_SOURCE defines dependent on the feature iio: fix center temperature of bmc150-accel-core iio: adc: meson_saradc: Fix memory allocation order power: supply: max14656: fix potential use-after-free drm/amd/display: fix odm combine pipe reset PCI/PME: Fix possible use-after-free on remove net: dsa: mv88e6xxx: Release lock while requesting IRQ exec: load_script: Do not exec truncated interpreter path ext4: disallow files with EXT4_JOURNAL_DATA_FL from EXT4_IOC_SWAP_BOOT media: vimc: Remove unused but set variables ALSA: hda/realtek - Apply ALC294 hp init also for S4 resume cifs: add credits from unmatched responses/messages CIFS: Respect SMB2 hdr preamble size in read responses scsi: lpfc: Correct localport timeout duration error mlxsw: spectrum: Set LAG port collector only when active arm64: kpti: Whitelist HiSilicon Taishan v110 CPUs arm64: Add MIDR encoding for HiSilicon Taishan CPUs rtc: pcf8523: set xtal load capacitance from DT usb: handle warm-reset port requests on hub resume ALSA: usb-audio: Cleanup DSD whitelist usb: dwc3: gadget: clear DWC3_EP_TRANSFER_STARTED on cmd complete usb: dwc3: gadget: early giveback if End Transfer already completed samples: bpf: fix: seg fault with NULL pointer arg HID: steam: fix deadlock with input devices. HID: steam: fix boot loop with bluetooth firmware NFSv4: Ensure that the state manager exits the loop on SIGKILL HID: Add ASUS T100CHI keyboard dock battery quirks staging: mt7621-pinctrl: use pinconf-generic for 'dt_node_to_map' and 'dt_free_map' scripts/setlocalversion: Improve -dirty check with git-status --no-optional-locks clk: boston: unregister clks on failure in clk_boston_setup() ath10k: assign 'n_cipher_suites = 11' for WCN3990 to enable WPA3 platform/x86: Fix config space access for intel_atomisp2_pm platform/x86: Add the VLV ISP PCI ID to atomisp2_pm HID: i2c-hid: Add Odys Winbook 13 to descriptor override HID: i2c-hid: Ignore input report if there's no data present on Elan touchpanels HID: i2c-hid: Disable runtime PM for LG touchscreen netfilter: ipset: Make invalid MAC address checks consistent Btrfs: fix deadlock on tree root leaf when finding free extent PCI: Fix Switchtec DMA aliasing quirk dmesg noise bcache: fix input overflow to writeback_rate_minimum drm/msm/dpu: handle failures while initializing displays x86/cpu: Add Atom Tremont (Jacobsville) tools/power turbostat: fix goldmont C-state limit decoding usb: dwc2: fix unbalanced use of external vbus-supply HID: i2c-hid: add Direkt-Tek DTLAPY133-1 to descriptor override f2fs: fix to recover inode->i_flags of inode block during POR f2fs: fix to recover inode's i_gc_failures during POR powerpc/powernv: hold device_hotplug_lock when calling memtrace_offline_pages() sc16is7xx: Fix for "Unexpected interrupt: 8" scsi: lpfc: Fix a duplicate 0711 log message number. f2fs: flush quota blocks after turnning it off wil6210: fix freeing of rx buffers in EDMA mode btrfs: tracepoints: Fix wrong parameter order for qgroup events btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() Btrfs: fix memory leak due to concurrent append writes with fiemap Btrfs: fix inode cache block reserve leak on failure to allocate data space dm snapshot: rework COW throttling to fix deadlock dm snapshot: introduce account_start_copy() and account_end_copy() zram: fix race between backing_dev_show and backing_dev_store Conflicts: arch/arm64/include/asm/cputype.h drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c drivers/net/wireless/ath/wil6210/txrx_edma.c drivers/usb/dwc3/gadget.c include/linux/cpu.h kernel/cpu.c Following USB commits were reverted on importing android-4.19.57 into msm-4.19 due to BootTimeRunner failure. android-4.19-q.82 introduced new usb changes [1] that fixed the regression, hence it is safe to restore the reverts. It is done in this merge. 9c423fd89("usb: dwc3: Reset num_trbs after skipping") 385cacd95("usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanup") 6edcdd0e6("usb: dwc3: gadget: remove wait_end_transfer") d7ff2e3ff("usb: dwc3: gadget: move requests to cancelled_list") bba5f9878("usb: dwc3: gadget: introduce cancelled_list") 65e1f3403("usb: dwc3: gadget: extract dwc3_gadget_ep_skip_trbs()") 56092bd50("usb: dwc3: gadget: use num_trbs when skipping TRBs on->dequeue()") 2a2b1c4dc("usb: dwc3: gadget: track number of TRBs per request") 420b1237c("usb: dwc3: gadget: combine unaligned and zero flags") 62805d319("Revert "usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanup"") [1] a0608eec29("usb: dwc3: gadget: clear DWC3_EP_TRANSFER_STARTED on cmd complete") d0e8b35e91("usb: dwc3: gadget: early giveback if End Transfer already completed") Change-Id: I77c3490d2c1cf7c8233a7e797c6f217f737621a2 Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org> |
||
|
|
1d5cb12a25 |
net: fix sk_page_frag() recursion from memory reclaim
[ Upstream commit 20eb4f29b60286e0d6dc01d9c260b4bd383c58fb ]
sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user. The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.
Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.
[2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
...
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
sock_xmit.isra.24+0xa1/0x170 [nbd]
nbd_send_cmd+0x1d2/0x690 [nbd]
nbd_queue_rq+0x1b5/0x3b0 [nbd]
__blk_mq_try_issue_directly+0x108/0x1b0
blk_mq_request_issue_directly+0xbd/0xe0
blk_mq_try_issue_list_directly+0x41/0xb0
blk_mq_sched_insert_requests+0xa2/0xe0
blk_mq_flush_plug_list+0x205/0x2a0
blk_flush_plug_list+0xc3/0xf0
[1] blk_finish_plug+0x21/0x2e
_xfs_buf_ioapply+0x313/0x460
__xfs_buf_submit+0x67/0x220
xfs_buf_read_map+0x113/0x1a0
xfs_trans_read_buf_map+0xbf/0x330
xfs_btree_read_buf_block.constprop.42+0x95/0xd0
xfs_btree_lookup_get_block+0x95/0x170
xfs_btree_lookup+0xcc/0x470
xfs_bmap_del_extent_real+0x254/0x9a0
__xfs_bunmapi+0x45c/0xab0
xfs_bunmapi+0x15/0x30
xfs_itruncate_extents_flags+0xca/0x250
xfs_free_eofblocks+0x181/0x1e0
xfs_fs_destroy_inode+0xa8/0x1b0
destroy_inode+0x38/0x70
dispose_list+0x35/0x50
prune_icache_sb+0x52/0x70
super_cache_scan+0x120/0x1a0
do_shrink_slab+0x120/0x290
shrink_slab+0x216/0x2b0
shrink_node+0x1b6/0x4a0
do_try_to_free_pages+0xc6/0x370
try_to_free_mem_cgroup_pages+0xe3/0x1e0
try_charge+0x29e/0x790
mem_cgroup_charge_skmem+0x6a/0x100
__sk_mem_raise_allocated+0x18e/0x390
__sk_mem_schedule+0x2a/0x40
[0] tcp_sendmsg_locked+0x8eb/0xe10
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
___sys_sendmsg+0x26d/0x2b0
__sys_sendmsg+0x57/0xa0
do_syscall_64+0x42/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule(). It already calculated how many bytes can
be fit into current->page_frag. Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path. Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().
nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress. However, this
confused sk_page_frag() called from [2]. Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].
After [2] used current->page_frag, the offset would be increased by
the used amount. When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.
Fix it by adding gfpflags_normal_context() which tests sleepable &&
!reclaim and use it to determine whether to use current->task_frag.
v2: Eric didn't like gfp flags being tested twice. Introduce a new
helper gfpflags_normal_context() and combine the two tests.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
0157fbd79d |
mm: Allow only __GFP_CMA allocations from Movable zone
Ensure that only allocations which include __GFP_CMA can be satisfied from zone Movable. This restriction helps reduces the likelihood of a page in the movable zone being pinned which would prevent memory from being offlined. Change-Id: I570970ff848d4a7147f3ccda2b294f08bcdf88d8 Signed-off-by: Liam Mark <lmark@codeaurora.org> |
||
|
|
46f8fca539 |
cma: redirect page allocation to CMA
CMA pages are designed to be used as fallback for movable allocations and cannot be used for non-movable allocations. If CMA pages are utilized poorly, non-movable allocations may end up getting starved if all regular movable pages are allocated and the only pages left are CMA. Always using CMA pages first creates unacceptable performance problems. As a midway alternative, use CMA pages for certain userspace allocations. The userspace pages can be migrated or dropped quickly which giving decent utilization. Change-Id: I6165dda01b705309eebabc6dfa67146b7a95c174 Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Heesub Shin <heesub.shin@samsung.com [lauraa@codeaurora.org: Missing CONFIG_CMA guards, add commit text] Signed-off-by: Laura Abbott <lauraa@codeaurora.org> [lmark@codeaurora.org: resolve conflicts relating to MIGRATE_HIGHATOMIC] Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [swatsrid@codeaurora.org: Fix merge conflicts] Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org> |
||
|
|
263fade51f |
docs/mm: make GFP flags descriptions usable as kernel-doc
This patch adds DOC: headings for GFP flag descriptions and adjusts the formatting to fit sphinx expectations of paragraphs. Link: http://lkml.kernel.org/r/1532626360-16650-7-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4b33b69595 |
include/linux/gfp.h: fix the annotation of GFP_ZONE_TABLE
When bit is equal to 0x4, it means OPT_ZONE_DMA32 should be got from GFP_ZONE_TABLE. OPT_ZONE_DMA32 shall be equal to ZONE_DMA32 or ZONE_NORMAL according to the status of CONFIG_ZONE_DMA32. Similarly, when bit is equal to 0xc, that means OPT_ZONE_DMA32 should be got with an allocation policy GFP_MOVABLE. So ZONE_DMA32 or ZONE_NORMAL is the possible result value. Link: http://lkml.kernel.org/r/20180601163403.1032-1-yehs2007@zoho.com Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e67d4ca79a |
mm: save two stranded bits in gfp_mask
___GFP_COLD and ___GFP_OTHER_NODE were removed but their bits were stranded. Fill the gaps by moving the existing gfp masks around. Link: http://lkml.kernel.org/r/20180516211439.177440-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Thelen <gthelen@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8addc2d00f |
mm: do not warn on offline nodes unless the specific node is explicitly requested
Oscar has noticed that we splat WARNING: CPU: 0 PID: 64 at ./include/linux/gfp.h:467 vmemmap_alloc_block+0x4e/0xc9 [...] CPU: 0 PID: 64 Comm: kworker/u4:1 Tainted: G W E 4.17.0-rc5-next-20180517-1-default+ #66 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Workqueue: kacpi_hotplug acpi_hotplug_work_fn Call Trace: vmemmap_populate+0xf2/0x2ae sparse_mem_map_populate+0x28/0x35 sparse_add_one_section+0x4c/0x187 __add_pages+0xe7/0x1a0 add_pages+0x16/0x70 add_memory_resource+0xa3/0x1d0 add_memory+0xe4/0x110 acpi_memory_device_add+0x134/0x2e0 acpi_bus_attach+0xd9/0x190 acpi_bus_scan+0x37/0x70 acpi_device_hotplug+0x389/0x4e0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x146/0x340 worker_thread+0x47/0x3e0 kthread+0xf5/0x130 ret_from_fork+0x35/0x40 when adding memory to a node that is currently offline. The VM_WARN_ON is just too loud without a good reason. In this particular case we are doing alloc_pages_node(node, GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN, order) so we do not insist on allocating from the given node (it is more a hint) so we can fall back to any other populated node and moreover we explicitly ask to not warn for the allocation failure. Soften the warning only to cases when somebody asks for the given node explicitly by __GFP_THISNODE. Link: http://lkml.kernel.org/r/20180523125555.30039-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Oscar Salvador <osalvador@techadventures.net> Tested-by: Oscar Salvador <osalvador@techadventures.net> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
453f85d43f |
mm: remove __GFP_COLD
As the page free path makes no distinction between cache hot and cold pages, there is no real useful ordering of pages in the free list that allocation requests can take advantage of. Juding from the users of __GFP_COLD, it is likely that a number of them are the result of copying other sites instead of actually measuring the impact. Remove the __GFP_COLD parameter which simplifies a number of paths in the page allocator. This is potentially controversial but bear in mind that the size of the per-cpu pagelists versus modern cache sizes means that the whole per-cpu list can often fit in the L3 cache. Hence, there is only a potential benefit for microbenchmarks that alloc/free pages in a tight loop. It's even worse when THP is taken into account which has little or no chance of getting a cache-hot page as the per-cpu list is bypassed and the zeroing of multiple pages will thrash the cache anyway. The truncate microbenchmarks are not shown as this patch affects the allocation path and not the free path. A page fault microbenchmark was tested but it showed no sigificant difference which is not surprising given that the __GFP_COLD branches are a miniscule percentage of the fault path. Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2d4894b5d2 |
mm: remove cold parameter from free_hot_cold_page*
Most callers users of free_hot_cold_page claim the pages being released are cache hot. The exception is the page reclaim paths where it is likely that enough pages will be freed in the near future that the per-cpu lists are going to be recycled and the cache hotness information is lost. As no one really cares about the hotness of pages being released to the allocator, just ditch the parameter. The APIs are renamed to indicate that it's no longer about hot/cold pages. It should also be less confusing as there are subtle differences between them. __free_pages drops a reference and frees a page when the refcount reaches zero. free_hot_cold_page handled pages whose refcount was already zero which is non-obvious from the name. free_unref_page should be more obvious. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. [mgorman@techsingularity.net: add pages to head, not tail] Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d8be75663c |
kmemcheck: remove whats left of NOTRACK flags
Now that kmemcheck is gone, we don't need the NOTRACK flags. Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b24413180f |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
0ee931c4e3 |
mm: treewide: remove GFP_TEMPORARY allocation flag
GFP_TEMPORARY was introduced by commit
|
||
|
|
dcda9b0471 |
mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
04ec6264f2 |
mm, page_alloc: pass preferred nid instead of zonelist to allocator
The main allocator function __alloc_pages_nodemask() takes a zonelist pointer as one of its parameters. All of its callers directly or indirectly obtain the zonelist via node_zonelist() using a preferred node id and gfp_mask. We can make the code a bit simpler by doing the zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node id instead (gfp_mask is already another parameter). There are some code size benefits thanks to removal of inlined node_zonelist(): bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952) This will also make things simpler if we proceed with converting cpusets to zonelists. Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1bde33e051 |
include/linux/gfp.h: fix ___GFP_NOLOCKDEP value
Igor Stoppa has noticed that __GFP_NOLOCKDEP can use a lower bit. At the time commit |
||
|
|
ac2e8e40ac |
mm: fix spelling error
Fix variable name error in comments. No code changes. Link: http://lkml.kernel.org/r/20170403161655.5081-1-haolee.swjtu@gmail.com Signed-off-by: Hao Lee <haolee.swjtu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7dea19f9ee |
mm: introduce memalloc_nofs_{save,restore} API
GFP_NOFS context is used for the following 5 reasons currently:
- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because the
allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on other
reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives
Unfortunately overuse of this allocation context brings some problems to
the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.
In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.
Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.
Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.
PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.
This patch shouldn't introduce any functional changes.
Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.
[akpm@linux-foundation.org: fix comment typo, reflow comment]
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
7e7844226f |
lockdep: allow to disable reclaim lockup detection
The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: mark_held_locks+0x79/0xa0 lockdep_trace_alloc+0xb3/0x100 kmem_cache_alloc+0x33/0x230 kmem_zone_alloc+0x81/0x120 [xfs] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] __xfs_refcount_find_shared+0x75/0x580 [xfs] xfs_refcount_find_shared+0x84/0xb0 [xfs] xfs_getbmap+0x608/0x8c0 [xfs] xfs_vn_fiemap+0xab/0xc0 [xfs] do_vfs_ioctl+0x498/0x670 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 ---- lock(&xfs_nondir_ilock_class); <Interrupt> lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Call Trace: lock_acquire+0xd8/0x1e0 down_write_nested+0x5e/0xc0 xfs_ilock+0x177/0x200 [xfs] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] evict+0xc5/0x190 dispose_list+0x39/0x60 prune_icache_sb+0x4b/0x60 super_cache_scan+0x14f/0x1a0 shrink_slab.part.63.constprop.79+0x1e9/0x4e0 shrink_zone+0x15e/0x170 kswapd+0x4f1/0xa80 kthread+0xf2/0x110 ret_from_fork+0x3f/0x70 To quote Dave: "Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock..." This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to skip the current allocation request. While we are at it also make sure that the radix tree doesn't accidentaly override tags stored in the upper part of the gfp_mask. Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ca96b62534 |
mm: alloc_contig_range: allow to specify GFP mask
Currently alloc_contig_range assumes that the compaction should be done with the default GFP_KERNEL flags. This is probably right for all current uses of this interface, but may change as CMA is used in more use-cases (including being the default DMA memory allocator on some platforms). Change the function prototype, to allow for passing through the GFP mask set by upper layers. Also respect global restrictions by applying memalloc_noio_flags to the passed in flags. Link: http://lkml.kernel.org/r/20170127172328.18574-1-l.stach@pengutronix.de Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alexander Graf <agraf@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2976db8018 |
mm: rename __page_frag functions to __page_frag_cache, drop order from drain
This patch does two things. First it goes through and renames the __page_frag prefixed functions to __page_frag_cache so that we can be clear that we are draining or refilling the cache, not the frags themselves. Second we drop the order parameter from __page_frag_cache_drain since we don't actually need to pass it since all fragments are either order 0 or must be a compound page. Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8c2dd3e4a4 |
mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
Patch series "Page fragment updates", v4. This patch series takes care of a few cleanups for the page fragments API. First we do some renames so that things are much more consistent. First we move the page_frag_ portion of the name to the front of the functions names. Secondly we split out the cache specific functions from the other page fragment functions by adding the word "cache" to the name. Finally I added a bit of documentation that will hopefully help to explain some of this. I plan to revisit this later as we get things more ironed out in the near future with the changes planned for the DMA setup to support eXpress Data Path. This patch (of 3): This patch renames the page frag functions to be more consistent with other APIs. Specifically we place the name page_frag first in the name and then have either an alloc or free call name that we append as the suffix. This makes it a bit clearer in terms of naming. In addition we drop the leading double underscores since we are technically no longer a backing interface and instead the front end that is called from the networking APIs. Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
41b6167e8f |
mm: get rid of __GFP_OTHER_NODE
The flag was introduced by commit
|
||
|
|
44fdffd705 |
mm: add support for releasing multiple instances of a page
Add a function that allows us to batch free a page that has multiple references outstanding. Specifically this function can be used to drop a page being used in the page frag alloc cache. With this drivers can make use of functionality similar to the page frag alloc cache without having to do any workarounds for the fact that there is no function that frees multiple references. Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.com Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> Cc: Helge Deller <deller@gmx.de> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Keguang Zhang <keguang.zhang@gmail.com> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Miao <realmz6@gmail.com> Cc: Tobias Klauser <tklauser@distanz.ch> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2516035499 |
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.
We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
The patch effectively changes the current GFP_TRANSHUGE users as
follows:
* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")
* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.
As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.
* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.
* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).
[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
4949148ad4 |
mm: charge/uncharge kmemcg from generic page allocator paths
Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b11a7b9410 |
mm: exclude ZONE_DEVICE from GFP_ZONE_TABLE
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
mm zones that are bumping up against the current maximum limit of 4
zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.
The GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
build. We need to be careful to only build the table for zones that
have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
purpose. This patch does not attempt to solve the problem of adding a
new zone that also has a corresponding GFP_ flag.
Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
consume another bit in page->flags (expand ZONES_WIDTH) with room to
spare.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes:
|
||
|
|
444eb2a449 |
mm: thp: set THP defrag by default to madvise and add a stall-free defrag option
THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.
First, this patch defines THP defrag as follows;
madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.
Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.
Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.
This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.
After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.
The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this
usemem
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)
For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41
Which is substantial. Now, the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 128458477 278352931
Major Faults 2174976 225
Swap Ins 16904701 0
Swap Outs 17359627 0
Allocation stalls 43611 0
DMA allocs 0 0
DMA32 allocs 19832646 19448017
Normal allocs 614488453 580941839
Movable allocs 0 0
Direct pages scanned 24163800 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 20691346 0
Compaction stalls 42263 0
Compaction success 938 0
Compaction failures 41325 0
This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.
I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used
thpscale Fault Latencies
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)
The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 37429143 47564000
Major Faults 1916 1558
Swap Ins 1466 1079
Swap Outs 2936863 149626
Allocation stalls 62510 3
DMA allocs 0 0
DMA32 allocs 6566458 6401314
Normal allocs 216361697 216538171
Movable allocs 0 0
Direct pages scanned 25977580 17998
Kswapd pages scanned 0 3638931
Kswapd pages reclaimed 0 207236
Direct pages reclaimed 8833714 88
Compaction stalls 103349 5
Compaction success 270 4
Compaction failures 103079 1
Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.
I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available
stutter
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)
This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98
And once again we look at the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 335241922 1314582827
Major Faults 715 819
Swap Ins 0 0
Swap Outs 0 0
Allocation stalls 532723 0
DMA allocs 0 0
DMA32 allocs 1822364341 1177950222
Normal allocs 1815640808 1517844854
Movable allocs 0 0
Direct pages scanned 21892772 0
Kswapd pages scanned 20015890 41879484
Kswapd pages reclaimed 19961986 41822072
Direct pages reclaimed 21892741 0
Compaction stalls 1065755 0
Compaction success 514 0
Compaction failures 1065241 0
Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.
THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
b14a1ef58e |
mm: remove unnecessary description about a non-exist gfp flag
Since __GFP_NOACCOUNT was removed by commit
|
||
|
|
7cf91a98e6 |
mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
There is a performance drop report due to hugepage allocation and in there half of cpu time are spent on pageblock_pfn_to_page() in compaction [1]. In that workload, compaction is triggered to make hugepage but most of pageblocks are un-available for compaction due to pageblock type and skip bit so compaction usually fails. Most costly operations in this case is to find valid pageblock while scanning whole zone range. To check if pageblock is valid to compact, valid pfn within pageblock is required and we can obtain it by calling pageblock_pfn_to_page(). This function checks whether pageblock is in a single zone and return valid pfn if possible. Problem is that we need to check it every time before scanning pageblock even if we re-visit it and this turns out to be very expensive in this workload. Although we have no way to skip this pageblock check in the system where hole exists at arbitrary position, we can use cached value for zone continuity and just do pfn_to_page() in the system where hole doesn't exist. This optimization considerably speeds up in above workload. Before vs After Max: 1096 MB/s vs 1325 MB/s Min: 635 MB/s 1015 MB/s Avg: 899 MB/s 1194 MB/s Avg is improved by roughly 30% [2]. [1]: http://www.spinics.net/lists/linux-mm/msg97378.html [2]: https://lkml.org/lkml/2015/12/9/23 [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
420adbe9fc |
mm, tracing: unify mm flags handling in tracepoints and printk
In tracepoints, it's possible to print gfp flags in a human-friendly format through a macro show_gfp_flags(), which defines a translation array and passes is to __print_flags(). Since the following patch will introduce support for gfp flags printing in printk(), it would be nice to reuse the array. This is not straightforward, since __print_flags() can't simply reference an array defined in a .c file such as mm/debug.c - it has to be a macro to allow the macro magic to communicate the format to userspace tools such as trace-cmd. The solution is to create a macro __def_gfpflag_names which is used both in show_gfp_flags(), and to define the gfpflag_names[] array in mm/debug.c. On the other hand, mm/debug.c also defines translation tables for page flags and vma flags, and desire was expressed (but not implemented in this series) to use these also from tracepoints. Thus, this patch also renames the events/gfpflags.h file to events/mmflags.h and moves the table definitions there, using the same macro approach as for gfpflags. This allows translating all three kinds of mm-specific flags both in tracepoints and printk. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
14e0a214d6 |
tools, perf: make gfp_compact_table up to date
When updating tracing's show_gfp_flags() I have noticed that perf's gfp_compact_table is also outdated. Fill in the missing flags and place a note in gfp.h to increase chance that future updates are synced. Convert the __GFP_X flags from "GFP_X" to "__GFP_X" strings in line with the previous patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1f7866b4ae |
mm, tracing: make show_gfp_flags() up to date
The show_gfp_flags() macro provides human-friendly printing of gfp flags in tracepoints. However, it is somewhat out of date and missing several flags. This patches fills in the missing flags, and distinguishes properly between GFP_ATOMIC and __GFP_ATOMIC which were both translated to "GFP_ATOMIC". More generally, all __GFP_X flags which were previously printed as GFP_X, are now printed as __GFP_X, since ommiting the underscores results in output that doesn't actually match the source code, and can only lead to confusion. Where both variants are defined equal (e.g. _DMA and _DMA32), the variant without underscores are preferred. Also add a note in gfp.h so hopefully future changes will be synced better. __GFP_MOVABLE is defined twice in include/linux/gfp.h with different comments. Leave just the newer one, which was intended to replace the old one. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
080fe2068e |
mm, hugetlb: don't require CMA for runtime gigantic pages
Commit
|
||
|
|
543dfb2df8 |
mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
Running sparse on drivers/staging/lustre results in dozens of warnings: include/linux/gfp.h:281:41: warning: odd constant _Bool cast (400000 becomes 1) Use "!!" to explicitly convert to bool and get rid of the warning. Signed-off-by: Joshua Clayton <stillcompiling@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c00eb15a89 |
mm/zonelist: enumerate zonelists array index
Hardcoding index to zonelists array in gfp_zonelist() is not a good idea, let's enumerate it to improve readability. No functional change. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator] Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a9bb7e620e |
memcg: only account kmem allocations marked as __GFP_ACCOUNT
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So this patch switches kmem accounting to the white-policy: now only those kmem allocations that are marked as __GFP_ACCOUNT are accounted to memcg. Currently, no kmem allocations are marked like this. The following patches will mark several kmem allocations that are known to be easily triggered from userspace and therefore should be accounted to memcg. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
20b5c30398 |
Revert "gfp: add __GFP_NOACCOUNT"
This reverts commit
|
||
|
|
21fa844279 |
mm: fix up sparse warning in gfpflags_allow_blocking
sparse says:
include/linux/gfp.h:274:26: warning: incorrect type in return expression (different base types)
include/linux/gfp.h:274:26: expected bool
include/linux/gfp.h:274:26: got restricted gfp_t
...add a forced cast to silence the warning.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
dd56b04642 |
mm: page_alloc: hide some GFP internals and document the bits and flag combinations
Andrew stated the following We have quite a history of remote parts of the kernel using weird/wrong/inexplicable combinations of __GFP_ flags. I tend to think that this is because we didn't adequately explain the interface. And I don't think that gfp.h really improved much in this area as a result of this patchset. Could you go through it some time and decide if we've adequately documented all this stuff? This patches first moves some GFP flag combinations that are part of the MM internals to mm/internal.h. The rest of the patch documents the __GFP_FOO bits under various headings and then documents the flag combinations. It will not help callers that are brain damaged but the clarity might motivate some fixes and avoid future mistakes. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
71baba4b92 |
mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4011337083 |
mm: page_alloc: remove GFP_IOFS
GFP_IOFS was intended to be shorthand for clearing two flags, not a set of allocation flags. There is only one user of this flag combination now and there appears to be no reason why Lustre had to be protected from reclaim stalls. As none of the sites appear to be atomic, this patch simply deletes GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or GFP_NOIO as appropriate. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d0164adc89 |
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
016c13daa5 |
mm, page_alloc: use masks and shifts when converting GFP flags to migrate types
This patch redefines which GFP bits are used for specifying mobility and the order of the migrate types. Once redefined it's possible to convert GFP flags to a migrate type with a simple mask and shift. The only downside is that readers of OOM kill messages and allocation failures may have been used to the existing values but scripts/gfp-translate will help. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
82c1fc7147 |
mm: use numa_mem_id() in alloc_pages_node()
alloc_pages_node() might fail when called with NUMA_NO_NODE and __GFP_THISNODE on a CPU belonging to a memoryless node. To make the local-node fallback more robust and prevent such situations, use numa_mem_id(), which was introduced for similar scenarios in the slab context. Suggested-by: Christoph Lameter <cl@linux.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |