bka
13739 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e8f74fb113 |
mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
Currently SLUB uses a work scheduled after an RCU grace period to
deactivate a non-root kmem_cache. This mechanism can be reused for
kmem_caches release, but requires generalization for SLAB case.
Introduce kmemcg_cache_deactivate() function, which calls
allocator-specific __kmem_cache_deactivate() and schedules execution of
__kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
context after an rcu grace period.
Here is the new calling scheme:
kmemcg_cache_deactivate()
__kmemcg_cache_deactivate() SLAB/SLUB-specific
kmemcg_rcufn() rcu
kmemcg_workfn() work
__kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific
instead of:
__kmemcg_cache_deactivate() SLAB/SLUB-specific
slab_deactivate_memcg_cache_rcu_sched() SLUB-only
kmemcg_rcufn() rcu
kmemcg_workfn() work
kmemcg_cache_deact_after_rcu() SLUB-only
For consistency, all allocator-specific functions start with "__".
Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
bd9e610f6d |
mm/page_alloc: Disable pcp lists checks on !DEBUG_VM
Reference: https://lore.kernel.org/all/20230201162549.68384-1-halbuer@sra.uni-hannover.de/T/#m2d0dccbb7653a8761a657ee046766dcd56e35df9 Signed-off-by: Alexander Winkowski <dereference23@outlook.com> |
||
|
|
8b9fba7141 |
UPSTREAM: mm/vmscan.c: use update_lru_size() in update_lru_sizes()
We already defined the helper update_lru_size(). Let's use this to reduce code duplication. Moto-CRs-Fixed:(CR) Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200331221550.1011-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit a892cb6b977ffe209683809e5e9d627656d20aa8) Bug: 228114874 Signed-off-by: Yu Zhao <yuzhao@google.org> Change-Id: I7f0b264466a00539302476b56ac9759d26f7de50 Reviewed-on: https://gerrit.mot.com/2268683 SME-Granted: SME Approvals Granted SLTApproved: Slta Waiver Tested-by: Jira Key Reviewed-by: Xiangpo Zhao <zhaoxp3@motorola.com> Submit-Approved: Jira Key Reviewed-on: https://gerrit.mot.com/2480990 |
||
|
|
411378c37f |
BACKPORT: mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()
Add new APIs to assert that mmap_sem is held. Using this instead of rwsem_is_locked and lockdep_assert_held[_write] makes the assertions more tolerant of future changes to the lock type. Change-Id: I96e548467787c2f71138cfe3e8139d2cbcba8b35 Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
172dec8a20 |
BACKPORT: locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:
b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")
@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.
Change-Id: Ic64f4522a848a31195e546ff531d59201e6e8deb
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
213670f0fe |
BACKPORT: mm: introduce include/linux/pgtable.h
The include/linux/pgtable.h is going to be the home of generic page table manipulation functions. Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and make the latter include asm/pgtable.h. Change-Id: I8a69883a0091366839170f569a44e12544327183 Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d4f39ef53d |
ANDROID: syscall_check: add vendor hook for mmap syscall
Through this vendor hook, we can get the timing to check current running task for the validation of its credential and related operations. Bug: 191291287 Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Change-Id: If20bd8bb8311ad10a374033734fbdc7ef61a7704 |
||
|
|
b8ba613d40 |
BACKPORT: mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Change-Id: If729000ea8cedab7079ccc1350db26ed71f0df92 Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c6aa1292ca |
Merge branch 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip into android-4.19.y-mediatek
* 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip: CIP: Bump version suffix to -cip124 after merge from cip/linux-4.19.y-st tree Update localversion-st, tree is up-to-date with 5.4.298. f2fs: fix to do sanity check on ino and xnid squashfs: fix memory leak in squashfs_fill_super pNFS: Handle RPC size limit for layoutcommits wifi: iwlwifi: fw: Fix possible memory leak in iwl_fw_dbg_collect usb: core: usb_submit_urb: downgrade type check udf: Verify partition map count f2fs: fix to avoid panic in f2fs_evict_inode usb: hub: Fix flushing and scheduling of delayed work that tunes runtime pm Revert "drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS" net: usb: qmi_wwan: add Telit Cinterion LE910C4-WWX new compositions HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version() HID: asus: fix UAF via HID_CLAIMED_INPUT validation efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare sctp: initialize more fields in sctp_v6_from_sk() net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts net/mlx5e: Set local Xoff after FW update net: dlink: fix multicast stats being counted incorrectly atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control(). net/atm: remove the atmdev_ops {get, set}sockopt methods Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced powerpc/kvm: Fix ifdef to remove build warning net: ipv4: fix regression in local-broadcast routes vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put() scsi: core: sysfs: Correct sysfs attributes access rights ftrace: Fix potential warning in trace_printk_seq during ftrace_dump alloc_fdtable(): change calling conventions. ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit ipv6: sr: validate HMAC algorithm ID in seg6_hmac_info_add ALSA: usb-audio: Fix size validation in convert_chmap_v3() scsi: qla4xxx: Prevent a potential error pointer dereference usb: xhci: Fix slot_id resource race conflict nfs: fix UAF in direct writes NFS: Fix up commit deadlocks Bluetooth: fix use-after-free in device_for_each_child() selftests: forwarding: tc_actions.sh: add matchall mirror test codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog() sch_qfq: make qfq_qlen_notify() idempotent sch_hfsc: make hfsc_qlen_notify() idempotent sch_drr: make drr_qlen_notify() idempotent btrfs: populate otime when logging an inode item media: venus: hfi: explicitly release IRQ during teardown f2fs: fix to avoid out-of-boundary access in dnode page media: venus: protect against spurious interrupts during probe media: venus: vdec: Clamp param smaller than 1fps and bigger than 240. drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS media: rainshadow-cec: fix TOCTOU race condition in rain_interrupt() media: v4l2-ctrls: Don't reset handler's error in v4l2_ctrl_handler_free() ata: Fix SATA_MOBILE_LPM_POLICY description in Kconfig usb: musb: omap2430: fix device leak at unbind NFS: Fix the setting of capabilities when automounting a new filesystem NFS: Fix up handling of outstanding layoutcommit in nfs_update_inode() NFSv4: Fix nfs4_bitmap_copy_adjust() usb: typec: fusb302: cache PD RX state cdc-acm: fix race between initial clearing halt and open USB: cdc-acm: do not log successful probe on later errors nfsd: handle get_client_locked() failure in nfsd4_setclientid_confirm() tracing: Add down_write(trace_event_sem) when adding trace event usb: hub: Don't try to recover devices lost during warm reset. usb: hub: avoid warm port reset during USB3 disconnect x86/mce/amd: Add default names for MCA banks and blocks iio: hid-sensor-prox: Fix incorrect OFFSET calculation mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage() net: usbnet: Fix the wrong netif_carrier_on() call net: usbnet: Avoid potential RCU stall on LINK_CHANGE event PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value kbuild: Add KBUILD_CPPFLAGS to as-option invocation kbuild: add $(CLANG_FLAGS) to KBUILD_CPPFLAGS kbuild: Add CLANG_FLAGS to as-instr mips: Include KBUILD_CPPFLAGS in CHECKFLAGS invocation kbuild: Update assembler calls to use proper flags and language target ARM: 9448/1: Use an absolute path to unified.h in KBUILD_AFLAGS usb: dwc3: Ignore late xferNotReady event to prevent halt timeout USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles usb: storage: realtek_cr: Use correct byte order for bcs->Residue USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive iio: proximity: isl29501: fix buffered read on big-endian systems ftrace: Also allocate and copy hash for reading of filter files fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable() fs/buffer: fix use-after-free when call bh_read() helper drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3 media: venus: Add a check for packet size after reading from shared memory media: ov2659: Fix memory leaks in ov2659_probe() media: usbtv: Lock resolution while streaming media: gspca: Add bounds checking to firmware parser jbd2: prevent softlockup in jbd2_log_do_checkpoint() PCI: endpoint: Fix configfs group removal on driver teardown PCI: endpoint: Fix configfs group list head handling mtd: rawnand: fsmc: Add missing check after DMA map wifi: brcmsmac: Remove const from tbl_ptr parameter in wlc_lcnphy_common_read_table() zynq_fpga: use sgtable-based scatterlist wrappers ata: libata-scsi: Fix ata_to_sense_error() status handling ext4: fix reserved gdt blocks handling in fsmap ext4: fix fsmap end of range reporting with bigalloc ext4: check fast symlink for ea_inode correctly Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()" vt: defkeymap: Map keycodes above 127 to K_HOLE usb: gadget: udc: renesas_usb3: fix device leak at unbind usb: atm: cxacru: Merge cxacru_upload_firmware() into cxacru_heavy_init() m68k: Fix lost column on framebuffer debug console serial: 8250: fix panic due to PSLVERR media: uvcvideo: Do not mark valid metadata as invalid media: uvcvideo: Fix 1-byte out-of-bounds read in uvc_parse_format() btrfs: fix log tree replay failure due to file with 0 links and extents thunderbolt: Fix copy+paste error in match_service_id() misc: rtsx: usb: Ensure mmc child device is active when card is present scsi: lpfc: Remove redundant assignment to avoid memory leak rtc: ds1307: remove clear of oscillator stop flag (OSF) in probe pNFS: Fix uninited ptr deref in block/scsi layout pNFS: Fix disk addr range check in block/scsi layout pNFS: Fix stripe mapping in block/scsi layout ipmi: Fix strcpy source and destination the same kconfig: lxdialog: fix 'space' to (de)select options kconfig: gconf: fix potential memory leak in renderer_edited() kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed() scsi: aacraid: Stop using PCI_IRQ_AFFINITY scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans kconfig: nconf: Ensure null termination where strncpy is used kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c PCI: pnv_php: Work around switches with broken presence detection media: uvcvideo: Fix bandwidth issue for Alcor camera media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb() media: usb: hdpvr: disable zero-length read messages media: tc358743: Increase FIFO trigger level to 374 media: tc358743: Return an appropriate colorspace from tc358743_set_fmt media: tc358743: Check I2C succeeded during probe pinctrl: stm32: Manage irq affinity settings scsi: mpt3sas: Correctly handle ATA device errors RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask() MIPS: Don't crash in stack_top() for tasks without ABI or vDSO jfs: upper bound check of tree index in dbAllocAG jfs: Regular file corruption check jfs: truncate good inode pages when hard link is 0 scsi: bfa: Double-free fix MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free} watchdog: dw_wdt: Fix default timeout fs/orangefs: use snprintf() instead of sprintf() scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr vhost: fail early when __vhost_add_used() fails uapi: in6: restore visibility of most IPv6 socket options net: ncsi: Fix buffer overflow in fetching version id net: dsa: b53: fix b53_imp_vlan_setup for BCM5325 net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs wifi: iwlegacy: Check rate_idx range after addition netmem: fix skb_frag_address_safe with unreadable skbs wifi: rtlwifi: fix possible skb memory leak in `_rtl_pci_rx_interrupt()`. wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd() net: fec: allow disable coalescing (powerpc/512) Fix possible `dma_unmap_single()` on uninitialized pointer s390/stp: Remove udelay from stp_sync_clock() wifi: iwlwifi: mvm: fix scan request validation net: thunderx: Fix format-truncation warning in bgx_acpi_match_id() net: ipv4: fix incorrect MTU in broadcast routes wifi: cfg80211: Fix interface type validation et131x: Add missing check after DMA map be2net: Use correct byte order and format string for TCP seq and ack_seq s390/time: Use monotonic clock in get_cycles() wifi: cfg80211: reject HTC bit for management frames ktest.pl: Prevent recursion of default variable options ASoC: codecs: rt5640: Retry DEVICE_ID verification ALSA: usb-audio: Avoid precedence issues in mixer_quirks macros ALSA: hda/ca0132: Fix buffer overflow in add_tuning_control platform/x86: thinkpad_acpi: Handle KCOV __init vs inline mismatches pm: cpupower: Fix the snapshot-order of tsc,mperf, clock in mperf_stop() ALSA: intel8x0: Fix incorrect codec index usage in mixer for ICH4 ASoC: hdac_hdmi: Rate limit logging on connection and disconnection mmc: rtsx_usb_sdmmc: Fix error-path in sd_set_power_mode() ACPI: processor: fix acpi_object initialization PM: sleep: console: Fix the black screen issue thermal: sysfs: Return ENODATA instead of EAGAIN for reads selftests: tracing: Use mutex_unlock for testing glob filter ARM: tegra: Use I/O memcpy to write to IRAM gpio: tps65912: check the return value of regmap_update_bits() ASoC: soc-dapm: set bias_level if snd_soc_dapm_set_bias_level() was successed cpufreq: Exit governor when failed to start old governor usb: xhci: Avoid showing errors during surprise removal usb: xhci: Set avg_trb_len = 8 for EP0 during Address Device Command usb: xhci: Avoid showing warnings for dying controller selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t usb: xhci: print xhci->xhc_state when queue_command failed securityfs: don't pin dentries twice, once is enough... hfs: fix not erasing deleted b-tree node issue drbd: add missing kref_get in handle_write_conflicts arm64: Handle KCOV __init vs inline mismatches hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file() hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc() hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read() hfs: fix slab-out-of-bounds in hfs_bnode_read() sctp: linearize cloned gso packets in sctp_rcv netfilter: ctnetlink: fix refcount leak on table dump udp: also consider secpath when evaluating ipsec use for checksumming fs: Prevent file descriptor table allocations exceeding INT_MAX sunvdc: Balance device refcount in vdc_port_mpgroup_check NFSD: detect mismatch of file handle and delegation stateid in OPEN op net: dpaa: fix device leak when querying time stamp info net: gianfar: fix device leak when querying time stamp info netlink: avoid infinite retry looping in netlink_unicast() ALSA: usb-audio: Validate UAC3 cluster segment descriptors ALSA: usb-audio: Validate UAC3 power domain descriptors, too usb: gadget : fix use-after-free in composite_dev_cleanup() MIPS: mm: tlb-r4k: Uniquify TLB entries on init USB: serial: option: add Foxconn T99W709 vsock: Do not allow binding to VMADDR_PORT_ANY net/packet: fix a race in packet_set_ring() and packet_notifier() perf/core: Prevent VMA split of buffer mappings perf/core: Exit early on perf_mmap() fail perf/core: Don't leak AUX buffer refcount on allocation failure pptp: fix pptp_xmit() error path smb: client: let recv_done() cleanup before notifying the callers. benet: fix BUG when creating VFs ipv6: reject malicious packets in ipv6_gso_segment() pptp: ensure minimal skb length in pptp_xmit() netpoll: prevent hanging NAPI when netcons gets enabled NFS: Fix filehandle bounds checking in nfs_fh_to_dentry() pci/hotplug/pnv-php: Wrap warnings in macro pci/hotplug/pnv-php: Improve error msg on power state change failure usb: chipidea: udc: fix sleeping function called from invalid context f2fs: fix to avoid out-of-boundary access in devs.path f2fs: fix to avoid UAF in f2fs_sync_inode_meta() rtc: pcf8563: fix incorrect maximum clock rate handling rtc: hym8563: fix incorrect maximum clock rate handling rtc: ds1307: fix incorrect maximum clock rate handling mtd: rawnand: atmel: set pmecc data setup time mtd: rawnand: atmel: Fix dma_mapping_error() address jfs: fix metapage reference count leak in dbAllocCtl fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref crypto: qat - fix seq_file position update in adf_ring_next() dmaengine: nbpfaxi: Add missing check after DMA map dmaengine: mv_xor: Fix missing check after DMA map and missing unmap fs/orangefs: Allow 2 more characters in do_c_string() crypto: img-hash - Fix dma_unmap_sg() nents value scsi: isci: Fix dma_unmap_sg() nents value scsi: mvsas: Fix dma_unmap_sg() nents value scsi: ibmvscsi_tgt: Fix dma_unmap_sg() nents value perf tests bp_account: Fix leaked file descriptor crypto: ccp - Fix crash when rebind ccp device for ccp.ko pinctrl: sunxi: Fix memory leak on krealloc failure power: supply: max14577: Handle NULL pdata when CONFIG_OF is not set clk: davinci: Add NULL check in davinci_lpsc_clk_register() mtd: fix possible integer overflow in erase_xfer() crypto: marvell/cesa - Fix engine load inaccuracy PCI: rockchip-host: Fix "Unexpected Completion" log message vrf: Drop existing dst reference in vrf_ip6_input_dst netfilter: xt_nfacct: don't assume acct name is null-terminated can: kvaser_usb: Assign netdev.dev_port based on device channel index wifi: brcmfmac: fix P2P discovery failure in P2P peer due to missing P2P IE Reapply "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()" mwl8k: Add missing check after DMA map wifi: rtl8xxxu: Fix RX skb size for aggregation disabled net/sched: Restrict conditions for adding duplicating netems to qdisc tree arch: powerpc: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX netfilter: nf_tables: adjust lockdep assertions handling drm/amd/pm/powerplay/hwmgr/smu_helper: fix order of mask and value m68k: Don't unregister boot console needlessly tcp: fix tcp_ofo_queue() to avoid including too much DUP SACK range iwlwifi: Add missing check for alloc_ordered_workqueue wifi: iwlwifi: Fix memory leak in iwl_mvm_init() wifi: rtl818x: Kill URBs before clearing tx status queue caif: reduce stack size, again staging: nvec: Fix incorrect null termination of battery manufacturer samples: mei: Fix building on musl libc usb: early: xhci-dbc: Fix early_ioremap leak Revert "vmci: Prevent the dispatching of uninitialized payloads" pps: fix poll support vmci: Prevent the dispatching of uninitialized payloads staging: fbtft: fix potential memory leak in fbtft_framebuffer_alloc() ARM: dts: vfxxx: Correctly use two tuples for timer address ASoC: ops: dynamically allocate struct snd_ctl_elem_value hfsplus: remove mutex_lock check in hfsplus_free_extents ASoC: Intel: fix SND_SOC_SOF dependencies ethernet: intel: fix building with large NR_CPUS usb: phy: mxs: disconnect line when USB charger is attached usb: chipidea: udc: protect usb interrupt enable usb: chipidea: udc: add new API ci_hdrc_gadget_connect comedi: comedi_test: Fix possible deletion of uninitialized timers nilfs2: reject invalid file types when reading inodes i2c: qup: jump out of the loop in case of timeout net/sched: sch_qfq: Avoid triggering might_sleep in atomic context in qfq_delete_class net: appletalk: Fix use-after-free in AARP proxy probe net: appletalk: fix kerneldoc warnings RDMA/core: Rate limit GID cache warning messages usb: hub: fix detection of high tier USB3 devices behind suspended hubs net_sched: sch_sfq: reject invalid perturb period net_sched: sch_sfq: move the limit validation net_sched: sch_sfq: use a temporary work area for validating configuration net_sched: sch_sfq: don't allow 1 packet limit net_sched: sch_sfq: handle bigger packets net_sched: sch_sfq: annotate data-races around q->perturb_period power: supply: bq24190_charger: Fix runtime PM imbalance on error xhci: Disable stream for xHC controller with XHCI_BROKEN_STREAMS virtio-net: ensure the received length does not exceed allocated size usb: dwc3: qcom: Don't leave BCR asserted usb: musb: fix gadget state on disconnect net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtree net: vlan: fix VLAN 0 refcount imbalance of toggling filtering during runtime Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout Bluetooth: SMP: If an unallowed command is received consider it a failure Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb() usb: net: sierra: check for no status endpoint net/sched: sch_qfq: Fix race condition on qfq_aggregate net: emaclite: Fix missing pointer increment in aligned_read() comedi: Fix use of uninitialized data in insn_rw_emulate_bits() comedi: Fix some signed shift left operations comedi: das6402: Fix bit shift out of bounds comedi: das16m1: Fix bit shift out of bounds comedi: aio_iiro_16: Fix bit shift out of bounds comedi: pcl812: Fix bit shift out of bounds iio: adc: max1363: Reorder mode_list[] entries iio: adc: max1363: Fix MAX1363_4X_CHANS/MAX1363_8X_CHANS[] soc: aspeed: lpc-snoop: Don't disable channels that aren't enabled soc: aspeed: lpc-snoop: Cleanup resources in stack-order mmc: sdhci-pci: Quirk for broken command queuing on Intel GLK-based Positivo models memstick: core: Zero initialize id_reg in h_memstick_read_dev_id() isofs: Verify inode mode when loading from disk dmaengine: nbpfaxi: Fix memory corruption in probe() af_packet: fix soft lockup issue caused by tpacket_snd() af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd() phonet/pep: Move call to pn_skb_get_dst_sockaddr() earlier in pep_sock_accept() HID: core: do not bypass hid_hw_raw_request HID: core: ensure __hid_request reserves the report ID as the first byte HID: core: ensure the allocated report buffer can contain the reserved report ID pch_uart: Fix dma_sync_sg_for_device() nents value Input: xpad - set correct controller type for Acer NGR200 i2c: stm32: fix the device used for the DMA map usb: gadget: configfs: Fix OOB read on empty string write USB: serial: ftdi_sio: add support for NDI EMGUIDE GEMINI USB: serial: option: add Foxconn T99W640 USB: serial: option: add Telit Cinterion FE910C04 (ECM) composition dma-mapping: add generic helpers for mapping sgtable objects usb: renesas_usbhs: Flush the notify_hotplug_work gpio: rcar: Use raw_spinlock to protect register access Change-Id: Ia6b8b00918487999c648f298d3550afc7eaaae03 Signed-off-by: bengris32 <bengris32@protonmail.ch> |
||
|
|
5fa1416882 |
fixup! BACKPORT: FROMGIT: hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp
Commit
|
||
|
|
63b22e1fed |
BACKPORT: mm: memcontrol: handle div0 crash race condition in memory.low
Tejun reports seeing rare div0 crashes in memory.low stress testing:
RIP: 0010:mem_cgroup_calculate_protection+0xed/0x150
Code: 0f 46 d1 4c 39 d8 72 57 f6 05 16 d6 42 01 40 74 1f 4c 39 d8 76 1a 4c 39 d1 76 15 4c 29 d1 4c 29 d8 4d 29 d9 31 d2 48 0f af c1 <49> f7 f1 49 01 c2 4c 89 96 38 01 00 00 5d c3 48 0f af c7 31 d2 49
RSP: 0018:ffffa14e01d6fcd0 EFLAGS: 00010246
RAX: 000000000243e384 RBX: 0000000000000000 RCX: 0000000000008f4b
RDX: 0000000000000000 RSI: ffff8b89bee84000 RDI: 0000000000000000
RBP: ffffa14e01d6fcd0 R08: ffff8b89ca7d40f8 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000006422f7 R12: 0000000000000000
R13: ffff8b89d9617000 R14: ffff8b89bee84000 R15: ffffa14e01d6fdb8
FS: 0000000000000000(0000) GS:ffff8b8a1f1c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f93b1fc175b CR3: 000000016100a000 CR4: 0000000000340ea0
Call Trace:
shrink_node+0x1e5/0x6c0
balance_pgdat+0x32d/0x5f0
kswapd+0x1d7/0x3d0
kthread+0x11c/0x160
ret_from_fork+0x1f/0x30
This happens when parent_usage == siblings_protected.
We check that usage is bigger than protected, which should imply
parent_usage being bigger than siblings_protected. However, we don't
read (or even update) these values atomically, and they can be out of
sync as the memory state changes under us. A bit of fluctuation around
the target protection isn't a big deal, but we need to handle the div0
case.
Check the parent state explicitly to make sure we have a reasonable
positive value for the divisor.
Link: http://lkml.kernel.org/r/20200615140658.601684-1-hannes@cmpxchg.org
Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Change-Id: I107e15cad76f603a830ef97905c4c3678e784fec
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
6aad4e41d8 |
BACKPORT: mm: memcontrol: fix memory.low proportional distribution
Patch series "mm: memcontrol: recursive memory.low protection", v3.
The current memory.low (and memory.min) semantics require protection to be
assigned to a cgroup in an untinterrupted chain from the top-level cgroup
all the way to the leaf.
In practice, we want to protect entire cgroup subtrees from each other
(system management software vs. workload), but we would like the VM to
balance memory optimally *within* each subtree, without having to make
explicit weight allocations among individual components. The current
semantics make that impossible.
They also introduce unmanageable complexity into more advanced resource
trees. For example:
host root
`- system.slice
`- rpm upgrades
`- logging
`- workload.slice
`- a container
`- system.slice
`- workload.slice
`- job A
`- component 1
`- component 2
`- job B
At a host-level perspective, we would like to protect the outer
workload.slice subtree as a whole from rpm upgrades, logging etc. But for
that to be effective, right now we'd have to propagate it down through the
container, the inner workload.slice, into the job cgroup and ultimately
the component cgroups where memory is actually, physically allocated.
This may cross several tree delegation points and namespace boundaries,
which make such a setup near impossible.
CPU and IO on the other hand are already distributed recursively. The
user would simply configure allowances at the host level, and they would
apply to the entire subtree without any downward propagation.
To enable the above-mentioned usecases and bring memory in line with other
resource controllers, this patch series extends memory.low/min such that
settings apply recursively to the entire subtree. Users can still assign
explicit shares in subgroups, but if they don't, any ancestral protection
will be distributed such that children compete freely amongst each other -
as if no memory control were enabled inside the subtree - but enjoy
protection from neighboring trees.
In the above example, the user would then be able to configure shares of
CPU, IO and memory at the host level to comprehensively protect and
isolate the workload.slice as a whole from system.slice activity.
Patch #1 fixes an existing bug that can give a cgroup tree more protection
than it should receive as per ancestor configuration.
Patch #2 simplifies and documents the existing code to make it easier to
reason about the changes in the next patch.
Patch #3 finally implements recursive memory protection semantics.
Because of a risk of regressing legacy setups, the new semantics are
hidden behind a cgroup2 mount option, 'memory_recursiveprot'.
More details in patch #3.
This patch (of 3):
When memory.low is overcommitted - i.e. the children claim more
protection than their shared ancestor grants them - the allowance is
distributed in proportion to how much each sibling uses their own declared
protection:
low_usage = min(memory.low, memory.current)
elow = parent_elow * (low_usage / siblings_low_usage)
However, siblings_low_usage is not the sum of all low_usages. It sums
up the usages of *only those cgroups that are within their memory.low*
That means that low_usage can be *bigger* than siblings_low_usage, and
consequently the total protection afforded to the children can be
bigger than what the ancestor grants the subtree.
Consider three groups where two are in excess of their protection:
A/memory.low = 10G
A/A1/memory.low = 10G, memory.current = 20G
A/A2/memory.low = 10G, memory.current = 20G
A/A3/memory.low = 10G, memory.current = 8G
siblings_low_usage = 8G (only A3 contributes)
A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
(the 12.5G are capped to the explicit memory.low setting of 10G)
With that, the sum of all awarded protection below A is 30G, when A
only grants 10G for the entire subtree.
What does this mean in practice? A1 and A2 would still be in excess of
their 10G allowance and would be reclaimed, whereas A3 would not. As
they eventually drop below their protection setting, they would be
counted in siblings_low_usage again and the error would right itself.
When reclaim was applied in a binary fashion (cgroup is reclaimed when
it's above its protection, otherwise it's skipped) this would actually
work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan
aggression always exclude protection"), reclaim pressure is scaled to
how much a cgroup is above its protection. As a result this
calculation error unduly skews pressure away from A1 and A2 toward the
rest of the system.
But why did we do it like this in the first place?
The reasoning behind exempting groups in excess from
siblings_low_usage was to go after them first during reclaim in an
overcommitted subtree:
A/memory.low = 2G, memory.current = 4G
A/A1/memory.low = 3G, memory.current = 2G
A/A2/memory.low = 1G, memory.current = 2G
siblings_low_usage = 2G (only A1 contributes)
A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
While the children combined are overcomitting A and are technically
both at fault, A2 is actively declaring unprotected memory and we
would like to reclaim that first.
However, while this sounds like a noble goal on the face of it, it
doesn't make much difference in actual memory distribution: Because A
is overcommitted, reclaim will not stop once A2 gets pushed back to
within its allowance; we'll have to reclaim A1 either way. The end
result is still that protection is distributed proportionally, with A1
getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
[ If A weren't overcommitted, it wouldn't make a difference since each
cgroup would just get the protection it declares:
A/memory.low = 2G, memory.current = 3G
A/A1/memory.low = 1G, memory.current = 1G
A/A2/memory.low = 1G, memory.current = 2G
With the current calculation:
siblings_low_usage = 1G (only A1 contributes)
A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
Including excess groups in siblings_low_usage:
siblings_low_usage = 2G
A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
Simplify the calculation and fix the proportional reclaim bug by
including excess cgroups in siblings_low_usage.
After this patch, the effective memory.low distribution from the
example above would be as follows:
A/memory.low = 10G
A/A1/memory.low = 10G, memory.current = 20G
A/A2/memory.low = 10G, memory.current = 20G
A/A3/memory.low = 10G, memory.current = 8G
siblings_low_usage = 28G
A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G
Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
Fixes:
|
||
|
|
6218891e50 |
UPSTREAM: mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
In a rare case, flush_tlb_kernel_range() could be called with a start higher than the end. In vm_remove_mappings(), in case page_address() returns 0 for all pages (for example they were all in highmem), _vm_unmap_aliases() will be called with start = ULONG_MAX, end = 0 and flush = 1. If at the same time, the vmalloc purge operation is triggered by something else while the current operation is between remove_vm_area() and _vm_unmap_aliases(), then the vm mapping just removed will be already purged. In this case the call of vm_unmap_aliases() may not find any other mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So only set flush = true if we find something in the direct mapping that we need to flush, and this way this can't happen. Change-Id: Ie2f25a33a18853927ea7d181b3da24ee46c12250 Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Meelis Roos <mroos@linux.ee> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions") Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1aa1a3c33c |
UPSTREAM: mm/vmalloc: Fix calculation of direct map addr range
The calculation of the direct map address range to flush was wrong. This could cause the RO direct map alias to not get flushed. Today this shouldn't be a problem because this flush is only needed on x86 right now and the spurious fault handler will fix cached RO->RW translations. In the future though, it could cause the permissions to remain RO in the TLB for the direct map alias, and then the page would return from the page allocator to some other component as RO and cause a crash. So fix fix the address range calculation so the flush will include the direct map range. Change-Id: If4daf9963b60f1218bef480e42e334d294ec60e0 Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Meelis Roos <mroos@linux.ee> Cc: Nadav Amit <namit@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions") Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
d8090828f4 |
UPSTREAM: mm: remove vmalloc_user_node_flags
Open code it in __bpf_map_area_alloc, which is the only caller. Also clean up __bpf_map_area_alloc to have a single vmalloc call with slightly different flags instead of the current two different calls. For this to compile for the nommu case add a __vmalloc_node_range stub to nommu.c. [akpm@linux-foundation.org: fix nommu.c build] Change-Id: Ic44eb2e1ae04523a6d32533c8d5bb15a281f1914 Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200414131348.444715-27-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1ee5f73d78 |
UPSTREAM: mm: remove __vmalloc_node_flags_caller
Just use __vmalloc_node instead which gets and extra argument. To be able to to use __vmalloc_node in all caller make it available outside of vmalloc and implement it in nommu.c. [akpm@linux-foundation.org: fix nommu build] Change-Id: Ie9839e4ad6222037f0e9697e77e0395f92a8ccef Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200414131348.444715-25-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
35712b66ec |
UPSTREAM: mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range()
vmalloc_user*() calls differ from normal vmalloc() only in that they set VM_USERMAP flags for the area. During the whole history of vmalloc.c changes now it is possible simply to pass VM_USERMAP flags directly to __vmalloc_node_range() call instead of finding the area (which obviously takes time) after the allocation. Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de Change-Id: Icdaf97ed7d05d9473caddd722a4a9c535ff19f49 Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2b3bb6cde3 |
UPSTREAM: mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA
This patch repeats the original one from David S Miller:
|
||
|
|
b0f862608a |
UPSTREAM: mm: remove both instances of __vmalloc_node_flags
The real version just had a few callers that can open code it and remove one layer of indirection. The nommu stub was public but only had a single caller, so remove it and avoid a CONFIG_MMU ifdef in vmalloc.h. Change-Id: I9e313b1ab8aeca02afc7d45a9b1f51ff6bae51a3 Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200414131348.444715-24-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
fe82b0c6bb |
UPSTREAM: mm: remove the prot argument to __vmalloc_node
This is always PAGE_KERNEL now. Change-Id: I1491f2d985238d959f32aac8777426d1ee526d35 Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200414131348.444715-23-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b99a3a1e82 |
UPSTREAM: zsfold: Convert zsfold to use the new mount API
Convert the zsfold filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Change-Id: Ia3da9e9fd2ef5214c4e7fb71228f8b90c15f9e71 Signed-off-by: David Howells <dhowells@redhat.com> |
||
|
|
a818e2b380 |
UPSTREAM: z3fold: don't bother with dentry_operations
Don't bother with dentry_operations as no dentry is ever allocated. Change-Id: If12563c66cfa3711388546b9e6e445c3b794d871 Signed-off-by: David Howells <dhowells@redhat.com> |
||
|
|
046b82127a |
UPSTREAM: mm/z3fold.c: support page migration
Now that we are not using page address in handles directly, we can make z3fold pages movable to decrease the memory fragmentation z3fold may create over time. This patch starts advertising non-headless z3fold pages as movable and uses the existing kernel infrastructure to implement moving of such pages per memory management subsystem's request. It thus implements 3 required callbacks for page migration: * isolation callback: z3fold_page_isolate(): try to isolate the page by removing it from all lists. Pages scheduled for some activity and mapped pages will not be isolated. Return true if isolation was successful or false otherwise * migration callback: z3fold_page_migrate(): re-check critical conditions and migrate page contents to the new page provided by the memory subsystem. Returns 0 on success or negative error code otherwise * putback callback: z3fold_page_putback(): put back the page if z3fold_page_migrate() for it failed permanently (i. e. not with -EAGAIN code). [lkp@intel.com: z3fold_page_isolate() can be static] Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42 Link: http://lkml.kernel.org/r/20190417103922.31253da5c366c4ebe0419cfc@gmail.com Change-Id: I00f1ba74e97d70077f3a7620324a5fc5765509bb Signed-off-by: Vitaly Wool <vitaly.vul@sony.com> Signed-off-by: kbuild test robot <lkp@intel.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f1d6a7c688 |
UPSTREAM: mm/z3fold.c: add structure for buddy handles
For z3fold to be able to move its pages per request of the memory subsystem, it should not use direct object addresses in handles. Instead, it will create abstract handles (3 per page) which will contain pointers to z3fold objects. Thus, it will be possible to change these pointers when z3fold page is moved. Link: http://lkml.kernel.org/r/20190417103826.484eaf18c1294d682769880f@gmail.com Change-Id: I809934ea92de55db3eccf1c12a546184ea570c2c Signed-off-by: Vitaly Wool <vitaly.vul@sony.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
544ae325e2 |
UPSTREAM: mm/z3fold.c: introduce helper functions
Patch series "z3fold: support page migration", v2. This patchset implements page migration support and slightly better buddy search. To implement page migration support, z3fold has to move away from the current scheme of handle encoding. i. e. stop encoding page address in handles. Instead, a small per-page structure is created which will contain actual addresses for z3fold objects, while pointers to fields of that structure will be used as handles. Thus, it will be possible to change the underlying addresses to reflect page migration. To support migration itself, 3 callbacks will be implemented: 1: isolation callback: z3fold_page_isolate(): try to isolate the page by removing it from all lists. Pages scheduled for some activity and mapped pages will not be isolated. Return true if isolation was successful or false otherwise 2: migration callback: z3fold_page_migrate(): re-check critical conditions and migrate page contents to the new page provided by the system. Returns 0 on success or negative error code otherwise 3: putback callback: z3fold_page_putback(): put back the page if z3fold_page_migrate() for it failed permanently (i. e. not with -EAGAIN code). To make sure an isolated page doesn't get freed, its kref is incremented in z3fold_page_isolate() and decremented during post-migration compaction, if migration was successful, or by z3fold_page_putback() in the other case. Since the new handle encoding scheme implies slight memory consumption increase, better buddy search (which decreases memory consumption) is included in this patchset. This patch (of 4): Introduce a separate helper function for object allocation, as well as 2 smaller helpers to add a buddy to the list and to get a pointer to the pool from the z3fold header. No functional changes here. Link: http://lkml.kernel.org/r/20190417103633.a4bb770b5bf0fb7e43ce1666@gmail.com Change-Id: I51fdbcd3d9ed44500e62cb9eb992bfa5d6b50432 Signed-off-by: Vitaly Wool <vitaly.vul@sony.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b3cdc77bc0 |
UPSTREAM: vfs: Convert zsmalloc to use the new mount API
Convert the zsmalloc filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Change-Id: Ibce867d147a8cce9d2dcb4136233144340e0d54e Signed-off-by: David Howells <dhowells@redhat.com> cc: Minchan Kim <minchan@kernel.org> cc: Nitin Gupta <ngupta@vflare.org> cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> cc: linux-mm@kvack.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
57624cf29b |
BACKPORT: mount_pseudo(): drop 'name' argument, switch to d_make_root()
Once upon a time we used to set ->d_name of e.g. pipefs root so that d_path() on pipes would work. These days it's completely pointless - dentries of pipes are not even connected to pipefs root. However, mount_pseudo() had set the root dentry name (passed as the second argument) and callers kept inventing names to pass to it. Including those that didn't *have* any non-root dentries to start with... All of that had been pointless for about 8 years now; it's time to get rid of that cargo-culting... Change-Id: Ib7aa010e5e16a032ae8c7d6960b6feae8c826ae9 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
489214dc84 |
UPSTREAM: zsmalloc: don't bother with dentry_operations
Change-Id: Ifbee7d9b6ad67d2b724ac0c6f877502aef623c67 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
c8fd3ae554 |
BACKPORT: mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Change-Id: If8b02a44fb05102eea5f4b1ccd2cff20c59c41b3
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
751d700cc8 |
UPSTREAM: mm: memcontrol: clean up and document effective low/min calculations
The effective protection of any given cgroup is a somewhat complicated construct that depends on the ancestor's configuration, siblings' configurations, as well as current memory utilization in all these groups. It's done this way to satisfy hierarchical delegation requirements while also making the configuration semantics flexible and expressive in complex real life scenarios. Unfortunately, all the rules and requirements are sparsely documented, and the code is a little too clever in merging different scenarios into a single min() expression. This makes it hard to reason about the implementation and avoid breaking semantics when making changes to it. This patch documents each semantic rule individually and splits out the handling of the overcommit case from the regular case. Michal Koutný also points out that the points of equilibrium as described in the existing example scenarios aren't actually accurate. Delete these examples for now to avoid confusion. Change-Id: I5a09b1e6bcc474c8ee1279f0905ea2cd0e417fc5 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-3-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bb4322c4b8 |
UPSTREAM: kernfs: Add removed_size out param for simple_xattr_set
This helps set up size accounting in the next commit. Without this out param, it's difficult to find out the removed xattr size without taking a lock for longer and walking the xattr linked list twice. Change-Id: Iaea5549f0a5d6e72b6e42b0bdc0b3abc55b376a1 Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
e3a2d58a95 |
BACKPORT: maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault
This matches the naming of strncpy_from_user, and also makes it more clear what the function is supposed to do. Change-Id: If76207724081e94b4450d7311c3385e494e29e6e Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200521152301.2587579-7-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
344e97225f |
UPSTREAM: iov_iter: Use accessor function
Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Change-Id: I13097aadc08013f7ad6b94723f2d704b4fbd6c20 Signed-off-by: David Howells <dhowells@redhat.com> |
||
|
|
0fcb75291f |
BACKPORT: mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Change-Id: Iae5854c7005dec82942db58215d615a10bde1f31 Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv] Acked-by: Gao Xiang <xiang@kernel.org> [erofs] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Wei Liu <wei.liu@kernel.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e13d6f5008 |
BACKPORT: sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Change-Id: Ic71fd778e4cea58adc51d634d9e53c1f9f90cdf2 Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
dc24631756 |
BACKPORT: bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY
Add ability to memory-map contents of BPF array map. This is extremely useful
for working with BPF global data from userspace programs. It allows to avoid
typical bpf_map_{lookup,update}_elem operations, improving both performance
and usability.
There had to be special considerations for map freezing, to avoid having
writable memory view into a frozen map. To solve this issue, map freezing and
mmap-ing is happening under mutex now:
- if map is already frozen, no writable mapping is allowed;
- if map has writable memory mappings active (accounted in map->writecnt),
map freezing will keep failing with -EBUSY;
- once number of writable memory mappings drops to zero, map freezing can be
performed again.
Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
can't be memory mapped either.
For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
to be mmap()'able. We also need to make sure that array data memory is
page-sized and page-aligned, so we over-allocate memory in such a way that
struct bpf_array is at the end of a single page of memory with array->value
being aligned with the start of the second page. On deallocation we need to
accomodate this memory arrangement to free vmalloc()'ed memory correctly.
One important consideration regarding how memory-mapping subsystem functions.
Memory-mapping subsystem provides few optional callbacks, among them open()
and close(). close() is called for each memory region that is unmapped, so
that users can decrease their reference counters and free up resources, if
necessary. open() is *almost* symmetrical: it's called for each memory region
that is being mapped, **except** the very first one. So bpf_map_mmap does
initial refcnt bump, while open() will do any extra ones after that. Thus
number of close() calls is equal to number of open() calls plus one more.
Change-Id: I16d3912dee0c18949866b2b136cd0a5d687159da
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
|
||
|
|
07d02cd68f |
UPSTREAM: uaccess: Add strict non-pagefault kernel-space read function
Add two new probe_kernel_read_strict() and strncpy_from_unsafe_strict()
helpers which by default alias to the __probe_kernel_read() and the
__strncpy_from_unsafe(), respectively, but can be overridden by archs
which have non-overlapping address ranges for kernel space and user
space in order to bail out with -EFAULT when attempting to probe user
memory including non-canonical user access addresses [0]:
4-level page tables:
user-space mem: 0x0000000000000000 - 0x00007fffffffffff
non-canonical: 0x0000800000000000 - 0xffff7fffffffffff
5-level page tables:
user-space mem: 0x0000000000000000 - 0x00ffffffffffffff
non-canonical: 0x0100000000000000 - 0xfeffffffffffffff
The idea is that these helpers are complementary to the probe_user_read()
and strncpy_from_unsafe_user() which probe user-only memory. Both added
helpers here do the same, but for kernel-only addresses.
Both set of helpers are going to be used for BPF tracing. They also
explicitly avoid throwing the splat for non-canonical user addresses from
00c42373d397 ("x86-64: add warning for non-canonical user access address
dereferences").
For compat, the current probe_kernel_read() and strncpy_from_unsafe() are
left as-is.
[0] Documentation/x86/x86_64/mm.txt
Change-Id: I1256fa00e1869a14519e186a89d5ab65f9fcbc29
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: x86@kernel.org
Link: https://lore.kernel.org/bpf/eefeefd769aa5a013531f491a71f0936779e916b.1572649915.git.daniel@iogearbox.net
|
||
|
|
b2643fdc2d |
UPSTREAM: cgroup: memcg: net: do not associate sock with unrelated cgroup
[ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ] We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: |
||
|
|
b773976b49 |
UPSTREAM: mm/vmalloc: Add flag for freeing of special permsissions
Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to immediately clear executable TLB entries before freeing pages, and handle resetting permissions on the directmap. This flag is useful for any kind of memory with elevated permissions, or where there can be related permissions changes on the directmap. Today this is RO+X and RO memory. Although this enables directly vfreeing non-writeable memory now, non-writable memory cannot be freed in an interrupt because the allocation itself is used as a node on deferred free list. So when RO memory needs to be freed in an interrupt the code doing the vfree needs to have its own work queue, as was the case before the deferred vfree list was added to vmalloc. For architectures with set_direct_map_ implementations this whole operation can be done with one TLB flush when centralized like this. For others with directmap permissions, currently only arm64, a backup method using set_memory functions is used to reset the directmap. When arm64 adds set_direct_map_ functions, this backup can be removed. When the TLB is flushed to both remove TLB entries for the vmalloc range mapping and the direct map permissions, the lazy purge operation could be done to try to save a TLB flush later. However today vm_unmap_aliases could flush a TLB range that does not include the directmap. So a helper is added with extra parameters that can allow both the vmalloc address and the direct mapping to be flushed during this operation. The behavior of the normal vm_unmap_aliases function is unchanged. Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Will Deacon <will.deacon@arm.com> Change-Id: I914f605f3601bde21f194069dd31e658c9a22fcd Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-17-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
51086bf8c4 |
BACKPORT: mm: drop mmap_sem before calling balance_dirty_pages() in write fault
[ Upstream commit 89b15332af7c0312a41e50846819ca6613b58b4c ]
One of our services is observing hanging ps/top/etc under heavy write
IO, and the task states show this is an mmap_sem priority inversion:
A write fault is holding the mmap_sem in read-mode and waiting for
(heavily cgroup-limited) IO in balance_dirty_pages():
balance_dirty_pages+0x724/0x905
balance_dirty_pages_ratelimited+0x254/0x390
fault_dirty_shared_page.isra.96+0x4a/0x90
do_wp_page+0x33e/0x400
__handle_mm_fault+0x6f0/0xfa0
handle_mm_fault+0xe4/0x200
__do_page_fault+0x22b/0x4a0
page_fault+0x45/0x50
Somebody tries to change the address space, contending for the mmap_sem in
write-mode:
call_rwsem_down_write_failed_killable+0x13/0x20
do_mprotect_pkey+0xa8/0x330
SyS_mprotect+0xf/0x20
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The waiting writer locks out all subsequent readers to avoid lock
starvation, and several threads can be seen hanging like this:
call_rwsem_down_read_failed+0x14/0x30
proc_pid_cmdline_read+0xa0/0x480
__vfs_read+0x23/0x140
vfs_read+0x87/0x130
SyS_read+0x42/0x90
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
To fix this, do what we do for cache read faults already: drop the
mmap_sem before calling into anything IO bound, in this case the
balance_dirty_pages() function, and return VM_FAULT_RETRY.
Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
Change-Id: I7d4d692b23b402e3aa283fa06b3fe2f09485035c
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
91eec772ff |
UPSTREAM: mm/userfaultfd: fix memory corruption due to writeprotect
Userfaultfd self-test fails occasionally, indicating a memory corruption.
Analyzing this problem indicates that there is a real bug since mmap_lock
is only taken for read in mwriteprotect_range() and defers flushes, and
since there is insufficient consideration of concurrent deferred TLB
flushes in wp_page_copy(). Although the PTE is flushed from the TLBs in
wp_page_copy(), this flush takes place after the copy has already been
performed, and therefore changes of the page are possible between the time
of the copy and the time in which the PTE is flushed.
To make matters worse, memory-unprotection using userfaultfd also poses a
problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural PTE
permission: when userfaultfd_writeprotect() unprotects memory region, it
unintentionally *clears* the RW-bit if it was already set. Note that this
unprotecting a PTE that is not write-protected is a valid use-case: the
userfaultfd monitor might ask to unprotect a region that holds both
write-protected and write-unprotected PTEs.
The scenario that happens in selftests/vm/userfaultfd is as follows:
cpu0 cpu1 cpu2
---- ---- ----
[ Writable PTE
cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()
change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
cow_user_page()
[ copy page ]
[ write to old
page ]
...
set_pte_at_notify()
A similar scenario can happen:
cpu0 cpu1 cpu2 cpu3
---- ---- ---- ----
[ Writable PTE
cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
cow_user_page()
[ copy page ]
... [ write to page ]
set_pte_at_notify()
This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became apparent
since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if page_count(page)
> 1.
To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.
Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.
Bug: 254441685
Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Nadav Amit <namit@vmware.com>
Suggested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> [5.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6ce64428d62026a10cb5d80138ff2f90cc21d367)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Ie35334aa739acfade88b74c9e5dde5c8a387925d
|
||
|
|
186bc1a4e0 |
UPSTREAM: mm/mremap: don't account pages in vma_to_resize()
All this vm_unacct_memory(charged) dance seems to complicate the life without a good reason. Furthermore, it seems not always done right on error-pathes in mremap_to(). And worse than that: this `charged' difference is sometimes double-accounted for growing MREMAP_DONTUNMAP mremap()s in move_vma(): if (security_vm_enough_memory_mm(mm, new_len >> PAGE_SHIFT)) Let's not do this. Account memory in mremap() fast-path for growing VMAs or in move_vma() for actually moving things. The same simpler way as it's done by vm_stat_account(), but with a difference to call security_vm_enough_memory_mm() before copying/adjusting VMA. Originally noticed by Chen Wandun: https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com Link: https://lkml.kernel.org/r/20210721131320.522061-1-dima@arista.com Fixes: e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") Signed-off-by: Dmitry Safonov <dima@arista.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit fdbef61491359947753c13581098878e8d038286) Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: Id0833f3e84c4a6119e2ec069afff5122cb0c6b35 |
||
|
|
1ddcd9a8be |
UPSTREAM: mm/shmem: use page_mapping() to detect page cache for uffd continue
mfill_atomic_install_pte() checks page->mapping to detect whether one page is used in the page cache. However as pointed out by Matthew, the page can logically be a tail page rather than always the head in the case of uffd minor mode with UFFDIO_CONTINUE. It means we could wrongly install one pte with shmem thp tail page assuming it's an anonymous page. It's not that clear even for anonymous page, since normally anonymous pages also have page->mapping being setup with the anon vma. It's safe here only because the only such caller to mfill_atomic_install_pte() is always passing in a newly allocated page (mcopy_atomic_pte()), whose page->mapping is not yet setup. However that's not extremely obvious either. For either of above, use page_mapping() instead. Bug: 254441685 Link: https://lkml.kernel.org/r/Y2K+y7wnhC4vbnP2@x1n Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 93b0d9178743a68723babe8448981f658aebc58e) Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I03246130310cc7f3486843ed945ef92cab966cdc |
||
|
|
699cabe586 |
BACKPORT: FROMGIT: userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()
In a previous commit, we added the mfill_atomic_install_pte() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases. In other words, shmem_mfill_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mfill_atomic_pte() use it directly, to reduce code duplication. This requires that we refactor shmem_mfill_atomic_pte() a bit: Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it afterward. This frees up mfill_atomic_install_pte() from having to care about this accounting, and means we don't need to e.g. shmem_uncharge() in the error path. A side effect is this switches shmem_mfill_atomic_pte() to use lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add(). This wrapper does some extra accounting in an exceptional case, if appropriate, so it's actually the more correct thing to use. Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit 42660ee254d889478c4d12929875f4d0ab067ca4 https: //git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm) Link: https://lore.kernel.org/patchwork/patch/1420973/ Conflicts: 1. mm/shmem.c 2. mm/userfaultfd.c (1. Incorporated mem_cgroup_try_charge() into the patch 2. Manual rebase) Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Bug: 187930641 Change-Id: I130c9f7cdfe0404dbd7bed48f19eddec65ea1c48 |
||
|
|
1ae05f03e6 |
BACKPORT: FROMGIT: userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
With this change, userspace can resolve a minor fault within a shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this match those for hugetlbfs - we look up the existing page in the page cache, and install a PTE for it. This commit introduces a new helper: mfill_atomic_install_pte. Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in shmem.c? The existing userfault implementation only relies on shmem.c for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for shmem in one place, regardless of shared/private (to reduce code duplication). Why add a new mfill_atomic_install_pte helper? A problem we have with continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are *close* to what we want, but not exactly. We do want to setup the PTEs in a CONTINUE operation, but we don't want to e.g. allocate a new page, charge it (e.g. to the shmem inode), manipulate various flags, etc. Also we have the problem stated above: shmem_mfill_atomic_pte() and mcopy_atomic_pte() both handle one-half of the problem (shared / private) continue cares about. So, introduce mcontinue_atomic_pte(), to handle all of the shmem continue cases. Introduce the helper so it doesn't duplicate code with mcopy_atomic_pte(). In a future commit, shmem_mfill_atomic_pte() will also be modified to use this new helper. However, since this is a bigger refactor, it seems most clear to do it as a separate change. Link: https://lkml.kernel.org/r/20210503180737.2487560-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit c9a4579a9f5320ff062f973476473242b551bacd https: //git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm) Link: https://lore.kernel.org/patchwork/patch/1420972/ Conflicts: mm/userfaultfd.c (1. Removed all 'wp_copy' usage as write-protect uffd feature doesn't exist in this kernel. 2. Due to lack of mem_cgroup_charge() in this kernel, worked with mem_cgroup_try_charge() instead in mcopy_atomic_pte() 3. Replaced lru_cache_add_inactive_or_unevictable() with lru_cache_add_active_or_unevictable()) Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Bug: 187930641 Change-Id: I46eb835849e7798a9d23cf53959bd93655b926d4 |
||
|
|
0f90cf1d45 |
BACKPORT: FROMGIT: userfaultfd/shmem: support minor fault registration for shmem
This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag. This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults. Because of this, we also don't yet advertise this as a supported feature. That will be done in a separate commit when the feature is fully implemented. Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit 6867a29320b7d178feb9786856e5ea2cf40f6d33 https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm) Link: https://lore.kernel.org/patchwork/patch/1420970/ Conflicts: mm/shmem.c (Manual rebase) Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Bug: 187930641 Change-Id: Ib8e3fe202feab7404742814f4250a0981065368c |
||
|
|
eed519aab0 |
BACKPORT: FROMGIT: userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte
Patch series "userfaultfd: add minor fault handling for shmem", v6. Overview ======== See the series which added minor faults for hugetlbfs [3] for a detailed overview of minor fault handling in general. This series adds the same support for shmem-backed areas. This series is structured as follows: - Commits 1 and 2 are cleanups. - Commits 3 and 4 implement the new feature (minor fault handling for shmem). - Commit 5 advertises that the feature is now available since at this point it's fully implemented. - Commit 6 is a final cleanup, modifying an existing code path to re-use a new helper we've introduced. - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature. Use Case ======== In some cases it is useful to have VM memory backed by tmpfs instead of hugetlbfs. So, this feature will be used to support the same VM live migration use case described in my original series. Additionally, Android folks (Lokesh Gidra <lokeshgidra@google.com>) hope to optimize the Android Runtime garbage collector using this feature: "The plan is to use userfaultfd for concurrently compacting the heap. With this feature, the heap can be shared-mapped at another location where the GC-thread(s) could continue the compaction operation without the need to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads get faults on the heap, UFFDIO_CONTINUE can be used to resume execution. Furthermore, this feature enables updating references in the 'non-moving' portion of the heap efficiently. Without this feature, uneccessary page copying (ioctl(UFFDIO_COPY)) would be required." [1] https://lore.kernel.org/patchwork/cover/1388144/ [2] https://lore.kernel.org/patchwork/patch/1408161/ [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t This patch (of 9): Previously, we did a dance where we had one calling path in userfaultfd.c (mfill_atomic_pte), but then we split it into two in shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single shared function in shmem.c (shmem_mfill_atomic_pte). This is all a bit overly complex. Just call the single combined shmem function directly, allowing us to clean up various branches, boilerplate, etc. While we're touching this function, two other small cleanup changes: - offset is equivalent to pgoff, so we can get rid of offset entirely. - Split two VM_BUG_ON cases into two statements. This means the line number reported when the BUG is hit specifies exactly which condition was true. Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit f7e89f242f0dfcdd62e7aeecebdc2620e4792954 https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm) Link: https://lore.kernel.org/patchwork/patch/1420969/ Conflicts: 1. include/linux/shmem_fs.h 2. mm/shmem.c 3. mm/userfaultfd.c (All resolved by manual rebase) Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Bug: 187930641 Change-Id: I8b443809643339b407d0bc06bf96146ecfc5a9fa |
||
|
|
4ac21dfeb3 |
Revert "BACKPORT: FROMGIT: userfaultfd: support minor fault handling for shmem"
This reverts commit 0309b3f479b967acb644f99d214e2b25297a20b1 as an updated version of the patch-set will be merged later. Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Bug: 187930641 Change-Id: I765fe86a2dc0305482a0590c14143dee27840b8a |
||
|
|
0ddfd451df |
Revert "FROMLIST: userfaultfd/shmem: fix minor fault page leak"
This reverts commit 33a50fd21ddb04629ba4a262f3cdcdc2f43ee578 as an updated version of the patch-set will be merged later. Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Bug: 187930641 Change-Id: Ie6fe2a611d4cbb3bd103c62a90b84e6ba4e89af1 |
||
|
|
53db7537f2 |
FROMLIST: userfaultfd/shmem: fix minor fault page leak
This fix is analogous to Peter Xu's fix for hugetlb [0]. If we don't
put_page() after getting the page out of the page cache, we leak the
reference.
The fix can be verified by checking /proc/meminfo and running the
userfaultfd selftest in shmem mode. Without the fix, we see MemFree /
MemAvailable steadily decreasing with each run of the test. With the
fix, memory is correctly freed after the test program exits.
Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/patchwork/patch/1400686/
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: I599f1434e24fce6e31d0d73c7f9c4714e9875b63
|