kernel_google_redbull

Evolution-X-Devices/kernel_google_redbull

Author	SHA1	Message	Date
Christoph Hellwig	c5095805e9	BACKPORT: sysctl: pass kernel pointers to ->proc_handler Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Change-Id: Ic71fd778e4cea58adc51d634d9e53c1f9f90cdf2 Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-08 17:26:43 +03:00
Greg Kroah-Hartman	ce7025b713	Merge 4.19.238 into android-4.19-stable Changes in 4.19.238 USB: serial: pl2303: add IBM device IDs USB: serial: simple: add Nokia phone driver netdevice: add the case if dev is NULL xfrm: fix tunnel model fragmentation behavior virtio_console: break out of buf poll on remove ethernet: sun: Free the coherent when failing in probing spi: Fix invalid sgs value net:mcf8390: Use platform_get_irq() to get the interrupt spi: Fix erroneous sgs value with min_t() af_key: add __GFP_ZERO flag for compose_sadb_supported in function pfkey_register fuse: fix pipe buffer lifetime for direct_io tpm: fix reference counting for struct tpm_chip block: Add a helper to validate the block size virtio-blk: Use blk_validate_block_size() to validate block size USB: usb-storage: Fix use of bitfields for hardware data in ene_ub6250.c xhci: make xhci_handshake timeout for xhci_reset() adjustable coresight: Fix TRCCONFIGR.QE sysfs interface iio: afe: rescale: use s64 for temporary scale calculations iio: inkern: apply consumer scale on IIO_VAL_INT cases iio: inkern: apply consumer scale when no channel scale is available iio: inkern: make a best effort on offset calculation clk: uniphier: Fix fixed-rate initialization ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE Documentation: add link to stable release candidate tree Documentation: update stable tree link SUNRPC: avoid race between mod_timer() and del_timer_sync() NFSD: prevent underflow in nfssvc_decode_writeargs() NFSD: prevent integer overflow on 32 bit systems f2fs: fix to unlock page correctly in error path of is_alive() pinctrl: samsung: drop pin banks references on error paths can: ems_usb: ems_usb_start_xmit(): fix double dev_kfree_skb() in error path jffs2: fix use-after-free in jffs2_clear_xattr_subsystem jffs2: fix memory leak in jffs2_do_mount_fs jffs2: fix memory leak in jffs2_scan_medium mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node mm: invalidate hwpoison page cache page in fault path mempolicy: mbind_range() set_policy() after vma_merge() scsi: libsas: Fix sas_ata_qc_issue() handling of NCQ NON DATA commands qed: display VF trust config qed: validate and restrict untrusted VFs vlan promisc mode Revert "Input: clear BTN_RIGHT/MIDDLE on buttonpads" ALSA: cs4236: fix an incorrect NULL check on list iterator ALSA: hda/realtek: Fix audio regression on Mi Notebook Pro 2020 mm,hwpoison: unmap poisoned page before invalidation drbd: fix potential silent data corruption powerpc/kvm: Fix kvm_use_magic_page ACPI: properties: Consistently return -ENOENT if there are no more references drivers: hamradio: 6pack: fix UAF bug caused by mod_timer() block: don't merge across cgroup boundaries if blkcg is enabled drm/edid: check basic audio support on CEA extension block video: fbdev: sm712fb: Fix crash in smtcfb_read() video: fbdev: atari: Atari 2 bpp (STe) palette bugfix ARM: dts: at91: sama5d2: Fix PMERRLOC resource size ARM: dts: exynos: fix UART3 pins configuration in Exynos5250 ARM: dts: exynos: add missing HDMI supplies on SMDK5250 ARM: dts: exynos: add missing HDMI supplies on SMDK5420 carl9170: fix missing bit-wise or operator for tx_params thermal: int340x: Increase bitmap size lib/raid6/test: fix multiple definition linking error DEC: Limit PMAX memory probing to R3k systems media: davinci: vpif: fix unbalanced runtime PM get brcmfmac: firmware: Allocate space for default boardrev in nvram brcmfmac: pcie: Replace brcmf_pcie_copy_mem_todev with memcpy_toio PCI: pciehp: Clear cmd_busy bit in polling mode regulator: qcom_smd: fix for_each_child.cocci warnings crypto: authenc - Fix sleep in atomic context in decrypt_tail crypto: mxs-dcp - Fix scatterlist processing spi: tegra114: Add missing IRQ check in tegra_spi_probe selftests/x86: Add validity check and allow field splitting spi: pxa2xx-pci: Balance reference count for PCI DMA device hwmon: (pmbus) Add mutex to regulator ops hwmon: (sch56xx-common) Replace WDOG_ACTIVE with WDOG_HW_RUNNING block: don't delete queue kobject before its children PM: hibernate: fix __setup handler error handling PM: suspend: fix return value of __setup handler hwrng: atmel - disable trng on failure path crypto: vmx - add missing dependencies clocksource/drivers/timer-of: Check return value of of_iomap in timer_of_base_init() ACPI: APEI: fix return value of __setup handlers crypto: ccp - ccp_dmaengine_unregister release dma channels hwmon: (pmbus) Add Vin unit off handling clocksource: acpi_pm: fix return value of __setup handler sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa perf/core: Fix address filter parser for multiple filters perf/x86/intel/pt: Fix address filter config for 32-bit kernel media: coda: Fix missing put_device() call in coda_get_vdoa_data video: fbdev: smscufx: Fix null-ptr-deref in ufx_usb_probe() video: fbdev: fbcvt.c: fix printing in fb_cvt_print_name() ARM: dts: qcom: ipq4019: fix sleep clock soc: ti: wkup_m3_ipc: Fix IRQ check in wkup_m3_ipc_probe media: em28xx: initialize refcount before kref_get media: usb: go7007: s2250-board: fix leak in probe() ASoC: rt5663: check the return value of devm_kzalloc() in rt5663_parse_dp() ASoC: ti: davinci-i2s: Add check for clk_enable() ALSA: spi: Add check for clk_enable() arm64: dts: ns2: Fix spi-cpol and spi-cpha property arm64: dts: broadcom: Fix sata nodename printk: fix return value of printk.devkmsg __setup handler ASoC: mxs-saif: Handle errors for clk_enable ASoC: atmel_ssc_dai: Handle errors for clk_enable memory: emif: Add check for setup_interrupts memory: emif: check the pointer temp in get_device_details() ALSA: firewire-lib: fix uninitialized flag for AV/C deferred transaction media: stk1160: If start stream fails, return buffers with VB2_BUF_STATE_QUEUED ASoC: atmel: Add missing of_node_put() in at91sam9g20ek_audio_probe ASoC: wm8350: Handle error for wm8350_register_irq ASoC: fsi: Add check for clk_enable video: fbdev: omapfb: Add missing of_node_put() in dvic_probe_of ASoC: dmaengine: do not use a NULL prepare_slave_config() callback ASoC: mxs: Fix error handling in mxs_sgtl5000_probe ASoC: imx-es8328: Fix error return code in imx_es8328_probe() ASoC: msm8916-wcd-digital: Fix missing clk_disable_unprepare() in msm8916_wcd_digital_probe mmc: davinci_mmc: Handle error for clk_enable drm/bridge: Fix free wrong object in sii8620_init_rcp_input_dev ath10k: fix memory overwrite of the WoWLAN wakeup packet pattern Bluetooth: hci_serdev: call init_rwsem() before p->open() mtd: onenand: Check for error irq drm/edid: Don't clear formats if using deep color drm/amd/display: Fix a NULL pointer dereference in amdgpu_dm_connector_add_common_modes() ath9k_htc: fix uninit value bugs KVM: PPC: Fix vmx/vsx mixup in mmio emulation power: reset: gemini-poweroff: Fix IRQ check in gemini_poweroff_probe ray_cs: Check ioremap return value power: supply: ab8500: Fix memory leak in ab8500_fg_sysfs_init HID: i2c-hid: fix GET/SET_REPORT for unnumbered reports iwlwifi: Fix -EIO error code that is never returned dm crypt: fix get_key_size compiler warning if !CONFIG_KEYS scsi: pm8001: Fix command initialization in pm80XX_send_read_log() scsi: pm8001: Fix command initialization in pm8001_chip_ssp_tm_req() scsi: pm8001: Fix payload initialization in pm80xx_set_thermal_config() scsi: pm8001: Fix abort all task initialization TOMOYO: fix __setup handlers return values ext2: correct max file size computing drm/tegra: Fix reference leak in tegra_dsi_ganged_probe power: supply: bq24190_charger: Fix bq24190_vbus_is_enabled() wrong false return drm/bridge: cdns-dsi: Make sure to to create proper aliases for dt powerpc/Makefile: Don't pass -mcpu=powerpc64 when building 32-bit KVM: x86: Fix emulation in writing cr8 KVM: x86/emulator: Defer not-present segment check in __load_segment_descriptor() hv_balloon: rate-limit "Unhandled message" warning i2c: xiic: Make bus names unique power: supply: wm8350-power: Handle error for wm8350_register_irq power: supply: wm8350-power: Add missing free in free_charger_irq PCI: Reduce warnings on possible RW1C corruption powerpc/sysdev: fix incorrect use to determine if list is empty mfd: mc13xxx: Add check for mc13xxx_irq_request vxcan: enable local echo for sent CAN frames MIPS: RB532: fix return value of __setup handler mtd: rawnand: atmel: fix refcount issue in atmel_nand_controller_init USB: storage: ums-realtek: fix error code in rts51x_read_mem() af_netlink: Fix shift out of bounds in group mask calculation i2c: mux: demux-pinctrl: do not deactivate a master that is not active selftests/bpf/test_lirc_mode2.sh: Exit with proper code tcp: ensure PMTU updates are processed during fastopen mfd: asic3: Add missing iounmap() on error asic3_mfd_probe mxser: fix xmit_buf leak in activate when LSR == 0xff pwm: lpc18xx-sct: Initialize driver data and hardware before pwmchip_add() staging:iio:adc:ad7280a: Fix handing of device address bit reversing. clk: qcom: ipq8074: Use floor ops for SDCC1 clock serial: 8250_mid: Balance reference count for PCI DMA device serial: 8250: Fix race condition in RTS-after-send handling iio: adc: Add check for devm_request_threaded_irq dma-debug: fix return value of __setup handlers clk: qcom: clk-rcg2: Update the frac table for pixel clock remoteproc: qcom_wcnss: Add missing of_node_put() in wcnss_alloc_memory_region clk: actions: Terminate clk_div_table with sentinel element clk: loongson1: Terminate clk_div_table with sentinel element clk: clps711x: Terminate clk_div_table with sentinel element clk: tegra: tegra124-emc: Fix missing put_device() call in emc_ensure_emc_driver NFS: remove unneeded check in decode_devicenotify_args() pinctrl: mediatek: Fix missing of_node_put() in mtk_pctrl_init pinctrl: nomadik: Add missing of_node_put() in nmk_pinctrl_probe pinctrl/rockchip: Add missing of_node_put() in rockchip_pinctrl_probe tty: hvc: fix return value of __setup handler kgdboc: fix return value of __setup handler kgdbts: fix return value of __setup handler jfs: fix divide error in dbNextAG netfilter: nf_conntrack_tcp: preserve liberal flag in tcp options clk: qcom: gcc-msm8994: Fix gpll4 width xen: fix is_xen_pmu() net: phy: broadcom: Fix brcm_fet_config_init() qlcnic: dcb: default to returning -EOPNOTSUPP net/x25: Fix null-ptr-deref caused by x25_disconnect NFSv4/pNFS: Fix another issue with a list iterator pointing to the head lib/test: use after free in register_test_dev_kmod() selinux: use correct type for context length loop: use sysfs_emit() in the sysfs xxx show() Fix incorrect type in assignment of ipv6 port for audit irqchip/qcom-pdc: Fix broken locking irqchip/nvic: Release nvic_base upon failure bfq: fix use-after-free in bfq_dispatch_request ACPICA: Avoid walking the ACPI Namespace if it is not there lib/raid6/test/Makefile: Use $(pound) instead of \# for Make 4.3 Revert "Revert "block, bfq: honor already-setup queue merges"" ACPI/APEI: Limit printable size of BERT table data PM: core: keep irq flags in device_pm_check_callbacks() spi: tegra20: Use of_device_get_match_data() ext4: don't BUG if someone dirty pages without asking ext4 first ntfs: add sanity check on allocation size video: fbdev: nvidiafb: Use strscpy() to prevent buffer overflow video: fbdev: w100fb: Reset global state video: fbdev: cirrusfb: check pixclock to avoid divide by zero video: fbdev: omapfb: acx565akm: replace snprintf with sysfs_emit ARM: dts: qcom: fix gic_irq_domain_translate warnings for msm8960 ARM: dts: bcm2837: Add the missing L1/L2 cache information video: fbdev: omapfb: panel-dsi-cm: Use sysfs_emit() instead of snprintf() video: fbdev: omapfb: panel-tpo-td043mtea1: Use sysfs_emit() instead of snprintf() video: fbdev: udlfb: replace snprintf in show functions with sysfs_emit ASoC: soc-core: skip zero num_dai component in searching dai name media: cx88-mpeg: clear interrupt status register before streaming video ARM: tegra: tamonten: Fix I2C3 pad setting ARM: mmp: Fix failure to remove sram device video: fbdev: sm712fb: Fix crash in smtcfb_write() media: Revert "media: em28xx: add missing em28xx_close_extension" media: hdpvr: initialize dev->worker at hdpvr_register_videodev mmc: host: Return an error when ->enable_sdio_irq() ops is missing powerpc/lib/sstep: Fix 'sthcx' instruction powerpc/lib/sstep: Fix build errors with newer binutils powerpc: Fix build errors with newer binutils scsi: qla2xxx: Fix stuck session in gpdb scsi: qla2xxx: Fix warning for missing error code scsi: qla2xxx: Check for firmware dump already collected scsi: qla2xxx: Suppress a kernel complaint in qla_create_qpair() scsi: qla2xxx: Fix incorrect reporting of task management failure scsi: qla2xxx: Fix hang due to session stuck scsi: qla2xxx: Reduce false trigger to login scsi: qla2xxx: Use correct feature type field during RFF_ID processing KVM: Prevent module exit until all VMs are freed KVM: x86: fix sending PV IPI ubifs: rename_whiteout: Fix double free for whiteout_ui->data ubifs: Fix deadlock in concurrent rename whiteout and inode writeback ubifs: Add missing iput if do_tmpfile() failed in rename whiteout ubifs: setflags: Make dirtied_ino_d 8 bytes aligned ubifs: Fix read out-of-bounds in ubifs_wbuf_write_nolock() ubifs: rename_whiteout: correct old_dir size computing can: mcba_usb: mcba_usb_start_xmit(): fix double dev_kfree_skb in error path can: mcba_usb: properly check endpoint type gfs2: Make sure FITRIM minlen is rounded up to fs block size pinctrl: pinconf-generic: Print arguments for bias-pull-* ubi: Fix race condition between ctrl_cdev_ioctl and ubi_cdev_ioctl ACPI: CPPC: Avoid out of bounds access when parsing _CPC data mm/mmap: return 1 from stack_guard_gap __setup() handler mm/memcontrol: return 1 from cgroup.memory __setup() handler mm/usercopy: return 1 from hardened_usercopy __setup() handler bpf: Fix comment for helper bpf_current_task_under_cgroup() ubi: fastmap: Return error code if memory allocation fails in add_aeb() ASoC: topology: Allow TLV control to be either read or write ARM: dts: spear1340: Update serial node properties ARM: dts: spear13xx: Update SPI dma properties um: Fix uml_mconsole stop/go openvswitch: Fixed nd target mask field in the flow dump. KVM: x86: Forbid VMM to set SYNIC/STIMER MSRs when SynIC wasn't activated ubifs: Rectify space amount budget for mkdir/tmpfile operations rtc: wm8350: Handle error for wm8350_register_irq riscv module: remove (NOLOAD) ARM: 9187/1: JIVE: fix return value of __setup handler KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs drm: Add orientation quirk for GPD Win Max ath5k: fix OOB in ath5k_eeprom_read_pcal_info_5111 drm/amd/amdgpu/amdgpu_cs: fix refcount leak of a dma_fence obj ptp: replace snprintf with sysfs_emit powerpc: dts: t104xrdb: fix phy type for FMAN 4/5 scsi: mvsas: Replace snprintf() with sysfs_emit() scsi: bfa: Replace snprintf() with sysfs_emit() power: supply: axp20x_battery: properly report current when discharging powerpc: Set crashkernel offset to mid of RMA region PCI: aardvark: Fix support for MSI interrupts iommu/arm-smmu-v3: fix event handling soft lockup usb: ehci: add pci device support for Aspeed platforms PCI: pciehp: Add Qualcomm quirk for Command Completed erratum ipv4: Invalidate neighbour for broadcast address upon address addition dm ioctl: prevent potential spectre v1 gadget drm/amdkfd: make CRAT table missing message informational only scsi: pm8001: Fix pm8001_mpi_task_abort_resp() scsi: aha152x: Fix aha152x_setup() __setup handler return value net/smc: correct settings of RMB window update limit macvtap: advertise link netns via netlink bnxt_en: Eliminate unintended link toggle during FW reset MIPS: fix fortify panic when copying asm exception handlers scsi: libfc: Fix use after free in fc_exch_abts_resp() usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm xtensa: fix DTC warning unit_address_format Bluetooth: Fix use after free in hci_send_acl init/main.c: return 1 from handled __setup() functions minix: fix bug when opening a file with O_DIRECT w1: w1_therm: fixes w1_seq for ds28ea00 sensors NFSv4: Protect the state recovery thread against direct reclaim xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 clk: Enforce that disjoints limits are invalid SUNRPC/call_alloc: async tasks mustn't block waiting for memory NFS: swap IO handling is slightly different for O_DIRECT IO NFS: swap-out must always use STABLE writes. serial: samsung_tty: do not unlock port->lock for uart_write_wakeup() virtio_console: eliminate anonymous module_init & module_exit jfs: prevent NULL deref in diFree parisc: Fix CPU affinity for Lasi, WAX and Dino chips net: add missing SOF_TIMESTAMPING_OPT_ID support mm: fix race between MADV_FREE reclaim and blkdev direct IO read KVM: arm64: Check arm64_get_bp_hardening_data() didn't return NULL drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire() Drivers: hv: vmbus: Fix potential crash on module unload scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one() net: stmmac: Fix unset max_speed difference between DT and non-DT platforms drm/imx: Fix memory leak in imx_pd_connector_get_modes net: openvswitch: don't send internal clone attribute to the userspace. rxrpc: fix a race in rxrpc_exit_net() qede: confirm skb is allocated before using spi: bcm-qspi: fix MSPI only access with bcm_qspi_exec_mem_op() drbd: Fix five use after free bugs in get_initial_state Revert "mmc: sdhci-xenon: fix annoying 1.8V regulator warning" mmc: renesas_sdhi: don't overwrite TAP settings when HS400 tuning is complete mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0) mm/mempolicy: fix mpol_new leak in shared_policy_replace x86/pm: Save the MSR validity status at context setup x86/speculation: Restore speculation related MSRs during S3 resume btrfs: fix qgroup reserve overflow the qgroup limit arm64: patch_text: Fixup last cpu should be master ata: sata_dwc_460ex: Fix crash due to OOB write perf: qcom_l2_pmu: fix an incorrect NULL check on list iterator irqchip/gic-v3: Fix GICR_CTLR.RWP polling tools build: Filter out options and warnings not supported by clang tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts dmaengine: Revert "dmaengine: shdma: Fix runtime PM imbalance on error" mm: don't skip swap entry even if zap_details specified arm64: module: remove (NOLOAD) from linker script mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning cgroup: Use open-time credentials for process migraton perm checks cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv cgroup: Use open-time cgroup namespace for process migration perm checks selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644 selftests: cgroup: Test open-time credential usage for migration checks selftests: cgroup: Test open-time cgroup namespace usage for migration checks xfrm: policy: match with both mark and mask on user interfaces drm/amdgpu: Check if fd really is an amdgpu fd. drm/amdkfd: Use drm_priv to pass VM from KFD to amdgpu Linux 4.19.238 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I55a3615d2fbf9bde9ac152456701b36a6c9d20b6	2022-04-18 09:57:50 +02:00
Waiman Long	ff929e3abe	mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning commit a431dbbc540532b7465eae4fc8b56a85a9fc7d17 upstream. The gcc 12 compiler reports a "'mem_section' will never be NULL" warning on the following code: static inline struct mem_section __nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME if (!mem_section) return NULL; #endif if (!mem_section[SECTION_NR_TO_ROOT(nr)]) return NULL; : It happens with CONFIG_SPARSEMEM_EXTREME off. The mem_section definition is #ifdef CONFIG_SPARSEMEM_EXTREME extern struct mem_section *mem_section; #else extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif In the !CONFIG_SPARSEMEM_EXTREME case, mem_section is a static 2-dimensional array and so the check "!mem_section[SECTION_NR_TO_ROOT(nr)]" doesn't make sense. Fix this warning by moving the "!mem_section[SECTION_NR_TO_ROOT(nr)]" check up inside the CONFIG_SPARSEMEM_EXTREME block and adding an explicit NR_SECTION_ROOTS check to make sure that there is no out-of-bound array access. Link: https://lkml.kernel.org/r/20220331180246.2746210-1-longman@redhat.com Fixes: `3e347261a8` ("sparsemem extreme implementation") Signed-off-by: Waiman Long <longman@redhat.com> Reported-by: Justin Forbes <jforbes@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-04-15 14:15:07 +02:00
Greg Kroah-Hartman	2dce03a5c2	Merge 4.19.150 into android-4.19-stable Changes in 4.19.150 mmc: sdhci: Workaround broken command queuing on Intel GLK based IRBIS models USB: gadget: f_ncm: Fix NDP16 datagram validation gpio: mockup: fix resource leak in error path gpio: tc35894: fix up tc35894 interrupt configuration clk: socfpga: stratix10: fix the divider for the emac_ptp_free_clk vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock vsock/virtio: stop workers during the .remove() vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock() net: virtio_vsock: Enhance connection semantics Input: i8042 - add nopnp quirk for Acer Aspire 5 A515 ftrace: Move RCU is watching check after recursion check drm/amdgpu: restore proper ref count in amdgpu_display_crtc_set_config drivers/net/wan/hdlc_fr: Add needed_headroom for PVC devices drm/sun4i: mixer: Extend regmap max_register net: dec: de2104x: Increase receive ring size for Tulip rndis_host: increase sleep time in the query-response loop nvme-core: get/put ctrl and transport module in nvme_dev_open/release() drivers/net/wan/lapbether: Make skb->protocol consistent with the header drivers/net/wan/hdlc: Set skb->protocol before transmitting mac80211: do not allow bigger VHT MPDUs than the hardware supports spi: fsl-espi: Only process interrupts for expected events nvme-fc: fail new connections to a deleted host or remote port gpio: sprd: Clear interrupt when setting the type as edge pinctrl: mvebu: Fix i2c sda definition for 98DX3236 nfs: Fix security label length not being reset clk: samsung: exynos4: mark 'chipid' clock as CLK_IGNORE_UNUSED iommu/exynos: add missing put_device() call in exynos_iommu_of_xlate() i2c: cpm: Fix i2c_ram structure Input: trackpoint - enable Synaptics trackpoints random32: Restore __latent_entropy attribute on net_rand_state mm: replace memmap_context by meminit_context mm: don't rely on system state to detect hot-plug operations net/packet: fix overflow in tpacket_rcv epoll: do not insert into poll queues until all sanity checks are done epoll: replace ->visited/visited_list with generation count epoll: EPOLL_CTL_ADD: close the race in decision to take fast path ep_create_wakeup_source(): dentry name can change under you... netfilter: ctnetlink: add a range check for l3/l4 protonum Linux 4.19.150 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ib6f1b6fce01bec80efd4a905d03903ff20ca89be	2020-10-07 08:45:35 +02:00
Laurent Dufour	25eaea1b33	mm: replace memmap_context by meminit_context commit c1d0da83358a2316d9be7f229f26126dbaa07468 upstream. Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-10-07 08:00:08 +02:00
Minchan Kim	1cf3a7305e	ANDROID: GKI: fix ABI diffs caused by GPU heap and pool vmstat additions New nr_gpu_heap fields in vmstat file cause ABI differences. Fix them by adding the fields and we don't print it to vmstat now as it's out-of-tree. Bug: 128254966 Bug: 130198686 Test: boot and check /proc/vmstat has nr_gpu_heap Change-Id: Idb322f80dbfde785720f01b3ad9954eb4feb7326 Signed-off-by: Minchan Kim <minchan@google.com> Signed-off-by: Kyle Lin <kylelin@google.com> Signed-off-by: Martin Liu <liumartin@google.com>	2020-07-07 23:32:09 +08:00
Greg Kroah-Hartman	4d01d462e6	Merge 4.19.129 into android-4.19-stable Changes in 4.19.129 ipv6: fix IPV6_ADDRFORM operation logic net_failover: fixed rollback in net_failover_open() bridge: Avoid infinite loop when suppressing NS messages with invalid options vxlan: Avoid infinite loop when suppressing NS messages with invalid options tun: correct header offsets in napi frags mode selftests: bpf: fix use of undeclared RET_IF macro make 'user_access_begin()' do 'access_ok()' Fix 'acccess_ok()' on alpha and SH arch/openrisc: Fix issues with access_ok() x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user() btrfs: merge btrfs_find_device and find_device btrfs: Detect unbalanced tree with empty leaf before crashing btree operations crypto: talitos - fix ECB and CBC algs ivsize Input: mms114 - fix handling of mms345l ARM: 8977/1: ptrace: Fix mask for thumb breakpoint hook sched/fair: Don't NUMA balance for kthreads Input: synaptics - add a second working PNP_ID for Lenovo T470s drivers/net/ibmvnic: Update VNIC protocol version reporting powerpc/xive: Clear the page tables for the ESB IO mapping ath9k_htc: Silence undersized packet warnings RDMA/uverbs: Make the event_queue fds return POLLERR when disassociated x86/cpu/amd: Make erratum #1054 a legacy erratum perf probe: Accept the instance number of kretprobe event mm: add kvfree_sensitive() for freeing sensitive data objects aio: fix async fsync creds btrfs: tree-checker: Check level for leaves and nodes x86_64: Fix jiffies ODR violation x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs x86/speculation: Prevent rogue cross-process SSBD shutdown x86/reboot/quirks: Add MacBook6,1 reboot quirk efi/efivars: Add missing kobject_put() in sysfs entry creation error path ALSA: es1688: Add the missed snd_card_free() ALSA: hda/realtek - add a pintbl quirk for several Lenovo machines ALSA: usb-audio: Fix inconsistent card PM state after resume ALSA: usb-audio: Add vendor, product and profile name for HP Thunderbolt Dock ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile() ACPI: CPPC: Fix reference count leak in acpi_cppc_processor_probe() ACPI: GED: add support for _Exx / _Lxx handler methods ACPI: PM: Avoid using power resources if there are none for D0 cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages nilfs2: fix null pointer dereference at nilfs_segctor_do_construct() spi: dw: Fix controller unregister order spi: bcm2835aux: Fix controller unregister order spi: bcm-qspi: when tx/rx buffer is NULL set to 0 PM: runtime: clk: Fix clk_pm_runtime_get() error path crypto: cavium/nitrox - Fix 'nitrox_get_first_device()' when ndevlist is fully iterated ALSA: pcm: disallow linking stream to itself x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned KVM: x86: Fix APIC page invalidation race kvm: x86: Fix L1TF mitigation for shadow MMU KVM: x86/mmu: Consolidate "is MMIO SPTE" code KVM: x86: only do L1TF workaround on affected processors x86/speculation: Change misspelled STIPB to STIBP x86/speculation: Add support for STIBP always-on preferred mode x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS. x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches. spi: No need to assign dummy value in spi_unregister_controller() spi: Fix controller unregister order spi: pxa2xx: Fix controller unregister order spi: bcm2835: Fix controller unregister order spi: pxa2xx: Balance runtime PM enable/disable on error spi: pxa2xx: Fix runtime PM ref imbalance on probe error crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req() crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req() crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req() selftests/net: in rxtimestamp getopt_long needs terminating null entry ovl: initialize error in ovl_copy_xattr proc: Use new_inode not new_inode_pseudo video: fbdev: w100fb: Fix a potential double free. KVM: nSVM: fix condition for filtering async PF KVM: nSVM: leave ASID aside in copy_vmcb_control_area KVM: nVMX: Consult only the "basic" exit reason when routing nested exit KVM: MIPS: Define KVM_ENTRYHI_ASID to cpu_asid_mask(&boot_cpu_data) KVM: MIPS: Fix VPN2_MASK definition for variable cpu_vmbits KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts scsi: megaraid_sas: TM command refire leads to controller firmware crash ath9k: Fix use-after-free Read in ath9k_wmi_ctrl_rx ath9k: Fix use-after-free Write in ath9k_htc_rx_msg ath9x: Fix stack-out-of-bounds Write in ath9k_hif_usb_rx_cb ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb Smack: slab-out-of-bounds in vsscanf drm/vkms: Hold gem object while still in-use mm/slub: fix a memory leak in sysfs_slab_add() fat: don't allow to mount if the FAT length == 0 perf: Add cond_resched() to task_function_call() agp/intel: Reinforce the barrier after GTT updates mmc: sdhci-msm: Clear tuning done flag while hs400 tuning ARM: dts: at91: sama5d2_ptc_ek: fix sdmmc0 node description mmc: sdio: Fix potential NULL pointer error in mmc_sdio_init_card() xen/pvcalls-back: test for errors when calling backend_connect() KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception ACPI: GED: use correct trigger type field in _Exx / _Lxx handling drm: bridge: adv7511: Extend list of audio sample rates crypto: ccp -- don't "select" CONFIG_DMADEVICES media: si2157: Better check for running tuner in init objtool: Ignore empty alternatives spi: pxa2xx: Apply CS clk quirk to BXT net: atlantic: make hw_get_regs optional net: ena: fix error returning in ena_com_get_hash_function() efi/libstub/x86: Work around LLVM ELF quirk build regression arm64: cacheflush: Fix KGDB trap detection spi: dw: Zero DMA Tx and Rx configurations on stack arm64: insn: Fix two bugs in encoding 32-bit logical immediates ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K MIPS: Loongson: Build ATI Radeon GPU driver as module Bluetooth: Add SCO fallback for invalid LMP parameters error kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb kgdb: Prevent infinite recursive entries to the debugger spi: dw: Enable interrupts in accordance with DMA xfer mode clocksource: dw_apb_timer: Make CPU-affiliation being optional clocksource: dw_apb_timer_of: Fix missing clockevent timers btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums ARM: 8978/1: mm: make act_mm() respect THREAD_SIZE batman-adv: Revert "disable ethtool link speed detection when auto negotiation off" mmc: meson-mx-sdio: trigger a soft reset after a timeout or CRC error spi: dw: Fix Rx-only DMA transfers x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit net: vmxnet3: fix possible buffer overflow caused by bad DMA value in vmxnet3_get_rss() staging: android: ion: use vmap instead of vm_map_ram brcmfmac: fix wrong location to get firmware feature tools api fs: Make xxx__mountpoint() more scalable e1000: Distribute switch variables for initialization dt-bindings: display: mediatek: control dpi pins mode to avoid leakage audit: fix a net reference leak in audit_send_reply() media: dvb: return -EREMOTEIO on i2c transfer failure. media: platform: fcp: Set appropriate DMA parameters MIPS: Make sparse_init() using top-down allocation Bluetooth: btbcm: Add 2 missing models to subver tables audit: fix a net reference leak in audit_list_rules_send() netfilter: nft_nat: return EOPNOTSUPP if type or flags are not supported selftests/bpf: Fix memory leak in extract_build_id() net: bcmgenet: set Rx mode before starting netif lib/mpi: Fix 64-bit MIPS build with Clang exit: Move preemption fixup up, move blocking operations down sched/core: Fix illegal RCU from offline CPUs drivers/perf: hisi: Fix typo in events attribute array net: lpc-enet: fix error return code in lpc_mii_init() media: cec: silence shift wrapping warning in __cec_s_log_addrs() net: allwinner: Fix use correct return type for ndo_start_xmit() powerpc/spufs: fix copy_to_user while atomic xfs: clean up the error handling in xfs_swap_extents Crypto/chcr: fix for ccm(aes) failed test MIPS: Truncate link address into 32bit for 32bit kernel mips: cm: Fix an invalid error code of INTVN__ERR kgdb: Fix spurious true from in_dbg_master() xfs: reset buffer write failure state on successful completion xfs: fix duplicate verification from xfs_qm_dqflush() platform/x86: intel-vbtn: Use acpi_evaluate_integer() platform/x86: intel-vbtn: Split keymap into buttons and switches parts platform/x86: intel-vbtn: Do not advertise switches to userspace if they are not there platform/x86: intel-vbtn: Also handle tablet-mode switch on "Detachable" and "Portable" chassis-types nvme: refine the Qemu Identify CNS quirk ath10k: Remove msdu from idr when management pkt send fails wcn36xx: Fix error handling path in 'wcn36xx_probe()' net: qed: Reduce RX and TX default ring count when running inside kdump kernel mt76: avoid rx reorder buffer overflow md: don't flush workqueue unconditionally in md_open veth: Adjust hard_start offset on redirect XDP frames net/mlx5e: IPoIB, Drop multicast packets that this interface sent rtlwifi: Fix a double free in _rtl_usb_tx_urb_setup() mwifiex: Fix memory corruption in dump_station x86/boot: Correct relocation destination on old linkers mips: MAAR: Use more precise address mask mips: Add udelay lpj numbers adjustment crypto: stm32/crc32 - fix ext4 chksum BUG_ON() crypto: stm32/crc32 - fix run-time self test issue. crypto: stm32/crc32 - fix multi-instance x86/mm: Stop printing BRK addresses m68k: mac: Don't call via_flush_cache() on Mac IIfx btrfs: qgroup: mark qgroup inconsistent if we're inherting snapshot to a new qgroup macvlan: Skip loopback packets in RX handler PCI: Don't disable decoding when mmio_always_on is set MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe() bcache: fix refcount underflow in bcache_device_free() mmc: sdhci-msm: Set SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk staging: greybus: sdio: Respect the cmd->busy_timeout from the mmc core mmc: via-sdmmc: Respect the cmd->busy_timeout from the mmc core ixgbe: fix signed-integer-overflow warning mmc: sdhci-esdhc-imx: fix the mask for tuning start point spi: dw: Return any value retrieved from the dma_transfer callback cpuidle: Fix three reference count leaks platform/x86: hp-wmi: Convert simple_strtoul() to kstrtou32() platform/x86: intel-hid: Add a quirk to support HP Spectre X2 (2015) platform/x86: intel-vbtn: Only blacklist SW_TABLET_MODE on the 9 / "Laptop" chasis-type string.h: fix incompatibility between FORTIFY_SOURCE and KASAN btrfs: include non-missing as a qualifier for the latest_bdev btrfs: send: emit file capabilities after chown mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked() mm: initialize deferred pages with interrupts enabled ima: Fix ima digest hash table key calculation ima: Directly assign the ima_default_policy pointer to ima_rules evm: Fix possible memory leak in evm_calc_hmac_or_hash() ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max ext4: fix error pointer dereference ext4: fix race between ext4_sync_parent() and rename() PCI: Avoid Pericom USB controller OHCI/EHCI PME# defect PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0 PCI: Avoid FLR for AMD Starship USB 3.0 PCI: Add ACS quirk for iProc PAXB PCI: Add ACS quirk for Intel Root Complex Integrated Endpoints PCI: Remove unused NFP32xx IDs pci:ipmi: Move IPMI PCI class id defines to pci_ids.h hwmon/k10temp, x86/amd_nb: Consolidate shared device IDs x86/amd_nb: Add PCI device IDs for family 17h, model 30h PCI: add USR vendor id and use it in r8169 and w6692 driver PCI: Move Synopsys HAPS platform device IDs PCI: Move Rohm Vendor ID to generic list misc: pci_endpoint_test: Add the layerscape EP device support misc: pci_endpoint_test: Add support to test PCI EP in AM654x PCI: Add Synopsys endpoint EDDA Device ID PCI: Add NVIDIA GPU multi-function power dependencies PCI: Enable NVIDIA HDA controllers PCI: mediatek: Add controller support for MT7629 x86/amd_nb: Add PCI device IDs for family 17h, model 70h ALSA: lx6464es - add support for LX6464ESe pci express variant PCI: Add Genesys Logic, Inc. Vendor ID PCI: Add Amazon's Annapurna Labs vendor ID PCI: vmd: Add device id for VMD device 8086:9A0B x86/amd_nb: Add Family 19h PCI IDs PCI: Add Loongson vendor ID serial: 8250_pci: Move Pericom IDs to pci_ids.h PCI: Make ACS quirk implementations more uniform PCI: Unify ACS quirk desired vs provided checking PCI: Generalize multi-function power dependency device links btrfs: fix error handling when submitting direct I/O bio btrfs: fix wrong file range cleanup after an error filling dealloc range ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init() PCI: Program MPS for RCiEP devices e1000e: Disable TSO for buffer overrun workaround e1000e: Relax condition to trigger reset for ME workaround carl9170: remove P2P_GO support media: go7007: fix a miss of snd_card_free Bluetooth: hci_bcm: fix freeing not-requested IRQ b43legacy: Fix case where channel status is corrupted b43: Fix connection problem with WPA3 b43_legacy: Fix connection problem with WPA3 media: ov5640: fix use of destroyed mutex igb: Report speed and duplex as unknown when device is runtime suspended power: vexpress: add suppress_bind_attrs to true pinctrl: samsung: Correct setting of eint wakeup mask on s5pv210 pinctrl: samsung: Save/restore eint_mask over suspend for EINT_TYPE GPIOs gnss: sirf: fix error return code in sirf_probe() sparc32: fix register window handling in genregs32_[gs]et() sparc64: fix misuses of access_process_vm() in genregs32_[sg]et() dm crypt: avoid truncating the logical block size alpha: fix memory barriers so that they conform to the specification kernel/cpu_pm: Fix uninitted local in cpu_pm ARM: tegra: Correct PL310 Auxiliary Control Register initialization ARM: dts: exynos: Fix GPIO polarity for thr GalaxyS3 CM36651 sensor's bus ARM: dts: at91: sama5d2_ptc_ek: fix vbus pin ARM: dts: s5pv210: Set keep-power-in-suspend for SDHCI1 on Aries drivers/macintosh: Fix memleak in windfarm_pm112 driver powerpc/64s: Don't let DT CPU features set FSCR_DSCR powerpc/64s: Save FSCR to init_task.thread.fscr after feature init kbuild: force to build vmlinux if CONFIG_MODVERSION=y sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations. sunrpc: clean up properly in gss_mech_unregister() mtd: rawnand: brcmnand: fix hamming oob layout mtd: rawnand: pasemi: Fix the probe error path w1: omap-hdq: cleanup to add missing newline for some dev_dbg perf probe: Do not show the skipped events perf probe: Fix to check blacklist address correctly perf probe: Check address correctness by map instead of _etext perf symbols: Fix debuginfo search for Ubuntu Linux 4.19.129 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I7b1108d90ee1109a28fe488a4358b7a3e101d9c9	2020-06-22 10:50:54 +02:00
Pavel Tatashin	88afa532c1	mm: initialize deferred pages with interrupts enabled commit 3d060856adfc59afb9d029c233141334cfaba418 upstream. Initializing struct pages is a long task and keeping interrupts disabled for the duration of this operation introduces a number of problems. 1. jiffies are not updated for long period of time, and thus incorrect time is reported. See proposed solution and discussion here: lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com 2. It prevents farther improving deferred page initialization by allowing intra-node multi-threading. We are keeping interrupts disabled to solve a rather theoretical problem that was never observed in real world (See `3a2d7fa8a3`). Let's keep interrupts enabled. In case we ever encounter a scenario where an interrupt thread wants to allocate large amount of memory this early in boot we can deal with that by growing zone (see deferred_grow_zone()) by the needed amount before starting deferred_init_memmap() threads. Before: [ 1.232459] node 0 initialised, 12058412 pages in 1ms After: [ 1.632580] node 0 initialised, 12051227 pages in 436ms Fixes: `3a2d7fa8a3` ("mm: disable interrupts while initializing deferred pages") Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Yiqian Wei <yiwei@redhat.com> Cc: <stable@vger.kernel.org> [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-06-22 09:05:21 +02:00
Greg Kroah-Hartman	9dc35e84f8	ANDROID: GKI: mm: add Android ABI padding to some structures Try to mitigate potential future driver core api changes by adding a padding to stuct vm_area_struct and struct zone. Based on a patch from Michal Marek <mmarek@suse.cz> from the SLES kernel Bug: 151154716 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I81702aa833f419928e0e32e9609722b98592c171	2020-05-01 15:18:12 +02:00
Suren Baghdasaryan	c69ff7a87b	ANDROID: GKI: fix ABI diffs caused by ION heap and pool vmstat additions New nr_ion_heap and nr_ion_heap_pool fields in vmstat file cause ABI differences. Fix them by adding the fields. Bug: 110330255 Bug: 130198686 Bug: 153442668 Test: boot Signed-off-by: Minchan Kim <minchan@google.com> (cherry picked from commit 8f76a3ed2a76492deba9edebc990cac3b32f78be) Signed-off-by: Martin Liu <liumartin@google.com> [surenb: cherry picked and trimmed the original patch to include only necessary changes to resolve ABI diff for node_stat_item enum] Bug: 149182139 Test: build and boot Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I82b43bd3d94b950bd266aa53a3794182604a52d0	2020-04-17 06:43:05 +00:00
Mark Salyzyn	fbc355c16d	Revert "BACKPORT: mm: move zone watermark accesses behind an accessor" This reverts commit `acfb1c608b`. Reason for revert: revert customized code Bug: 140544941 Test: boot Signed-off-by: Minchan Kim <minchan@google.com> Signed-off-by: Martin Liu <liumartin@google.com> Signed-off-by: Mark Salyzyn <salyzyn@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I4988ddf1fc24579d4fc478de1de6e45b870f7bcd	2020-04-14 19:35:44 +08:00
Mark Salyzyn	35be952ae0	Revert "BACKPORT: mm: reclaim small amounts of memory when an external fragmentation event occurs" This reverts commit `5cbbeadd5a`. Reason for revert: revert customized code Bug: 140544941 Test: boot Signed-off-by: Minchan Kim <minchan@google.com> Signed-off-by: Martin Liu <liumartin@google.com> Signed-off-by: Mark Salyzyn <salyzyn@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I65735f27f6a44a112957bcec07e2f63f2d8ccff6	2020-04-14 19:35:27 +08:00
Mark Salyzyn	776eba0172	Revert "BACKPORT: mm, compaction: be selective about what pageblocks to clear skip hints" This reverts commit `fd9c71c06b`. Reason for revert: revert customized code Bug: 140544941 Test: boot Signed-off-by: Minchan Kim <minchan@google.com> Signed-off-by: Martin Liu <liumartin@google.com> Signed-off-by: Mark Salyzyn <salyzyn@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I58d978c6f74f8a1a04b17fa6d3890ddc4f1988c0	2020-04-14 19:35:13 +08:00
Vijayanand Jitta	c9b2421c4c	ANDROID: GKI: mm: introduce NR_UNRECLAIMABLE_PAGES Introduce NR_UNRECLAIMABLE_PAGES memory counter which accounts the pages that cannot be reclaimed under memory pressure. Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org> (cherry picked from commit `66db0cfc65`) [surenb: cherry-picked from: `66db0cfc65` "mm: introduce NR_UNRECLAIMABLE_PAGES" to resolve ABI diff for per_cpu_nodestat struct] Bug: 150808082 Test: build Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I8c99dcec04df3c9ead4ea0eebd11c193af9fbc75	2020-03-23 17:11:57 -07:00
Liam Mark	c98dd3b1df	ANDROID: GKI: mm: add cma pcp list Add a cma pcp list in order to increase cma memory utilization. Increased cma memory utilization will improve overall memory utilization because free cma pages are ignored when memory reclaim is done with gfp mask GFP_KERNEL. Since most memory reclaim is done by kswapd, which uses a gfp mask of GFP_KERNEL, by increasing cma memory utilization we are therefore ensuring that less aggressive memory reclaim takes place. Increased cma memory utilization will improve performance, for example it will increase app concurrency. Change-Id: I809589a25c6abca51f1c963f118adfc78e955cf9 Signed-off-by: Liam Mark <lmark@codeaurora.org> [vinmenon@codeaurora.org: fix !CONFIG_CMA compile time issues] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [swatsrid@codeaurora.org: Fix merge conflicts] Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org> (cherry picked from commit `0caf6be3e0`) Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 150378964	2020-03-04 12:01:17 -08:00
Mark Salyzyn	c29070e5b9	ANDROID: GKI: cma: redirect page allocation to CMA CMA pages are designed to be used as fallback for movable allocations and cannot be used for non-movable allocations. If CMA pages are utilized poorly, non-movable allocations may end up getting starved if all regular movable pages are allocated and the only pages left are CMA. Always using CMA pages first creates unacceptable performance problems. As a midway alternative, use CMA pages for certain userspace allocations. The userspace pages can be migrated or dropped quickly which giving decent utilization. Change-Id: I6165dda01b705309eebabc6dfa67146b7a95c174 Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Heesub Shin <heesub.shin@samsung.com> [lauraa@codeaurora.org: Missing CONFIG_CMA guards, add commit text] Signed-off-by: Laura Abbott <lauraa@codeaurora.org> [lmark@codeaurora.org: resolve conflicts relating to MIGRATE_HIGHATOMIC] Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [swatsrid@codeaurora.org: Fix merge conflicts] Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org> (cherry picked from commit `46f8fca539`) Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 150378964	2020-03-04 12:01:17 -08:00
Mark Salyzyn	fd9c71c06b	BACKPORT: mm, compaction: be selective about what pageblocks to clear skip hints Pageblock hints are cleared when compaction restarts or kswapd makes enough progress that it can sleep but it's over-eager in that the bit is cleared for migration sources with no LRU pages and migration targets with no free pages. As pageblock skip hint flushes are relatively rare and out-of-band with respect to kswapd, this patch makes a few more expensive checks to see if it's appropriate to even clear the bit. Every pageblock that is not cleared will avoid 512 pages being scanned unnecessarily on x86-64. The impact is variable with different workloads showing small differences in latency, success rates and scan rates. This is expected as clearing the hints is not that common but doing a small amount of work out-of-band to avoid a large amount of work in-band later is generally a good thing. Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> [cai@lca.pw: no stuck in __reset_isolation_pfn()] Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I27b3d1bf8100d0281ec297ef5ce79d100d0cb37e Git-commit: e332f741a8dd1ec9a6dc8aa997296ecbfe64323e Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> (cherry picked from commit e332f741a8dd1ec9a6dc8aa997296ecbfe64323e) Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 150378964	2020-03-04 12:01:17 -08:00
Mark Salyzyn	5cbbeadd5a	BACKPORT: mm: reclaim small amounts of memory when an external fragmentation event occurs An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ied06272defcdbf3fff07b7ebccb46c68ce081e1e Git-commit: 1c30844d2dfe272d58c8fc000960b835d13aa2ac Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git [vinmenon@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> (cherry picked from commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac) Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 150378964	2020-03-04 12:01:17 -08:00
Mark Salyzyn	acfb1c608b	BACKPORT: mm: move zone watermark accesses behind an accessor This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Iccef5f02348ab59d75cbbe2b48f40017391b88b1 Git-commit: a921444382b49cc7fdeca3fba3e278bc09484a27 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git [vinmenon@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> (cherry picked from commit a921444382b49cc7fdeca3fba3e278bc09484a27) Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 150378964	2020-03-04 12:01:16 -08:00
Greg Kroah-Hartman	654c66e990	Merge 4.19.100 into android-4.19 Changes in 4.19.100 can, slip: Protect tty->disc_data in write_wakeup and close with RCU firestream: fix memory leaks gtp: make sure only SOCK_DGRAM UDP sockets are accepted ipv6: sr: remove SKB_GSO_IPXIP6 on End.D* actions net: bcmgenet: Use netif_tx_napi_add() for TX NAPI net: cxgb3_main: Add CAP_NET_ADMIN check to CHELSIO_GET_MEM net: ip6_gre: fix moving ip6gre between namespaces net, ip6_tunnel: fix namespaces move net, ip_tunnel: fix namespaces move net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() net_sched: fix datalen for ematch net-sysfs: Fix reference count leak in rx\|netdev_queue_add_kobject net-sysfs: fix netdev_queue_add_kobject() breakage net-sysfs: Call dev_hold always in netdev_queue_add_kobject net-sysfs: Call dev_hold always in rx_queue_add_kobject net-sysfs: Fix reference count leak net: usb: lan78xx: Add .ndo_features_check Revert "udp: do rmem bulk free even if the rx sk queue is empty" tcp_bbr: improve arithmetic division in bbr_update_bw() tcp: do not leave dangling pointers in tp->highest_sack tun: add mutex_unlock() call and napi.skb clearing in tun_get_user() afs: Fix characters allowed into cell names hwmon: (adt7475) Make volt2reg return same reg as reg2volt input hwmon: (core) Do not use device managed functions for memory allocations PCI: Mark AMD Navi14 GPU rev 0xc5 ATS as broken tracing: trigger: Replace unneeded RCU-list traversals Input: keyspan-remote - fix control-message timeouts Revert "Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers" ARM: 8950/1: ftrace/recordmcount: filter relocation types mmc: tegra: fix SDR50 tuning override mmc: sdhci: fix minimum clock rate for v3 controller Documentation: Document arm64 kpti control Input: pm8xxx-vib - fix handling of separate enable register Input: sur40 - fix interface sanity checks Input: gtco - fix endpoint sanity check Input: aiptek - fix endpoint sanity check Input: pegasus_notetaker - fix endpoint sanity check Input: sun4i-ts - add a check for devm_thermal_zone_of_sensor_register netfilter: nft_osf: add missing check for DREG attribute hwmon: (nct7802) Fix voltage limits to wrong registers scsi: RDMA/isert: Fix a recently introduced regression related to logout tracing: xen: Ordered comparison of function pointers do_last(): fetch directory ->i_mode and ->i_uid before it's too late net/sonic: Add mutual exclusion for accessing shared state net/sonic: Clear interrupt flags immediately net/sonic: Use MMIO accessors net/sonic: Fix interface error stats collection net/sonic: Fix receive buffer handling net/sonic: Avoid needless receive descriptor EOL flag updates net/sonic: Improve receive descriptor status flag check net/sonic: Fix receive buffer replenishment net/sonic: Quiesce SONIC before re-initializing descriptor memory net/sonic: Fix command register usage net/sonic: Fix CAM initialization net/sonic: Prevent tx watchdog timeout tracing: Use hist trigger's var_ref array to destroy var_refs tracing: Remove open-coding of hist trigger var_ref management tracing: Fix histogram code when expression has same var as value sd: Fix REQ_OP_ZONE_REPORT completion handling crypto: geode-aes - switch to skcipher for cbc(aes) fallback coresight: etb10: Do not call smp_processor_id from preemptible coresight: tmc-etf: Do not call smp_processor_id from preemptible libertas: Fix two buffer overflows at parsing bss descriptor media: v4l2-ioctl.c: zero reserved fields for S/TRY_FMT scsi: iscsi: Avoid potential deadlock in iscsi_if_rx func netfilter: ipset: use bitmap infrastructure completely netfilter: nf_tables: add __nft_chain_type_get() net/x25: fix nonblocking connect mm/memory_hotplug: make remove_memory() take the device_hotplug_lock mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section() mm, sparse: pass nid instead of pgdat to sparse_add_one_section() drivers/base/memory.c: remove an unnecessary check on NR_MEM_SECTIONS mm, memory_hotplug: add nid parameter to arch_remove_memory mm/memory_hotplug: release memory resource after arch_remove_memory() drivers/base/memory.c: clean up relics in function parameters mm, memory_hotplug: update a comment in unregister_memory() mm/memory_hotplug: make unregister_memory_section() never fail mm/memory_hotplug: make __remove_section() never fail powerpc/mm: Fix section mismatch warning mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail s390x/mm: implement arch_remove_memory() mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE drivers/base/memory: pass a block_id to init_memory_block() mm/memory_hotplug: create memory block devices after arch_add_memory() mm/memory_hotplug: remove memory block devices before arch_remove_memory() mm/memory_hotplug: make unregister_memory_block_under_nodes() never fail mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section mm/hotplug: kill is_dev_zone() usage in __remove_pages() drivers/base/node.c: simplify unregister_memory_block_under_nodes() mm/memunmap: don't access uninitialized memmap in memunmap_pages() mm/memory_hotplug: fix try_offline_node() mm/memory_hotplug: shrink zones when offlining memory Linux 4.19.100 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I1664d6d4de9358bff5632c291a26e1401ec7b5f1	2020-01-29 17:10:45 +01:00
Wei Yang	b1dbaa1916	mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section() commit 83af658898cb292a32d8b6cd9b51266d7cfc4b6a upstream. pgdat_resize_lock is used to protect pgdat's memory region information like: node_start_pfn, node_present_pages, etc. While in function sparse_add/remove_one_section(), pgdat_resize_lock is used to protect initialization/release of one mem_section. This looks not proper. These code paths are currently protected by mem_hotplug_lock currently but should there ever be any reason for locking at the sparse layer a dedicated lock should be introduced. Following is the current call trace of sparse_add/remove_one_section() mem_hotplug_begin() arch_add_memory() add_pages() __add_pages() __add_section() sparse_add_one_section() mem_hotplug_done() mem_hotplug_begin() arch_remove_memory() __remove_pages() __remove_section() sparse_remove_one_section() mem_hotplug_done() The comment above the pgdat_resize_lock also mentions "Holding this will also guarantee that any pfn_valid() stays that way.", which is true with the current implementation and false after this patch. But current implementation doesn't meet this comment. There isn't any pfn walkers to take the lock so this looks like a relict from the past. This patch also removes this comment. [richard.weiyang@gmail.com: v4] Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com [mhocko@suse.com: changelog suggestion] Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-29 16:43:24 +01:00
Vlastimil Babka	2b618a16e7	UPSTREAM: mm: rename and change semantics of nr_indirectly_reclaimable_bytes The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by commit `eb59254608` ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with the goal of accounting objects that can be reclaimed, but cannot be allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user is converted. The counter is however still useful for accounting direct page allocations (i.e. not slab) with a shrinker, such as the ION page pool. So keep it, and: - change granularity to pages to be more like other counters; sub-page allocations should be able to use kmalloc - rename the counter to NR_KERNEL_MISC_RECLAIMABLE - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can again remove the check for not printing "hidden" counters Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> cherry picked from commit b29940c1abd7a4c3abeb926df0a5ec84d6902d47) Bug: 138148041 Test: verify KReclaimable accounting after ION allocation+deallocation Change-Id: I6694939516e5dfec388e038131021e3885c2640b Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-12-13 14:04:35 -08:00
Sami Tolvanen	9c265248fa	FROMLIST: scs: add accounting This change adds accounting for the memory allocated for shadow stacks. Bug: 145210207 Change-Id: I51157fe0b23b4cb28bb33c86a5dfe3ac911296a4 (am from https://lore.kernel.org/patchwork/patch/1149055/) Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2019-11-27 12:37:25 -08:00
Johannes Weiner	b4a56abdf7	UPSTREAM: mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 8508cf3ffad4defa202b303e5b6379efc4cd9054) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I71df060dce5590a3c654f9a0e8e54deeb74b64c2 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:26 -07:00
Srikar Dronamraju	e054637597	mm, sched/numa: Remove remaining traces of NUMA rate-limiting Remove the leftover pglist_data::numabalancing_migrate_lock and its initialization, we stopped using this lock with: `efaffc5e40` ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration") [ mingo: Rewrote the changelog. ] Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-10-09 08:30:51 +02:00
Mel Gorman	efaffc5e40	mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration Rate limiting of page migrations due to automatic NUMA balancing was introduced to mitigate the worst-case scenario of migrating at high frequency due to false sharing or slowly ping-ponging between nodes. Since then, a lot of effort was spent on correctly identifying these pages and avoiding unnecessary migrations and the safety net may no longer be required. Jirka Hladky reported a regression in 4.17 due to a scheduler patch that avoids spreading STREAM tasks wide prematurely. However, once the task was properly placed, it delayed migrating the memory due to rate limiting. Increasing the limit fixed the problem for him. Currently, the limit is hard-coded and does not account for the real capabilities of the hardware. Even if an estimate was attempted, it would not properly account for the number of memory controllers and it could not account for the amount of bandwidth used for normal accesses. Rather than fudging, this patch simply eliminates the rate limiting. However, Jirka reports that a STREAM configuration using multiple processes achieved similar performance to 4.16. In local tests, this patch improved performance of STREAM relative to the baseline but it is somewhat machine-dependent. Most workloads show little or not performance difference implying that there is not a heavily reliance on the throttling mechanism and it is safe to remove. STREAM on 2-socket machine 4.19.0-rc5 4.19.0-rc5 numab-v1r1 noratelimit-v1r1 MB/sec copy 43298.52 ( 0.00%) 44673.38 ( 3.18%) MB/sec scale 30115.06 ( 0.00%) 31293.06 ( 3.91%) MB/sec add 32825.12 ( 0.00%) 34883.62 ( 6.27%) MB/sec triad 32549.52 ( 0.00%) 34906.60 ( 7.24% Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-10-02 11:31:14 +02:00
Pavel Tatashin	c1093b746c	mm: access zone->node via zone_to_nid() and zone_set_nid() zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to have inline functions to access this field in order to avoid ifdef's in c files. Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-22 10:52:45 -07:00
Oscar Salvador	89696701ea	mm: remove zone_id() and make use of zone_idx() in is_dev_zone() is_dev_zone() is using zone_id() to check if the zone is ZONE_DEVICE. zone_id() looks pretty much the same as zone_idx(), and while the use of zone_idx() is quite spread in the kernel, zone_id() is only being used by is_dev_zone(). This patch removes zone_id() and makes is_dev_zone() use zone_idx() to check the zone, so we do not have two things with the same functionality around. Link: http://lkml.kernel.org/r/20180730133718.28683-1-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-22 10:52:45 -07:00
Joonsoo Kim	d3cda2337b	mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that important to reserve. When ZONE_MOVABLE is used, this problem would theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE allocation request which is mainly used for page cache and anon page allocation. So, fix it by setting 0 to sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM]. And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size makes code complex. For example, if there is highmem system, following reserve ratio is activated for NORMAL ZONE which would be easyily misleading people. #ifdef CONFIG_HIGHMEM 32 #endif This patch also fixes this situation by defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to right place. Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tony Lindgren <tony@atomide.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-11 10:28:32 -07:00
Roman Gushchin	eb59254608	mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES Patch series "indirectly reclaimable memory", v2. This patchset introduces the concept of indirectly reclaimable memory and applies it to fix the issue of when a big number of dentries with external names can significantly affect the MemAvailable value. This patch (of 3): Introduce a concept of indirectly reclaimable memory and adds the corresponding memory counter and /proc/vmstat item. Indirectly reclaimable memory is any sort of memory, used by the kernel (except of reclaimable slabs), which is actually reclaimable, i.e. will be released under memory pressure. The counter is in bytes, as it's not always possible to count such objects in pages. The name contains BYTES by analogy to NR_KERNEL_STACK_KB. Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-11 10:28:29 -07:00
David Rientjes	5ecd9d403a	mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory Kswapd will not wakeup if per-zone watermarks are not failing or if too many previous attempts at background reclaim have failed. This can be true if there is a lot of free memory available. For high- order allocations, kswapd is responsible for waking up kcompactd for background compaction. If the zone is not below its watermarks or reclaim has recently failed (lots of free memory, nothing left to reclaim), kcompactd does not get woken up. When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be woken up even if kswapd will not reclaim. This allows high-order allocations, such as thp, to still trigger background compaction even when the zone has an abundance of free memory. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-05 21:36:27 -07:00
Pavel Tatashin	3a2d7fa8a3	mm: disable interrupts while initializing deferred pages Vlastimil Babka reported about a window issue during which when deferred pages are initialized, and the current version of on-demand initialization is finished, allocations may fail. While this is highly unlikely scenario, since this kind of allocation request must be large, and must come from interrupt handler, we still want to cover it. We solve this by initializing deferred pages with interrupts disabled, and holding node_size_lock spin lock while pages in the node are being initialized. The on-demand deferred page initialization that comes later will use the same lock, and thus synchronize with deferred_init_memmap(). It is unlikely for threads that initialize deferred pages to be interrupted. They run soon after smp_init(), but before modules are initialized, and long before user space programs. This is why there is no adverse effect of having these threads running with interrupts disabled. [pasha.tatashin@oracle.com: v6] Link: http://lkml.kernel.org/r/20180313182355.17669-2-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180309220807.24961-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-05 21:36:24 -07:00
David Rientjes	fc5d1073ca	x86/mm/32: Remove unused node_memmap_size_bytes() & CONFIG_NEED_NODE_MEMMAP_SIZE logic node_memmap_size_bytes() has been unused since the v3.9 kernel, so remove it. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: `f03574f2d5` ("x86-32, mm: Rip out x86_32 NUMA remapping code") Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803262325540.256524@chino.kir.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-03-27 08:45:02 +02:00
Petr Tesarik	def9b71ee6	include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer The comment is confusing. On the one hand, it refers to 32-bit alignment (struct page alignment on 32-bit platforms), but this would only guarantee that the 2 lowest bits must be zero. On the other hand, it claims that at least 3 bits are available, and 3 bits are actually used. This is not broken, because there is a stronger alignment guarantee, just less obvious. Let's fix the comment to make it clear how many bits are available and why. Although memmap arrays are allocated in various places, the resulting pointer is encoded eventually, so I am adding a BUG_ON() here to enforce at runtime that all expected bits are indeed available. I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is sufficient, because this part of the calculation can be easily checked at build time. [ptesarik@suse.com: v2] Link: http://lkml.kernel.org/r/20180125100516.589ea6af@ezekiel.suse.cz Link: http://lkml.kernel.org/r/20180119080908.3a662e6f@ezekiel.suse.cz Signed-off-by: Petr Tesarik <ptesarik@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-01-31 17:18:39 -08:00
Pavel Tatashin	d135e57502	mm/page_alloc.c: broken deferred calculation In reset_deferred_meminit() we determine number of pages that must not be deferred. We initialize pages for at least 2G of memory, but also pages for reserved memory in this node. The reserved memory is determined in this function: memblock_reserved_memory_within(), which operates over physical addresses, and returns size in bytes. However, reset_deferred_meminit() assumes that that this function operates with pfns, and returns page count. The result is that in the best case machine boots slower than expected due to initializing more pages than needed in single thread, and in the worst case panics because fewer than needed pages are initialized early. Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com Fixes: `864b9a393d` ("mm: consider memblock reservations for deferred memory initialization sizing") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:07 -08:00
Andrey Ryabinin	3a50d14d0d	mm: remove unused pgdat->inactive_ratio Since commit `59dc76b0d4` ("mm: vmscan: reduce size of inactive file list") 'pgdat->inactive_ratio' is not used, except for printing "node_inactive_ratio: 0" in /proc/zoneinfo output. Remove it. Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:03 -08:00
Ingo Molnar	b3d9a13681	Merge branch 'linus' into x86/asm, to pick up fixes and resolve conflicts Conflicts: arch/x86/kernel/cpu/Makefile Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-11-07 10:53:06 +01:00
Greg Kroah-Hartman	b24413180f	License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-11-02 11:10:55 +01:00
Kirill A. Shutemov	83e3c48729	mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y Size of the mem_section[] array depends on the size of the physical address space. In preparation for boot-time switching between paging modes on x86-64 we need to make the allocation of mem_section[] dynamic, because otherwise we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB for 4-level paging and 2MB for 5-level paging mode. The patch allocates the array on the first call to sparse_memory_present_with_active_regions(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@suse.de> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-10-20 13:07:09 +02:00
YASUAKI ISHIMATSU	1dd2bfc868	mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function pfn_to_section_nr() and section_nr_to_pfn() are defined as macro. pfn_to_section_nr() has no issue even if it is defined as macro. But section_nr_to_pfn() has overflow issue if sec is defined as int. section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT. If sec is defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value. But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit value. __remove_section() calculates start_pfn using section_nr_to_pfn() and scn_nr defined as int. So if hot-removed memory address is over 16TB, overflow issue occurs and section_nr_to_pfn() does not calculate correct pfn. To make callers use proper arg, the patch changes the macros to inline functions. Fixes: `815121d2b5` ("memory_hotplug: clear zone when removing the memory") Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-10-03 17:54:25 -07:00
Kemi Wang	1d90ca897c	mm: update NUMA counter threshold size There is significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (suggested by Dave Hansen). This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter(suggested by Ying Huang). The rationality is that these statistics counters don't affect the kernel's decision, unlike other VM counters, so it's not a problem to use a large threshold. With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default (base) 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Ying Huang <ying.huang@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Christopher Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-08 18:26:47 -07:00
Kemi Wang	3a321d2a3d	mm: change the call sites of numa statistics items Patch series "Separate NUMA statistics from zone statistics", v2. Each page allocation updates a set of per-zone statistics with a call to zone_statistics(). As discussed in 2017 MM summit, these are a substantial source of overhead in the page allocator and are very rarely consumed. This significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (pointed out by Dave Hansen). A link to the MM summit slides: http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf To mitigate this overhead, this patchset separates NUMA statistics from zone statistics framework, and update NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter (suggested by Ying Huang). The rationality is that these statistics counters don't need to be read often, unlike other VM counters, so it's not a problem to use a large threshold and make readers more expensive. With this patchset, we see 31.3% drop of CPU cycles(537-->369, see below) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Meanwhile, this patchset keeps the same style of virtual memory statistics with little end-user-visible effects (only move the numa stats to show behind zone page stats, see the first patch for details). I did an experiment of single page allocation and reclaim concurrently using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server (88 processors with 126G memory) with different size of threshold of pcp counter. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics This patch (of 3): In this patch, NUMA statistics is separated from zone statistics framework, all the call sites of NUMA stats are changed to use numa-stats-specific functions, it does not have any functionality change except that the number of NUMA stats is shown behind zone page stats when users read the zone info. E.g. cat /proc/zoneinfo *Base* *With this patch* nr_free_pages 3976 nr_free_pages 3976 nr_zone_inactive_anon 0 nr_zone_inactive_anon 0 nr_zone_active_anon 0 nr_zone_active_anon 0 nr_zone_inactive_file 0 nr_zone_inactive_file 0 nr_zone_active_file 0 nr_zone_active_file 0 nr_zone_unevictable 0 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_zone_write_pending 0 nr_mlock 0 nr_mlock 0 nr_page_table_pages 0 nr_page_table_pages 0 nr_kernel_stack 0 nr_kernel_stack 0 nr_bounce 0 nr_bounce 0 nr_zspages 0 nr_zspages 0 numa_hit 0 nr_free_cma 0 numa_miss 0 numa_hit 0 numa_foreign 0 numa_miss 0 numa_interleave 0 numa_foreign 0 numa_local 0 numa_interleave 0 numa_other 0 numa_local 0 nr_free_cma 0 numa_other 0 ... ... vm stats threshold: 10 vm stats threshold: 10 ... ... The next patch updates the numa stats counter size and threshold. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-08 18:26:47 -07:00
Michal Hocko	b93e0f329e	mm, memory_hotplug: get rid of zonelists_mutex zonelists_mutex was introduced by commit `4eaf3f6439` ("mem-hotplug: fix potential race while building zonelist for new populated zone") to protect zonelist building from races. This is no longer needed though because both memory online and offline are fully serialized. New users have grown since then. Notably setup_per_zone_wmarks wants to prevent from races between memory hotplug, khugepaged setup and manual min_free_kbytes update via sysctl (see `cfd3da1e49` ("mm: Serialize access to min_free_kbytes"). Let's add a private lock for that purpose. This will not prevent from seeing halfway through memory hotplug operation but that shouldn't be a big deal becuse memory hotplug will update watermarks explicitly so we will eventually get a full picture. The lock just makes sure we won't race when updating watermarks leading to weird results. Also __build_all_zonelists manipulates global data so add a private lock for it as well. This doesn't seem to be necessary today but it is more robust to have a lock there. While we are at it make sure we document that memory online/offline depends on a full serialization either via mem_hotplug_begin() or device_lock. Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Haicheng Li <haicheng.li@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:26 -07:00
Michal Hocko	72675e131e	mm, memory_hotplug: drop zone from build_all_zonelists build_all_zonelists gets a zone parameter to initialize zone's pagesets. There is only a single user which gives a non-NULL zone parameter and that one doesn't really need the rest of the build_all_zonelists (see commit `6dcd73d701` ("memory-hotplug: allocate zone's pcp before onlining pages")). Therefore remove setup_zone_pageset from build_all_zonelists and call it from its only user directly. This will also remove a pointless zonlists rebuilding which is always good. Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:25 -07:00
Michal Hocko	c9bff3eebc	mm, page_alloc: rip out ZONELIST_ORDER_ZONE Patch series "cleanup zonelists initialization", v1. This is aimed at cleaning up the zonelists initialization code we have but the primary motivation was bug report [2] which got resolved but the usage of stop_machine is just too ugly to live. Most patches are straightforward but 3 of them need a special consideration. Patch 1 removes zone ordered zonelists completely. I am CCing linux-api because this is a user visible change. As I argue in the patch description I do not think we have a strong usecase for it these days. I have kept sysctl in place and warn into the log if somebody tries to configure zone lists ordering. If somebody has a real usecase for it we can revert this patch but I do not expect anybody will actually notice runtime differences. This patch is not strictly needed for the rest but it made patch 6 easier to implement. Patch 7 removes stop_machine from build_all_zonelists without adding any special synchronization between iterators and updater which I _believe_ is acceptable as explained in the changelog. I hope I am not missing anything. Patch 8 then removes zonelists_mutex which is kind of ugly as well and not really needed AFAICS but a care should be taken when double checking my thinking. This patch (of 9): Supporting zone ordered zonelists costs us just a lot of code while the usefulness is arguable if existent at all. Mel has already made node ordering default on 64b systems. 32b systems are still using ZONELIST_ORDER_ZONE because it is considered better to fallback to a different NUMA node rather than consume precious lowmem zones. This argument is, however, weaken by the fact that the memory reclaim has been reworked to be node rather than zone oriented. This means that lowmem requests have to skip over all highmem pages on LRUs already and so zone ordering doesn't save the reclaim time much. So the only advantage of the zone ordering is under a light memory pressure when highmem requests do not ever hit into lowmem zones and the lowmem pressure doesn't need to reclaim. Considering that 32b NUMA systems are rather suboptimal already and it is generally advisable to use 64b kernel on such a HW I believe we should rather care about the code maintainability and just get rid of ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if somebody tries to set zone ordering either from kernel command line or the sysctl. [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate] Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:25 -07:00
Michal Hocko	9d1f4b3f5b	mm: disallow early_pfn_to_nid on configurations which do not implement it early_pfn_to_nid will return node 0 if both HAVE_ARCH_EARLY_PFN_TO_NID and HAVE_MEMBLOCK_NODE_MAP are disabled. It seems we are safe now because all architectures which support NUMA define one of them (with an exception of alpha which however has CONFIG_NUMA marked as broken) so this works as expected. It can get silently and subtly broken too easily, though. Make sure we fail the compilation if NUMA is enabled and there is no proper implementation for this function. If that ever happens we know that either the specific configuration is invalid and the fix should either disable NUMA or enable one of the above configs. Link: http://lkml.kernel.org/r/20170704075803.15979-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-10 16:32:33 -07:00
Nikolay Borisov	618b8c20d0	include/linux/mmzone.h: remove ancient/ambiguous comment Currently pg_data_t is just a struct which describes a NUMA node memory layout. Let's keep the comment simple and remove ambiguity. Link: http://lkml.kernel.org/r/1498220534-22717-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-10 16:32:32 -07:00
Johannes Weiner	385386cff4	mm: vmstat: move slab statistics from zone to node counters Patch series "mm: per-lruvec slab stats" Josef is working on a new approach to balancing slab caches and the page cache. For this to work, he needs slab cache statistics on the lruvec level. These patches implement that by adding infrastructure that allows updating and reading generic VM stat items per lruvec, then switches some existing VM accounting sites, including the slab accounting ones, to this new cgroup-aware API. I'll follow up with more patches on this, because there is actually substantial simplification that can be done to the memory controller when we replace private memcg accounting with making the existing VM accounting sites cgroup-aware. But this is enough for Josef to base his slab reclaim work on, so here goes. This patch (of 5): To re-implement slab cache vs. page cache balancing, we'll need the slab counters at the lruvec level, which, ever since lru reclaim was moved from the zone to the node, is the intersection of the node, not the zone, and the memcg. We could retain the per-zone counters for when the page allocator dumps its memory information on failures, and have counters on both levels - which on all but NUMA node 0 is usually redundant. But let's keep it simple for now and just move them. If anybody complains we can restore the per-zone counters. [hannes@cmpxchg.org: fix oops] Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Michal Hocko	f1dd2cd13c	mm, memory_hotplug: do not associate hotadded memory to zones until online The current memory hotplug implementation relies on having all the struct pages associate with a zone/node during the physical hotplug phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast majority of cases this means that they are added to ZONE_NORMAL. This has been so since `9d99aaa31f` ("[PATCH] x86_64: Support memory hotadd without sparsemem") and it wasn't a big deal back then because movable onlining didn't exist yet. Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable onlining `511c2aba8f` ("mm, memory-hotplug: dynamic configure movable memory and portion memory") and then things got more complicated. Rather than reconsidering the zone association which was no longer needed (because the memory hotplug already depended on SPARSEMEM) a convoluted semantic of zone shifting has been developed. Only the currently last memblock or the one adjacent to the zone_movable can be onlined movable. This essentially means that the online type changes as the new memblocks are added. Let's simulate memory hot online manually $ echo 0x100000000 > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable $ echo online_movable > /sys/devices/system/memory/memory34/state $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This is an awkward semantic because an udev event is sent as soon as the block is onlined and an udev handler might want to online it based on some policy (e.g. association with a node) but it will inherently race with new blocks showing up. This patch changes the physical online phase to not associate pages with any zone at all. All the pages are just marked reserved and wait for the onlining phase to be associated with the zone as per the online request. There are only two requirements - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses the latter one is not an inherent requirement and can be changed in the future. It preserves the current behavior and made the code slightly simpler. This is subject to change in future. This means that the same physical online steps as above will lead to the following state: Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Implementation: The current move_pfn_range is reimplemented to check the above requirements (allow_online_pfn_range) and then updates the respective zone (move_pfn_range_to_zone), the pgdat and links all the pages in the pfn range with the zone/node. __add_pages is updated to not require the zone and only initializes sections in the range. This allowed to simplify the arch_add_memory code (s390 could get rid of quite some of code). devm_memremap_pages is the only user of arch_add_memory which relies on the zone association because it only hooks into the memory hotplug only half way. It uses it to associate the new memory with ZONE_DEVICE but doesn't allow it to be {on,off}lined via sysfs. This means that this particular code path has to call move_pfn_range_to_zone explicitly. The original zone shifting code is kept in place and will be removed in the follow up patch for an easier review. Please note that this patch also changes the original behavior when offlining a memory block adjacent to another zone (Normal vs. Movable) used to allow to change its movable type. This will be handled later. [richard.weiyang@gmail.com: simplify zone_intersects()] Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com [richard.weiyang@gmail.com: remove duplicate call for set_page_links] Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com [akpm@linux-foundation.org: remove unused local `i'] Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:32 -07:00
Michal Hocko	2d070eab2e	mm: consider zone which is not fully populated to have holes __pageblock_pfn_to_page has two users currently, set_zone_contiguous which checks whether the given zone contains holes and pageblock_pfn_to_page which then carefully returns a first valid page from the given pfn range for the given zone. This doesn't handle zones which are not fully populated though. Memory pageblocks can be offlined or might not have been onlined yet. In such a case the zone should be considered to have holes otherwise pfn walkers can touch and play with offline pages. Current callers of pageblock_pfn_to_page in compaction seem to work properly right now because they only isolate PageBuddy (isolate_freepages_block) or PageLRU resp. __PageMovable (isolate_migratepages_block) which will be always false for these pages. It would be safer to skip these pages altogether, though. In order to do this patch adds a new memory section state (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or in online_pages_range during the memory hotplug. Similarly offline_mem_sections clears the bit and it is called when the memory range is offlined. pfn_to_online_page helper is then added which check the mem section and only returns a page if it is onlined already. Use the new helper in __pageblock_pfn_to_page and skip the whole page block in such a case. [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil), mark sections online after all struct pages are initialized in online_pages_range (Vlastimil)] Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:32 -07:00

1 2 3 4 5 ...

378 Commits