Commit Graph

30007 Commits

Author SHA1 Message Date
Rick Yiu
7a1b8ea81e sched/fair: use actual cpu capacity to calculate boosted util
Currently when calculating boosted util for a cpu, it uses a fixed
value of 1024 for calculation. So when top-app tasks moved to LC,
which has much lower capacity than BC, the freq calculated will be
high even the cpu util is low. This results in higher power
consumption, especially on arch which has more little cores than
big cores. By replacing the fixed value of 1024 with actual cpu
capacity will reduce the freq calculated on LC.

Bug: 152925197
Test: boosted util reduced on little cores
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I80cdd08a2c7fa5e674c43bfc132584d85c14622b
2020-06-16 13:20:18 +00:00
Rick Yiu
9ff77c2408 sched: separate capacity margin for boosted tasks
With the introduction of placement hint patch, boosted tasks will not
scheduled from big cores. We tune capacity margin to let important
boosted tasks get scheduled on big cores. However, the capacity margin
affects all group of tasks, so that non-boosted tasks get more chances
to be scheduled on big cores, too. This could be solved by separating
capacity margin for boosted tasks.

Bug: 152925197
Test: margin set correctly
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I0e059c56efa9bc8513f0ef4b0f6ab8f5d04a592a
2020-06-16 13:20:00 +00:00
Wei Wang
ac1e356b2a sched: separate boost signal from placement hint
Test: build and boot
Bug: 144451857
Bug: 147785606
Bug: 152925197
Change-Id: Ib2d86a72cad12971a99c7105813473211a7fbd76
Signed-off-by: Wei Wang <wvw@google.com>
2020-06-16 13:18:33 +00:00
lucaswei
20f57bd70a Revert "vmscan: Support multiple kswapd threads per node"
This reverts commit 7e78bc0ad2.

Reason for revert: revert vendor customization patch
Bug: 157880566
Bug: 157858241
Change-Id: Id3c8f6c950ac01c3e85bea2b8ec0f9d6dce7af42
Signed-off-by: lucaswei <lucaswei@google.com>
2020-06-15 15:39:53 +08:00
lucaswei
56acc710a6 Merge LA.UM.9.12.R2.10.00.00.685.011 via branch 'qcom-msm-4.19-7250' into android-msm-pixel-4.19
Conflicts:
	Documentation/ABI/testing/sysfs-fs-f2fs
	Documentation/filesystems/f2fs.txt
	Documentation/filesystems/fscrypt.rst
	Documentation/sysctl/vm.txt
	Makefile
	arch/arm64/boot/Makefile
	arch/arm64/configs/vendor/kona_defconfig
	arch/arm64/configs/vendor/lito_defconfig
	arch/arm64/kernel/vdso.c
	arch/arm64/mm/init.c
	arch/arm64/mm/mmu.c
	arch/ia64/mm/init.c
	arch/powerpc/mm/mem.c
	arch/s390/mm/init.c
	arch/sh/mm/init.c
	arch/x86/mm/init_32.c
	arch/x86/mm/init_64.c
	block/bio.c
	block/blk-crypto-fallback.c
	block/blk-crypto-internal.h
	block/blk-crypto.c
	block/blk-merge.c
	block/keyslot-manager.c
	build.config.common
	drivers/base/core.c
	drivers/base/power/wakeup.c
	drivers/char/adsprpc.c
	drivers/char/diag/diagchar_core.c
	drivers/crypto/Makefile
	drivers/crypto/msm/qcedev.c
	drivers/crypto/msm/qcrypto.c
	drivers/dma-buf/dma-buf.c
	drivers/input/input.c
	drivers/input/keycombo.c
	drivers/input/misc/gpio_input.c
	drivers/input/misc/gpio_matrix.c
	drivers/input/touchscreen/st/fts.c
	drivers/md/Kconfig
	drivers/md/dm-default-key.c
	drivers/md/dm.c
	drivers/mmc/host/Makefile
	drivers/mmc/host/sdhci-msm-ice.h
	drivers/net/phy/phy_device.c
	drivers/of/property.c
	drivers/pci/controller/pci-msm.c
	drivers/platform/msm/gsi/Makefile
	drivers/platform/msm/ipa/ipa_rm_inactivity_timer.c
	drivers/platform/msm/ipa/ipa_v3/ipa.c
	drivers/platform/msm/ipa/ipa_v3/ipa_pm.c
	drivers/platform/msm/ipa/ipa_v3/rmnet_ipa.c
	drivers/platform/msm/sps/spsi.h
	drivers/power/supply/qcom/qpnp-smb5.c
	drivers/power/supply/qcom/smb5-lib.h
	drivers/power/supply/qcom/step-chg-jeita.c
	drivers/scsi/ufs/Kconfig
	drivers/scsi/ufs/Makefile
	drivers/scsi/ufs/ufs-qcom-ice.c
	drivers/scsi/ufs/ufs-qcom.c
	drivers/scsi/ufs/ufs-qcom.h
	drivers/scsi/ufs/ufshcd-crypto.c
	drivers/scsi/ufs/ufshcd-crypto.h
	drivers/scsi/ufs/ufshcd.c
	drivers/scsi/ufs/ufshcd.h
	drivers/soc/qcom/Makefile
	drivers/soc/qcom/msm_bus/msm_bus_dbg.c
	drivers/soc/qcom/msm_bus/msm_bus_dbg_rpmh.c
	drivers/soc/qcom/msm_minidump.c
	drivers/soc/qcom/peripheral-loader.c
	drivers/soc/qcom/smp2p.c
	drivers/soc/qcom/smp2p_sleepstate.c
	drivers/soc/qcom/subsystem_restart.c
	drivers/spi/spi-geni-qcom.c
	drivers/thermal/tsens.h
	drivers/tty/serial/Kconfig
	drivers/tty/serial/msm_geni_serial.c
	drivers/usb/typec/tcpm/fusb302.c
	drivers/usb/typec/tcpm/tcpm.c
	fs/crypto/bio.c
	fs/crypto/crypto.c
	fs/crypto/fname.c
	fs/crypto/fscrypt_private.h
	fs/crypto/inline_crypt.c
	fs/crypto/keyring.c
	fs/crypto/keysetup.c
	fs/crypto/keysetup_v1.c
	fs/crypto/policy.c
	fs/eventpoll.c
	fs/ext4/inode.c
	fs/ext4/ioctl.c
	fs/ext4/page-io.c
	fs/ext4/super.c
	fs/f2fs/Kconfig
	fs/f2fs/compress.c
	fs/f2fs/data.c
	fs/f2fs/dir.c
	fs/f2fs/f2fs.h
	fs/f2fs/file.c
	fs/f2fs/gc.c
	fs/f2fs/hash.c
	fs/f2fs/inline.c
	fs/f2fs/inode.c
	fs/f2fs/namei.c
	fs/f2fs/super.c
	fs/f2fs/sysfs.c
	fs/ubifs/ioctl.c
	include/linux/bio-crypt-ctx.h
	include/linux/bio.h
	include/linux/blk-crypto.h
	include/linux/blk_types.h
	include/linux/fscrypt.h
	include/linux/gfp.h
	include/linux/keyslot-manager.h
	include/linux/memory_hotplug.h
	include/linux/usb/tcpm.h
	include/linux/usb/typec.h
	include/soc/qcom/socinfo.h
	include/trace/events/f2fs.h
	include/uapi/linux/fscrypt.h
	include/uapi/linux/sched/types.h
	kernel/memremap.c
	kernel/sched/core.c
	kernel/sched/cpufreq_schedutil.c
	kernel/sched/fair.c
	kernel/sched/psi.c
	kernel/sched/sched.h
	kernel/sysctl.c
	lib/Makefile
	lib/test_stackinit.c
	mm/filemap.c
	mm/hmm.c
	mm/memory_hotplug.c
	mm/page_alloc.c
	scripts/gen_autoksyms.sh

Bug: 157994070
Bug: 157858241
Bug: 157879992
Signed-off-by: lucaswei <lucaswei@google.com>
Change-Id: Ib43efc6464e484b85107587c2f770246b48ddee6
2020-06-15 15:36:42 +08:00
Rick Yiu
7c722a0cb3 sched/fair: consider boost margin for type FREQUENCY_UTIL
When computing energy in selecting task runqueue, it does not use
boosted cpu util, so it could not reflect the real freq when a
cpu has boosted tasks on it. Addressing it by adding boost margin
if type is FREQUENCY_UTIL in schedutil_freq_util().

Bug: 158637636
Test: boot to home
Change-Id: I13f4283f03c0962dfc82ca7da01319c98e7aa7bf
Signed-off-by: Rick Yiu <rickyiu@google.com>
2020-06-11 07:33:02 +00:00
Martin Liu
3b44513213 sched/psi: add psi trigger event trace
add psi trigger events to help observer psi state.
Below are the example of the outputs.

<...>-577   [000] ....   213.883816: psi_update_trigger_growth: mem_some growth=208414044 threshold=15000000 elapsed=206674740 win=1000000000 last_event_diff=1033535524
<...>-577   [000] ....   213.883821: psi_update_trigger_wake_up: mem_some growth=208414044 threshold=15000000 win=1000000000 last_event_time=212850243878

Bug: 157840940
Test: check trace output
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I0a89eec867c11de518ead9e4844cfb7374c4e25b
2020-06-10 02:48:54 +00:00
Vincenzo Frascino
748ec8be6a UPSTREAM: timekeeping: Provide a generic update_vsyscall() implementation
The new generic VDSO library allows to unify the update_vsyscall[_tz]()
implementations.

Provide a generic implementation based on the x86 code and the bindings
which need to be implemented in architecture specific code.

[ tglx: Moved it into kernel/time where it belongs. Removed the pointless
  	line breaks in the stub functions. Massaged changelog ]

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shijith Thotton <sthotton@marvell.com>
Tested-by: Andre Przywara <andre.przywara@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-4-vincenzo.frascino@arm.com
(cherry picked from commit 44f57d788e7deecb504843534081d3449c2eede9)
Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Bug: 154668398
Change-Id: I2a85e391be80f58f6516eb7d8e6448f522fc3013
Signed-off-by: Chiawei Wang <chiaweiwang@google.com>
2020-06-09 17:51:52 +08:00
Martin Liu
52d2c35c9a Revert "mm: oom_kill: reap memory of a task that receives SIGKILL"
This reverts commit 97bf2fb571.

Reason to revert: The changes introduced in this commit are
causing an undesirable SELinux denial as a side-effect and
we do not enable the functionality that this commit adds.
Reverting the commit fixes the SELinux denial bug.

Bug: 152624411
Test: boot
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I149b66e6fa0e90e691436e1a83261ff1de233669
2020-06-08 17:26:41 +08:00
Jimmy Shiu
81827827c1 Use find_best_target to select cpu for a zero-util task
Always choosing the prev_cpu for a zero-utilization task
might lead tasks competing for the same cpu and increase
the overall task execution time.
Instead, selecting cpu with find_best_target to share the
loading onto other cpus.

Bug: 143857473
Test: https://paste.googleplex.com/5570415529295872
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
Change-Id: Ibeb766957d2dea5fee85c798d8a9f7b62c2c1a09
2020-05-30 02:29:34 +08:00
Woody Lin
b450d8c9b1 kdebuginfo: Interface to set buildinfo
Bug: 155246473
Change-Id: I2d6efccab9c8b0ee8a8f6c0d069205403d890296
Signed-off-by: Woody Lin <woodylin@google.com>
2020-05-30 02:29:09 +08:00
lucaswei
5b02fceb61 sched/fair: Fix compilation issues for !CONFIG_SCHED_WALT
For compilation issues for !CONFIG_SCHED_WALT of the following two
commits:

commit a80cf2007d ("sched: Add support to spread tasks")

Bug: 154086870
Bug: 153823050
Signed-off-by: lucaswei <lucaswei@google.com>
Change-Id: I89e224e18f6700ea2abcd162a5b9f3f938a7ad92
2020-05-30 02:28:22 +08:00
lucaswei
95ddbb8a09 Merge LA.UM.9.12.R1.10.00.00.597.042 via branch 'qcom-msm-4.19-7250' into android-msm-pixel-4.19
Conflicts:
	Documentation/ABI/testing/sysfs-fs-f2fs
	Documentation/filesystems/f2fs.txt
	Documentation/filesystems/fscrypt.rst
	Documentation/filesystems/fsverity.rst
	Makefile
	arch/arm64/configs/vendor/kona_defconfig
	arch/arm64/configs/vendor/lito_defconfig
	block/blk-core.c
	build.config.common
	drivers/base/core.c
	drivers/base/power/main.c
	drivers/clk/clk.c
	drivers/clk/qcom/clk-alpha-pll.c
	drivers/dma-buf/dma-buf.c
	drivers/gpu/msm/kgsl_pool.c
	drivers/input/misc/qpnp-power-on.c
	drivers/iommu/dma-mapping-fast.c
	drivers/iommu/io-pgtable-fast.c
	drivers/iommu/io-pgtable-msm-secure.c
	drivers/iommu/io-pgtable.c
	drivers/of/property.c
	drivers/platform/msm/ipa/ipa_clients/ipa_gsb.c
	drivers/platform/msm/ipa/ipa_clients/ipa_mhi_client.c
	drivers/platform/msm/ipa/ipa_v3/ipa_mpm.c
	drivers/power/supply/power_supply_sysfs.c
	drivers/power/supply/qcom/qg-core.h
	drivers/power/supply/qcom/qpnp-qg.c
	drivers/soc/qcom/scm.c
	drivers/staging/android/ion/ion_page_pool.c
	drivers/tty/serial/msm_geni_serial.c
	fs/crypto/Kconfig
	fs/crypto/bio.c
	fs/crypto/crypto.c
	fs/crypto/fname.c
	fs/crypto/fscrypt_private.h
	fs/crypto/hooks.c
	fs/crypto/keyinfo.c
	fs/crypto/policy.c
	fs/ext4/ext4.h
	fs/ext4/hash.c
	fs/ext4/inode.c
	fs/ext4/namei.c
	fs/ext4/page-io.c
	fs/ext4/readpage.c
	fs/ext4/super.c
	fs/ext4/verity.c
	fs/f2fs/Makefile
	fs/f2fs/data.c
	fs/f2fs/dir.c
	fs/f2fs/f2fs.h
	fs/f2fs/file.c
	fs/f2fs/gc.c
	fs/f2fs/hash.c
	fs/f2fs/inline.c
	fs/f2fs/namei.c
	fs/f2fs/segment.c
	fs/f2fs/super.c
	fs/f2fs/sysfs.c
	fs/f2fs/verity.c
	fs/inode.c
	fs/ubifs/dir.c
	fs/unicode/utf8-core.c
	fs/verity/enable.c
	fs/verity/fsverity_private.h
	fs/verity/hash_algs.c
	fs/verity/open.c
	fs/verity/verify.c
	include/linux/coresight.h
	include/linux/device.h
	include/linux/dma-buf.h
	include/linux/f2fs_fs.h
	include/linux/fscrypt.h
	include/linux/fsverity.h
	include/linux/fwnode.h
	include/linux/leds-qpnp-flash.h
	include/linux/perf_event.h
	include/linux/power_supply.h
	include/linux/unicode.h
	include/soc/qcom/scm.h
	include/uapi/linux/nl80211.h
	kernel/events/core.c
	kernel/sched/core.c
	kernel/sched/fair.c
	lib/Kconfig.debug
	lib/Makefile
	lib/test_meminit.c
	mm/slub.c
	mm/swapfile.c
	mm/vmalloc.c
	net/wireless/nl80211.c
	security/selinux/include/security.h

Bug: 153823050
Bug: 153825378
Signed-off-by: lucaswei <lucaswei@google.com>
Change-Id: Ia2bfb56f0d48504ba600b52bdde958a76d5bff72
2020-05-30 02:28:19 +08:00
Eric W. Biederman
4a03fb3835 ANDROID: signal: Extend exec_id to 64bits
commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream.

Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter.  With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent.  This
bypasses the signal sending checks if the parent changes their
credentials during exec.

The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days.  Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.

Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions.  Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value.  So with very lucky timing after this change this still
remains expoiltable.

I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.

Bug: 154513111
Test: boot bramble, verify list of probed devices
Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
Fixes: 2.3.23pre2
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a2a1be2de7)
Signed-off-by: Will McVicker <willmcvicker@google.com>
Change-Id: I55f74f593ea58a97c8bdea0769ebc93083a8f30d
2020-05-30 02:22:13 +08:00
Mark Salyzyn
b2ff09fbe0 GKI: devfreq: move trace definitions to the driver
move bw_hwmon and memlat governor traces from global kernel
definitions, to local module.  Removes the need to maintain
ABI in other kernels.

Test:
  insmod governor_bw_hwmon.ko
  insmod governor_memlat.ko
  mkdir /tmp/t
  mount -t tracefs tracefs /tmp/t
  find /tmp/t/events | grep 'power/bw_hwmon'
  find /tmp/t/events | grep 'power/memlat'

Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Bug: 142948174
Bug: 142905293
Change-Id: I98bba1d43cdaede74d9a631416288e2b8d6da9b3
2020-05-30 02:22:13 +08:00
Minchan Kim
6dd6325deb mm: introduce per-process mm event tracking feature
Linux supports /proc/meminfo and /proc/vmstat stats as memory health metric.
Android uses them too. If user see something goes wrong(e.g., sluggish, jank)
on their system, they can capture and report system state to developers
for debugging.

It shows memory stat at the moment the bug is captured. However, it’s
not enough to investigate application's jank problem caused by memory
shortage. Because

1. It just shows event count which doesn’t quantify the latency of the
application well. Jank could happen by various reasons and one of simple
scenario is frame drop for a second. App should draw the frame every 16ms
interval. Just number of stats(e.g., allocstall or pgmajfault) couldn't
represnt how many of time the app spends for handling the event.

2. At bugreport, dump with vmstat and meminfo is never helpful because it's
too late to capture the moment when the problem happens.
When the user catch up the problem and try to capture the system state,
the problem has already gone.

3. Although we could capture MM stat at the moment bug happens, it couldn't
be helpful because MM stats are usually very flucuate so we need historical
data rather than one-time snapshot to see MM trend.

To solve above problems, this patch introduces per-process, light-weight,
mm event stat. Basically, it tracks minor/major faults, reclaim and compaction
latency of each process as well as event count and record the data into global
buffer.
To compromise memory overhead, it doesn't record every MM event of the process
to the buffer but just drain accumuated stats every 0.5sec interval to buffer.
If there isn't any event, it just skips the recording.
For latency data, it keeps average/max latency of each event in that period

With that, we could keep useful information with small buffer so that
we couldn't miss precious information any longer although the capture time
is rather late. This patch introduces basic facility of MM event stat.

After all patches in this patchset are applied, outout format is as follows,
dumpstate can use it for VM debugging in future.

<...>-1665  [001] d...   217.575173: mm_event_record: min_flt count=203 avg_lat=3 max_lat=58
<...>-1665  [001] d...   217.575183: mm_event_record: maj_flt count=1 avg_lat=1994 max_lat=1994
<...>-1665  [001] d...   217.575184: mm_event_record: kern_alloc count=227 avg_lat=0 max_lat=0
<...>-626   [000] d...   217.578096: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0
<...>-6547  [000] ....   217.581913: mm_event_record: min_flt count=7 avg_lat=7 max_lat=20
<...>-6547  [000] ....   217.581955: mm_event_record: kern_alloc count=4 avg_lat=0 max_lat=0

This feature uses event trace for output buffer so that we could use all of
general benefit of event trace(e.g., buffer size management, filtering and
so on). To prevent overflow of the ring buffer by other random event race,
highly suggest that create separate instance of tracing
on /sys/kernel/debug/tracing/instances/

I had a concern of adding overhead. Actually, major|compaction/reclaim
are already heavy cost so it should be not a concern. Rather than,
minor fault and kern alloc would be severe so I tested a micro benchmark
to measure minor page fault overhead.

Test scenario is create 40 threads and each of them does minor
page fault for 25M range(ranges are not overwrapped).
I didn't see any noticible regression.

Base:
fault/wsec avg: 758489.8288

minor faults=13123118, major faults=0 ctx switch=139234
    User   System     Wall        fault/wsec
  39.55s   41.73s   17.49s        749995.768
minor faults=13123135, major faults=0 ctx switch=139627
    User   System     Wall        fault/wsec
  34.59s   41.61s   16.95s        773906.976
minor faults=13123061, major faults=0 ctx switch=139254
    User   System     Wall        fault/wsec
  39.03s   41.55s   16.97s        772966.334
minor faults=13123131, major faults=0 ctx switch=139970
    User   System     Wall        fault/wsec
  36.71s   42.12s   17.04s        769941.019
minor faults=13123027, major faults=0 ctx switch=138524
    User   System     Wall        fault/wsec
  42.08s   42.24s   18.08s        725639.047

Base + MM event + event trace enable:
fault/wsec avg: 759626.1488

minor faults=13123488, major faults=0 ctx switch=140303
    User   System     Wall        fault/wsec
  37.66s   42.21s   17.48s        750414.257
minor faults=13123066, major faults=0 ctx switch=138119
    User   System     Wall        fault/wsec
  36.77s   42.14s   17.49s        750010.107
minor faults=13123505, major faults=0 ctx switch=140021
    User   System     Wall        fault/wsec
  38.51s   42.50s   17.54s        748022.219
minor faults=13123431, major faults=0 ctx switch=138517
    User   System     Wall        fault/wsec
  36.74s   41.49s   17.03s        770255.610
minor faults=13122955, major faults=0 ctx switch=137174
    User   System     Wall        fault/wsec
  40.68s   40.97s   16.83s        779428.551

Bug: 80168800
Bug: 116825053
Bug: 153442668
Test: boot
Change-Id: I4e69c994f47402766481c58ab5ec2071180964b8
Signed-off-by: Minchan Kim <minchan@google.com>
(cherry picked from commit 04ff5ec537a5f9f546dcb32257d8fbc1f4d9ca2d)
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:21:52 +08:00
Greg Kroah-Hartman
05951af8e6 UPSTREAM: bpf: Explicitly memset some bpf info structures declared on the stack
Trying to initialize a structure with "= {};" will not always clean out
all padding locations in a structure. So be explicit and call memset to
initialize everything for a number of bpf information structures that
are then copied from userspace, sometimes from smaller memory locations
than the size of the structure.

Bug: 153418162
Test: run vts VtsKernelNetTest pass (b/153418162#comment5)

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320162258.GA794295@kroah.com
(cherry picked from commit 269efb7fc478563a7e7b22590d8076823f4ac82a)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I52a2cab20aa310085ec104bd811ac4f2b83657b6
Signed-off-by: Mars Lin <marslin@google.com>
2020-05-30 02:21:36 +08:00
Greg Kroah-Hartman
cb05993e6a UPSTREAM: bpf: Explicitly memset the bpf_attr structure
For the bpf syscall, we are relying on the compiler to properly zero out
the bpf_attr union that we copy userspace data into. Unfortunately that
doesn't always work properly, padding and other oddities might not be
correctly zeroed, and in some tests odd things have been found when the
stack is pre-initialized to other values.

Fix this by explicitly memsetting the structure to 0 before using it.

Bug: 153418162
Test: run vts VtsKernelNetTest pass (b/153418162#comment5)

Reported-by: Maciej Żenczykowski <maze@google.com>
Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Alexander Potapenko <glider@google.com>
Reported-by: Alistair Delva <adelva@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://android-review.googlesource.com/c/kernel/common/+/1235490
Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com
(cherry picked from commit 8096f229421f7b22433775e928d506f0342e5907)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2dc28cd45024da5cc6861ff4a9b25fae389cc6d8
Signed-off-by: Mars Lin <marslin@google.com>
2020-05-30 02:21:36 +08:00
Saravana Kannan
ae46a7d4e7 GKI: sched: Add back the root_domain.overutilized field
This field is necessary to maintain ABI compatibility with ACK. Add it
back, but leave it unused.

Bug: 153905799
Change-Id: Ic9ef5640fa77c3aada023843658e7e4de3bada82
Signed-off-by: Saravana Kannan <saravanak@google.com>
2020-05-30 02:21:26 +08:00
Saravana Kannan
e8a84bbd89 GKI: sched: Compile out push_task field in struct rq
The push_task field is a WALT related field that shouldn't be needed
since we run PELT. So conditionally compile in the field only when WALT
is enabled. Also add #ifdefs around all the uses of this field.

Bug: 153905799
Change-Id: I12edd3f2180ebab14719ba2548e83519beffacc2
Signed-off-by: Saravana Kannan <saravanak@google.com>
2020-05-30 02:21:26 +08:00
Martin Liu
2f52110115 Revert "mm: reclaim small amounts of memory when an external fragmentation event occurs"
This reverts commit 68809fdd57.
also fix BB from 4165090057

Reason for revert: roll back to stable kernel
Bug: 140544941
Test: boot
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I61b51972eab01f328ce375111a3bd04670de670b
2020-05-30 02:21:22 +08:00
Martin Liu
afb52653fc Revert "mm: oom-kill: Add lmk_kill possible for ULMK"
This reverts commit aa9e75a9ff.

Reason for revert: remove customized code
Bug: 140544941
Test: boot
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I364b45e09f22c59f82fdd768c0a5ec86d69fee9c
2020-05-30 02:20:58 +08:00
Martin Liu
ff1201cd90 Revert "mm: introduce INIT_VMA()"
This reverts commit ead04c98fd.

Reason for revert: remove SPF non upstream code
Bug: 140544941
Test: boot
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I18ebe5d38d1ffb7a5a599b5be93eab71e0a5804f
2020-05-30 02:20:38 +08:00
Martin Liu
3e5f49e2cc Revert "mm: protect mm_rb tree with a rwlock"
This reverts commit 3f31f748a8.

Reason for revert: remove SPF non upstream code
Bug: 140544941
Test: boot
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: Ic6efff6d069d20badad9af11beec0dbe36c659f5
2020-05-30 02:20:37 +08:00
Martin Liu
18a8850202 Revert "mm: protect against PTE changes done by dup_mmap()"
This reverts commit 0c8a35f8dd.

Reason for revert: remove SPF non upstream code
Bug: 140544941
Test: boot
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I912a8891ac6cf3e72c7b7aa27df2922554b31491
2020-05-30 02:20:30 +08:00
Martin Liu
ef36c60fe7 mm: Revert previous mm revert list
This commit reverts e799c1b10c54...cfb042c6c5d1

Reason for revert: unblock GKI
Bug: 140544941
Test: boot
Change-Id: I4ebe6c01918788cdc2468ceabf101ef7c3e3c452
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:20:27 +08:00
Minchan Kim
355b8cff31 Revert "mm: introduce INIT_VMA()"
This reverts commit ead04c98fd.

Reason for revert: revet customized code
Bug: 140544941
Test: boot
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I05aace3dfeb65fdb47f650e5b93dccc72f2edee3
2020-05-30 02:20:20 +08:00
Martin Liu
029f472f91 Revert "mm: protect mm_rb tree with a rwlock"
This reverts commit 3f31f748a8.

Reason for revert: revet customized code
Bug: 140544941
Test: boot

Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I1b72d4595a89fc512ff22e49e61e3b8dfa47ede8
2020-05-30 02:20:19 +08:00
Minchan Kim
96f9319be9 Revert "mm: protect against PTE changes done by dup_mmap()"
This reverts commit 0c8a35f8dd.

Reason for revert: revet customized code
Bug: 140544941
Test: boot
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I3cbd23ebf4fb0bd92009d05f772f48d8f46e48f0
2020-05-30 02:20:13 +08:00
Minchan Kim
c48c177651 Revert "mm: reclaim small amounts of memory when an external fragmentation event occurs"
This reverts commit 68809fdd57.
also fix BB from porting 416509005

Reason for revert: revet customized code
Bug: 140544941
Test: boot
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I65735f27f6a44a112957bcec07e2f63f2d8ccff6
2020-05-30 02:20:04 +08:00
Minchan Kim
4a7f0b329a Revert "mm: oom-kill: Add lmk_kill possible for ULMK"
This reverts commit aa9e75a9ff.

Reason for revert: revet customized code
Bug: 140544941
Test: boot
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: I1475e099c72dcdd33fc2497dd30aa51d07bfa73d
2020-05-30 02:19:43 +08:00
Eric Biggers
f6ab3f8004 FROMLIST: kmod: make request_module() return an error when autoloading is disabled
It's long been possible to disable kernel module autoloading completely
(while still allowing manual module insertion) by setting
/proc/sys/kernel/modprobe to the empty string.  This can be preferable
to setting it to a nonexistent file since it avoids the overhead of an
attempted execve(), avoids potential deadlocks, and avoids the call to
security_kernel_module_request() and thus on SELinux-based systems
eliminates the need to write SELinux rules to dontaudit module_request.

However, when module autoloading is disabled in this way,
request_module() returns 0.  This is broken because callers expect 0 to
mean that the module was successfully loaded.

Apparently this was never noticed because this method of disabling
module autoloading isn't used much, and also most callers don't use the
return value of request_module() since it's always necessary to check
whether the module registered its functionality or not anyway.  But
improperly returning 0 can indeed confuse a few callers, for example
get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit:

	if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
		fs = __get_fs_type(name, len);
		WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
	}

This is easily reproduced with:

	echo > /proc/sys/kernel/modprobe
	mount -t NONEXISTENT none /

It causes:

	request_module fs-NONEXISTENT succeeded, but still no fs?
	WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0
	[...]

This should actually use pr_warn_once() rather than WARN_ONCE(), since
it's also user-reachable if userspace immediately unloads the module.
Regardless, request_module() should correctly return an error when it
fails.  So let's make it return -ENOENT, which matches the error when
the modprobe binary doesn't exist.

I've also sent patches to document and test this case.

Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jessica Yu <jeyu@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: NeilBrown <neilb@suse.com>
Link: https://lore.kernel.org/r/20200318230515.171692-2-ebiggers@kernel.org
Bug: 151690015
Change-Id: I5e04f85e12a4f85da23e53bc11da1ade565abcd6
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-30 02:19:27 +08:00
Suren Baghdasaryan
5b3cf9b841 UPSTREAM: sched/psi: Fix OOB write when writing 0 bytes to PSI files
Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com

Bug: 152499875
Test: lmkd_unit_test
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I9ec7acfc6e1083c677a95b0ea1c559ab50152873
(cherry picked from commit 67e4408599)
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:19:24 +08:00
Johannes Weiner
48711f4d27 UPSTREAM: psi: Fix a division error in psi poll()
The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.

Use div64_u64.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
Signed-off-by: Sasha Levin <sashal@kernel.org>

Bug: 152499875
Test: lmkd_unit_test
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I49fdfd55751d1a2cde19666624c9c5d76dc78dad
(cherry picked from commit cf46cf40bc)
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:19:23 +08:00
Johannes Weiner
7064fd39b3 UPSTREAM: sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
Jingfeng reports rare div0 crashes in psi on systems with some uptime:

[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330

The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.

We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.

The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.

Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.

Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
Signed-off-by: Sasha Levin <sashal@kernel.org>

Bug: 152499875
Test: lmkd_unit_test
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Iaada5c2f1a03cf38cbb053adde478f762ce40843
(cherry picked from commit 55013802e8)
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:19:23 +08:00
Miles Chen
3b96c1807d UPSTREAM: sched/psi: Correct overly pessimistic size calculation
When passing a equal or more then 32 bytes long string to psi_write(),
psi_write() copies 31 bytes to its buf and overwrites buf[30]
with '\0'. Which makes the input string 1 byte shorter than
it should be.

Fix it by copying sizeof(buf) bytes when nbytes >= sizeof(buf).

This does not cause problems in normal use case like:
"some 500000 10000000" or "full 500000 10000000" because they
are less than 32 bytes in length.

	/* assuming nbytes == 35 */
	char buf[32];

	buf_size = min(nbytes, (sizeof(buf) - 1)); /* buf_size = 31 */
	if (copy_from_user(buf, user_buf, buf_size))
		return -EFAULT;

	buf[buf_size - 1] = '\0'; /* buf[30] = '\0' */

Before:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [   22.473497] nbytes=35,buf_size=31
 [   22.473775] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [   64.916162] nbytes=32,buf_size=31
 [   64.916331] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

After:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [  254.837863] nbytes=35,buf_size=32
 [  254.838541] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [ 9965.714935] nbytes=32,buf_size=32
 [ 9965.715096] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

Also remove the superfluous parentheses.

Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190912103452.13281-1-miles.chen@mediatek.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Bug: 152499875
Test: lmkd_unit_test
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I9371b4d5e465bb8b84ff7adf5f40f30696c6ff70
(cherry picked from commit 88a47f1659)
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:19:22 +08:00
Jason Xing
4694dbe19e UPSTREAM: psi: get poll_work to run when calling poll syscall next time
Only when calling the poll syscall the first time can user receive
POLLPRI correctly.  After that, user always fails to acquire the event
signal.

Reproduce case:
 1. Get the monitor code in Documentation/accounting/psi.txt
 2. Run it, and wait for the event triggered.
 3. Kill and restart the process.

The question is why we can end up with poll_scheduled = 1 but the work
not running (which would reset it to 0).  And the answer is because the
scheduling side sees group->poll_kworker under RCU protection and then
schedules it, but here we cancel the work and destroy the worker.  The
cancel needs to pair with resetting the poll_scheduled flag.

Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bug: 152499875
Test: lmkd_unit_test
Change-Id: Ieaa8284ef632ef06318a92d792b239d344bb29d1
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(cherry picked from commit e71f9c35ee)
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:19:22 +08:00
Peter Zijlstra
7cec2c6125 UPSTREAM: sched/psi: Reduce psimon FIFO priority
PSI defaults to a FIFO-99 thread, reduce this to FIFO-1.

FIFO-99 is the very highest priority available to SCHED_FIFO and
it not a suitable default; it would indicate the psi work is the
most important work on the machine.

Since Real-Time tasks will have pre-allocated memory and locked it in
place, Real-Time tasks do not care about PSI. All it needs is to be
above OTHER.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>

Bug: 152499875
Test: lmkd_unit_test
Change-Id: I52964915467577bfc3543700aec9b463f6f0ffe1
(cherry picked from commit 2a220bc9f2)
Signed-off-by: Martin Liu <liumartin@google.com>
2020-05-30 02:19:22 +08:00
Aneesh Kumar K.V
f98bddaf3c BACKPORT: GKI: mm/memunmap: don't access uninitialized memmap in memunmap_pages()
Patch series "mm/memory_hotplug: Shrink zones before removing memory",
v6.

This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory.  Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect
the ZONE_DEVICE as contiguous).

We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
of code to a minimum.  Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of
DIMMs at zone boundaries.

--------------------------------------------------------------------------

Zones are now properly shrunk when offlining memory blocks or when
onlining failed.  This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.

Example:

  :/# cat /proc/zoneinfo
  Node 1, zone  Movable
          spanned  0
          present  0
          managed  0
  :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
  :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
  :/# cat /proc/zoneinfo
  Node 1, zone  Movable
          spanned  98304
          present  65536
          managed  65536
  :/# echo 0 > /sys/devices/system/memory/memory43/online
  :/# cat /proc/zoneinfo
  Node 1, zone  Movable
          spanned  32768
          present  32768
          managed  32768
  :/# echo 0 > /sys/devices/system/memory/memory41/online
  :/# cat /proc/zoneinfo
  Node 1, zone  Movable
          spanned  0
          present  0
          managed  0

This patch (of 10):

With an altmap, the memmap falling into the reserved altmap space are not
initialized and, therefore, contain a garbage NID and a garbage zone.
Make sure to read the NID/zone from a memmap that was initialized.

This fixes a kernel crash that is observed when destroying a namespace:

  kernel BUG at include/linux/mm.h:1107!
  cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
      pc: c0000000004b9728: memunmap_pages+0x238/0x340
      lr: c0000000004b9724: memunmap_pages+0x234/0x340
  ...
      pid   = 3669, comm = ndctl
  kernel BUG at include/linux/mm.h:1107!
    devm_action_release+0x30/0x50
    release_nodes+0x268/0x2d0
    device_release_driver_internal+0x174/0x240
    unbind_store+0x13c/0x190
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x70/0xa0
    kernfs_fop_write+0x1ac/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xe4/0x200
    ksys_write+0x7c/0x140
    system_call+0x5c/0x68

The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm,
devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
think we will never have driver reserved memory with
MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).

[david@redhat.com: minimze code changes, rephrase description]
Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Damian Tometzki <damian.tometzki@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Rich Felker <dalias@libc.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 77e080e7680e1e615587352f70c87b9e98126d03)
Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Bug: 150378964
Change-Id: Ib3b09ddcbc8a42df0d596e6549fd3a40e6b998b1
2020-05-30 02:19:10 +08:00
Oscar Salvador
f88adb8eab BACKPORT: GKI: mm, memory_hotplug: add nid parameter to arch_remove_memory
-- snip --

Missing unification of mm/hmm.c and kernel/memremap.c

-- snip --

Patch series "Do not touch pages in hot-remove path", v2.

This patchset aims for two things:

 1) A better definition about offline and hot-remove stage
 2) Solving bugs where we can access non-initialized pages
    during hot-remove operations [2] [3].

This is achieved by moving all page/zone handling to the offline
stage, so we do not need to access pages when hot-removing memory.

[1] https://patchwork.kernel.org/cover/10691415/
[2] https://patchwork.kernel.org/patch/10547445/
[3] https://www.spinics.net/lists/linux-mm/msg161316.html

This patch (of 5):

This is a preparation for the following-up patches.  The idea of passing
the nid is that it will allow us to get rid of the zone parameter
afterwards.

Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2c2a5af6fed20cf74401c9d64319c76c5ff81309)
Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Bug: 150378964
Change-Id: Ie66e53db21682a60d6eb8b269b6e0980a736c573
2020-05-30 02:19:09 +08:00
Wei Wang
9923768ca5 sched: restrict iowait boost to tasks with prefer_idle
Currently iowait doesn't distinguish background/foreground tasks and we
have seen cases where a device run to high frequency unnecessarily when
running some background I/O. This patch limits iowait boost to tasks with
prefer_idle only. Specifically, on Pixel, those are foreground and top
app tasks.

Bug: 130308826
Bug: 144961757
Test: Boot and trace
Change-Id: I2d892beeb4b12b7e8f0fb2848c23982148648a10
Signed-off-by: Wei Wang <wvw@google.com>
2020-05-30 02:18:46 +08:00
Peter Zijlstra
6b46ef99ae futex: Fix inode life-time issue
commit 8019ad13ef7f64be44d4f892af9c840179009254 upstream.

As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.

This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.

Bug: 152809067
Test: compile, boot bramble, verify list of probed devices
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e6d506cd22)
Signed-off-by: Will McVicker <willmcvicker@google.com>
Change-Id: I237f35a875c07e5bda19df4fdf3318fae16e79d4
2020-05-30 02:18:43 +08:00
Robin Peng
58a95695aa Merge LA.UM.9.12.R1.10.00.00.597.032 via branch 'qcom-msm-4.19-7250' into android-msm-pixel-4.19
Conflicts:
	arch/arm64/configs/vendor/kona_defconfig
	arch/arm64/configs/vendor/lito_defconfig
	arch/arm64/include/asm/traps.h
	drivers/power/supply/qcom/qpnp-smb5.c
	kernel/sched/sched.h

Bug: 151568484
Change-Id: I6ed9ae8bc29d93e42b8527ae25074db334c640da
Signed-off-by: Robin Peng <robinpeng@google.com>
2020-05-30 02:18:38 +08:00
Will McVicker
66a2602997 GKI: trace: ipc_logging: modularize IPC logging
This change:
* adds exports and converts ifdef -> if IS_ENABLED(...)
* sets CONFIG_IPC_LOGGING as tristate
* properly handles ipc_logging_debug.c for debugfs
* stubs out ipc_log_* calls for built-in drivers

Signed-off-by: Will McVicker <willmcvicker@google.com>
Bug: 150231337
Test: compile and boot bramble
Change-Id: I0379cadbaf1dee5d358144b1b4f2dc374635021d
2020-05-30 02:16:53 +08:00
Tri Vo
cc81490830 BACKPORT: PM / wakeup: Show wakeup sources stats in sysfs
Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
expose wakeup sources statistics in sysfs under
/sys/class/wakeup/wakeup<ID>/*.

Bug: 129087298
Bug: 151789966

Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit c8377adfa78103be5380200eb9dab764d7ca890e)
Signed-off-by: Tri Vo <trong@google.com>
(cherry picked from commit 2c9f5fa9c3)
[sspatil: fix conflict in fs/eventpoll.c]
[sspatil: fix all in-tree usage of wakeup_source_register]
Signed-off-by: Sandeep Patil <sspatil@google.com>
Change-Id: Ie12200c8d439b08410961415d5899a390b82f5b0
2020-05-30 02:16:43 +08:00
Tri Vo
02ba99da90 UPSTREAM: PM / wakeup: Use wakeup_source_register() in wakelock.c
kernel/power/wakelock.c duplicates wakeup source creation and
registration code from drivers/base/power/wakeup.c.

Change struct wakelock's wakeup source to a pointer and use
wakeup_source_register() function to create and register said wakeup
source. Use wakeup_source_unregister() on cleanup path.

Signed-off-by: Tri Vo <trong@android.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 2434aea58e652a9fe114181ac90aa60e2f2e1b25)
Bug: 129087298
Signed-off-by: Tri Vo <trong@google.com>
Change-Id: I4e6b3c613c561fb382f17c3c31b6584aebabfb5d
(cherry picked from commit 5bc2bdfb22)
Signed-off-by: Sandeep Patil <sspatil@google.com>
2020-05-30 02:16:39 +08:00
Suren Baghdasaryan
26ca615b1a ANDROID: replace NR_INDIRECTLY_RECLAIMABLE_BYTES with NR_KERNEL_MISC_RECLAIMABLE
Use NR_KERNEL_MISC_RECLAIMABLE instead of
NR_INDIRECTLY_RECLAIMABLE_BYTES for kgsl allocations and in sysstats.

Bug: 150808082
Test: build
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ice5167bd9b380bb4c4b4d810aa685d211bcf2f80
2020-05-30 02:16:20 +08:00
Minchan Kim
81a5020848 attribute page lock and waitqueue functions as sched
trace_sched_blocked_trace in CFS is really useful for debugging via
trace because it tell where the process was stuck on callstack.

For example,
           <...>-6143  ( 6136) [005] d..2    50.278987: sched_blocked_reason: pid=6136 iowait=0 caller=SyS_mprotect+0x88/0x208
           <...>-6136  ( 6136) [005] d..2    50.278990: sched_blocked_reason: pid=6142 iowait=0 caller=do_page_fault+0x1f4/0x3b0
           <...>-6142  ( 6136) [006] d..2    50.278996: sched_blocked_reason: pid=6144 iowait=0 caller=SyS_prctl+0x52c/0xb58
           <...>-6144  ( 6136) [006] d..2    50.279007: sched_blocked_reason: pid=6136 iowait=0 caller=vm_mmap_pgoff+0x74/0x104

However, sometime it gives pointless information like this.
    RenderThread-2322  ( 1805) [006] d.s3    50.319046: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220
     logd.writer-594   (  587) [002] d.s3    50.334011: sched_blocked_reason: pid=6126 iowait=1 caller=wait_on_page_bit+0x194/0x208
  kworker/u16:13-333   (  333) [007] d.s4    50.343161: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220

Such wait_on_page_bit, __lock_page_killable are pointless because it doesn't
carry on higher information to identify the callstack.

The reason is page_lock and waitqueue are special synchronization method unlike
other normal locks(mutex, spinlock).
Let's mark them as "__sched" so get_wchan which used in trace_sched_blocked_trace
could detect it and skip them. It will produce more meaningful callstack
function like this.

           <...>-2867  ( 1068) [002] d.h4   124.209701: sched_blocked_reason: pid=329 iowait=0 caller=worker_thread+0x378/0x470
           <...>-2867  ( 1068) [002] d.s3   124.209763: sched_blocked_reason: pid=8454 iowait=1 caller=__filemap_fdatawait_range+0xa0/0x104
           <...>-2867  ( 1068) [002] d.s4   124.209803: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
 ScreenDecoratio-2364  ( 1867) [002] d.s3   124.209973: sched_blocked_reason: pid=8454 iowait=1 caller=f2fs_wait_on_page_writeback+0x84/0xcc
 ScreenDecoratio-2364  ( 1867) [002] d.s4   124.209986: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
           <...>-329   (  329) [000] d..3   124.210435: sched_blocked_reason: pid=538 iowait=0 caller=worker_thread+0x378/0x470
  kworker/u16:13-538   (  538) [007] d..3   124.210450: sched_blocked_reason: pid=6 iowait=0 caller=worker_thread+0x378/0x470

Bug: 144961676
Bug: 144713689
Change-Id: I30397400c5d056946bdfbc86c9ef5f4d7e6c98fe
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
2020-05-30 02:16:09 +08:00
Wei Wang
fbabe265d9 trace: sched: add capacity change tracing
Add a new tracepoint sched_capacity_update when capacity value
updated.

Bug: 144177658
Bug: 144961676
Test: Boot and grab trace to check
Change-Id: I30ee55bfcc2fb5a92dd448ad364768ee428f3cc4
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
2020-05-30 02:16:09 +08:00
Wei Wang
6a370c94e7 kernel: sched: fix cpu cpu_capacity_orig being capped incorrectly
update_cpu_capacity will update cpu_capacity_orig capped with
thermal_cap, in non-WALT case, thermal_cap is previous
cpu_capacity_orig. This caused cpu_capacity_orig being capped
incorrectly.

Test: Build
Bug: 144143594
Bug: 144961676
Change-Id: I1ff9d9c87554c2d2395d46b215276b7ab50585c0
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
2020-05-30 02:16:09 +08:00