kernel_xiaomi_cepheus

Evolution-X-Devices/kernel_xiaomi_cepheus

Author	SHA1	Message	Date
balgxmr	434f599332	Merge branch 'upstream-linux-4.14.y' of https://android.googlesource.com/kernel/common into rebase	2023-08-09 16:44:33 -05:00
Tejun Heo	b77600e26f	memcg: fix possible use-after-free in memcg_write_event_control() commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream. memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to `347c4a8747` ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: `347c4a8747` ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-12-14 11:26:13 +01:00
Tejun Heo	666d49b22b	UPSTREAM: cgroup: add cgroup_parse_float() cgroup already uses floating point for percent[ile] numbers and there are several controllers which want to take them as input. Add a generic parse helper to handle inputs. Update the interface convention documentation about the use of percentage numbers. While at it, also clarify the default time unit. Bug: 120440300 Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit a5e112e6424adb77d953eac20e6936b952fd6b32) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ic1fcf21d7955eb8edd2e8e91517bca6aef41694f Signed-off-by: Quentin Perret <qperret@google.com>	2022-09-16 20:05:51 +03:00
Yu Zhao	54d9c5f519	FROMLIST: mm: multi-gen LRU: kill switch Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously. Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/ Link: https://lore.kernel.org/r/20220309021230.721028-11-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03	2022-06-27 14:32:38 +00:00
kondors1995	0cca5fb93e	Revert "Merge branch 'dev/mglru' into dev/12-2" This reverts commit `e3d1ce4d09`, reversing changes made to `d3cf1f4d72`.	2022-05-17 08:16:01 +00:00
Yu Zhao	db2f982ebb	FROMLIST: mm: multi-gen LRU: kill switch Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously. Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/ Link: https://lore.kernel.org/r/20220309021230.721028-11-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03	2022-05-03 11:10:21 +00:00
kondors1995	9fc8417d64	Merge branch 'android-4.14-stable' of github.com:aosp-mirror/kernel_common into 11.0	2021-10-25 16:54:02 +00:00
Suren Baghdasaryan	d6e8ea9269	BACKPORT: cgroup: make per-cgroup pressure stall tracking configurable PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This causes additional overhead with psi_avgs_work being called for each cgroup in the hierarchy. psi_avgs_work has been highly optimized, however on systems with large number of cgroups the overhead becomes noticeable. Systems which use PSI only at the system level could avoid this overhead if PSI can be configured to skip per-cgroup stall accounting. Add "cgroup_disable=pressure" kernel command-line option to allow requesting system-wide only pressure stall accounting. When set, it keeps system-wide accounting under /proc/pressure/ but skips accounting for individual cgroups and does not expose PSI nodes in cgroup hierarchy. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/patchwork/patch/1435705 (cherry picked from commit 3958e2d0c34e18c41b60dc01832bd670a59ef70f https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git tj) Conflicts: include/linux/cgroup-defs.h kernel/cgroup/cgroup.c 1. Trivial merge conflict in cgroup-defs.h due to missing CFTYPE_DEBUG 2. Changed flags to (CFTYPE_NOT_ON_ROOT \| CFTYPE_PRESSURE) in cgroup.c because in 4.14 psi files were allowed only in non-root cgroups. 3. Dropped changes in show_delegatable_files() because it's missing in 4.14. Bug: 178872719 Bug: 191734423 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ifc8fbc52f9a1131d7c2668edbb44c525c76c3360	2021-10-18 10:50:18 -07:00
Roman Gushchin	dfba8de103	BACKPORT: cgroup: cgroup v2 freezer Cgroup v1 implements the freezer controller, which provides an ability to stop the workload in a cgroup and temporarily free up some resources (cpu, io, network bandwidth and, potentially, memory) for some other tasks. Cgroup v2 lacks this functionality. This patch implements freezer for cgroup v2. Cgroup v2 freezer tries to put tasks into a state similar to jobctl stop. This means that tasks can be killed, ptraced (using PTRACE_SEIZE), and interrupted. It is possible to attach to a frozen task, get some information (e.g. read registers) and detach. It's also possible to migrate a frozen tasks to another cgroup. This differs cgroup v2 freezer from cgroup v1 freezer, which mostly tried to imitate the system-wide freezer. However uninterruptible sleep is fine when all tasks are going to be frozen (hibernation case), it's not the acceptable state for some subset of the system. Cgroup v2 freezer is not supporting freezing kthreads. If a non-root cgroup contains kthread, the cgroup still can be frozen, but the kthread will remain running, the cgroup will be shown as non-frozen, and the notification will not be delivered. PTRACE_ATTACH is not working because non-fatal signal delivery is blocked in frozen state. There are some interface differences between cgroup v1 and cgroup v2 freezer too, which are required to conform the cgroup v2 interface design principles: 1) There is no separate controller, which has to be turned on: the functionality is always available and is represented by cgroup.freeze and cgroup.events cgroup control files. 2) The desired state is defined by the cgroup.freeze control file. Any hierarchical configuration is allowed. 3) The interface is asynchronous. The actual state is available using cgroup.events control file ("frozen" field). There are no dedicated transitional states. 4) It's allowed to make any changes with the cgroup hierarchy (create new cgroups, remove old cgroups, move tasks between cgroups) no matter if some cgroups are frozen. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com> Cc: kernel-team@fb.com Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa (cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com> (cherry picked from commit 666d8913b8f1adef750ae86d9acb74c9cb84c4ef) Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>	2021-08-30 10:36:53 +00:00
Andrey Ignatov	921a3bda2c	BACKPORT: cgroup: Simplify cgroup_ancestor Simplify cgroup_ancestor function. This is follow-up for commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I9e96704713f34fbc68e92b9f91c01b593708220f Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com> (cherry picked from commit 808c43b7c7f70360ed7b9e43e2cf980f388e71fa) This cherry pick differs from the original in that cgroup_ancestor is added in place of being just modified. The patch originally introducing the function was 7723628101aae (bpf: Introduce bpf_skb_ancestor_cgroup_id helper) which also relied on bpf dependencies not present in android-4.14. cgroup_ancestor is independent from the bpf_skb code and can hence be taken alone (cherry picked from commit 22fe07d3a8cc54d4ade52a46776afbb9fbd13eee) Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>	2021-08-30 10:36:49 +00:00
Greg Kroah-Hartman	4437a4dfa7	Merge 4.14.189 into android-4.14-stable Changes in 4.14.189 KVM: s390: reduce number of IO pins to 1 spi: spi-fsl-dspi: Adding shutdown hook spi: spi-fsl-dspi: Fix lockup if device is removed during SPI transfer spi: spi-fsl-dspi: use IRQF_SHARED mode to request IRQ spi: spi-fsl-dspi: Fix external abort on interrupt in resume or exit paths ARM: dts: omap4-droid4: Fix spi configuration and increase rate gpu: host1x: Detach driver on unregister spi: spidev: fix a race between spidev_release and spidev_remove spi: spidev: fix a potential use-after-free in spidev_release() ixgbe: protect ring accesses with READ- and WRITE_ONCE s390/kasan: fix early pgm check handler execution cifs: update ctime and mtime during truncate ARM: imx6: add missing put_device() call in imx6q_suspend_init() scsi: mptscsih: Fix read sense data size nvme-rdma: assign completion vector correctly x86/entry: Increase entry_stack size to a full page net: cxgb4: fix return error value in t4_prep_fw smsc95xx: check return value of smsc95xx_reset smsc95xx: avoid memory leak in smsc95xx_bind ALSA: compress: fix partial_drain completion state arm64: kgdb: Fix single-step exception handling oops nbd: Fix memory leak in nbd_add_socket bnxt_en: fix NULL dereference in case SR-IOV configuration fails net: macb: mark device wake capable when "magic-packet" property present mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON() ALSA: opl3: fix infoleak in opl3 ALSA: hda - let hs_mic be picked ahead of hp_mic ALSA: usb-audio: add quirk for MacroSilicon MS2109 KVM: arm64: Fix definition of PAGE_HYP_DEVICE KVM: arm64: Stop clobbering x0 for HVC_SOFT_RESTART KVM: x86: bit 8 of non-leaf PDPEs is not reserved KVM: x86: Inject #GP if guest attempts to toggle CR4.LA57 in 64-bit mode KVM: x86: Mark CR4.TSD as being possibly owned by the guest Revert "ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb" btrfs: fix fatal extent_buffer readahead vs releasepage race drm/radeon: fix double free dm: use noio when sending kobject event ARC: entry: fix potential EFA clobber when TIF_SYSCALL_TRACE ARC: elf: use right ELF_ARCH s390/mm: fix huge pte soft dirty copying genetlink: remove genl_bind ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg l2tp: remove skb_dst_set() from l2tp_xmit_skb() llc: make sure applications use ARPHRD_ETHER net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb net: usb: qmi_wwan: add support for Quectel EG95 LTE modem tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key() tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers tcp: md5: allow changing MD5 keys in all socket states net_sched: fix a memory leak in atm_tc_init() tcp: make sure listeners don't initialize congestion-control state tcp: md5: do not send silly options in SYNCOOKIES cgroup: fix cgroup_sk_alloc() for sk_clone_lock() cgroup: Fix sock_cgroup_data on big-endian. drm/exynos: fix ref count leak in mic_pre_enable arm64/alternatives: use subsections for replacement sequences tpm_tis: extra chip->ops check on error path in tpm_tis_core_init gfs2: read-only mounts should grab the sd_freeze_gl glock i2c: eg20t: Load module automatically if ID matches arm64: alternative: Use true and false for boolean values arm64/alternatives: don't patch up internal branches iio:magnetometer:ak8974: Fix alignment and data leak issues iio:humidity:hdc100x Fix alignment and data leak issues iio: magnetometer: ak8974: Fix runtime PM imbalance on error iio: mma8452: Add missed iio_device_unregister() call in mma8452_probe() iio: pressure: zpa2326: handle pm_runtime_get_sync failure iio:pressure:ms5611 Fix buffer element alignment iio:health:afe4403 Fix timestamp alignment and prevent data leak. spi: spi-fsl-dspi: Fix lockup if device is shutdown during SPI transfer spi: fix initial SPI_SR value in spi-fsl-dspi net: dsa: bcm_sf2: Fix node reference count of: of_mdio: Correct loop scanning logic Revert "usb/ohci-platform: Fix a warning when hibernating" Revert "usb/ehci-platform: Set PM runtime as active on resume" Revert "usb/xhci-plat: Set PM runtime as active on resume" doc: dt: bindings: usb: dwc3: Update entries for disabling SS instances in park mode mmc: sdhci: do not enable card detect interrupt for gpio cd type ACPI: video: Use native backlight on Acer Aspire 5783z ACPI: video: Use native backlight on Acer TravelMate 5735Z iio:health:afe4404 Fix timestamp alignment and prevent data leak. phy: sun4i-usb: fix dereference of pointer phy0 before it is null checked arm64: dts: meson: add missing gxl rng clock spi: spi-sun6i: sun6i_spi_transfer_one(): fix setting of clock rate usb: gadget: udc: atmel: fix uninitialized read in debug printk staging: comedi: verify array index is correct before using it Revert "thermal: mediatek: fix register index error" ARM: dts: socfpga: Align L2 cache-controller nodename with dtschema copy_xstate_to_kernel: Fix typo which caused GDB regression perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode mtd: rawnand: brcmnand: fix CS0 layout mtd: rawnand: oxnas: Keep track of registered devices mtd: rawnand: oxnas: Unregister all devices on error mtd: rawnand: oxnas: Release all devices in the _remove() path HID: magicmouse: do not set up autorepeat ALSA: line6: Perform sanity check for each URB creation ALSA: usb-audio: Fix race against the error recovery URB submission USB: c67x00: fix use after free in c67x00_giveback_urb usb: dwc2: Fix shutdown callback in platform usb: chipidea: core: add wakeup support for extcon usb: gadget: function: fix missing spinlock in f_uac1_legacy USB: serial: iuu_phoenix: fix memory corruption USB: serial: cypress_m8: enable Simply Automated UPB PIM USB: serial: ch341: add new Product ID for CH340 USB: serial: option: add GosunCn GM500 series USB: serial: option: add Quectel EG95 LTE modem virtio: virtio_console: add missing MODULE_DEVICE_TABLE() for rproc serial fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS Revert "zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()" mei: bus: don't clean driver pointer Input: i8042 - add Lenovo XiaoXin Air 12 to i8042 nomux list uio_pdrv_genirq: fix use without device tree and no interrupt timer: Fix wheel index calculation on last level MIPS: Fix build for LTS kernel caused by backporting lpj adjustment hwmon: (emc2103) fix unable to change fan pwm1_enable attribute intel_th: pci: Add Jasper Lake CPU support intel_th: pci: Add Tiger Lake PCH-H support intel_th: pci: Add Emmitsburg PCH support dmaengine: fsl-edma: Fix NULL pointer exception in fsl_edma_tx_handler misc: atmel-ssc: lock with mutex instead of spinlock thermal/drivers/cpufreq_cooling: Fix wrong frequency converted from power arm64: ptrace: Override SPSR.SS when single-stepping is enabled sched/fair: handle case of task_h_load() returning 0 x86/cpu: Move x86_cache_bits settings libceph: don't omit recovery_deletes in target_copy() rxrpc: Fix trace string Linux 4.14.189 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ib5da2b58af11e2738c78990bf691a0211a55a40f	2020-07-24 10:08:39 +02:00
Cong Wang	82fd2138a5	cgroup: fix cgroup_sk_alloc() for sk_clone_lock() [ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ] When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit `d979a39d72` ("cgroup: duplicate cgroup reference when cloning sockets") tried to fix it but not compeletely. It seems not easy to trigger until the recent commit 090e28b229af ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged. Fixes: `bd1060a1d6` ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-07-22 09:22:21 +02:00
Michal Koutný	713f26696c	cgroup: Iterate tasks that did not finish do_exit() commit 9c974c77246460fa6a92c18554c3311c8c83c160 upstream. PF_EXITING is set earlier than actual removal from css_set when a task is exitting. This can confuse cgroup.procs readers who see no PF_EXITING tasks, however, rmdir is checking against css_set membership so it can transitionally fail with EBUSY. Fix this by listing tasks that weren't unlinked from css_set active lists. It may happen that other users of the task iterator (without CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This is equal to the state before commit c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") but it may be reviewed later. Reported-by: Suren Baghdasaryan <surenb@google.com> Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-20 10:54:14 +01:00
Michal Koutný	67ea9300c7	UPSTREAM: cgroup: Iterate tasks that did not finish do_exit() PF_EXITING is set earlier than actual removal from css_set when a task is exitting. This can confuse cgroup.procs readers who see no PF_EXITING tasks, however, rmdir is checking against css_set membership so it can transitionally fail with EBUSY. Fix this by listing tasks that weren't unlinked from css_set active lists. It may happen that other users of the task iterator (without CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This is equal to the state before commit c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") but it may be reviewed later. Reported-by: Suren Baghdasaryan <surenb@google.com> Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") Signed-off-by: Michal Koutný <mkoutny@suse.com> (cherry picked from commit 9c974c77246460fa6a92c18554c3311c8c83c160) Bug: 141213848 Bug: 146758430 Test: test_cgcore_destroy from linux-kselftest Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Iac57661b931129ed1e44b89675f8115bb89084ff (cherry picked from commit 21ee296526c70d6dc3c64639406f156f39b80fd0)	2020-03-14 22:40:51 +00:00
Greg Kroah-Hartman	57ac921eaf	Merge 4.14.138 into android-4.14 Changes in 4.14.138 scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure ARM: dts: Add pinmuxing for i2c2 and i2c3 for LogicPD SOM-LV ARM: dts: Add pinmuxing for i2c2 and i2c3 for LogicPD torpedo tcp: be more careful in tcp_fragment() arm64: cpufeature: Fix feature comparison for CTR_EL0.{CWG,ERG} HID: wacom: fix bit shift for Cintiq Companion 2 HID: Add quirk for HP X1200 PIXART OEM mouse RDMA: Directly cast the sockaddr union to sockaddr IB: directly cast the sockaddr union to aockaddr objtool: Add machine_real_restart() to the noreturn list objtool: Add rewind_stack_do_exit() to the noreturn list atm: iphase: Fix Spectre v1 vulnerability ife: error out when nla attributes are empty ip6_tunnel: fix possible use-after-free on xmit net: bridge: delete local fdb on device init failure net: bridge: mcast: don't delete permanent entries when fast leave is enabled net: fix ifindex collision during namespace removal net/mlx5: Use reversed order when unregister devices net: phylink: Fix flow control for fixed-link net: sched: Fix a possible null-pointer dereference in dequeue_func() NFC: nfcmrvl: fix gpio-handling regression tipc: compat: allow tipc commands without arguments compat_ioctl: pppoe: fix PPPOEIOCSFWD handling net/mlx5e: Prevent encap flow counter update async to user query tun: mark small packets as owned by the tap sock mvpp2: refactor MTU change code bnx2x: Disable multi-cos feature. cgroup: Call cgroup_release() before __exit_signal() cgroup: Implement css_task_iter_skip() cgroup: Include dying leaders with live threads in PROCS iterations cgroup: css_task_iter_skip()'d iterators must be advanced before accessed cgroup: Fix css_task_iter_advance_css_set() cset skip condition spi: bcm2835: Fix 3-wire mode if DMA is enabled Linux 4.14.138 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-08-09 18:07:12 +02:00
Tejun Heo	feb6b123b7	cgroup: Include dying leaders with live threads in PROCS iterations commit c03cd7738a83b13739f00546166969342c8ff014 upstream. CSS_TASK_ITER_PROCS currently iterates live group leaders; however, this means that a process with dying leader and live threads will be skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't, which is confusing to say the least. Fix it by making cset track dying tasks and include dying leaders with live threads in PROCS iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-09 17:53:36 +02:00
Tejun Heo	b0af004fd5	cgroup: Implement css_task_iter_skip() commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream. When a task is moved out of a cset, task iterators pointing to the task are advanced using the normal css_task_iter_advance() call. This is fine but we'll be tracking dying tasks on csets and thus moving tasks from cset->tasks to (to be added) cset->dying_tasks. When we remove a task from cset->tasks, if we advance the iterators, they may move over to the next cset before we had the chance to add the task back on the dying list, which can allow the task to escape iteration. This patch separates out skipping from advancing. Skipping only moves the affected iterators to the next pointer rather than fully advancing it and the following advancing will recognize that the cursor has already been moved forward and do the rest of advancing. This ensures that when a task moves from one list to another in its cset, as long as it moves in the right direction, it's always visible to iteration. This doesn't cause any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-09 17:53:36 +02:00
Greg Kroah-Hartman	334aa9b115	Merge 4.14.128 into android-4.14 Changes in 4.14.128 drm/nouveau: add kconfig option to turn off nouveau legacy contexts. (v3) nouveau: Fix build with CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT disabled HID: wacom: Correct button numbering 2nd-gen Intuos Pro over Bluetooth HID: wacom: Sync INTUOSP2_BT touch state after each frame if necessary ALSA: oxfw: allow PCM capture for Stanton SCS.1m ALSA: hda/realtek - Update headset mode for ALC256 ALSA: firewire-motu: fix destruction of data for isochronous resources libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node fs/ocfs2: fix race in ocfs2_dentry_attach_lock() mm/vmscan.c: fix trying to reclaim unevictable LRU page signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO ptrace: restore smp_rmb() in __ptrace_may_access() media: v4l2-ioctl: clear fields in s_parm iommu/arm-smmu: Avoid constant zero in TLBI writes i2c: acorn: fix i2c warning bcache: fix stack corruption by PRECEDING_KEY() cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() ASoC: cs42xx8: Add regcache mask dirty ASoC: fsl_asrc: Fix the issue about unsupported rate drm/i915/sdvo: Implement proper HDMI audio support for SDVO x86/uaccess, kcov: Disable stack protector ALSA: seq: Protect in-kernel ioctl calls with mutex ALSA: seq: Fix race of get-subscription call vs port-delete ioctls Revert "ALSA: seq: Protect in-kernel ioctl calls with mutex" s390/kasan: fix strncpy_from_user kasan checks Drivers: misc: fix out-of-bounds access in function param_set_kgdbts_var scsi: qedi: remove memset/memcpy to nfunc and use func instead scsi: qedi: remove set but not used variables 'cdev' and 'udev' scsi: lpfc: add check for loss of ndlp when sending RRQ arm64/mm: Inhibit huge-vmap with ptdump nvme: remove the ifdef around nvme_nvm_ioctl platform/x86: pmc_atom: Add Lex 3I380D industrial PC to critclk_systems DMI table platform/x86: pmc_atom: Add several Beckhoff Automation boards to critclk_systems DMI table scsi: bnx2fc: fix incorrect cast to u64 on shift operation libnvdimm: Fix compilation warnings with W=1 selftests/timers: Add missing fflush(stdout) calls usbnet: ipheth: fix racing condition KVM: x86/pmu: do not mask the value that is written to fixed PMUs KVM: s390: fix memory slot handling for KVM_SET_USER_MEMORY_REGION drm/vmwgfx: integer underflow in vmw_cmd_dx_set_shader() leading to an invalid read drm/vmwgfx: NULL pointer dereference from vmw_cmd_dx_view_define() usb: dwc2: Fix DMA cache alignment issues usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression) USB: Fix chipmunk-like voice when using Logitech C270 for recording audio. USB: usb-storage: Add new ID to ums-realtek USB: serial: pl2303: add Allied Telesis VT-Kit3 USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode USB: serial: option: add Telit 0x1260 and 0x1261 compositions RAS/CEC: Fix binary search function x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback x86/kasan: Fix boot with 5-level paging and KASAN rtc: pcf8523: don't return invalid date when battery is low Linux 4.14.128 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-06-19 10:16:09 +02:00
Tejun Heo	0b1c102e9d	cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() commit 18fa84a2db0e15b02baa5d94bdb5bd509175d2f6 upstream. A PF_EXITING task can stay associated with an offline css. If such task calls task_get_css(), it can get stuck indefinitely. This can be triggered by BSD process accounting which writes to a file with PF_EXITING set when racing against memcg disable as in the backtrace at the end. After this change, task_get_css() may return a css which was already offline when the function was called. None of the existing users are affected by this change. INFO: rcu_sched self-detected stall on CPU INFO: rcu_sched detected stalls on CPUs/tasks: ... NMI backtrace for cpu 0 ... Call Trace: <IRQ> dump_stack+0x46/0x68 nmi_cpu_backtrace.cold.2+0x13/0x57 nmi_trigger_cpumask_backtrace+0xba/0xca rcu_dump_cpu_stacks+0x9e/0xce rcu_check_callbacks.cold.74+0x2af/0x433 update_process_times+0x28/0x60 tick_sched_timer+0x34/0x70 __hrtimer_run_queues+0xee/0x250 hrtimer_interrupt+0xf4/0x210 smp_apic_timer_interrupt+0x56/0x110 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0 ... btrfs_file_write_iter+0x31b/0x563 __vfs_write+0xfa/0x140 __kernel_write+0x4f/0x100 do_acct_process+0x495/0x580 acct_process+0xb9/0xdb do_exit+0x748/0xa00 do_group_exit+0x3a/0xa0 get_signal+0x254/0x560 do_signal+0x23/0x5c0 exit_to_usermode_loop+0x5d/0xa0 prepare_exit_to_usermode+0x53/0x80 retint_user+0x8/0x8 Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.2+ Fixes: `ec438699a9` ("cgroup, block: implement task_get_css() and use it in bio_associate_current()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-06-19 08:20:55 +02:00
Greg Kroah-Hartman	171fc237b3	Merge 4.14.111 into android-4.14 Changes in 4.14.111 arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals ext4: cleanup bh release code in ext4_ind_remove_space() lib/int_sqrt: optimize initial value compute tty/serial: atmel: Add is_half_duplex helper tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified CIFS: fix POSIX lock leak and invalid ptr deref h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux- f2fs: fix to avoid deadlock in f2fs_read_inline_dir() tracing: kdb: Fix ftdump to not sleep net/mlx5: Avoid panic when setting vport rate net/mlx5: Avoid panic when setting vport mac, getting vport config gpio: gpio-omap: fix level interrupt idling include/linux/relay.h: fix percpu annotation in struct rchan sysctl: handle overflow for file-max enic: fix build warning without CONFIG_CPUMASK_OFFSTACK scsi: hisi_sas: Set PHY linkrate when disconnected iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver perf c2c: Fix c2c report for empty numa node mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm, mempolicy: fix uninit memory access mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! mm/slab.c: kmemleak no scan alien caches ocfs2: fix a panic problem caused by o2cb_ctl f2fs: do not use mutex lock in atomic context fs/file.c: initialize init_files.resize_wait page_poison: play nicely with KASAN cifs: use correct format characters dm thin: add sanity checks to thin-pool and external snapshot creation cifs: Fix NULL pointer dereference of devname jbd2: fix invalid descriptor block checksum fs: fix guard_bio_eod to check for real EOD errors tools lib traceevent: Fix buffer overflow in arg_eval PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove() wil6210: check null pointer in _wil_cfg80211_merge_extra_ies crypto: crypto4xx - add missing of_node_put after of_device_is_available crypto: cavium/zip - fix collision with generic cra_driver_name usb: chipidea: Grab the (legacy) USB PHY by phandle first scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc coresight: etm4x: Add support to enable ETMv4.2 serial: 8250_pxa: honor the port number from devicetree ARM: 8840/1: use a raw_spinlock_t in unwind iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback mmc: omap: fix the maximum timeout setting e1000e: Fix -Wformat-truncation warnings mlxsw: spectrum: Avoid -Wformat-truncation warnings IB/mlx4: Increase the timeout for CM cache clk: fractional-divider: check parent rate only if flag is set cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies efi: cper: Fix possible out-of-bounds access scsi: megaraid_sas: return error when create DMA pool failed scsi: fcoe: make use of fip_mode enum complete perf test: Fix failure of 'evsel-tp-sched' test on s390 SoC: imx-sgtl5000: add missing put_device() media: sh_veu: Correct return type for mem2mem buffer helpers media: s5p-jpeg: Correct return type for mem2mem buffer helpers media: s5p-g2d: Correct return type for mem2mem buffer helpers media: mx2_emmaprp: Correct return type for mem2mem buffer helpers media: mtk-jpeg: Correct return type for mem2mem buffer helpers vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 HID: intel-ish-hid: avoid binding wrong ishtp_cl_device jbd2: fix race when writing superblock leds: lp55xx: fix null deref on firmware load failure iwlwifi: pcie: fix emergency path ACPI / video: Refactor and fix dmi_is_desktop() kprobes: Prohibit probing on bsearch() netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm ARM: 8833/1: Ensure that NEON code always compiles with Clang ALSA: PCM: check if ops are defined before suspending PCM usb: f_fs: Avoid crash due to out-of-scope stack ptr access sched/topology: Fix percpu data types in struct sd_data & struct s_data bcache: fix input overflow to cache set sysfs file io_error_halflife bcache: fix input overflow to sequential_cutoff bcache: improve sysfs_strtoul_clamp() genirq: Avoid summation loops for /proc/stat iw_cxgb4: fix srqidx leak during connection abort fbdev: fbmem: fix memory access if logo is bigger than the screen cdrom: Fix race condition in cdrom_sysctl_register e1000e: fix cyclic resets at link up with active tx platform/x86: intel_pmc_core: Fix PCH IP sts reading ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK efi/memattr: Don't bail on zero VA if it equals the region's PA ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted soc: qcom: gsbi: Fix error handling in gsbi_probe() mt7601u: bump supported EEPROM version ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of ARM: avoid Cortex-A9 livelock on tight dmb loops bpf: fix missing prototype warnings cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state tty: increase the default flip buffer limit to 2640K powerpc/pseries: Perform full re-add of CPU for topology update post-migration usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded media: mt9m111: set initial frame size other than 0x0 hwrng: virtio - Avoid repeated init of completion soc/tegra: fuse: Fix illegal free of IO base address HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable cpu/hotplug: Mute hotplug lockdep during init dmaengine: imx-dma: fix warning comparison of distinct pointer types dmaengine: qcom_hidma: assign channel cookie correctly dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_ netfilter: physdev: relax br_netfilter dependency media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting drm: Auto-set allow_fb_modifiers when given modifiers at plane init drm/nouveau: Stop using drm_crtc_force_disable x86/build: Specify elf_i386 linker emulation explicitly for i386 objects selinux: do not override context on context mounts wlcore: Fix memory leak in case wl12xx_fetch_firmware failure x86/build: Mark per-CPU symbols as absolute explicitly for LLD clk: rockchip: fix frac settings of GPLL clock for rk3328 dmaengine: tegra: avoid overflow of byte tracking drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers ACPI / video: Extend chassis-type detection with a "Lunch Box" check Linux 4.14.111 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-04-05 22:40:00 +02:00
Oleg Nesterov	f3b3b54347	cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting [ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ] The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which needs pids_free() to uncharge the pid. However, ->free() is called from __put_task_struct()->cgroup_free() and this is too late. Even the trivial program which does for (;;) { int pid = fork(); assert(pid >= 0); if (pid) wait(NULL); else exit(0); } can run out of limits because release_task()->call_rcu(delayed_put_task_struct) implies an RCU gp after the task/pid goes away and before the final put(). Test-case: mkdir -p /tmp/CG mount -t cgroup2 none /tmp/CG echo '+pids' > /tmp/CG/cgroup.subtree_control mkdir /tmp/CG/PID echo 2 > /tmp/CG/PID/pids.max perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' & echo $! > /tmp/CG/PID/cgroup.procs Without this patch the forking process fails soon after migration. Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite into the new helper, cgroup_release(), called by release_task() which actually frees the pid(s). Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:31:37 +02:00
Johannes Weiner	edd6f9faed	BACKPORT: psi: cgroup support On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 2ce7135adc9ad081aa3c49744144376ac74fea60) Conflicts: Documentation/cgroup-v2.txt include/linux/psi.h kernel/cgroup/cgroup.c (1. manual merge from Documentation/admin-guide/cgroup-v2.rst 2. include <linux/cgroup-defs.h> into include/linux/psi.h 3. manual merge in css_free_work_fn to allow psi support only for cgroup v2 4. manual merge in cgroup_create to allow psi support only for cgroup v2) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:19:55 -07:00
Greg Kroah-Hartman	b24413180f	License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-11-02 11:10:55 +01:00
Linus Torvalds	a0725ab0c7	Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...	2017-09-07 11:59:42 -07:00
Tejun Heo	3e48930cc7	cgroup: misc changes Misc trivial changes to prepare for future changes. No functional difference. * Expose cgroup_get(), cgroup_tryget() and cgroup_parent(). * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp. * Rename cgroup_stats_show() to cgroup_stat_show() for consistency with the file name. Signed-off-by: Tejun Heo <tj@kernel.org>	2017-08-11 05:49:01 -07:00
Shaohua Li	69fd5c3917	blktrace: add an option to allow displaying cgroup path By default we output cgroup id in blktrace. This adds an option to display cgroup path. Since get cgroup path is a relativly heavy operation, we don't enable it by default. with the option enabled, blktrace will output something like this: dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd] Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	121508df44	cgroup: export fhandle info for a cgroup Add an API to export cgroup fhandle info. We don't export a full 'struct file_handle', there are unrequired info. Sepcifically, cgroup is always a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle, we only need export the inode number and generation number just like what generic_fh_to_dentry does. And we can avoid the overhead of getting an inode too, since kernfs_node_id (ino and generation) has all the info required. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	c53cd490b1	kernfs: introduce kernfs_node_id inode number and generation can identify a kernfs node. We are going to export the identification by exportfs operations, so put ino and generation into a separate structure. It's convenient when later patches use the identification. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Tejun Heo	450ee0c1fe	cgroup: implement CSS_TASK_ITER_THREADED cgroup v2 is in the process of growing thread granularity support. Once thread mode is enabled, the root cgroup of the subtree serves as the dom_cgrp to which the processes of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. In the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch implements a new task iterator flag CSS_TASK_ITER_THREADED, which, when used on a dom_cgrp, makes the iteration include the tasks on all the associated threaded css_sets. "cgroup.procs" read path is updated to use it so that reading the file on a proc_cgrp lists all processes. This will also be used by controller implementations which need to walk processes or tasks at the resource domain level. Task iteration is implemented nested in css_set iteration. If CSS_TASK_ITER_THREADED is specified, after walking tasks of each !threaded css_set, all the associated threaded css_sets are visited before moving onto the next !threaded css_set. v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new enable-threaded-per-cgroup behavior. Signed-off-by: Tejun Heo <tj@kernel.org>	2017-07-21 11:14:51 -04:00
Tejun Heo	454000adaa	cgroup: introduce cgroup->dom_cgrp and threaded css_set handling cgroup v2 is in the process of growing thread granularity support. A threaded subtree is composed of a thread root and threaded cgroups which are proper members of the subtree. The root cgroup of the subtree serves as the domain cgroup to which the processes (as opposed to threads / tasks) of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. Inside the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch introduces cgroup->dom_cgrp along with threaded css_set handling. * cgroup->dom_cgrp points to self for normal and thread roots. For proper thread subtree members, points to the dom_cgrp (the thread root). * css_set->dom_cset points to self if for normal and thread roots. If threaded, points to the css_set which belongs to the cgrp->dom_cgrp. The dom_cgrp serves as the resource domain and keeps the matching csses available. The dom_cset holds those csses and makes them easily accessible. * All threaded csets are linked on their dom_csets to enable iteration of all threaded tasks. * cgroup->nr_threaded_children keeps track of the number of threaded children. This patch adds the above but doesn't actually use them yet. The following patches will build on top. v4: ->nr_threaded_children added. v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new enable-threaded-per-cgroup behavior. v2: Added cgroup_is_threaded() helper. Signed-off-by: Tejun Heo <tj@kernel.org>	2017-07-21 11:14:51 -04:00
Tejun Heo	bc2fb7ed08	cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: Tejun Heo <tj@kernel.org>	2017-07-21 11:14:51 -04:00
Tejun Heo	788b950c62	cgroup: distinguish local and children populated states cgrp->populated_cnt counts both local (the cgroup's populated css_sets) and subtree proper (populated children) so that it's only zero when the whole subtree, including self, is empty. This patch splits the counter into two so that local and children populated states are tracked separately. It allows finer-grained tests on the state of the hierarchy which will be used to replace css_set walking local populated test. Signed-off-by: Tejun Heo <tj@kernel.org>	2017-07-16 21:44:42 -04:00
Tejun Heo	41c25707d2	cpuset: consider dying css as offline In most cases, a cgroup controller don't care about the liftimes of cgroups. For the controller, a css becomes online when ->css_online() is called on it and offline when ->css_offline() is called. However, cpuset is special in that the user interface it exposes cares whether certain cgroups exist or not. Combined with the RCU delay between cgroup removal and css offlining, this can lead to user visible behavior oddities where operations which should succeed after cgroup removals fail for some time period. The effects of cgroup removals are delayed when seen from userland. This patch adds css_is_dying() which tests whether offline is pending and updates is_cpuset_online() so that the function returns false also while offline is pending. This gets rid of the userland visible delays. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com> Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>	2017-05-24 12:43:30 -04:00
Linus Torvalds	9410091dd5	Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing major. Two notable fixes are Li's second stab at fixing the long-standing race condition in the mount path and suppression of spurious warning from cgroup_get(). All other changes are trivial" * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: mark cgroup_get() with __maybe_unused cgroup: avoid attaching a cgroup root to two different superblocks, take 2 cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() cgroup: move cgroup_subsys_state parent field for cache locality cpuset: Remove cpuset_update_active_cpus()'s parameter. cgroup: switch to BUG_ON() cgroup: drop duplicate header nsproxy.h kernel: convert css_set.refcount from atomic_t to refcount_t kernel: convert cgroup_namespace.count from atomic_t to refcount_t	2017-05-01 13:52:24 -07:00
Geliang Tang	8f48cfabac	cgroup: drop duplicate header nsproxy.h Drop duplicate header nsproxy.h from linux/cgroup.h. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2017-03-24 11:36:16 -04:00
Tejun Heo	77f88796ce	cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3) Signed-off-by: Tejun Heo <tj@kernel.org>	2017-03-17 10:18:47 -04:00
Elena Reshetova	387ad9674b	kernel: convert cgroup_namespace.count from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2017-03-06 14:55:22 -05:00
Geliang Tang	7b4632f048	cgroup: fix a comment typo Fix a comment typo in cgroup.h. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2016-12-27 14:52:47 -05:00
Linus Torvalds	f34d3606f7	Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - tracepoints for basic cgroup management operations added - kernfs and cgroup path formatting functions updated to behave in the style of strlcpy() - non-critical bug fixes * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() cpuset: fix error handling regression in proc_cpuset_show() cgroup: add tracepoints for basic operations cgroup: make cgroup_path() and friends behave in the style of strlcpy() kernfs: remove kernfs_path_len() kernfs: make kernfs_path*() behave in the style of strlcpy() kernfs: add dummy implementation of kernfs_path_from_node()	2016-10-14 12:18:50 -07:00
Linus Torvalds	14986a34e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This set of changes is a number of smaller things that have been overlooked in other development cycles focused on more fundamental change. The devpts changes are small things that were a distraction until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an trivial regression fix to autofs for the unprivileged mount changes that went in last cycle. A pair of ioctls has been added by Andrey Vagin making it is possible to discover the relationships between namespaces when referring to them through file descriptors. The big user visible change is starting to add simple resource limits to catch programs that misbehave. With namespaces in general and user namespaces in particular allowing users to use more kinds of resources, it has become important to have something to limit errant programs. Because the purpose of these limits is to catch errant programs the code needs to be inexpensive to use as it always on, and the default limits need to be high enough that well behaved programs on well behaved systems don't encounter them. To this end, after some review I have implemented per user per user namespace limits, and use them to limit the number of namespaces. The limits being per user mean that one user can not exhause the limits of another user. The limits being per user namespace allow contexts where the limit is 0 and security conscious folks can remove from their threat anlysis the code used to manage namespaces (as they have historically done as it root only). At the same time the limits being per user namespace allow other parts of the system to use namespaces. Namespaces are increasingly being used in application sand boxing scenarios so an all or nothing disable for the entire system for the security conscious folks makes increasing use of these sandboxes impossible. There is also added a limit on the maximum number of mounts present in a single mount namespace. It is nontrivial to guess what a reasonable system wide limit on the number of mount structure in the kernel would be, especially as it various based on how a system is using containers. A limit on the number of mounts in a mount namespace however is much easier to understand and set. In most cases in practice only about 1000 mounts are used. Given that some autofs scenarious have the potential to be 30,000 to 50,000 mounts I have set the default limit for the number of mounts at 100,000 which is well above every known set of users but low enough that the mount hash tables don't degrade unreaonsably. These limits are a start. I expect this estabilishes a pattern that other limits for resources that namespaces use will follow. There has been interest in making inotify event limits per user per user namespace as well as interest expressed in making details about what is going on in the kernel more visible" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits) autofs: Fix automounts by using current_real_cred()->uid mnt: Add a per mount namespace limit on the number of mounts netns: move {inc,dec}_net_namespaces into #ifdef nsfs: Simplify __ns_get_path tools/testing: add a test to check nsfs ioctl-s nsfs: add ioctl to get a parent namespace nsfs: add ioctl to get an owning user namespace for ns file descriptor kernel: add a helper to get an owning user namespace for a namespace devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts devpts: Remove sync_filesystems devpts: Make devpts_kill_sb safe if fsi is NULL devpts: Simplify devpts_mount by using mount_nodev devpts: Move the creation of /dev/pts/ptmx into fill_super devpts: Move parse_mount_options into fill_super userns: When the per user per user namespace limit is reached return ENOSPC userns; Document per user per user namespace limits. mntns: Add a limit on the number of mount namespaces. netns: Add a limit on the number of net namespaces cgroupns: Add a limit on the number of cgroup namespaces ipcns: Add a limit on the number of ipc namespaces ...	2016-10-06 09:52:23 -07:00
Sargun Dhillon	aed704b7a6	cgroup: Add task_under_cgroup_hierarchy cgroup inline function to headers This commit adds an inline function to cgroup.h to check whether a given task is under a given cgroup hierarchy. This is to avoid having to put ifdefs in .c files to gate access to cgroups. When cgroups are disabled this always returns true. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-12 21:49:41 -07:00
Tejun Heo	4c737b41de	cgroup: make cgroup_path() and friends behave in the style of strlcpy() cgroup_path() and friends used to format the path from the end and thus the resulting path usually didn't start at the start of the passed in buffer. Also, when the buffer was too small, the partial result was truncated from the head rather than tail and there was no way to tell how long the full path would be. These make the functions less robust and more awkward to use. With recent updates to kernfs_path(), cgroup_path() and friends can be made to behave in strlcpy() style. * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now always return the length of the full path. If buffer is too small, it contains nul terminated truncated output. * All users updated accordingly. v2: cgroup_path() usage in kernel/sched/debug.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Peter Zijlstra <peterz@infradead.org>	2016-08-10 11:23:44 -04:00
Tejun Heo	3abb1d90f5	kernfs: make kernfs_path() behave in the style of strlcpy() kernfs_path() functions always return the length of the full path but the path content is undefined if the length is larger than the provided buffer. This makes its behavior different from strlcpy() and requires error handling in all its users even when they don't care about truncation. In addition, the implementation can actully be simplified by making it behave properly in strlcpy() style. * Update kernfs_path_from_node_locked() to always fill up the buffer with path. If the buffer is not large enough, the output is truncated and terminated. * kernfs_path() no longer needs error handling. Make it a simple inline wrapper around kernfs_path_from_node(). * sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling. Updated accordingly. * cgroup_path()'s use of kernfs_path() updated to retain the old behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>	2016-08-10 11:23:44 -04:00
Eric W. Biederman	d08311dd6f	cgroupns: Add a limit on the number of cgroup namespaces Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-08-08 14:42:03 -05:00
Martin KaFai Lau	1f3fe7ebf6	cgroup: Add cgroup_get_from_fd Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:30:38 -04:00
Aditya Kali	a79a908fd2	cgroup: introduce cgroup namespaces Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgroup namespace (referred as cgroupns-root). The main purpose of cgroup namespace is to virtualize the contents of /proc/self/cgroup file. Processes inside a cgroup namespace are only able to see paths relative to their namespace root (unless they are moved outside of their cgroupns-root, at which point they will see a relative path from their cgroupns-root). For a correctly setup container this enables container-tools (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized containers without leaking system level cgroup hierarchy to the task. This patch only implements the 'unshare' part of the cgroupns. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2016-02-16 13:04:58 -05:00
Linus Torvalds	34a9304a96	Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup v2 interface is now official. It's no longer hidden behind a devel flag and can be mounted using the new cgroup2 fs type. Unfortunately, cpu v2 interface hasn't made it yet due to the discussion around in-process hierarchical resource distribution and only memory and io controllers can be used on the v2 interface at the moment. - The existing documentation which has always been a bit of mess is relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt is added as the authoritative documentation for the v2 interface. - Some features are added through for-4.5-ancestor-test branch to enable netfilter xt_cgroup match to use cgroup v2 paths. The actual netfilter changes will be merged through the net tree which pulled in the said branch. - Various cleanups * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rename cgroup documentations cgroup: fix a typo. cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX. cgroup: demote subsystem init messages to KERN_DEBUG cgroup: Fix uninitialized variable warning cgroup: put controller Kconfig options in meaningful order cgroup: clean up the kernel configuration menu nomenclature cgroup_pids: fix a typo. Subject: cgroup: Fix incomplete dd command in blkio documentation cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends cpuset: Replace all instances of time_t with time64_t cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/ cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type	2016-01-12 19:20:32 -08:00
David S. Miller	b3e0d3d7ba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-12-17 22:08:28 -05:00
Tejun Heo	bd1060a1d6	sock, cgroup: add sock->sk_cgroup In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. In preparation, this patch updates sock->sk_cgrp_data handling so that it points to the v2 cgroup that sock was created in until either net_prio or net_cls is used. Once either of the two is used, sock->sk_cgrp_data reverts to its previous role of carrying prioidx and classid. This is to avoid adding yet another cgroup related field to struct sock. As the mode switching can happen at most once per boot, the switching mechanism is aimed at lowering hot path overhead. It may leak a finite, likely small, number of cgroup refs and report spurious prioidx or classid on switching; however, dynamic updates of prioidx and classid have always been racy and lossy - socks between creation and fd installation are never updated, config changes don't update existing sockets at all, and prioidx may index with dead and recycled cgroup IDs. Non-critical inaccuracies from small race windows won't make any noticeable difference. This patch doesn't make use of the pointer yet. The following patch will implement netfilter match for cgroup2 membership. v2: Use sock_cgroup_data to avoid inflating struct sock w/ another cgroup specific field. v3: Add comments explaining why sock_data_prioidx() and sock_data_classid() use different fallback values. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Daniel Wagner <daniel.wagner@bmw-carit.de> CC: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-12-08 22:02:33 -05:00
Tejun Heo	177493987c	Merge branch 'for-4.5-ancestor-test' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-4.5 Signed-off-by: Tejun Heo <tj@kernel.org>	2015-12-07 17:24:10 -05:00

1 2 3 4 5 ...

386 Commits