cgroup_path() and friends used to format the path from the end and
thus the resulting path usually didn't start at the start of the
passed in buffer. Also, when the buffer was too small, the partial
result was truncated from the head rather than tail and there was no
way to tell how long the full path would be. These make the functions
less robust and more awkward to use.
With recent updates to kernfs_path(), cgroup_path() and friends can be
made to behave in strlcpy() style.
* cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
always return the length of the full path. If buffer is too small,
it contains nul terminated truncated output.
* All users updated accordingly.
v2: cgroup_path() usage in kernel/sched/debug.c converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
Change-Id: Ibb5c7544237e0ab6cde03c8bec80b3376917c0a8
Original patch was cherry-picked from 3.18 kernel,
commit 324181821c790ca02a2180ff01576d3c4d15fa6d
("Move sync logic to be run always") and reworked to account for
differences in want_affine calculations.
Lift sync logic out of select_energy_cpu_brute() and run that logic before
select_idle_sibling() or select_energy_cpu_brute().
At high CPU utilization, the scheduler falls back to
select_idle_sibling(). On Marlin, select_idle_sibling() wasn't
respecting the sync flag to force the task to be scheduled on the same
CPU. This caused the sync flag to be ignored and thrash on binder with
sync enabled. This fixes the thrash we saw.
Bug 38260530
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I1ecd8de6af010248ebc9daa5f80689f521500fc9
cherry-picked from 3.18 kernel, commit 324181821c790ca02a2180ff01576d3c4d15fa6d ("Move sync logic to be run always")
Lift sync logic out of select_energy_cpu_brute() and run that logic before
select_idle_sibling() or select_energy_cpu_brute().
At high CPU utilization, the scheduler falls back to
select_idle_sibling(). On Marlin, select_idle_sibling() wasn't
respecting the sync flag to force the task to be scheduled on the same
CPU. This caused the sync flag to be ignored and thrash on binder with
sync enabled. This fixes the thrash we saw.
Bug 38260530
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I5b5552e87b76028328b6f6439f11dc7de43fdfc0
Signed-off-by: Joel Fernandes <joelaf@google.com>
Add and enable EAS support by bringing remote tracking branch
android-msm-8998-4.4-common in kernel/private/msm-qcom into
branch android-msm-muskie-4.4
Manual merges:
arch/arm/boot/dts/qcom/msm8998.dtsi (from msmcobalt.dtsi in android-msm-8998-4.4-common)
arch/arm64/configs/muskie_defconfig (from msmcobalt_defconfig in android-msm-8998-4.4-common)
include/linux/cpu.h (taken from android-msm-muskie-4.4)
drivers/staging/qca-wifi-host-cmn/hif/src/hif_napi.c (remove core_ctl_set_boost calls)
Change-Id: I8e0dd7233d4ab7122d5e3d7b5c8d5fc42165a494
Signed-off-by: Andres Oportus <andresoportus@google.com>
The previous patches in this rewrite of scheduler guided frequency
selection reintroduces the part-picture problem that we addressed in
our initial implementation. In that, when tasks migrate across CPUs
within a cluster, we end up losing the complete picture of the
sequential nature of the workload.
This patch aims to solve that problem slightly differently. We track
the top task on every CPU within a window. Top task is defined as the
task that runs the most in a given window. This enhances our ability
to detect the sequential nature of workloads. A single migrating task
executing for an entire window will cause 100% load to be reported
for frequency guidance instead of the maximum footprint left on any
individual CPU in the task's trail. There are cases, that this new
approach does not address. Namely, cases where the sum of two or more
tasks accurately reflects the true sequential nature of the workload.
Future optimizations might aim to tackle that problem.
To track top tasks, we first realize that there is no strict need to
maintain the task struct itself as long as we know the load exerted by
the top task. We also realize that to maintain top tasks on every CPU
we have to track the execution of every single task that runs during
the window. The load associated with a task needs to be migrated when
the task migrates from one CPU to another. When the top task migrates
away, we need to locate the second top task and so on.
Given the above realizations, we use hashmaps to track top task load
both for the current and the previous window. This hashmap is
implemented as an array of fixed size. The key of the hashmap is given
by task_execution_time_in_a_window / array_size. The size of the array
(number of buckets in the hashmap) dictate the load granularity of each
bucket. The value stored in each bucket is a refcount of all the tasks
that executed long enough to be in that bucket.
This approach has a few benefits. Firstly, any top task stats update
now take O(1) time. While task migration is also O(1), it does still
involve going through up to the size of the array to find the second
top task. Further patches will aim to optimize this behavior. Secondly,
and more importantly, not having to store the task struct itself saves
a lot of memory usage in that 1) there is no need to retrieve task
structs later causing cache misses and 2) we don't have to unnecessarily
hold up task memory for up to 2 full windows by calling get_task_struct()
after a task exits.
Change-Id: I004dba474f41590db7d3f40d9deafe86e71359ac
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Code sections found either CONFIG_SCHED_HMP or !CONFIG_SCHED_HMP
have become quite fragmented over time. Some of these fragmented
sections are necessary because of the code dependencies. Others
fragmented sections can easily be consolidated. Do so in order
to make kernel upgrades a lot simpler.
Change-Id: I6be476834ce70274aec5a52fd9455b5f0065af87
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
PELT extensions for HMP have never been used since the early days
of the HMP scheduler. Furthermore, changes to PELT itself in newer
kernel versions render some of the code redundant or incorrect. These
extensions have not been tested for a long time and are practically
dead code. Remove it so that future upgrades become easier.
Change-Id: I029f327406ca00b2370c93134158b61dda3b81e3
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
The max_possible_efficiency and CPU's efficiency are fixed values which
are determined at cluster allocation time. Avoid division on the fast
by using precomputed scale factor.
Also update_cpu_busy_time() doesn't need to know how many full windows
have elapsed. Thus replace unneeded division with simple comparison.
Change-Id: I2be1aad3fb9b895e4f0917d05bd8eade985bbccf
Suggested-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
A cluster is set of CPUs sharing some power controls and an L2 cache.
This patch buids a list of clusters at bootup which are sorted by
their max_power_cost. Many cluster-shared attributes like cur_freq,
max_freq etc are needlessly maintained in per-cpu 'struct rq' currently.
Consolidate them in a cluster structure.
Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in
arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for
CONFIG_SCHED_QHMP.]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Add per-cpu tunable to set the extra cost to use a CPU that is idle.
Add the same for a cluster.
Change-Id: I4aa53f3c42c963df7abc7480980f747f0413d389
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop@codeaurora.org: omitted changes for qhmp*.[c,h] stripped out
CONFIG_SCHED_QHMP in drivers/base/cpu.c and include/linux/sched.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
For the fair sched class, update the select_best_cpu() policy to do
power based placement. The hope is to minimize the voltage at which
the CPU runs.
While RT tasks already do power based placement, their placement
preference has to now take into account the power cost of all tasks
on a given CPU. Also remove the check for sched_boost since
sched_boost no longer intends to elevate all tasks to the highest
capacity cluster.
Change-Id: Ic6a7625c97d567254d93b94cec3174a91727cb87
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Task packing will now be determined solely on the basis of the
power cost of task placement. All tasks are eligible for packing.
Remove the notion of "small" tasks from the scheduler.
Change-Id: I72d52d04b2677c6a8d0bc6aa7d50ff0f1a4f5ebb
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
CFS_BANDWIDTH feature is not currently well-supported by HMP
scheduler. Issues encountered include a kernel panic when
rq->nr_big_tasks count becomes negative. This patch fixes HMP
scheduler code to better handle CFS_BANDWIDTH feature. The most
prominent change introduced is maintenance of HMP stats (nr_big_tasks,
nr_small_tasks, cumulative_runnable_avg) per 'struct cfs_rq' in
addition to being maintained in each 'struct rq'. This allows HMP
stats to be updated easily when a group is throttled on a cpu.
Change-Id: Iad9f378b79ab5d9d76f86d1775913cc1941e266a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in dequeue_task_fair().]
Key hmp stats (nr_big_tasks, nr_small_tasks and
cumulative_runnable_average) are currently maintained per-cpu in
'struct rq'. Merge those stats in their own structure (struct
hmp_sched_stats) and modify impacted functions to deal with the newly
introduced structure. This cleanup is required for a subsequent patch
which fixes various issues with use of CFS_BANDWIDTH feature in HMP
scheduler.
Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in __update_load_avg().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
sched_mostly_idle_load and sched_mostly_idle_nr_run knobs help pack
tasks on cpus to some extent. In some cases, it may be desirable to
have different packing limits for different cpus. For example, pack to
a higher limit on high-performance cpus compared to power-efficient
cpus.
This patch removes the global mostly_idle tunables and makes them
per-cpu, thus letting task packing behavior to be controlled in a
fine-grained manner.
Change-Id: Ifc254cda34b928eae9d6c342ce4c0f64e531e6c2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
rq->curr/prev_runnable_sum counters represent cpu demand from various
tasks that have run on a cpu. Any task that runs on a cpu will have a
representation in rq->curr_runnable_sum. Their partial_demand value
will be included in rq->curr_runnable_sum. Since partial_demand is
derived from historical load samples for a task, rq->curr_runnable_sum
could represent "inflated/un-realistic" cpu usage. As an example, lets
say that task with partial_demand of 10ms runs for only 1ms on a cpu.
What is included in rq->curr_runnable_sum is 10ms (and not the actual
execution time of 1ms). This leads to cpu busy time being reported on
the upside causing frequency to stay higher than necessary.
This patch fixes cpu busy accounting scheme to strictly represent
actual usage. It also provides for conditional fixup of busy time upon
migration and upon heavy-task wakeup.
CRs-Fixed: 691443
Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in init_task_load(),
se.avg.decay_count has deprecated.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Printing window size in /proc/sched_debug would provide useful
information to debug scheduler issues.
Change-Id: Ia12ab2cb544f41a61c8a1d87bf821b85a19e09fd
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
In the absence of a power driver providing real power values, the scheduler
currently defaults to using capacity of a CPU as a measure of power. This,
however, is not a good measure since the capacity of a CPU can change due
to thermal conditions and/or other hardware restrictions. These frequency
restrictions have no effect on the power efficiency of those CPUs.
Introduce max possible capacity of a CPU to track an absolute measure of
capacity which translates into a good absolute measure of power efficiency.
Max possible capacity takes the max possible frequency of CPUs into account
instead of max frequency.
Change-Id: Ia970b853e43a90eb8cc6fd990b5c47fca7e50db8
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Provide information in /proc/sched_debug on min_capacity, max_capacity
and whether pelt or window-based task load statistics is in use.
Change-Id: Ie4e9450652f4c83110dda75be3ead8aa5bb355c3
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Extend "sched" file in /proc for every task to provide information on
scaled load statistics and percentage-scaled based load (load_avg) for
a task. This will be valuable debug aid.
Change-Id: I6ee0394b409c77c7f79f5b9ac560da03dc879758
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Provide additional information in /proc/sched_debug for every cpu.
This will be a valuable debug aid.
Change-Id: If22ee530e880cd21719242be7bc2c41308ad4186
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Calls to sysrq_sched_debug_show() can yield rather verbose output
which contributes to log spew and, under heavy load, may increase
the chances of a watchdog bark.
Make printing of this data optional with the introduction of a
new Kconfig, CONFIG_SYSRQ_SCHED_DEBUG.
Change-Id: I5f54d901d0dea403109f7ac33b8881d967a899ed
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:
1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
updated at the granularity of an entity at a time, which results in the
cfs_rq's load average is stale or partially updated: at any time, only
one entity is up to date, all other entities are effectively lagging
behind. This is undesirable.
To illustrate, if we have n runnable entities in the cfs_rq, as time
elapses, they certainly become outdated:
t0: cfs_rq { e1_old, e2_old, ..., en_old }
and when we update:
t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
...
We solve this by combining all runnable entities' load averages together
in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
on the fact that if we regard the update as a function, then:
w * update(e) = update(w * e) and
update(e1) + update(e2) = update(e1 + e2), then
w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
therefore, by this rewrite, we have an entirely updated cfs_rq at the
time we update it:
t1: update cfs_rq { e1_new, e2_new, ..., en_new }
t2: update cfs_rq { e1_new, e2_new, ..., en_new }
...
2. cfs_rq's load average is different between top rq->cfs_rq and other
task_group's per CPU cfs_rqs in whether or not blocked_load_average
contributes to the load.
The basic idea behind runnable load average (the same for utilization)
is that the blocked state is taken into account as opposed to only
accounting for the currently runnable state. Therefore, the average
should include both the runnable/running and blocked load averages.
This rewrite does that.
In addition, we also combine runnable/running and blocked averages
of all entities into the cfs_rq's average, and update it together at
once. This is based on the fact that:
update(runnable) + update(blocked) = update(runnable + blocked)
This significantly reduces the code as we don't need to separately
maintain/update runnable/running load and blocked load.
3. How task_group entities' share is calculated is complex and imprecise.
We reduce the complexity in this rewrite to allow a very simple rule:
the task_group's load_avg is aggregated from its per CPU cfs_rqs's
load_avgs. Then group entity's weight is simply proportional to its
own cfs_rq's load_avg / task_group's load_avg. To illustrate,
if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics.
As a result, we have less load tracking overhead, better performance,
and especially better power efficiency due to more balanced load.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 44dba3d5d6 ("sched: Refactor task_struct to use
numa_faults instead of numa_* pointers") modified the way
tsk->numa_faults stats are accounted.
However that commit never touched show_numa_stats() that is displayed
in /proc/pid/sched and thus the numbers displayed in /proc/pid/sched
don't match the actual numbers.
Fix it by making sure that /proc/pid/sched reflects the task
fault numbers. Also add group fault stats too.
Also couple of more modifications are added here:
1. Format changes:
- Previously we would list two entries per node, one for private
and one for shared. Also the home node info was listed in each entry.
- Now preferred node, total_faults and current node are
displayed separately.
- Now there is one entry per node, that lists private,shared task and
group faults.
2. Unit changes:
- p->numa_pages_migrated was getting reset after every read of
/proc/pid/sched. It's more useful to have absolute numbers since
differential migrations between two accesses can be more easily
calculated.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435252903-1081-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:
- Cache footprint optimizations for both hrtimers and timer wheel
- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.
- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter
- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf
- Some more leap second tweaks
- Another round of changes addressing the 2038 problem
- First step to change the internals of clock event devices by
introducing the necessary infrastructure
- Allow constant folding for usecs/msecs_to_jiffies()
- The usual pile of clockevent/clocksource driver updates
The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).
A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.
All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:
- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.
After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.
- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.
- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.
- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.
There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...
cgroup->name handling became quite complicated over time involving
dedicated struct cgroup_name for RCU protection. Now that cgroup is
on kernfs, we can drop all of it and simply use kernfs_name/path() and
friends. Replace cgroup->name and all related code with kernfs
name/path constructs.
* Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
of kernfs counterparts, which involves semantic changes.
pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
* cgroup->name handling dropped from cgroup_rename().
* All users of cgroup_name/path() updated to the new semantics. Users
which were formatting the string just to printk them are converted
to use pr_cont_cgroup_name/path() instead, which simplifies things
quite a bit. As cgroup_name() no longer requires RCU read lock
around it, RCU lockings which were protecting only cgroup_name() are
removed.
v2: Comment above oom_info_lock updated as suggested by Michal.
v3: dummy_top doesn't have a kn associated and
pr_cont_cgroup_name/path() ended up calling the matching kernfs
functions with NULL kn leading to oops. Test for NULL kn and
print "/" if so. This issue was reported by Fengguang Wu.
v4: Rebased on top of 0ab02ca8f8 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.
Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.
(Also add some sched_debug lines for cfs_bandwidth.)
Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>