'group_id' member of the ioctl (KBASE_IOCTL_CS_TILER_HEAP_INIT) struct
must be validated before initializing CSF tiler heap.
Otherwise out-of-boundary of memory group pools array for the CSF tiler
heap could happen and will potentially lead to kernel panic.
TI2: 933204 (DDK Precommit)
TI2: 933199 (BASE_CSF_TEST)
Bug: 259061568
Test: verified fix using poc
Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/4766
Change-Id: I209a3d5152a34c278c17383e4aa9080aa9735822
(cherry picked from commit 55b44117111bf6a7e324301cbbf4f89669fa04c3)
Customer reported an issue where an unexpected GPU page fault happened
due to Tiler trying to access the chunk that was already freed by the
Userspace. The issue was root caused to cacheline sharing between the
HeapContexts of 2 Tiler heaps of the same Kbase context.
The page fault occurred for an Application that made use of more than 1
GPU queue group where one of the group, and its corresponding Tiler heap
instance, is created and destroyed multiple times over the lifetime of
Application.
Kbase sub-allocates memory for a HeapContext from a 4KB page that is
mapped as cached on GPU side, and the memory for HeapContext is zeroed
on allocation through an uncached CPU mapping.
Since the size of HeapContext is 32 bytes, 2 HeapContexts (corresponding
to 2 Tiler heaps of the same context) can end up sharing the same GPU
cacheline (which is 64 bytes in size).
GPU page fault occurred as FW found a non NULL or stale value for the
'free_list_head' pointer in the HeapContext, even though the Heap was
newly created, and so FW assumed a free chunk is available and passed
the address of it to the Tiler and didn't raise an OoM event for Host.
The stale value was found as the zeroing of new HeapContext's memory on
allocation got lost due to the eviction of cacheline from L2 cache.
The cacheline became dirty when FW had updated the contents of older
HeapContext (sharing the cacheline with new HeapContext) on CSG suspend
operation.
This commit makes the GPU VA of HeapContext to be GPU cacheline aligned
to avoid cacheline sharing. The alignment would suffice and there is no
explicit cache flush needed when HeapContext is freed, as whole GPU
cache would anyways be flushed on Odin & Turse GPUs when the initial
chunks are freed just before the HeapContext is freed.
Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/4724/
Test: Boot to home
Bug: 259523790
Change-Id: Ie9e8bffcadbd2ca7705dcd44f9be76754e28138d
Signed-off-by: Jeremy Kemp <jeremykemp@google.com>
There was an issue on Turse platform where sometimes Kbase misses the
CSG idle event notification from FW. This happens when FW sends back to
back notifications for the same CSG for events like SYNC_UPDATE and IDLE
but Kbase gets a single IRQ and observes only the first event.
The issue was root caused to a barrier missing on Kbase side between the
write to JOB_IRQ_CLEAR register and the read from interface memory, i.e.
CSG_ACK. Without the barrier there is no guarantee about the ordering,
the write to JOB_IRQ_CLEAR can take effect after the read from interface
memory.
The ordering is needed considering the way FW & Kbase writes to the
JOB_IRQ_RAWSTAT & JOB_IRQ_CLEAR registers without any synchronization.
This commit adds dmb(osh) barrier after the write to JOB_IRQ_CLEAR to
resolve the issue.
TI2: 896668 (PLAN-12467r490 TGT CS Nightly, few CSF scenarios)
Bug: 243913790
Test: SST ~4600 hours
Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/3841
Change-Id: I094a3b55c8ae28e8126057cdaf81990f62cd388e
(cherry picked from commit 220d89fd264b11a5b68290c3ca5a8c232e1d45db)
This commit prints a console message when cpulist_parse() reports a
bad list of CPUs, and sets all CPUs' bits in that case. The reason for
setting all CPUs' bits is that this is the safe(r) choice for real-time
workloads, which would normally be the ones using the rcu_nocbs= kernel
boot parameter. Either way, later RCU console log messages list the
actual set of CPUs whose RCU callbacks will be offloaded.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Fiqri Ardyansyah <fiqri15072019@gmail.com>
Signed-off-by: Edwiin Kusuma Jaya <kutemeikito0905@gmail.com>
Currently, the rcu_nocbs= kernel boot parameter requires that a specific
list of CPUs be specified, and has no way to say "all of them".
As noted by user RavFX in a comment to Phoronix topic 1002538, this
is an inconvenient side effect of the removal of the RCU_NOCB_CPU_ALL
Kconfig option. This commit therefore enables the rcu_nocbs= kernel boot
parameter to be given the string "all", as in "rcu_nocbs=all" to specify
that all CPUs on the system are to have their RCU callbacks offloaded.
Another approach would be to make cpulist_parse() check for "all", but
there are uses of cpulist_parse() that do other checking, which could
conflict with an "all". This commit therefore focuses on the specific
use of cpulist_parse() in rcu_nocb_setup().
Just a note to other people who would like changes to Linux-kernel RCU:
If you send your requests to me directly, they might get fixed somewhat
faster. RavFX's comment was posted on January 22, 2018 and I first saw
it on March 5, 2019. And the only reason that I found it -at- -all- was
that I was looking for projects using RCU, and my search engine showed
me that Phoronix comment quite by accident. Your choice, though! ;-)
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Fiqri Ardyansyah <fiqri15072019@gmail.com>
Signed-off-by: Edwiin Kusuma Jaya <kutemeikito0905@gmail.com>
Expose through sysfs the nr_zones field of struct request_queue.
Exposing this value helps in debugging disk issues as well as
facilitating scripts based use of the disk (e.g. blktests).
For zoned block devices, the nr_zones field indicates the total number
of zones of the device calculated using the known disk capacity and
zone size. This number of zones is always 0 for regular block devices.
Since nr_zones is defined conditionally with CONFIG_BLK_DEV_ZONED,
introduce the blk_queue_nr_zones() function to return the correct value
for any device, regardless if CONFIG_BLK_DEV_ZONED is set.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
necessary to reset a range of zones. Similarly to what is done for
discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
executed asynchronously and a synchronous call done only for the last
BIO of the chain.
Modify blkdev_reset_zones() to operate similarly to
blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
avoid code duplication of that function in blk_zoned.c, rename
next_bio() into blk_next_bio() and declare it as a block internal
function in blk.h.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Get a zoned block device total number of zones. The device can be a
partition of the whole device. The number of zones is always 0 for
regular block devices.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Get a zoned block device zone size in number of 512 B sectors.
The zone size is always 0 for regular block devices.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point in allocating more zone descriptors than the number of
zones a block device has for doing a zone report. Avoid doing that in
blkdev_report_zones_ioctl() by limiting the number of zone decriptors
allocated internally to process the user request.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce the blkdev_nr_zones() helper function to get the total
number of zones of a zoned block device. This number is always 0 for a
regular block device (q->limits.zoned == BLK_ZONED_NONE case).
Replace hard-coded number of zones calculation in dmz_get_zoned_device()
with a call to this helper.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
32-bit long. Make sure all of the targets are calculated as 64-bit
values.
Fixes: 6e25cb01ea20 ("kyber: implement improved heuristics")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When debugging Kyber, it's really useful to know what latencies we've
been having, how the domain depths have been adjusted, and if we've
actually been throttling. Add three tracepoints, kyber_latency,
kyber_adjust, and kyber_throttled, to record that.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kyber's current heuristics have a few flaws:
- It's based on the mean latency, but p99 latency tends to be more
meaningful to anyone who cares about latency. The mean can also be
skewed by rare outliers that the scheduler can't do anything about.
- The statistics calculations are purely time-based with a short window.
This works for steady, high load, but is more sensitive to outliers
with bursty workloads.
- It only considers the latency once an I/O has been submitted to the
device, but the user cares about the time spent in the kernel, as
well.
These are shortcomings of the generic blk-stat code which doesn't quite
fit the ideal use case for Kyber. So, this replaces the statistics with
a histogram used to calculate percentiles of total latency and I/O
latency, which we then use to adjust depths in a slightly more
intelligent manner:
- Sync and async writes are now the same domain.
- Discards are a separate domain.
- Domain queue depths are scaled by the ratio of the p99 total latency
to the target latency (e.g., if the p99 latency is double the target
latency, we will double the queue depth; if the p99 latency is half of
the target latency, we can halve the queue depth).
- We use the I/O latency to determine whether we should scale queue
depths down: we will only scale down if any domain's I/O latency
exceeds the target latency, which is an indicator of congestion in the
device.
These new heuristics are just as scalable as the heuristics they
replace.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The domain token sbitmaps are currently initialized to the device queue
depth or 256, whichever is larger, and immediately resized to the
maximum depth for that domain (256, 128, or 64 for read, write, and
other, respectively). The sbitmap is never resized larger than that, so
it's unnecessary to allocate a bitmap larger than the maximum depth.
Let's just allocate it to the maximum depth to begin with. This will use
marginally less memory, and more importantly, give us a more appropriate
number of bits per sbitmap word.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 4bc6339a58 ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently SLUB uses a work scheduled after an RCU grace period to
deactivate a non-root kmem_cache. This mechanism can be reused for
kmem_caches release, but requires generalization for SLAB case.
Introduce kmemcg_cache_deactivate() function, which calls
allocator-specific __kmem_cache_deactivate() and schedules execution of
__kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
context after an rcu grace period.
Here is the new calling scheme:
kmemcg_cache_deactivate()
__kmemcg_cache_deactivate() SLAB/SLUB-specific
kmemcg_rcufn() rcu
kmemcg_workfn() work
__kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific
instead of:
__kmemcg_cache_deactivate() SLAB/SLUB-specific
slab_deactivate_memcg_cache_rcu_sched() SLUB-only
kmemcg_rcufn() rcu
kmemcg_workfn() work
kmemcg_cache_deact_after_rcu() SLUB-only
For consistency, all allocator-specific functions start with "__".
Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_SLUB_CPU_PARTIAL is not set
This causes load spikes when the per-CPU partial caches are filled and
need to be drained, which is bad for maintaining low latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
CONFIG_POWERCAP=y
It enables the power capping sysfs interface for
different power zone devices.
Bug: 220884335
Change-Id: I11bc3efe06d2a02dcc602d223d3e6757088ca771
Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
CONFIG_CFG80211_CRDA_SUPPORT is not set
Since CRDA is not supported, disable CONFIG_CFG80211_CRDA_SUPPORT
by default.
Change-Id: I01bde48aea21612b9d5c79b11931999e02d610b4
CRs-Fixed: 2946898
Signed-off-by: Paul Zhang <paulz@codeaurora.org>
When building with clang + -Wtautological-pointer-compare, these
instances pop up:
kernel/profile.c:339:6: warning: comparison of array 'prof_cpu_mask' not equal to a null pointer is always true [-Wtautological-pointer-compare]
if (prof_cpu_mask != NULL)
^~~~~~~~~~~~~ ~~~~
kernel/profile.c:376:6: warning: comparison of array 'prof_cpu_mask' not equal to a null pointer is always true [-Wtautological-pointer-compare]
if (prof_cpu_mask != NULL)
^~~~~~~~~~~~~ ~~~~
kernel/profile.c:406:26: warning: comparison of array 'prof_cpu_mask' not equal to a null pointer is always true [-Wtautological-pointer-compare]
if (!user_mode(regs) && prof_cpu_mask != NULL &&
^~~~~~~~~~~~~ ~~~~
3 warnings generated.
This can be addressed with the cpumask_available helper, introduced in
commit f7e30f0 ("cpumask: Add helper cpumask_available()") to fix
warnings like this while keeping the code the same.
Link: ClangBuiltLinux#747
Link: http://lkml.kernel.org/r/20191022191957.9554-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KMSAN complains that new_value at cpumask_parse_user() from
write_irq_affinity() from irq_affinity_proc_write() is uninitialized.
[ 148.133411][ T5509] =====================================================
[ 148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340
[ 148.137819][ T5509]
[ 148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at:
[ 148.140768][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.142298][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.143823][ T5509] =====================================================
Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(),
any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility
that find_next_bit() accesses uninitialized cpu mask variable. Fix this
problem by replacing alloc_cpumask_var() with zalloc_cpumask_var().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp
CONFIG_MEMCG is not set
Pixel doesn't use the memcg but it hurts 15% performance in minor
fault benchmark so disable it until we see strong reason.
Bug: 169443770
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ifd9ddcd54559c590260d52f60a2e5e4b79c5480f
The rcu_qs is disabling IRQs by self so no need to do the same in raise_softirq
but instead we can save some cycles using raise_softirq_irqoff directly.
CC: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Replace the license boiler plate with a SPDX license identifier.
While in the area, update an email address.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
The name rcu_check_callbacks() arguably made sense back in the early
2000s when RCU was quite a bit simpler than it is today, but it has
become quite misleading, especially with the advent of dyntick-idle
and NO_HZ_FULL. The rcu_check_callbacks() function is RCU's hook into
the scheduling-clock interrupt, and is now but one of many ways that
callbacks get promoted to invocable state.
This commit therefore changes the name to rcu_sched_clock_irq(),
which is the same number of characters and clearly indicates this
function's relation to the rest of the Linux kernel. In addition, for
the sake of consistency, rcu_flavor_check_callbacks() is also renamed
to rcu_flavor_sched_clock_irq().
While in the area, the header comments for both functions are reworked.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Event tracing is moving to SRCU in order to take advantage of the fact
that SRCU may be safely used from idle and even offline CPUs. However,
event tracing can invoke call_srcu() very early in the boot process,
even before workqueue_init_early() is invoked (let alone rcu_init()).
Therefore, call_srcu()'s attempts to queue work fail miserably.
This commit therefore detects this situation, and refrains from attempting
to queue work before rcu_init() time, but does everything else that it
would have done, and in addition, adds the srcu_struct to a global list.
The rcu_init() function now invokes a new srcu_init() function, which
is empty if CONFIG_SRCU=n. Otherwise, srcu_init() queues work for
each srcu_struct on the list. This all happens early enough in boot
that there is but a single CPU with interrupts disabled, which allows
synchronization to be dispensed with.
Of course, the queued work won't actually be invoked until after
workqueue_init() is invoked, which happens shortly after the scheduler
is up and running. This means that although call_srcu() may be invoked
any time after per-CPU variables have been set up, there is still a very
narrow window when synchronize_srcu() won't work, and this window
extends from the time that the scheduler starts until the time that
workqueue_init() returns. This can be fixed in a manner similar to
the fix for synchronize_rcu_expedited() and friends, but until someone
actually needs to use synchronize_srcu() during this window, this fix
is added churn for no benefit.
Finally, note that Tree SRCU's new srcu_init() function invokes
queue_work() rather than the queue_delayed_work() function that is
invoked post-boot. The reason is that queue_delayed_work() will (as you
would expect) post a timer, and timers have not yet been initialized.
So use of queue_work() avoids the complaints about use of uninitialized
spinlocks that would otherwise result. Besides, some delay is already
provide by the aforementioned fact that the queued work won't actually
be invoked until after the scheduler is up and running.
Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a long-running CPU-bound in-kernel task invokes call_rcu(), the
callback won't be invoked until the next context switch. If there are
no other runnable tasks (which is not an uncommon situation on deep
embedded systems), the callback might never be invoked.
This commit therefore causes rcu_check_callbacks() to ask the scheduler
for a context switch if there are callbacks posted that are still waiting
for a grace period.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit renames Tiny RCU functions so that the lowest level of
functionality is RCU (e.g., synchronize_rcu()) rather than RCU-sched
(e.g., synchronize_sched()). This provides greater naming compatibility
with Tree RCU, which will in turn permit more LoC removal once
the RCU-sched and RCU-bh update-side API is removed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix Tiny call_rcu()'s EXPORT_SYMBOL() in response to a bug
report from kbuild test robot. ]
Now that RCU-preempt knows about preemption disabling, its implementation
of synchronize_rcu() works for synchronize_sched(), and likewise for the
other RCU-sched update-side API members. This commit therefore confines
the RCU-sched update-side code to CONFIG_PREEMPT=n builds, and defines
RCU-sched's update-side API members in terms of those of RCU-preempt.
This means that any given build of the Linux kernel has only one
update-side flavor of RCU, namely RCU-preempt for CONFIG_PREEMPT=y builds
and RCU-sched for CONFIG_PREEMPT=n builds. This in turn means that kernels
built with CONFIG_RCU_NOCB_CPU=y have only one rcuo kthread per CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Now that the main RCU API knows about softirq disabling and softirq's
quiescent states, the RCU-bh update code can be dispensed with.
This commit therefore removes the RCU-bh update-side implementation and
defines RCU-bh's update-side API in terms of that of either RCU-preempt or
RCU-sched, depending on the setting of the CONFIG_PREEMPT Kconfig option.
In kernels built with CONFIG_RCU_NOCB_CPU=y this has the knock-on effect
of reducing by one the number of rcuo kthreads per CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* Converted the observer polling to be suspend-aware so it stops pinging every
32 ms. Added a pob_timer_active flag, suspend/resume notifier, and bumped the
interval to 64 ms. The hrtimer callback now quietly backs off when disabled.
* Maybe now the CPU can finally enjoy its beauty sleep.
Signed-off-by: Aeron-Aeron <aeronrules2@gmail.com>
- This should improve the reliability of 2.4GHz hotspot connections.
Change-Id: Iea450301518d22701c35040a2581cb37d2d39ccf
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>