This regresses 3DMark scores by a short margin due to regwrite being outside the spinlock
that causes potential race condition. Hence, revert this.
This reverts commit 0cbd93ad24ea0eaf839ed151149a1180ecf23a57.
Reported-by: Kazuki H <kazukih0205@gmail.com>
Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: EmanuelCN <emanuelghub@gmail.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
This implies that there's an unseen ordering dependency between test_bit and set_bit
which isn't true and just adds memory barriers for no reason. Therefore, revert this.
This reverts commit fec203b7346886e7ec96af8e697ef15b66b304f7.
Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
The allocated dmapool pages are never freed for the lifetime of the pool.
There is no need for the two level list+stack lookup for finding a free
block since nothing is ever removed from the list. Just use a simple
stack, reducing time complexity to constant.
The implementation inserts the stack linking elements and the dma handle
of the block within itself when freed. This means the smallest possible
dmapool block is increased to at most 16 bytes to accommodate these
fields, but there are no exisiting users requesting a dma pool smaller
than that anyway.
Removing the list has a significant change in performance. Using the
kernel's micro-benchmarking self test:
Before:
# modprobe dmapool_test
dmapool test: size:16 blocks:8192 time:57282
dmapool test: size:64 blocks:8192 time:172562
dmapool test: size:256 blocks:8192 time:789247
dmapool test: size:1024 blocks:2048 time:371823
dmapool test: size:4096 blocks:1024 time:362237
After:
# modprobe dmapool_test
dmapool test: size:16 blocks:8192 time:24997
dmapool test: size:64 blocks:8192 time:26584
dmapool test: size:256 blocks:8192 time:33542
dmapool test: size:1024 blocks:2048 time:9022
dmapool test: size:4096 blocks:1024 time:6045
The module test allocates quite a few blocks that may not accurately
represent how these pools are used in real life. For a more marco level
benchmark, running fio high-depth + high-batched on nvme, this patch shows
submission and completion latency reduced by ~100usec each, 1% IOPs
improvement, and perf record's time spent in dma_pool_alloc/free were
reduced by half.
[kbusch@kernel.org: push new blocks in ascending order]
Link: https://lkml.kernel.org/r/20230221165400.1595247-1-kbusch@meta.com
Link: https://lkml.kernel.org/r/20230126215125.4069751-12-kbusch@meta.com
Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device. But nobody
ended up using dma pools without a device, and trying to do so will result
in an oops. So remove the checks for pool->dev == NULL since they are
unneeded bloat.
[kbusch@kernel.org: add check for null dev on create]
Link: https://lkml.kernel.org/r/20230126215125.4069751-3-kbusch@meta.com
Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
That's a LONG wait. Don't busy wait there to save power.
Signed-off-by: Kazuki Hashimoto <kazukih@tuta.io>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Silences:
[41385.516876] GICv3: CPU1: found redistributor 100 region 0:0x0000000017a80000
[41385.519545] GICv3: CPU2: found redistributor 200 region 0:0x0000000017aa0000
[41385.522043] GICv3: CPU3: found redistributor 300 region 0:0x0000000017ac0000
[41385.525185] GICv3: CPU4: found redistributor 400 region 0:0x0000000017ae0000
[41385.527049] GICv3: CPU5: found redistributor 500 region 0:0x0000000017b00000
[41385.528764] GICv3: CPU6: found redistributor 600 region 0:0x0000000017b20000
[41385.530522] GICv3: CPU7: found redistributor 700 region 0:0x0000000017b40000
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
kgsl gets the entry id or the gpu address through vm_pgoff. It is used
during mmap and never needed again. But this pgoff has different meaning
at other parts of kernel. Not setting to zero will let way for wrong
assumption when tried to unmap a page from the vma.
Change-Id: Ia81c64a77456caf168c6bd23bdf5755c3f3ee31c
Signed-off-by: Puranam V G Tejaswi <pvgtejas@codeaurora.org>
Signed-off-by: Rohan Sethi <rohsethi@codeaurora.org>
IO-Coherent cached buffers can be reclaimed. There is possibility that for
reclaimed buffer mmap() request can result into null pointer dereference
in vm_insert_page(). So, skip VM page insert operations for IO-coherent
cached buffers in mmap(). These buffers can be handled at CPU page fault
time in the kgsl vmfault handler.
Change-Id: I6cf29af2d37de736df27f745fc9bceb01cb097e6
Signed-off-by: Hareesh Gundu <quic_hareeshg@quicinc.com>
Signed-off-by: Rohan Sethi <quic_rohsethi@quicinc.com>
Currently, we don't trigger dispatcher timer while doing an inline
submission. This breaks the long ib detection. So, trigger dispatcher
timer during an inline submission.
Change-Id: I36397cea3f6ea4393789cd4b54a2258e189f4b13
Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
[@RealJohnGalt] update idle timer usage for 4.14
Currently a workqueue is being used to process the event work. In certain
scenarios like when most of CPU cores are busy, there can be a significant
delay between the actual timestamp retire event and when the work is
processed by the events workqueue as workqueues cannot have RT priority.
Hence use kthread instead of workqueue for event work.
Change-Id: Ib1ec7fa1ec3a133d03104c9a029dcc4c06180609
Signed-off-by: Puranam V G Tejaswi <quic_pvgtejas@quicinc.com>
[@RealJohnGalt] adapted to 4.14
Add some sugar to make kgsl_mem_entry_get() return the pointer it
just got which makes the code cleaner.
Change-Id: Ic0dedbadd3bb755a9ad1906eab04aeb02d5da53b
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
The sync fence callbacks are allocated in kernel context. Use the
GFP_KERNEL flag instead of GFP_ATOMIC to permit the allocation to
sleep if required.
Change-Id: I2099229cb1fb734e87e4bff0ddc38a2ced2c03ea
Signed-off-by: Lynus Vaz <quic_lvaz@quicinc.com>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
the call must not touch scheduling parameters (nice or priority). This
is particularly handy for uclamp when used in conjunction with
SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
impacts uclamp values.
However, sched_setattr always checks whether the priorities and nice
values passed in sched_attr are valid first, even if those never get
used down the line. This is useless at best since userspace can
trivially bypass this check to set the uclamp values by specifying low
priorities. However, it is cumbersome to do so as there is no single
expression of this that skips both RT and CFS checks at once. As such,
userspace needs to query the task policy first with e.g. sched_getattr
and then set sched_attr.sched_priority accordingly. This is racy and
slower than a single call.
As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
is specified, simply inherit them in this case to match the policy
inheritance of SCHED_FLAG_KEEP_POLICY.
Reported-by: Wei Wang <wvw@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-3-qperret@google.com
Bug: 190237315
(cherry picked from commit f4dddf9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ifdbc9262b82c7f5c0d34952ece07770a53e3f6a5
[panchajanya1999: adapt for k4.14]
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: mcdofrenchfreis <xyzevan@androidist.net>
A more energy efficient update of the IO wait boosting mechanism has
been introduced in:
commit a5a0809 ("cpufreq: schedutil: Make iowait boost more energy
efficient")
where the boost value is expected to be:
- doubled at each successive wakeup from IO
staring from the minimum frequency supported by a CPU
- reset when a CPU is not updated for more then one tick
by either disabling the IO wait boost or resetting its value to the
minimum frequency if this new update requires an IO boost.
This approach is supposed to "ignore" boosting for sporadic wakeups from
IO, while still getting the frequency boosted to the maximum to benefit
long sequence of wakeup from IO operations.
However, these assumptions are not always satisfied.
For example, when an IO boosted CPU enters idle for more the one tick
and then wakes up after an IO wait, since in sugov_set_iowait_boost() we
first check the IOWAIT flag, we keep doubling the iowait boost instead
of restarting from the minimum frequency value.
This misbehavior could happen mainly on non-shared frequency domains,
thus defeating the energy efficiency optimization, but it can also
happen on shared frequency domain systems.
Let fix this issue in sugov_set_iowait_boost() by:
- first check the IO wait boost reset conditions
to eventually reset the boost value
- then applying the correct IO boost value
if required by the caller
Fixes: a5a0809 (cpufreq: schedutil: Make iowait boost more energy
efficient)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
[ Upstream commit b89997aa88f0b07d8a6414c908af75062103b8c9 ]
Being called for each dequeue, util_est reduces the number of its updates
by filtering out when the EWMA signal is different from the task util_avg
by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the
decay from a previous high util_avg, EWMA might now be close enough to
the new util_avg. No update would then happen while it would leave
ue.enqueued with an out-of-date value.
Taking into consideration the two util_est members, EWMA and enqueued for
the filtering, ensures, for both, an up-to-date value.
This is for now an issue only for the trace probe that might return the
stale value. Functional-wise, it isn't a problem, as the value is always
accessed through max(enqueued, ewma).
This problem has been observed using LISA's UtilConvergence:test_means on
the sd845c board.
No regression observed with Hackbench on sd845c and Perf-bench sched pipe
on hikey/hikey960.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210225165820.1377125-1-vincent.donnefort@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
While calculating untilization of CPU during task placement in fbt(),
current code doesn't take uclamp into account which would lead to
selection of incorrect CPU for the task when uclamp restrictions
are in place for the task.
Change-Id: I8371affe3b37733d222e5c57953e53f91fc19a53
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
Make use of the existing need_idle feature to incorporate upstream latency
sensitive tasks.
Change-Id: Ie1513187d024b93c8b619d9e0a35d84195488696
Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org>
* This fixes an issue with stuck scrolling specifically on fod area.
Test: Open chrome, search for a long website, scroll through
fod area and check that it doesn't get stuck.