Age | Commit message (Collapse) | Author |
|
This is just a simple change to reflect the actual state. No rewording
yet, just a simple substitution in most visible places - docs, README
and paths.
There are probably some leftovers here and there, but we can let them be
for now, this is already well overdue.
v2: fixed couple of obvious leftovers pointed out by Petri
Cc: Petri Latvala <petri.latvala@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Acked-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
dist tarball doesn't build otherwise, as indicated by distcheck.
Fixes: 9e55cca889cd ("wsim: Add rtavg balancer")
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
|
|
This reverts commit b348107351c14cc7371ca65eea067d9a88ab7048.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
|
|
Too used to kbuild where you only specify objects.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.
Unfinished but shows promise so sending out for early feedback.
v2:
* Load workload descriptors from files. (also -w)
* Help text.
* Calibration control if needed. (-t)
* NORELOC | LUT to eb flags.
* Added sample workload to wsim/workload1.
v3:
* Multiple parallel different workloads (-w -w ...).
* Multi-context workloads.
* Variable (random) batch length.
* Load balancing (round robin and queue depth estimation).
* Workloads delays and explicit sync steps.
* Workload frequency (period) control.
v4:
* Fixed queue-depth estimation by creating separate batches
per engine when qd load balancing is on.
* Dropped separate -s cmd line option. It can turn itself on
automatically when needed.
* Keep a single status page and lie about the write hazard
as suggested by Chris.
* Use batch_start_offset for controlling the batch duration.
(Chris)
* Set status page object cache level. (Chris)
* Moved workload description to a README.
* Tidied example workloads.
* Some other cleanups and refactorings.
v5:
* Master and background workloads (-W / -w).
* Single batch per step is enough even when balancing. (Chris)
* Use hars_petruska_f54_1_random IGT functions and see to zero
at start. (Chris)
* Use WC cache domain when WC mapping. (Chris)
* Keep seqnos 64-bytes apart in the status page. (Chris)
* Add workload throttling and queue-depth throttling commands.
(Chris)
v6:
* Added two more workloads.
* Merged RT balancer from Chris.
v7:
* Merged NO_RELOC patch from Chris.
* Added missing RT balancer to help text.
TODO list:
* Fence support.
* Batch buffer caching (re-use pool).
* Better error handling.
* Less 1980's workload parsing.
* More workloads.
* Threads?
* ... ?
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
|
|
Just a silly benchmark to stress prime_fd_to_handle and
prime_handle_to_fd.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
And include poll(dmabuf) for comparison.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Makfile.sources
Replace the automake specific name of listings in Makefile.sources
with something not automake specific.
Signed-off-by: Robert Foss <robert.foss@collabora.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
|
Use the HAS_INTEL automake flag to avoid building benchmarks that won't
compile unless libdrm_intel is available in the build system.
Signed-off-by: Robert Foss <robert.foss@collabora.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
|
Primarily to check that we have the WC read/write disparity.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
If we specify an unobtainable alignment (e.g, 63bits) the kernel will
eviction the object from the GTT and fail to rebind it. We can use this,
to measure how long it takes to move objects around in the GTT by
running execbuf followed by the unbind. For small objects, this will be
dominated by the nop execution time, but for larger objects this will be
ratelimited by how fast we can rewrite the PTE.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Instead of measuring the wakeup latency of a GEM client, we turn the
tables here and ask what is the wakeup latency of a normal process
competing with GEM. In particular, a realtime process that expects
deterministic latency.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Superseded by gem_latency.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
The goal is measure how long it takes for clients waiting on results to
wakeup after a buffer completes, and in doing so ensure scalibilty of
the kernel to large number of clients.
We spawn a number of producers. Each producer submits a busyload to the
system and records in the GPU the BCS timestamp of when the batch
completes. Then each producer spawns a number of waiters, who wait upon
the batch completion and measure the current BCS timestamp register and
compare against the recorded value.
By varying the number of producers and consumers, we can study different
aspects of the design, in particular how many wakeups the kernel does
for each interrupt (end of batch). The more wakeups on each batch, the
longer it takes for any one client to finish.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Benchmark the overhead of changing from GTT to CPU domains and vice
versa. Effectively this measures the cost of a clflush, and how well the
driver can avoid them.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
One scenario under recent discussion is that of having a thundering herd
in i915_wait_request - where the overhead of waking up every waiter for
every batchbuffer was significantly impacting customer throughput. This
benchmark tries to replicate something to that effect by having a large
number of consumers generating a busy load (a large copy followed by
lots of small copies to generate lots of interrupts) and tries to wait
upon all the consumers concurrenctly (to reproduce the thundering herd
effect). To measure the overhead, we have a bunch of cpu hogs - less
kernel overhead in waiting should allow more CPU throughput.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Execute N blits and time how long they complete to measure both GPU
limited bandwidth and submission overhead.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Allow specification of the many different busyness modes and relocation
interfaces, along with the number of buffers to use and relocations.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
These benchmarks are first-and-foremost development tools, not aimed at
general users. As such they should not be installed into the system-wide
bin/ directory, but installed into libexec/.
v2: Now actually install beneath ${libexec}
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
This slightly idealises the behaviour of clients with the aim of
measuring the kernel overhead of different workloads. This test focuses
on the cost of relocating batchbuffers.
A trace file is generated with an LD_PRELOAD intercept around
execbuffer, which we can then replay at our leisure. The replay replaces
the real buffers with a set of empty ones so the only thing that the
kernel has to do is parse the relocations. but without a real workload
we lose the impact of having to rewrite active buffers.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
A basic measurement, how fast can we create and populate an object with
backing storage?
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Measure the overhead of execution when doing nothing, switching between
a pair of contexts, or creating a new context every time.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
By measuring both the query and the event round trip time, we can make a
reasonable estimate of how long it takes for the query to send the
vblank following an interrupt.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
This adds a small benchmark for the new userptr functionality.
Apart from basic surface creation and destruction, also tested is the
impact of having userptr surfaces in the process address space. Reason
for that is the impact of MMU notifiers on common address space
operations like munmap() which is per process.
v2:
* Moved to benchmarks.
* Added pointer read/write tests.
* Changed output to say iterations per second instead of
operations per second.
* Multiply result by batch size for multi-create* tests
for a more comparable number with create-destroy test.
v3:
* Use ALIGN macro.
* Catchup with big lib/ reorganization.
* Removed unused code and one global variable.
* Fixed up some warnings.
v4:
* Fixed feature test, does not matter here but makes it
consistent with gem_userptr_blits and clearer.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Brad Volkin <bradley.d.volkin@intel.com>
Reviewed-by: Brad Volkin <bradley.d.volkin@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
|
They build fine so give them some exposure.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Brad Volkin <bradley.d.volkin@intel.com>
Signed-off-by: Thomas Wood <thomas.wood@intel.com>
|