benchmarks/wsim/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340

Workload descriptor format
==========================

ctx.engine.duration_us.dependency.wait,...
<uint>.<str>.<uint>[-<uint>]|*.<int <= 0>[/<int <= 0>][...].<0|1>,...
B.<uint>
M.<uint>.<str>[|<str>]...
P|S|X.<uint>.<int>
d|p|s|t|q|a|T.<int>,...
b.<uint>.<str>[|<str>].<str>
w|W.<uint>.<str>[/<str>]...
f

For duration a range can be given from which a random value will be picked
before every submit. Since this and seqno management requires CPU access to
objects, care needs to be taken in order to ensure the submit queue is deep
enough these operations do not affect the execution speed unless that is
desired.

Additional workload steps are also supported:

 'd' - Adds a delay (in microseconds).
 'p' - Adds a delay relative to the start of previous loop so that the each loop
       starts execution with a given period.
 's' - Synchronises the pipeline to a batch relative to the step.
 't' - Throttle every n batches.
 'q' - Throttle to n max queue depth.
 'f' - Create a sync fence.
 'a' - Advance the previously created sync fence.
 'B' - Turn on context load balancing.
 'b' - Set up engine bonds.
 'M' - Set up engine map.
 'P' - Context priority.
 'S' - Context SSEU configuration.
 'T' - Terminate an infinite batch.
 'w' - Working set. (See Working sets section.)
 'W' - Shared working set.
 'X' - Context preemption control.

Engine ids: DEFAULT, RCS, BCS, VCS, VCS1, VCS2, VECS

Example (leading spaces must not be present in the actual file):
----------------------------------------------------------------

  1.VCS1.3000.0.1
  1.RCS.500-1000.-1.0
  1.RCS.3700.0.0
  1.RCS.1000.-2.0
  1.VCS2.2300.-2.0
  1.RCS.4700.-1.0
  1.VCS2.600.-1.1
  p.16000

The above workload described in human language works like this:

  1.   A batch is sent to the VCS1 engine which will be executing for 3ms on the
       GPU and userspace will wait until it is finished before proceeding.
  2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
       duration range), 3.7ms and 1ms respectively. The first batch has a data
       dependency on the preceding VCS1 batch, and the last of the group depends
       on the first from the group.
  5.   Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
       RCS batch.
  6.   This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
       VCS2 batch.
  7.   Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
       same step the tool is told to wait for the batch completes before
       proceeding.
  8.   Finally the tool is told to wait long enough to ensure the next iteration
       starts 16ms after the previous one has started.

When workload descriptors are provided on the command line, commas must be used
instead of new lines.

Multiple dependencies can be given separated by forward slashes.

Example:

  1.VCS1.3000.0.1
  1.RCS.3700.0.0
  1.VCS2.2300.-1/-2.0

I this case the last step has a data dependency on both first and second steps.

Batch durations can also be specified as infinite by using the '*' in the
duration field. Such batches must be ended by the terminate command ('T')
otherwise they will cause a GPU hang to be reported.

Sync (fd) fences
----------------

Sync fences are also supported as dependencies.

To use them put a "f<N>" token in the step dependecy list. N is this case the
same relative step offset to the dependee batch, but instead of the data
dependency an output fence will be emitted at the dependee step, and passed in
as a dependency in the current step.

Example:

  1.VCS1.3000.0.0
  1.RCS.500-1000.-1/f-1.0

In this case the second step will have both a data dependency and a sync fence
dependency on the previous step.

Example:

  1.RCS.500-1000.0.0
  1.VCS1.3000.f-1.0
  1.VCS2.3000.f-2.0

VCS1 and VCS2 batches will have a sync fence dependency on the RCS batch.

Example:

  1.RCS.500-1000.0.0
  f
  2.VCS1.3000.f-1.0
  2.VCS2.3000.f-2.0
  1.RCS.500-1000.0.1
  a.-4
  s.-4
  s.-4

VCS1 and VCS2 batches have an input sync fence dependecy on the standalone fence
created at the second step. They are submitted ahead of time while still not
runnable. When the second RCS batch completes the standalone fence is signaled
which allows the two VCS batches to be executed. Finally we wait until the both
VCS batches have completed before starting the (optional) next iteration.

Submit fences
-------------

Submit fences are a type of input fence which are signalled when the originating
batch buffer is submitted to the GPU. (In contrary to normal sync fences, which
are signalled when completed.)

Submit fences have the identical syntax as the sync fences with the lower-case
's' being used to select them. Eg:

  1.RCS.500-1000.0.0
  1.VCS1.3000.s-1.0
  1.VCS2.3000.s-2.0

Here VCS1 and VCS2 batches will only be submitted for executing once the RCS
batch enters the GPU.

Context priority
----------------

  P.1.-1
  1.RCS.1000.0.0
  P.2.1
  2.BCS.1000.-2.0

Context 1 is marked as low priority (-1) and then a batch buffer is submitted
against it. Context 2 is marked as high priority (1) and then a batch buffer
is submitted against it which depends on the batch from context 1.

Context priority command is executed at workload runtime and is valid until
overriden by another (optional) same context priority change. Actual driver
ioctls are executed only if the priority level has changed for the context.

Context preemption control
--------------------------

  X.1.0
  1.RCS.1000.0.0
  X.1.500
  1.RCS.1000.0.0

Context 1 is marked as non-preemptable batches and a batch is sent against 1.
The same context is then marked to have batches which can be preempted every
500us and another batch is submitted.

Same as with context priority, context preemption commands are valid until
optionally overriden by another preemption control change on the same context.

Engine maps
-----------

Engine maps are a per context feature which changes the way engine selection is
done in the driver.

Example:

  M.1.VCS1|VCS2

This sets up context 1 with an engine map containing VCS1 and VCS2 engine.
Submission to this context can now only reference these two engines.

Engine maps can also be defined based on class like VCS.

Example:

M.1.VCS

This sets up the engine map to all available VCS class engines.

Context load balancing
----------------------

Context load balancing (aka Virtual Engine) is an i915 feature where the driver
will pick the best engine (most idle) to submit to given previously configured
engine map.

Example:

  B.1

This enables load balancing for context number one.

Engine bonds
------------

Engine bonds are extensions on load balanced contexts. They allow expressing
rules of engine selection between two co-operating contexts tied with submit
fences. In other words, the rule expression is telling the driver: "If you pick
this engine for context one, then you have to pick that engine for context two".

Syntax is:
  b.<context>.<engine_list>.<master_engine>

Engine list is a list of one or more sibling engines separated by a pipe
character (eg. "VCS1|VCS2").

There can be multiple bonds tied to the same context.

Example:

  M.1.RCS|VECS
  B.1
  M.2.VCS1|VCS2
  B.2
  b.2.VCS1.RCS
  b.2.VCS2.VECS

This tells the driver that if it picked RCS for context one, it has to pick VCS1
for context two. And if it picked VECS for context one, it has to pick VCS1 for
context two.

If we extend the above example with more workload directives:

  1.DEFAULT.1000.0.0
  2.DEFAULT.1000.s-1.0

We get to a fully functional example where two batch buffers are submitted in a
load balanced fashion, telling the driver they should run simultaneously and
that valid engine pairs are either RCS + VCS1 (for two contexts respectively),
or VECS + VCS2.

This can also be extended using sync fences to improve chances of the first
submission not getting on the hardware after the second one. Second block would
then look like:

  f
  1.DEFAULT.1000.f-1.0
  2.DEFAULT.1000.s-1.0
  a.-3

Context SSEU configuration
--------------------------

  S.1.1
  1.RCS.1000.0.0
  S.2.-1
  2.RCS.1000.0.0

Context 1 is configured to run with one enabled slice (slice mask 1) and a batch
is sumitted against it. Context 2 is configured to run with all slices (this is
the default so the command could also be omitted) and a batch submitted against
it.

This shows the dynamic SSEU reconfiguration cost beween two contexts competing
for the render engine.

Slice mask of -1 has a special meaning of "all slices". Otherwise any integer
can be specifying as the slice mask, but beware any apart from 1 and -1 can make
the workload not portable between different GPUs.

Working sets
------------

When used plainly workload steps can create implicit data dependencies by
relatively referencing another workload steps of a batch buffer type. Fourth
field contains the relative data dependncy. For example:

  1.RCS.1000.0.0
  1.BCS.1000.-1.0

This means the second batch buffer will be marked as having a read data
dependency on the first one. (The shared buffer is always marked as written to
by the dependency target buffer.) This will cause a serialization between the
two batch buffers.

Working sets are used where more complex data dependencies are required. Each
working set has an id, a list of buffers, and can either be local to the
workload or shared within the cloned workloads (-c command line option).

Lower-case 'w' command defines a local working set while upper-case 'W' defines
a shared version. Syntax is as follows:

  w.<id>.<size>[/<size>]...

For size a byte size can be given, or suffix 'k', 'm' or 'g' can be used (case
insensitive). Prefix in the format of "<int>n<size>" can also be given to create
multiple objects of the same size.

Ranges can also be specified using the <min>-<max> syntax.

Examples:

  w.1.4k - Working set 1 with a single 4KiB object in it.
  W.2.2M/32768 - Working set 2 with one 2MiB and one 32768 byte object.
  w.3.10n4k/2n20000 - Working set 3 with ten 4KiB and two 20000 byte objects.
  w.4.4n4k-1m - Working set 4 with four objects of random size between 4KiB and
		1MiB.

Working set objects can be referenced as data dependency targets using the new
'r'/'w' syntax. Simple example:

  w.1.4k
  W.2.1m
  1.RCS.1000.r1-0/w2-0.0
  1.BCS.1000.r2-0.0

In this example the RCS batch is reading from working set 1 object 0 and writing
to working set 2 object 0. BCS batch is reading from working set 2 object 0.

Because working set 2 is of a shared type, should two instances of the same
workload be executed (-c 2) then the 1MiB buffer would be shared and written
and read by both clients creating a serialization point.

Apart from single objects, ranges can also be given as depenencies:

  w.1.10n4k
  1.RCS.1000.r1-0-9.0

Here the RCS batch has a read dependency on working set 1 objects 0 to 9.