Workload descriptor format ========================== ctx.engine.duration_us.dependency.wait,... ..[-]|*.[/][...].<0|1>,... B. M..[|]... P|S|X.. d|p|s|t|q|a|T.,... b..[|]. w|W..[/]... f For duration a range can be given from which a random value will be picked before every submit. Since this and seqno management requires CPU access to objects, care needs to be taken in order to ensure the submit queue is deep enough these operations do not affect the execution speed unless that is desired. Additional workload steps are also supported: 'd' - Adds a delay (in microseconds). 'p' - Adds a delay relative to the start of previous loop so that the each loop starts execution with a given period. 's' - Synchronises the pipeline to a batch relative to the step. 't' - Throttle every n batches. 'q' - Throttle to n max queue depth. 'f' - Create a sync fence. 'a' - Advance the previously created sync fence. 'B' - Turn on context load balancing. 'b' - Set up engine bonds. 'M' - Set up engine map. 'P' - Context priority. 'S' - Context SSEU configuration. 'T' - Terminate an infinite batch. 'w' - Working set. (See Working sets section.) 'W' - Shared working set. 'X' - Context preemption control. Engine ids: DEFAULT, RCS, BCS, VCS, VCS1, VCS2, VECS Example (leading spaces must not be present in the actual file): ---------------------------------------------------------------- 1.VCS1.3000.0.1 1.RCS.500-1000.-1.0 1.RCS.3700.0.0 1.RCS.1000.-2.0 1.VCS2.2300.-2.0 1.RCS.4700.-1.0 1.VCS2.600.-1.1 p.16000 The above workload described in human language works like this: 1. A batch is sent to the VCS1 engine which will be executing for 3ms on the GPU and userspace will wait until it is finished before proceeding. 2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random duration range), 3.7ms and 1ms respectively. The first batch has a data dependency on the preceding VCS1 batch, and the last of the group depends on the first from the group. 5. Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms RCS batch. 6. This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms VCS2 batch. 7. Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the same step the tool is told to wait for the batch completes before proceeding. 8. Finally the tool is told to wait long enough to ensure the next iteration starts 16ms after the previous one has started. When workload descriptors are provided on the command line, commas must be used instead of new lines. Multiple dependencies can be given separated by forward slashes. Example: 1.VCS1.3000.0.1 1.RCS.3700.0.0 1.VCS2.2300.-1/-2.0 I this case the last step has a data dependency on both first and second steps. Batch durations can also be specified as infinite by using the '*' in the duration field. Such batches must be ended by the terminate command ('T') otherwise they will cause a GPU hang to be reported. Sync (fd) fences ---------------- Sync fences are also supported as dependencies. To use them put a "f" token in the step dependecy list. N is this case the same relative step offset to the dependee batch, but instead of the data dependency an output fence will be emitted at the dependee step, and passed in as a dependency in the current step. Example: 1.VCS1.3000.0.0 1.RCS.500-1000.-1/f-1.0 In this case the second step will have both a data dependency and a sync fence dependency on the previous step. Example: 1.RCS.500-1000.0.0 1.VCS1.3000.f-1.0 1.VCS2.3000.f-2.0 VCS1 and VCS2 batches will have a sync fence dependency on the RCS batch. Example: 1.RCS.500-1000.0.0 f 2.VCS1.3000.f-1.0 2.VCS2.3000.f-2.0 1.RCS.500-1000.0.1 a.-4 s.-4 s.-4 VCS1 and VCS2 batches have an input sync fence dependecy on the standalone fence created at the second step. They are submitted ahead of time while still not runnable. When the second RCS batch completes the standalone fence is signaled which allows the two VCS batches to be executed. Finally we wait until the both VCS batches have completed before starting the (optional) next iteration. Submit fences ------------- Submit fences are a type of input fence which are signalled when the originating batch buffer is submitted to the GPU. (In contrary to normal sync fences, which are signalled when completed.) Submit fences have the identical syntax as the sync fences with the lower-case 's' being used to select them. Eg: 1.RCS.500-1000.0.0 1.VCS1.3000.s-1.0 1.VCS2.3000.s-2.0 Here VCS1 and VCS2 batches will only be submitted for executing once the RCS batch enters the GPU. Context priority ---------------- P.1.-1 1.RCS.1000.0.0 P.2.1 2.BCS.1000.-2.0 Context 1 is marked as low priority (-1) and then a batch buffer is submitted against it. Context 2 is marked as high priority (1) and then a batch buffer is submitted against it which depends on the batch from context 1. Context priority command is executed at workload runtime and is valid until overriden by another (optional) same context priority change. Actual driver ioctls are executed only if the priority level has changed for the context. Context preemption control -------------------------- X.1.0 1.RCS.1000.0.0 X.1.500 1.RCS.1000.0.0 Context 1 is marked as non-preemptable batches and a batch is sent against 1. The same context is then marked to have batches which can be preempted every 500us and another batch is submitted. Same as with context priority, context preemption commands are valid until optionally overriden by another preemption control change on the same context. Engine maps ----------- Engine maps are a per context feature which changes the way engine selection is done in the driver. Example: M.1.VCS1|VCS2 This sets up context 1 with an engine map containing VCS1 and VCS2 engine. Submission to this context can now only reference these two engines. Engine maps can also be defined based on class like VCS. Example: M.1.VCS This sets up the engine map to all available VCS class engines. Context load balancing ---------------------- Context load balancing (aka Virtual Engine) is an i915 feature where the driver will pick the best engine (most idle) to submit to given previously configured engine map. Example: B.1 This enables load balancing for context number one. Engine bonds ------------ Engine bonds are extensions on load balanced contexts. They allow expressing rules of engine selection between two co-operating contexts tied with submit fences. In other words, the rule expression is telling the driver: "If you pick this engine for context one, then you have to pick that engine for context two". Syntax is: b... Engine list is a list of one or more sibling engines separated by a pipe character (eg. "VCS1|VCS2"). There can be multiple bonds tied to the same context. Example: M.1.RCS|VECS B.1 M.2.VCS1|VCS2 B.2 b.2.VCS1.RCS b.2.VCS2.VECS This tells the driver that if it picked RCS for context one, it has to pick VCS1 for context two. And if it picked VECS for context one, it has to pick VCS1 for context two. If we extend the above example with more workload directives: 1.DEFAULT.1000.0.0 2.DEFAULT.1000.s-1.0 We get to a fully functional example where two batch buffers are submitted in a load balanced fashion, telling the driver they should run simultaneously and that valid engine pairs are either RCS + VCS1 (for two contexts respectively), or VECS + VCS2. This can also be extended using sync fences to improve chances of the first submission not getting on the hardware after the second one. Second block would then look like: f 1.DEFAULT.1000.f-1.0 2.DEFAULT.1000.s-1.0 a.-3 Context SSEU configuration -------------------------- S.1.1 1.RCS.1000.0.0 S.2.-1 2.RCS.1000.0.0 Context 1 is configured to run with one enabled slice (slice mask 1) and a batch is sumitted against it. Context 2 is configured to run with all slices (this is the default so the command could also be omitted) and a batch submitted against it. This shows the dynamic SSEU reconfiguration cost beween two contexts competing for the render engine. Slice mask of -1 has a special meaning of "all slices". Otherwise any integer can be specifying as the slice mask, but beware any apart from 1 and -1 can make the workload not portable between different GPUs. Working sets ------------ When used plainly workload steps can create implicit data dependencies by relatively referencing another workload steps of a batch buffer type. Fourth field contains the relative data dependncy. For example: 1.RCS.1000.0.0 1.BCS.1000.-1.0 This means the second batch buffer will be marked as having a read data dependency on the first one. (The shared buffer is always marked as written to by the dependency target buffer.) This will cause a serialization between the two batch buffers. Working sets are used where more complex data dependencies are required. Each working set has an id, a list of buffers, and can either be local to the workload or shared within the cloned workloads (-c command line option). Lower-case 'w' command defines a local working set while upper-case 'W' defines a shared version. Syntax is as follows: w..[/]... For size a byte size can be given, or suffix 'k', 'm' or 'g' can be used (case insensitive). Prefix in the format of "n" can also be given to create multiple objects of the same size. Ranges can also be specified using the - syntax. Examples: w.1.4k - Working set 1 with a single 4KiB object in it. W.2.2M/32768 - Working set 2 with one 2MiB and one 32768 byte object. w.3.10n4k/2n20000 - Working set 3 with ten 4KiB and two 20000 byte objects. w.4.4n4k-1m - Working set 4 with four objects of random size between 4KiB and 1MiB. Working set objects can be referenced as data dependency targets using the new 'r'/'w' syntax. Simple example: w.1.4k W.2.1m 1.RCS.1000.r1-0/w2-0.0 1.BCS.1000.r2-0.0 In this example the RCS batch is reading from working set 1 object 0 and writing to working set 2 object 0. BCS batch is reading from working set 2 object 0. Because working set 2 is of a shared type, should two instances of the same workload be executed (-c 2) then the 1MiB buffer would be shared and written and read by both clients creating a serialization point. Apart from single objects, ranges can also be given as depenencies: w.1.10n4k 1.RCS.1000.r1-0-9.0 Here the RCS batch has a read dependency on working set 1 objects 0 to 9.