| Age | Commit message (Collapse) | Author |
|
PCI devices' B/D/F numbers can have alphabets in hex. Fix the regexp
so that hex B/D/F numbers also match the regular expression.
Cc: Petri Latvala <petri.latvala@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Swathi Dhanavanthri <swathi.dhanavanthri@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
In a few cases, we hit a timeout where no process appears to be
deadlocked (i.e. tasks stuck in 'D' with intertwined stacks) but
everything appears to be running happily. Often, they appear to be
fighting over the shrinker, so one naturally presumes we are running low
on memory. But for tests that were designed to run with ample memory to
spare, that is a little disconcerting and I would like to know where the
memory actually went.
sysrq('m'): Will dump current memory info to your console
Sounds like that should do the trick.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
Acked-by: Petri Latvala <petri.latvala@intel.com>
|
|
To help verify correct deployment, add a --version flag to igt_runner
that just prints the IGT-version text, the same tests would print.
Note that only igt_runner gained the --version flag. igt_resume and
igt_results don't do fancy flag handling, they only accept the
directory to operate as their single arg.
v2: Depend on version.h (CI)
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Include the reason why we are dumping the task state (test timeout) in
the kmsg log prior to the task state. Hopefully this helps when reading
the dump.
v2: Use asprintf to combine the strings into one to avoid error prone
manual string handling and enjoy one single write() into the kmsg.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
One missing fdatasync() for starting a subtest.
Fixes: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/81
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Instead of repeating every single time we overflow the read from kmsg,
just once per test is enough warning.
v2: Just suppress the multiple s/underflow/overflow/ messages. Having a
buffer smaller than a single kmsg packet is unlikely.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
Now that the IGT tests have a mechanism for signaling broken testing
conditions we can stop the run on the first test that has noticed it,
and possibly has triggered that state.
Traditionally run would have continued with that test failing and the
side effects would trickle down into the other tests causing a lot of
skip/fails.
v2: extra explanations, small cleanup (Petri)
Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
to make this bit of code more readable and to reuse it in the following patch
Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
Instead of reading one record at a time between select() calls and
tainted-checks etc, use the same at-the-end dmesg dumper whenever
there's activity in /dev/kmsg. It's possible that the occasional chunk
of missing dmesg we're sometimes hitting is due to reading too slowly,
especially if there's a huge gem traceback.
Also print a clear message if we hit a log buffer underrun so we know
it.
Reference: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/79
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
If a machine is hard-hanging or otherwise rebooted at the correct
time, intermediary output files get created but nothing ever gets
written to them. That yields results that are completely empty and
hard to categorize or even sometimes detect automatically. Handle this
corner case explicitly with a custom text explaining what might have
happened to prod result analysis towards fixing the real issue instead
of wondering if test result processing is faulty.
The race for getting empty files is easier to hit than it seems. The
files get created by the runner before calling exec(), and there's
plenty of time to hit a really hard crash.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
A new config option, --per-test-timeout, sets a time a single test
cannot exceed without getting itself killed. The time resets when
starting a subtest or a dynamic subtest, so an execution with
--per-test-timeout=20 can indeed go over 20 seconds a long as it
launches a dynamic subtest within that time.
As a bonus, verbose log level from runner now also prints dynamic
subtest begin/result.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Instead of aiming for inactivity_timeout and splitting that into
suitable intervals for watchdog pinging, replace the whole logic with
one-second select() timeouts and checking if we're reaching a timeout
condition based on current time and the time passed since a particular
event, be it the last activity or the time of signaling the child
processes.
With the refactoring, we gain a couple of new features for free:
- use-watchdog now makes sense even without
inactivity-timeout. Previously use-watchdog was silently ignored if
inactivity-timeout was not set. Now, watchdogs will be used always if
configured so, effectively ensuring the device gets rebooted if
userspace dies without other timeout tracking.
- Killing tests early on kernel taint now happens even
earlier. Previously on an inactive system we possibly waited for some
tens of seconds before checking kernel taints.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
In a very rudimentary and undocumented manner, testlist files can now
have dynamic subtests specified. This feature is intended for very
special cases, and the main supported mode of operation with testlist
files is still the CI-style "run it all no matter what".
The syntax for testlist files is:
igt@binary@subtestname@dynamicsubtestname
As dynamic subtests are not easily listable, any helpers for
generating such testlists are not implemented.
If running in multiple-mode, subtests with dynamic subtests specified
will run in single-mode instead.
Closes: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/45
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
If we're checking for taints, we kill the test as soon as we notice a
taint. Out of the box, such killing will get marked as such and yields
a 'timeout' result, which is misleading. The test didn't spend too
much time, it just did nasties.
Make sure taint-killing results in an 'incomplete' result
instead. It's still not completely truthful for the state of the
testing but closer than a 'timeout'. And stands out more in CI result
analysis.
Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
If the kernel is tainted, it stays tainted, so make sure the execution
monitoring still reaches the output collectors and other fd change
handlers.
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
If someone wants to execute tests without aborting when tainted, they
get all their tests just straight up killed on the first taint without
actually aborting execution. Obey their wishes and keep running.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
If the kernel OOPSed during the test, it is unlikely to ever complete.
Furthermore, we have the reason why it won't complete and so do not need
to burden ourselves with the full stacktrace of every process -- or at
least we have a more pressing bug to fix before worrying about the
system deadlock.
v2: Log the post-taint killing.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
See also: commit 0e6457f1bfe2 ("lib: Don't dump log buffer when
dynamic subtest failure is inherited")
This is quite an explicit top-to-bottom test that we don't get an
incorrect warn result for an innocent dynamic subtest. It is tested
here in runner_test instead of testing in lib/tests that the extra
lines are not printed, because not printing the extra lines is an
implementation detail that might change later. The issue is after all
about test results parsed by igt_runner.
v2: Squash adding the new mockup test binary to this commit
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
We have some handcrafted test binaries in runner/testdata/ for runner
testing, and hardcoded numbers for the total amount of subtests and
test binaries all over the runner's unit tests. Replace magic numbers
with clear defines so new tests can easily be added.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
A simple test output with numbers from 1 to 255, both in plain text
form and as a single byte with that particular value.
Note that the json spec doesn't require \u-encoding for characters
other than '"', '\' and the range U+0000 to U+001F, the raw
non-\u-encoded UTF-8 in the reference.json file for bytes 128 and up
is what libjson-c outputs for those codepoints and is valid.
The validity of the json file can be verified with iconv, i.e.
$ iconv -f UTF-8 reference.json -o /dev/null && echo it is utf-8
v2: Rebase over dynamic subtest tests, trivial
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Sometimes tests output garbage (e.g. due to extreme occurrences of
https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/55) but we
need to present the garbage as results.
We already ignore any test output after the first \0, and for the rest
of the bytes that are not directly UTF-8 as-is, we can quite easily
represent them with two-byte UTF-8 encoding.
libjson-c already expects the string you feed it through
json_object_new_string* functions to be UTF-8.
v2: Rebase, adjust for dynamic subtest parsing
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Multiple different subtests can have a dynamic subtest with the same
name. Add a test to make sure we correctly delimit the output parsing.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Dynamic subtests now include more output, change the dynamic subtest
parsing test accordingly.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
On an assertion failure, the string "Subtest xyz failed" is
printed. Make sure we don't match that for SUBTEST_RESULT, or the
equivalent for dynamic subtests.
Parsing the results already explicitly searched for the proper result
line, the difference is when we delimit output up to "the next line of
interest".
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
The first dynamic part now starts its reported output at the beginning
of its subtest's output, and the last dynamic part ends its reported
output at the end of its subtest's output.
v2: Rebase
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
v2: Rebase
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
v2: Rebase
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
v2: Use an enum to select a pattern string for asprintf (Arek)
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
The default timeout is 30s.
Recently we see more `ninja test` invocation failing due to runner tests
timing out. This happens for slower environments, i.e. running the
cross-compiled binaries through qemu.
Freedesktop CI/CD runners are shared machines with a lot of cores and
they accept many parallel jobs coming from multiple projects. There
are no resources guarantees, which leads to the sporadic slowness.
Runner tests are taking 10x longer than the next slowest test invoked by
`ninja test` on a typical run. They also seem more prone to resource
thrashing by other processes on the machine, probably due to heavier
reliance on disk IO.
Let's just give them proportional leeway when it comes to timing out.
Cc: Petri Latvala <petri.latvala@intel.com>
Issue: https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/197
Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
While the originally written timeout for process killing (2 seconds)
was way too short, waiting indefinitely is suboptimal as well. We're
seeing cases where the test is stuck for possibly hours in
uninterruptible sleep (IO). Wait a fairly longer selected time period
of 2 minutes, because even making progress for that long means the
machine is in bad enough state to require a good kicking and booting.
v2:
- Abort quicker if kernel is tainted (Chris)
- Correctly convert process-exists check with kill() to process-does-not-exist
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
The split to timeout intervals was made to accomodate for watchdogs
that cannot use a timeout as high as we wanted. Actually using that
feature requires us to ping the watchdog every interval even though we
handle actual timeouting after all intervals are used up.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
If binary 'bin' has a subtest 'sub', which has dynamic subtests 'foo'
and 'bar', results.json will now have "subtests" by the names
igt@bin@sub@foo and igt@bin@sub@bar, with data as expected of normal
subtests.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Don't add timestamps when printing that we cannot execute a binary
from a child (post fork-failed-execv). Timestamps were meant for
runner's direct output only, and this was accidentally converted.
v2: Rephrase commit message (Arek)
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Since we actually include the output before the subtest begins now,
add it to the reference.jsons where applicable.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
Instead of searching back and forth for proper lines, first find all
lines that we could be interested in with one pass through the output,
and use the positions of found lines to delimit the extracted outputs.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
If a test was attempted but didn't actually exist, make it result in a
skip instead of a notrun. This is to differentiate them from the tests
that we didn't even attempt, like tests after getting a machine
hang. This will improve handling of subtests for GEM engines that
don't exist.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Cc: Martin Peres <martin.peres@linux.intel.com>
Acked-by: Martin Peres <martin.peres@linux.intel.com>
|
|
When our watchdog expires and we declare the test has timed out, we send
it a signal to terminate. The test will produce a backtrace upon receipt
of that signal, but often times (especially as we do test and debug the
kernel), the test is hung inside the kernel. So we need the kernel state
to see where the live/deadlock is occuring. Enter sysrq-t to show the
backtraces of all processes (as the one we are searching for may be
sleeping).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
|
|
If an output (out.txt or err.txt) is completely empty, we handle the
parsing just fine as is, but we end up assuming that if journal says
we have a subtest, that subtest printed that it started. We have one
case where out.txt was empty and all other files were intact (ran out
of disk?)
All other paths that expect certain texts handle failures finding them
properly apart from subtest result processing, which happily passed
along a NULL pointer as a string to json. After handling that case,
the processing of said weird case proceeded fine and produced correct
results.
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|
|
A minor refinement to remove the trailing spaces after converting the
NUL-terminators to spaces.
v2: Beware the crafty filename entirely composed of spaces.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
/proc/$pid/cmdline is the entire argv[] including NUL-terminators.
Replace the NULs with spaces so we get a better idea of who the
signaler was, as often it is a subprocess (such as a child of sudo,
or worse java).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
|
|
We want to know who sent us the fatal signal, for there are plenty of
fingers to go around.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
|
|
If the network goes down while testing, CI tends to interpret that as
the device being down, cutting its power after a while. This causes an
incomplete to an innocent test, increasing noise in the results.
A new flag to --abort-on-monitored-error, "ping", uses liboping to
ping a host configured in .igtrc with one ping after each test
execution and aborts the run if there is no reply in a hardcoded
amount of time.
v2:
- Use a higher timeout
- Allow hostname configuration from environment
v3:
- Use runner_c_args for holding c args for runner
- Handle runner's meson options in runner/meson.build
- Instead of one ping with 20 second timeout, ping with 1 second timeout
for a duration of 20 seconds
v4:
- Rebase
- Use now-exported igt_load_igtrc instead of copypaste code
- Use define for timeout, clearer var name for single attempt timeout
Signed-off-by: Petri Latvala <petri.latvala@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Cc: Martin Peres <martin.peres@linux.intel.com>
Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
|