summaryrefslogtreecommitdiff
path: root/tests/core_hotunplug.c
AgeCommit message (Collapse)Author
2022-05-18lib/igt_kmod: make it less pedantic with audio driver removalMauro Carvalho Chehab
Current Linux Kernel don't report if the audio driver binds into the DRM driver. As this is CPU specific, allow audio driver unload fail without skipping the IGT tests on legacy Kernels, as this may not be mandatory. On new kernels where lsmod will properly display the dependency between the audio and DRM drivers, skip the core hotunplug test if it fails to unload the audio driver, as this is unrelated to the DRM driver - and it could simply because there are some userspace code using the audio device while the IGT test is running. Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2022-05-18core_hotunplug: fix audio unbind logicMauro Carvalho Chehab
The current audio unbind logic is wrong: it expects the audio driver to depend on i915 only on Haswell, Broadwell and DG1. That doesn't match the Kernel driver, where snd-hda-audio binds on i915 also on Skylake, Braswell and whenever it needs to use pm runtime from the DRM driver. Now that lib/igt-kmod has gained improved support for audio unbind, update core_hotunplug to benefit from the newer logic. Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2022-05-18tests/core_hotunplug: properly finish processes using audio devicesMauro Carvalho Chehab
Before unloading or unbinding an audio driver, all processes that are using it must be terminated. The current logic seeks only for alsactl, but ignore other processes, including pulseaudio. Make the logic more general, extending it to any processes that could have an open device under /dev/snd. It should be noticed that some distros like Fedora and openSUSE are now migrating from pulseaudio into pipewire-pulse. Right now, there's no standard distribution-agnostic way to request pipewire-pulse to stop using audio devices, but there's a new patch upstream that will make things easier: https://gitlab.freedesktop.org/pipewire/pipewire/-/commit/6ad6300ec657c88322a8cd6f3548261d3dc05359 Which should be available for pipewire-pulse versions 0.3.50 and upper. Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2022-04-14tests/core_hotunplug: apply audio unload for all i915 hwKai Vehmanen
The HDA audio driver does not support dynamic hotplug of a HDA codec and that applies also to the display HDA codec. To test unbind of i915, audio driver must be first unbound and/or unloaded. This is the only way to ensure clean unbind, and possibility to cleanly reestablish connection between audio and i915. Without this fix, the core_hotunplug test only works if the audio controller is runtime suspended when the test is run. This is very brittle and subject to timing races, differences in system configuration and so forth. A more predictable test is to explicitly unload the audio and in case some entity is using the audio driver, this is flagged with a clear, easily understandable error. References: https://gitlab.freedesktop.org/drm/intel/-/issues/1602 Cc: Uma Shankar <uma.shankar@intel.com> Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
2021-11-26tests/core_hotunplug: Show device PCI bus address on errorsJanusz Krzysztofik
Strange -ENODEV responses from the kernel to i915 driver rebind attempts have been sporadically observed. After successfully unbinding the driver from a device by writing a string representing its PCI bus address to /sys/bus/pci/driver/i915/unbind, the test then fails while writing the same device PCI bus address string to /sys/bus/pci/drivers/i915/bind. It is unlikely that the device disappears from the bus when this happens -- the test would attempt to rescan the bus in such cases while it doesn't. To shed more light on what may be going on, extend error messages emitted by the test with the device PCI bus address string it uses also printed. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com>
2021-10-08tests/core_hotunplug: Drop log level for preventively unloading snd driverJeevan B
change igt_warn to igt_info when unloading the snd module before unbinding i915 until WA is fixed. As this "todo reminder" is flagged as BAT failure and showing up in all sorts of top-bugs lists for various platforms. v2: Update commit message Signed-off-by: Jeevan B <jeevan.b@intel.com> Reviewed-by: Uma Shankar <uma.shankar@intel.com>
2021-07-08tests/core_hotplug: Convert to intel_ctx_tJason Ekstrand
Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Zbigniew Kempczyński <zbigniew.kempczynski@intel.com> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
2021-06-29tests/core_hotunplug: Unload snd driver before i915 unbindUma Shankar
Unload the snd module before unbinding i915. Audio holds a wakeref which triggers a warning otherwise, resulting in below warning and test failure. Currently HSW/BDW and DG1 are the platforms affected, can be extended to other platforms as well. <4> [137.001006] ------------[ cut here ]------------ <4> [137.001010] i915 0000:00:02.0: i915 raw-wakerefs=1 wakelocks=1 on cleanup <4> [137.001076] WARNING: CPU: 0 PID: 1417 at drivers/gpu/drm/i915/intel_runtime_pm.c:619 intel_runtime_pm_driver_release+0x56/0x60 [i915] <4> [137.001078] Modules linked in: snd_hda_intel i915 snd_hda_codec_hdmi mei_hdcp intel_pmt_telemetry intel_pmt_core x86_pkg_temp_thermal coretemp smsc75xx crct10dif_pclmul usbnet crc32_pclmul mii ghash_clmulni_intel kvm_intel e1000e snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core ptp pps_core mei_me snd_pcm mei prime_numbers intel_pmt [last unloaded: i915] <4> [137.001095] CPU: 0 PID: 1417 Comm: kworker/u16:7 Tainted: G U 5.9.0-g79478e23b1878-DII_3204+ #1 <4> [137.001097] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.3197.A00.2005110542 05/11/2020 <4> [137.001102] Workqueue: events_unbound async_run_entry_fn <4> [137.001140] RIP: 0010:intel_runtime_pm_driver_release+0x56/0x60 [i915] <4> [137.001142] Code: fd 10 4c 8b 67 50 4d 85 e4 75 03 4c 8b 27 e8 91 59 58 e1 45 89 e8 89 e9 4c 89 e2 48 89 c6 48 c7 c7 b0 f3 48 a0 e8 55 25 ef e0 <0f> 0b eb b5 66 0f 1f 44 00 00 48 8b 87 88 45 ff ff b9 02 00 00 00 <4> [137.001144] RSP: 0018:ffffc900007dbd68 EFLAGS: 00010286 <4> [137.001147] RAX: 0000000000000000 RBX: ffff88847338bea8 RCX: 0000000000000001 <4> [137.001148] RDX: 0000000080000001 RSI: ffffffff823efa86 RDI: 00000000ffffffff <4> [137.001150] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001 <4> [137.001152] R10: 000000009bda34df R11: 00000000e2a8a89a R12: ffff88849b209880 <4> [137.001153] R13: 0000000000000001 R14: ffff88847338bea8 R15: ffff88847338fcc0 <4> [137.001155] FS: 0000000000000000(0000) GS:ffff8884a0600000(0000) knlGS:0000000000000000 <4> [137.001157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [137.001159] CR2: 00007fc03597dd88 CR3: 0000000006610005 CR4: 0000000000770ef0 <4> [137.001160] PKRU: 55555554 <4> [137.001162] Call Trace: <4> [137.001199] i915_drm_suspend_late+0x102/0x120 [i915] <4> [137.001204] ? pci_pm_poweroff_late+0x30/0x30 <4> [137.001209] dpm_run_callback+0x61/0x270 <4> [137.001214] __device_suspend_late+0x8b/0x180 <4> [137.001217] async_suspend_late+0x15/0x90 <4> [137.001220] async_run_entry_fn+0x34/0x160 <4> [137.001224] process_one_work+0x26c/0x5c0 <4> [137.001231] worker_thread+0x37/0x380 <4> [137.001235] ? process_one_work+0x5c0/0x5c0 <4> [137.001238] kthread+0x149/0x170 <4> [137.001241] ? kthread_park+0x80/0x80 <4> [137.001246] ret_from_fork+0x1f/0x30 <4> [137.001256] irq event stamp: 2329 Cc: Kai Vehmanen <kai.vehmanen@linux.intel.com> Cc: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Signed-off-by: Uma Shankar <uma.shankar@intel.com> Acked-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
2021-05-27lib/i915/gem_create: Add gem_create_extAndrzej Turko
Add a wrapper for gem_create_ext ioctl (a version of gem_create that accepts extensions). In preparation for the driver change implementing it, a local definition of its id and necessary structs have been added, which are to be erased as soon as those definitions appear in the i915_drm.h file. The new ioctl wrapper is added to a separate file. For consistency the wrapper of the old ioctl, gem_create is moved from ioctl_wrappers to gem_create. Signed-off-by: Andrzej Turko <andrzej.turko@linux.intel.com> Cc: Zbigniew Kempczynski <zbigniew.kempczynski@intel.com> Cc: Dominik Grzegorzek <dominik.grzegorzek@intel.com> Cc: Petri Latvala <petri.latvala@intel.com> Cc: Chris P Wilson <chris.p.wilson@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Acked-by: Petri Latvala <petri.latvala@intel.com>
2021-04-14tests/core_hotunplug: Add perf health checkJanusz Krzysztofik
Sometimes CI reports skips of perf subtests when run subsequently after core_hotunplug. That may be an indication of issues with restoring device perf features on driver (hot)rebind. Detect device perf support at test start and check if still available after driver rebind. If that fails, a post-subtest device recovery step restores the device perf support so no subsequently executed tests are affected. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Acked-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com>
2021-03-23tests/core_hotunplug: Be more specific on sysfs vs. debugfs issuesJanusz Krzysztofik
Messages displayed on sysfs health check failures don't provide information which subtree of sysfs actually failed - device sysfs itself or device debugfs. That information could make debugging more easy if available. Be more specific when reporting sysfs health check failures. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Adam Miszczak <adam.miszczak@linux.intel.com>
2021-01-21tests/core_hotunplug: Reduce debug noise on stdoutJanusz Krzysztofik
Since igt_fixture sections are processed unconditionally regardless of which subtest has been requested, they can now emit a lot of unrelated debug messages which can make the picture less clear. Avoid emitting those messages from outside igt_subtest sections. Move device close status prerequisite checks from igt_fixture sections into subtest execution paths. For simplicity, pass any device close errors, including those from health checks, to next sections via a single .fd.drm data structure field. Moreover, postpone initial device health check until first actually selected subtest is started. In order to let that subtest skip on unsuccessful initial health check, not fail, move the decision whether to fail or skip on error from the health check helper to its users. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com>
2020-10-15tests/core_hotunplug: Take care of closing fences before failingJanusz Krzysztofik
The test was designed to keep track of open device file descriptors for safe driver unbind on recovery from a failed subtest. In that context, fences introduced by commit 1fbd127bd4e1 ("core_hotplug: Teach the healthcheck how to check execution status") can affect device recovery as much as an open device file if not closed before unbind. Moreover, forced GPU reset which used to be applied on recovery from a failed i915 GPU health check is no longer reachable since a GPU hang hopefully detected by the new health check algorithm can now break the whole recovery procedure prematurely. Refactor local_i915_healthcheck() so it takes care of closing fences and returns a result to its caller instead of long jumping on failures believed to be recoverable. While avoiding use of igt_assert() and friends, report actual source and error code of failures via igt_warn_on_f(). Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk>
2020-10-15tests/core_hotunplug: Restore i915 debugfs health checkJanusz Krzysztofik
Removal of igt_fork_hang_detector() from local_i915_healthcheck() by commit 1fbd127bd4e1 ("core_hotplug: Teach the healthcheck how to check execution status") resulted in unintentional removal of an important though implicit test feature of detecting, reporting as failures and recovering from potential misses of debugfs subdirs of hot rebound i915 devices. As a consequence, unexpected failures or skips of other unrelated but subsequently run tests have been observed on CI. On the other hand, removal of the debugfs issue detection and subtest failures from right after hot rebinding the driver enabled the better version of the i915 GPU health check fixed by the same commit to detect and report other issues potentially triggered by device late close. Restore the missing test feature by introducing an explicit sysfs health check, not limited to i915, that verifies existence of device sysfs and debugfs areas. Also, split hotrebind/hotreplug scenarios into a pair of each, one that performs the health check right after hot rebind/replug and delegates the device late close step to a follow up recovery phase, while the other one checks device health only after late closing it. v2: Give GPU health check a better chance to detect issues - run it before sysfs health checks. v3: Run sysfs health check on any hardware, not only i915. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com>
2020-10-06core_hotplug: Teach the healthcheck how to check execution statusChris Wilson
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2476 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
2020-09-14tests/core_hotunplug: Duplicate debug messages in dmesgJanusz Krzysztofik
The purpose of debug messages displayed by the test is to make identification of a subtest phase that fails more easy. Since issues exhibited by the test are mostly reported to dmesg, print those debug messages to /dev/kmsg as well. v2: Rebase on upstream. v3: Refresh. v4: Refresh. v5: Refresh. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: HSW/BDW audio issue workaroundJanusz Krzysztofik
Unbinding the i915 driver on some Haswell and Broadwell platforms with Azalia audio results in a kernel WARNING on "i915 raw-wakerefs=1 wakelocks=1 on cleanup". The issue can be worked around by manually enabling runtime power management for the conflicting audio adapter. Use that method but also display a warning to preserve visibility of the issue. Also tag the workaround with a FIXME comment. v2: Extend the scope of the workaround over Broadwell Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Petri Latvala <petri.latvala@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Check health both before and after late closeJanusz Krzysztofik
In hot rebind / hot replug subtests, device health is now checked only at the end of the subtest, after late close. If something fails, we may be not able to identify the failing phase of the subtest easily. Run health checks also before late closing the device, right after hot rebind / replug. For still being able to perform late close while also handling cleanup of potential device close misses in health checks, we need to maintain two separate device file descriptors in our private data structure. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Add 'lateclose before restore' variantsJanusz Krzysztofik
If a GPU gets wedged during driver rebind or device re-plug for some reason, current hotunbind/hotunplug test variants may time out before lateclose phase, resulting in incomplete CI reports. Add new test variants which close the device before restoring it. Also rename old variants to more adequate hotrebind/hotreplug-lateclose and perform health checks both before and after late close. v2: Rebase on upstream. v3: Refresh, - further rename hotunbind/hotunplug-lateclose to hotunbind-rebind and hotunplug-rescan respectively, then add two more variants under the old names which only exercise late close, leaving rebind / rescan to be cared of in the post-subtest recovery phase, - also update descriptions of unmodified subtests for consistency. v4: Refresh, - drop subtests with no health checks, adjust timeouts in successors, - perform health checks of hot restored devices also before late close, - in order to be able to safely run a health check while still keeping an unbound / unplugged device instance open, also preserve the open device fd, not only a close error, - adjust subtest descriptions. v5: Refresh, - split out pre-lateclose health checks and related changes, introduced in v4, to a separate patch. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: More thorough i915 healthcheck and recoveryJanusz Krzysztofik
The test now assumes the i915 driver is able to identify potential hardware or driver issues while rebinding to a device and indicate them by marking the GPU wedged. Should that assumption occur wrong, the health check phase of the test would happily succeed while potentially leaving the device in an unusable state. That would not only give us falsely positive test results but could also potentially affect subsequently run applications. Then, we should examine health of the exercised device more thoroughly and try harder to recover it from potentially detected stalls. We could use a gem_test_engine() library function which submits and asserts successful execution of a NOP batch on each physical engine. Unfortunately, on failure this function jumps out of an IGT test section it is called from, while we would like to continue with recovery steps, possibly not adding another level of test section group nesting. Moreover, the function opens the device again and doesn't close the extra file descriptor before the jump, while we care for being able to close the exercised device completely before running certain subtest operations. Then, reimplement the function locally with those issues fixed and use it as an i915 health check. Call it also on test startup so operations performed by the test are never blamed for driver or hardware issues which may potentially exist and be possible to detect on test start. Should the i915 GPU be found unresponsive by the health check called from a recovery section, try harder to recover it to a usable state with a global GPU reset. For still more effective detection of GPU hangs, use a hang detector provided by IGT library. However, replace the library signal handler with our own implementation that doesn't jump out of the current IGT test section on GPU hang so we are still able to perform the reset and retry. v2: Skip i915 health check if a GPU hang has been already detected by a previous health check run and not yet recovered with a GPU reset, - take care of stopping a hang detector instance possibly left running by a failed health check attempt. v3: Re-run i915 health check as a first setp of i915 recovery (use full GPU reset as a last resort), - prefix i915 health check debug messages with step indicators, - fix spelling error in a comment. v4: Unbind the driver from an unhealthy device before recovery, - drop caches on i915 health check completion. v5: Refresh on top of a new patch added to the series which already unbinds the driver form a device found unhealthy and runs health checks on test startup, - no need to drop caches from the i915 health check, it seems to do its job correctly without that. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Also check health of render device nodeJanusz Krzysztofik
Failures of subsequent tests accessing the render device node have been observed on CI after late close of a hot rebound device. Extend our health check over the render node to detect that condition and start our recovery phase with unbinding the driver from the device found faulty. Also, check health of both device nodes before running any subtests. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Explicitly ignore unused return valuesJanusz Krzysztofik
Some return values are not useful and can be ignored. Wrap those cases inside igt_ignore_warn(), not only to make sure compilers are happy but also to clearly document our decisions. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Assert expected device presence/absenceJanusz Krzysztofik
Don't rely on successful write to sysfs control files, assert existence / non-existence of a respective device sysfs node as well. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Process return values of sysfs operationsJanusz Krzysztofik
Return values of driver bind/unbind / device remove/recover sysfs operations are now ignored. Assert their correctness. v2: Add trailing newlines missing from igt_assert messages. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Let the driver time out essential sysfs operationsJanusz Krzysztofik
The test now arms a timer before performing each driver unbind / rebind or device unplug / bus rescan sysfs operation. Then in case of issues we may prevent the driver from showing us if and how it can handle them. Don't arm the timer before sysfs operations which are essential for a subtest. v2: Refresh, - don't time out on hot driver rebind / hot device restore in *-lateclose variants, those operations haven't been covered by other subtests. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Fail subtests on device close errorsJanusz Krzysztofik
Since health checks are now run from follow-up fixture sections, it is safe to fail subtests without the need to abort the test execution. Do that on device close errors instead of just emitting warnings. v2: Rebase only. v3: Refresh. v4: Refresh. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Recover from subtest failuresJanusz Krzysztofik
Subtests now forcibly call or request igt_abort on failures in order to avoid silently leaving an exercised device in an unusable state. However, a failure inside a subtest doesn't always mean the device is no longer working correctly and reboot is needed. On the other hand, if a subtest just fails without aborting, that doesn't mean in turn the device is healthy. We should still perform a device health check in that case before deciding on next steps. Reuse the 'failure' structure field as a mark which is set before each critical operation is executed that must be followed by a successful health check in order to avoid aborting the test. Then, follow each subtest with its individual igt_fixture section, from where device file descriptors potentially left open are closed, device rediscover or driver rebing operation is run as needed, and finally the health check is run again if the preceding igt_subtest section has exited with the marker set. v2: Start each recovery phase from unconditionally closing file descriptors potentially left open by a subtest before it entered its critical section, - replace igt_require() with 'if() return;' construct in recover() to reduce noise, - replace "subtest failure" message used as a request for healthcheck with a more appropriate "need healthcheck" for clarity, - rebase on current upstream master. v3: Refresh, - move bus_rescan() and driver_bind() function calls back from heaalthcheck() to recover() so a pure health check can still be called from a subtest if essential, - move failure mark assignments back from subtests to helpers for more adequate abort reason reporting but clean the mark only on health check success, - call cleanup() also from post_healthcheck() in order to close a device file descriptor potentially left open by a failed health check, - reword commit message and update description. v4: Close exercised device fd before failing a health check run, - don't drop health checks from subtest bodies, their results should always matter. v5: Refresh. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Skip selectively on sysfs close errorsJanusz Krzysztofik
Since we no longer open a device DRM sysfs node, only a PCI one, driver unbind operations are no longer affected by missed or unsuccessful sysfs file close attempts. Skip only affected subtests if that happens. v2: Rebase only. v3: Refresh. v4: Refresh. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Prepare invariant data once per test runJanusz Krzysztofik
Each subtest now calls a prepare() helper which opens a couple of files required by that subtest. Those files are then closed after use, either directly from the subtest body, or indirectly from inside one of helper functions called during the subtest execution. That approach not only makes life cycle of individual file descriptors difficult to follow but also prevents us from re-running health checks on subtest failures from follow up igt_fixture sections since we may need to retry bus rescan or driver rebind operations. Two of those files - device bus and driver sysfs nodes - are not affected nor interfere with driver unbind / device unplug operations performed by subtests. Then, there is not much sense in closing and reopening those nodes. Open them once at the beginning of a test run, then close them as late as on test completion. The prepare() helper also populates a device bus address string used by driver unbind / rebind operations. Since the bus address of an exercised device never changes, also prepare that string only once at the beginning of a test run. Note that it is the same as the last component of a device filter string which is already resolved and installed from an initial igt_fixture section of the test. Then, initialize the device bus address field of a hotunplug structure instance with a pointer to the respective substring of that filter rather than resolving it again from the device sysfs node pathname. There is one more sysfs node - a DRM device node - now opened by the prepare() helper for subtests which perform device remove operations. That node can't be opened only once at the beginning of a test run because its open file descriptor is no longer usable as soon as a driver unbind operation is performed. On the other hand, it can't be opened easily from inside a device_remove() helper since some subtests just don't open the device so its file descriptor used by igt_sysfs_open() may just not be available. However, note that only a PCI sysfs node of the device, not necessarily the DRM one, is actually required for a successful device remove operation, and that node can be opened easily from a bus file descriptor using a device bus address string, both already available. Then, change the semantics of a .fd.sysfs_dev field of the hotunplug structure from DRM to PCI device sysfs file descriptor, then let the device_remove() helper open the device PCI node by itself and store its file descriptor in that field. Also, for still more easy access to the device PCI node, use a 'subsystem/devices' sub-node of the PCI device as its bus sysfs location instead of just 'subsystem', then adjust a relative path to the bus 'rescan' function accordingly. A side benefit of using the PCI device sysfs node, not the DRM one, while removing the device is that a future subtest may now easily perform both driver unbind and device remove operations in a row. v2: Rebase only. v3: Refresh. v4: Refresh, still assert a device dile descriptor closed cleanly on subtest start, a device sysfs file descriptor still before open. Suggested-by: Michał Winiarski <michal.winiarski@intel.com> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Handle device close errorsJanusz Krzysztofik
The test now ignores device close errors. Those errors are believed to have no influence on device health so there is no need to process them the same way as we mostly do on errors, i.e., notify CI about a problem via igt_abort. However, those errors may indicate issues with the test itself. Moreover, impact of those errors on operations performed by subtests, like driver unbind or device remove, should be perceived as undefined. Then, we should fail as soon as a device or device sysfs node close error occurs in a subtest and also skip subsequent subtests. However, once a driver unbind or device unplug operation has been attempted by a subtest, we would still like to check the device health. When in a subtest, store results of device close operations for future reference. Reuse file descriptor fields of the hotunplug structure for that. Unless in between of a driver remove or device unplug operation and a successful device health check completion, fail current test section right after a device close error occurs, warn otherwise. If still running, examine device file descriptor fields in subsequent igt_fixture sections and skip on errors. v2: Fix a typo in post_healthcheck function name. v3: Don't fail on close error after successful health check, warn only, - move duplicated device close error messages to helpers. v4: Assert device file descriptors closed cleanly on start of each subtest. v5: Update device status on open for health check if not yet dirty, - move device close debug messages to helper. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Pass errors via a data structure fieldJanusz Krzysztofik
A pointer to fatal error messages can be passed around via hotunplug structure, no need to declare it as global. v2: Rebase only. v3: Refresh. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Maintain a single data structure instanceJanusz Krzysztofik
The following changes to the test are planned: - avoid global variables if possible, - prepare invariant data only once per test run, - skip subsequent subtests after device close errors, - allow subtests to fail on errors and try to recover from those failures in follow up igt dixture sections instead of aborting. For that to be possible, maintain a single instance of hotunplug structure at igt_main level and pass it down to subtests. v2: Commit description refreshed. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Assert successful device filter applicationJanusz Krzysztofik
Return value of igt_device_filter_add() representing a number of successfully installed device filters is now ignored. Fail if not 1. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Consolidate duplicated debug messagesJanusz Krzysztofik
Some debug messages which designate specific test operations, or their greater parts at least, sound always the same, no matter which subtest they are called from. Emit them, possibly updated with subtest specified modifiers, from inside respective helpers instead of duplicating them in subtest bodies. v2: Rebase only. v3: Refresh and extend over new case (local_drm_open_driver), - allow callers to specify a message suffix as well where applicable. v4: Rename prefix/suffix string arguments to more meaningful when/why. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Clean up device open error handlingJanusz Krzysztofik
We don't use drm_driver_open() since in case of an i915 device it keeps an extra file descriptor of the exercised device open for exit handler use, while we would like to be able to close the device completely before running certain test operations. Instead, we call __drm_driver_open() and handle its result ourselves. Unlike drm_driver_open() which skips on device open errors, we always fail or abort the test in such case. Moreover, we don't ensure that the i915 driver is idle before starting subtests like drm_open_driver() does. Skip instead of failing on initial device open error. Also, call gem_quiescent_gpu() if an i915 device is detected. For subsequent device opens, define a local helper that fails on error and use it. If we think we need to abort the test execution on device open error, set our failure marker first to trigger the abort from a follow up igt_fixture section. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Constify dev_bus_addr stringJanusz Krzysztofik
Device bus address structure field is always initialized with a pointer to a substring of the device sysfs path and never used for its modification. Declare it as a constant string. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-09-14tests/core_hotunplug: Use igt_assert_fd()Janusz Krzysztofik
There is a new library helper that asserts validity of open file descriptors. Use it instead of open coding. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
2020-05-07lib/i915: Split igt_require_gem() into i915/Chris Wilson
igt_require_gem() is a pecularity of i915/, move it out of the core. Similar opportunistic move of gem_reopen_driver() and gem_quiescent_gpu(). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
2020-05-05lib: Support multiple filtersArkadiusz Hiler
This patch brings back support for multiple filters that was in the original series by Zbyszek. We can now take multiple, semicolon separated filters. Right now the tests are using only the first filter. v2: drop unnecessary check before for-loop (Petri) Cc: Petri Latvala <petri.latvala@intel.com> Cc: Zbigniew Kempczyński <zbigniew.kempczynski@intel.com> Signed-off-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com> Reviewed-by: Petri Latvala <petri.latvala@intel.com>
2020-04-17tests: Add a test for device hot unplugJanusz Krzysztofik
There is a test which verifies unloading of i915 driver module but no test exists that checks how a driver behaves when it gets unbound from a device or when the device gets unplugged. Implement such test using sysfs interface. Two minimalistic subtests - "unbind-rebind" and "unplug-rescan" - perform the named operations on a DRM device which is believed to be not in use. Another pair of subtests named "hotunbind-lateclose" and hotunplug-lateclose" do the same on a DRM device while keeping its file descriptor open and close it thereafter. v2: Run a subprocess with dummy_load instead of external command (Antonio). v3: Run dummy_load from the test process directly (Antonio). v4: Run dummy_load from inside subtests (Antonio). v5: Try to restore the device to a working state after each subtest (Petri, Daniel). v6: Run workload inside an igt helper subprocess so resources consumed by the workload are cleaned up automatically on workload subprocess crash, without affecting test results, - move the igt helper with workload back from subtests to initial fixture so workload crash also does not affect test results, - other cleanups suggested by Katarzyna and Chris. v7: No changes. v8: Move workload functions back from fixture to subtests, - register different actions and different workloads in respective tables and iterate over those tables while enumerating subtests, - introduce new subtest flavors by simply omitting module unload step, - instead of simply requesting bus rescan or not, introduce action specific device recovery helpers, required specifically with those new subtests not touching the module, - split workload functions in two parts, one spawning the workload, the other waiting for its completion, - for the new subtests not requiring module unload, run workload functions directly from the test process and use new workload completion wait functions in place of subprocess completion wait, - take more control over logging, longjumps and exit codes in workload subprocesses, - add some debug messages for easy progress watching, - move function API descriptions on top of respective typedefs. v9: All changes after Daniel's comments - thanks! - flatten the code, don't try to create a midlayer (Daniel), - provide minimal subtests that even don't keep device open (Daniel), - don't use driver unbind in more advanced subtests (Daniel), - provide subtests with different level of resources allocated during device unplug (Daniel), - provide subtests which check driver behavior after device hot unplug (Daniel). v10 Rename variables and function arguments to something that indicates they're file descriptors (Daniel), - introduce a data structure that contains various file descriptors and a helper function to set them all (Daniel), - fix strange indentation (Daniel), - limit scope to first three subtests as the initial set of tests to merge (Daniel). v11 Fix typos in some comments, - use SPDX license identifier, - include a per-patch changelog in the commit message (Daniel). v12 We don't use SPDX license identifiers nor GPL-2.0 in IGT (Petri), - avoid chipset, make sure we reopen the same device (Chris), - rename subtest "drm_open-hotunplug" to "hotunplug-lateclose", - add subtest "hotunbind-lateclose" (less affected by IOMMU issues), - move some redundant code to helpers, - reorder some helpers, - reword some messages and comments, - clean up headers. v13 Add test / subtest descriptions (patchwork). v14 Extract redundant device rescan and reopen code to a 'healthcheck' helper, - call igt_abort_on_f() on device reopen failure (Petri), - if any timeout set with igt_set_timeout() inside a subtest expires, call igt_abort_on_f() from a follow-up igt_fixture (Petri), - when running on a i915 device, require GEM and call igt_abort_on_f() if no usable GEM is detected on device reopen. v15 Add the test to CI blacklist (Martin). v16 Separate blacklist entry with a descriptive comment (Petri). Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com> Cc: Antonio Argenziano <antonio.argenziano@intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Katarzyna Dec <katarzyna.dec@intel.com> Cc: Martin Peres <martin.peres@linux.intel.com> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Petri Latvala <petri.latvala@intel.com>