clusterduck - pixelcluster’s GPU blog

RADV Ray Tracing: Now ON by default

2023-06-13T00:00:00+00:00

Yes, you heard that right.

Ray Tracing Pipelines.

On RADV.

Enabled by default.

Now merged in Mesa main.

This has been in the works for a loooooooooong time. Probably the longest of any RADV features so far.

But what makes ray tracing pipelines so complex that it takes this long to implement? Let’s take a short look at what it took for RADV to get its implementation off the ground.

Ray Tracing basics

For the purposes of this blog, ray tracing is the process of finding intersections between rays and some geometry.

Most of the time, this geometry will be made up of lots of triangles. We don’t want to test every single triangle for intersection separately, so Bounding Volume Hierarchies (BVHs) are used to speed up the process by skipping entire groups of triangles at once.

Hardware acceleration

Nowadays, GPUs have dedicated hardware to speed up the ray tracing process.

AMD’s hardware acceleration for ray tracing is very simple: It consists of a single instruction called image_bvh_intersect_ray (and its 64-bit variant).¹

Why is it called image_bvh_intersect_ray? Because the hardware sees the BVH as a 1D image and uses its memory subsystem for textures to fetch BVH data, of course.

This instruction takes care of calculating intersections between a ray and a single node in the BVH. But intersecting one node isn’t good enough: In order to find actual intersections between the ray and geometry, we need to traverse the BVH and check lots of nodes. The traversal loop that accomplishes this is implemented in software².

Ray Tracing Pipelines

In Vulkan, you can use ray tracing pipelines to utilize your GPU’s hardware-accelerated ray tracing capabilities. It might not seem like it, but ray tracing pipelines actually bring a whole lot of new features with them that make them quite complex to implement.

Ray tracing pipelines introduce a set of new shader stages:

Ray generation shaders calculate origins and directions of rays to trace and call traceRayEXT to start tracing
Any-hit shaders are responsible or confirming or rejecting potential intersections
Intersection shaders can be used to run custom ray-primitive intersection code, which can be used to do raytracing on non-triangle geometry
Closest-hit shaders are responsible for handling rays that have hit geometry, calculating things like lighting for the traced ray
Miss shaders handle the case where no accepted intersections were results (either there were no intersections, or all intersections were rejected).
Callable shaders can be invoked from the ray generation shader and can do arbitrary calculations, including recursive calls (calling callable shaders from callable shaders)

That’s right, as a small side effect, ray tracing pipelines also introduced full proper recursion from shaders. This doesn’t just apply to callable shaders: You can also trace new rays from a closest-hit shader, which can recursively invoke more closest-hit shaders, etc.

Also, ray tracing pipelines introduce a very dynamic, GPU-driven shader dispatch process: In traditional graphics and compute pipelines, once you bind a pipeline, you know exactly which shaders are going to execute once you do a draw or dispatch. In ray tracing pipelines, this depends on something called the Shader Binding Table, which is a piece of memory containing so-called “shader handles”. These shader handles identify the shader that is actually launched when vkCmdTraceRaysKHR is called.

In both graphics and compute pipelines, the concept of pipeline stages was quite simple: You have a bunch of shader stages (for graphics pipelines, it’s usually vertex and fragment, for compute pipelines it’s just compute). Each stage has exactly one shader: You don’t have one graphics pipeline with many vertex shaders. In ray tracing pipelines, there are no restrictions on how many shaders can exist for each stage.

In RT pipelines, there is also the concept of shaders dispatching other shaders: Every time traceRayEXT is called, more shaders (any-hit, intersection, closest-hit or miss shaders) are launched.

That’s lots of changes just for some ray tracing!

Hardware limitations

RT pipelines aren’t really a fitting representation of AMD hardware. There is no such thing as reading a memory location to determine which shader to launch, and the hardware has no concept of a callstack to implement recursion. RADV therefore has to do a bit of magic to transform RT pipelines in a way that will actually run.

Shader stages: All-in-one

The first approach RADV used to implement these ray tracing pipelines was essentially to pretend that the whole ray tracing pipelines a normal compute shader: All shaders from the pipeline are assigned a unique ID. Then, all shaders are inserted into a humongous chain of if (idx == shader_id) { (paste shader code here) } statements.

If you wanted to call a shader, it was as simple as setting idx to the ID of the shader you wanted to call. You could even implement recursion by storing the ID of the shader to return to on a call stack.

Launching shaders according to the shader binding table wasn’t a problem either: You just read the shader binding table at the start and set idx to whatever value is in there.

But there was a problem.

Oh God there’s so many of them

As it turns out, if you don’t put any restrictions on how many shaders can exist in a stage, there’s going to be apps that use LOTS of them. We’re talking almost a thousand shaders in some cases. Ludicrously large code like that resulted in lots of ludicrous results (games spending over half an hour compiling shaders!). Clearly, the megashader solution wasn’t sustainable.

Also I forgot an important addition

Ray Tracing Pipelines also add pipeline libraries. You might have heard of them in the context of Graphics Pipeline Libraries, which was also really painful to implement in RADV.

Pipeline libraries essentially allow you to create parts of your ray tracing pipeline beforehand, and then re-use these created parts all over other ray tracing pipelines. But if we just paste all shaders into one chonker compute shader, we can’t compile it yet when creating a pipeline library, because other shaders will be added once a real pipeline is created from it!

This basically meant that we couldn’t do anything but copy the source code around, and start compiling only when the real pipeline is created. It also turned out that it’s valid behaviour to query the stack size used for recursion from pipeline libraries, but because RADV didn’t compile any code yet, it didn’t even know what stack size the shaders from that pipeline used.

Separate shader compilation

This is where separate shader compilation comes in. As the name suggests, most³ shaders are compiled independently. Instead of using shader IDs to select what shader is called, we store the VRAM addresses of the shaders and directly jump to whatever shaders we want to execute next.

Directly jumping to a shader is still impossible because reading the shader binding table is required. Instead, RADV creates a small piece of shader assembly that sets up necessary parameters, reads the shader binding table, and then directly jumps to the selected shader (like it is done for shader calls).

This allows us to compile shaders immediately when creating pipeline libraries. It also pretty much resolves the problem of chonker compute shaders taking ludicrously long to compile. It also required basically reworking the entire ray tracing compilation infrastructure, but I think it forms a great basis for future work in the performance area.

FAQ

What apps/games does RADV ray tracing run?

Everything runs.

In case you disagree, please open an issue.

How well do ray queries run?

Pretty competitive with AMDVLK/the AMD Windows drivers! You’ll generally see similar, if not better, performance on RADV.

How well do pipelines run?

Not well (expect significantly less performance compared to AMDVLK/Windows drivers). This is being worked on.

Footnotes

RDNA3 introduces another instruction that helps with BVH traversal stack management, but RADV doesn’t use it yet. ↩
This is also what makes it so easy to support ray tracing even when there is no hardware acceleration (using RADV_PERFTEST=emulate_rt): Most of the traversal code can be reused, only image_bvh_intersect_ray needs to be replaced with a software equivalent. ↩
Any-hit and Intersection shaders are still combined into a single traversal shader. This still shows some of the disadvantages of the combined shader method, but generally compile times aren’t that ludicrous anymore. ↩

GPU Hang Exploration: Splitgate

2023-05-11T00:00:00+00:00

GPU hangs are one of the most common results of pretty much anything going wrong GPU-side, and finding out why they occur isn’t always easy. In this blog post, I’ll document my journey towards finding the cause of one specific hang in the game “Splitgate”.

Right off the bat, I noticed a few oddities with this particular hang. Firstly, the actual game always ran completely fine. The only place where it hung was on the first startup where the game automatically configures graphics settings (I’ll call it autoconfiguration from here).

Additionally, while I could reproduce the hang on my Steam Deck, I couldn’t get the hang to appear on my desktop. I have an RDNA2 graphics card, which is the same architecture as the Deck, so it seemed unlikely that specifics about the hardware architecture were the problem here.

API Validation

As a first step, I tried running the game with the Vulkan Validation Layers. If the game is using the API in an invalid way and that is the cause of the hangs, there’s a rather good chance the Validation Layers will catch it.

Even though there were a few errors from the validation layers, it seemed like none of the errors were actually relevant to the hang. Most importantly, the errors with autoconfiguration on were the same as the errors with autoconfiguration off.

As any software, the Validation Layers aren’t perfect and can’t detect every possible invalid behaviour. At this point I was still unsure whether I’d have to search for the bug on the application side or on the driver side.

API dumping

With the validation layers being unable to detect any invalid behaviour by the app during the autoconfiguration phase, another question comes to mind: What is the application doing, actually?

To answer that, I utilized the API Dump Vulkan layer by LunarG. When this layer is activated, it dumps all the commands made by the application, including every parameter and return value to standard output.

While API dumps are good to have for debugging, large API dumps from large engines are often difficult to navigate (not just because it’s an 800MB large file and your text editor dies trying to scroll through them). Instead, it’s often best to extract just the work that hangs for further debugging. But what frame is this?

Finding the hanging submission

The CPU and GPU do work asynchronously, which means that the CPU is free to do more work while GPU starts with its work. Somewhat unfortunately, this also means the CPU can do more Vulkan calls which will show up in the API dump after the app already submitted the hanging frame to the GPU. This means that I couldn’t just look at the last command in the API dump and assume that command caused the hang. Luckily, there were other hints towards what caused the hang.

In Vulkan, when you want to know when a particular work submission finishes, you give a VkFence to the submit function. Later, you can wait for the submission to finish with vkWaitForFences, or you can query whether the submission has already finished with vkGetFenceStatus.

I noticed that after work was submitted, the app seemed to call vkGetFenceStatus from time to time, polling whether that submission was finished. Usually, vkGetFenceStatus would return VK_SUCCESS after a few calls, indicating that the submission finished. However, there was one submission where vkGetFenceStatus seemed to always return VK_NOT_READY. It seemed very likely that the GPU was hanging while executing that submission.

To test my theory, I modified the implementation of vkQueueSubmit, which you call for submitting work, to call vkDeviceWaitIdle immediately after submitting the work. vkDeviceWaitIdle waits for all outstanding GPU work to finish. When the GPU hangs, the vkQueueSubmit which caused the hang should be the last line in the API dump¹.

This time, the API dump cut off at the vkQueueSubmit for which vkGetFenceStatus always returned VK_NOT_READY.

Bingo.

Going lower-level

Now we know which submission hangs, but that submission still contains a lot of commands. Even though the text editor now survives scrolling through the commands, finding what is wrong by inspection is highly unlikely.

Instead, I tried to answer the question: “What specific command is making the GPU hang?”

In order to find the answer, I needed to find out as much as possible about what state the GPU is in when it hangs. There are a few useful tools which helped me gather info:

umr

The first thing I did was use umr to query if any waves were active at the time of the hang. Waves (or Wavefronts) are groups of 32 or 64 shader invocations (or threads in DX terms) that the GPU executes at the same time². There were indeed quite a few waves currently executing. For each wave, umr can show a disassembly of the GPU code that is currently executing, as well as the values of all registers, and more.

In this case, I was especially interested in the halt and fatal_halt status bits for each wave. These bits are set when the wave encounters a fatal exception (for example dereferencing invalid pointers) and won’t continue execution. These bits were not set for any waves I inspected, so it was unlikely that exceptions in a shader were causing the hang.

Aside from exceptions, the other common way for shaders to trigger GPU hangs is by accidentally executing infinite loops. But the shader code currently executing was very simple and didn’t even have a jump instruction anywhere, so the hang couldn’t be caused by infinite loops either.

RADV_DEBUG=hang

Shaders aren’t the only thing that the GPU executes, and as such shaders aren’t the only thing that can cause GPU hangs.

In RADV, command buffers recorded in Vulkan are translated to a hardware-specific command buffer format called PKT3³. Commands encoded in this format are written to GPU-accessible memory, and executed by the GPU’s command processor (CP for short) when the command buffer is submitted.

These commands might also be involved in the hang, so I tried finding out which commands the CP was executing when the hang happened. RADV has integrated debug functionality that can help with exactly this, which can be enabled by setting an environment variable named RADV_DEBUG to "hang". But when I tried triggering the hang with this environment variable in place, it started up just fine!

This isn’t the first time I’ve seen this. RADV_DEBUG=hang has a funny side effect: It also inserts commands to wait for draws or dispatches to complete immediately after the dispatch is triggered. This immensely helps with figuring out which shader is faulty if there are multiple shaders executing concurrently. But it also prevents certain hangs from happening: Where things executing concurrently causes the hang in the first place.

In other words, we seem to be looking at a synchronization issue.

Synchronization boogaloo

Even though we know we’re dealing with a synchronization issue, the original question remains unsolved: What command causes the hang?

The “sync after every draw/dispatch” method of RADV_DEBUG=hang fixes the issue, but it has a very broad effect. Since the issue seems to reproduce very reliably (which in itself is a rarity for synchronization bugs), we can apply that sync selectively to only some draws or dispatches to narrow down what commands exactly cause the hangs.

First, I tried restricting the synchronization to only apply to dispatches (so no draws were synchronized). This made the hang appear again. Testing the other way around (restricting the synchronization to only draws) confirmed: All compute dispatches were fine, the issue was about draw synchronization only.

Next, I tried only synchronizing at the end of renderpasses. This also fixed the hang. However, synchronizing at the start of renderpasses fixed nothing. Therefore it was impossible that missing synchronization across renderpasses was the cause of the hang.

The last likely option was that there was missing synchronization in between the draws and something in between renderpasses.

At this point, the API dump of the hanging submission proved very helpful. Upon taking a closer look, it became clear that the commands in the submitted command buffer had a very simple pattern (some irrelevant commands omitted for brevity):

vkCmdBeginRenderPass to begin a new renderpass
vkCmdDraw
vkCmdEndRenderPass, ending the renderpass
vkCmdWriteTimestamp, writing the current elapsed time

What stuck out to me was that vkCmdWriteTimestamp was called with a pipelineStage of VK_PIPELINE_STAGE_TOP_OF_PIPE. In simple terms, this means that the timestamp can be written before the preceding draw finished.⁴

Further testing confirmed: If I insert synchronization before writing the timestamp, the hang is fixed. Inserting synchronization immediately after writing the timestamp makes the hang re-appear.

How hard can writing a timestamp be?

By now, it has become pretty clear that timestamp queries are the problem here. But it just didn’t really make sense that the timestamp write itself would hang.

Timestamp writes on AMD hardware don’t require launching any shaders. They can be implemented using one PKT3 command called COPY_DATA⁵, which accepts many data sources other than memory. One of these data sources is the current timestamp. RADV uses COPY_DATA to write the timestamp to memory. The memory for these timestamps is managed by the driver, so it’s exceedingly unlikely the memory write would fail.

From the wave analysis with umr earlier I also knew that the in-flight shaders didn’t actually write or read any memory that might interfere with the timestamp write (somehow). The timestamp write itself being the cause of the hang seemed impossible.

Taking a step back

If timestamp writes can’t be the problem, what else can there be that might hang the GPU?

There is one other part to timestamp queries aside from writing the timestamp itself: In Vulkan, timestamps are always written to opaque “query pool” objects. In order to actually view the timestamp value, an app has to copy the results stored in the query pool to a buffer in CPU or GPU memory. Splitgate uses Unreal Engine 4, which has a known bug related to query pool copies that RADV has to work around.

It isn’t too far-fetched to think there might be other bugs in UE’s Vulkan RHI regarding query copies. Synchronizing the query copy didn’t do anything, but just commenting out the query copy fixed the hang as well.

????

Up until this point, I was pretty sure that something about the timestamp write must be the cause of the problems. Now it seemed like query copies might also influence the problem somehow? I was pretty unsure how to reconcile these two observations, so I tried finding out more about how exactly the query copy affected things.

Query copies on RADV are implemented using small compute shaders written directly in NIR. Having the simple driver-internal shaders in NIR is a nice and simple way of storing them inside the driver, but they’re a bit hard to read for people not used to the syntax. For demonstration purposes I’ll use a GLSL translation of the shader⁶. The copy shader for timestamp queries looks like this:

location(binding = 0) buffer dst_buf;
location(binding = 1) buffer src_buf;

void main() {
    uint32_t result_size = flags & VK_QUERY_RESULT_64_BIT ? sizeof(uint64_t) : sizeof(uint32_t);
    uint32_t dst_stride = result_size;
    if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT)
        dst_stride += sizeof(uint32_t);
    uint32_t src_stride = 8;

    uint64_t result = 0;
    bool available = false;
    uint64_t src_offset = src_stride * global_id.x;
    uint64_t dst_offset = dst_stride * global_id.x;
    uint64_t timestamp = src_buf[src_offset];
    if (timestamp != TIMESTAMP_NOT_READY) {
        result = timestamp;
        available = true;
    }
    if ((flags & VK_QUERY_RESULT_PARTIAL_BIT) || available) {
        if (flags & VK_QUERY_RESULT_64_BIT) {
            dst_buf[dst_offset] = result;
        } else {
            dst_buf[dst_offset] = (uint32_t)result;
        }
    }
    if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT) {
        dst_buf[dst_offset + result_size] = available;
    }
}

At first, I tried commenting out the stores to dst_buf, which resulted in the hangs disappearing again. This can indicate that dst_buf is the problem, but it’s not the only possibility. The compiler can also optimize out the load because it isn’t used further down in the shader, so this could also mask an invalid read as well. When I commented out the read and always stored a constant instead - it also didn’t hang!

But could it be that the shader was reading from an invalid address? Splitgate is by far not the only app out there using timestamp queries, and those apps all work fine - so it can’t just be fundamentally broken, right?

To test this out, I modified the timestamp write command once again. Remember how PKT3_COPY_DATA is really versatile? Aside from copying memory and timestamps, it can also copy a 32/64-bit constant supplied as a parameter. I undid all the modifications to the copy shader and forced a constant to be written instead of timestamps. No hangs to be seen.

?????????

It seems like aside from the synchronization, the value that is written as the timestamp influences whether a hang happens or not. But that also means neither of the two things already investigated can actually be the source of the hang, can they?

It’s essentially the same question as in the beginning, still unanswered:
“What the heck is hanging here???”

RADV_DEBUG=hang (but useful this time)

Stabbing in the dark with more guesses won’t help here. The only thing that can is more info. I already had a small GPU buffer that I used for some other debugging I skipped over. To get definitive info on whether it hangs because of the timestamp write, the timestamp copy, or something else entirely, I modified the command buffer recording to write some magic numbers into that debug buffer whenever these operations happened. It went something along the lines of:

write 0xAAAAAAAA if timestamp write is complete
write 0xBBBBBBBB if timestamp copy is complete

However, I still needed to ensure I only read the magic numbers after the GPU had time to execute them (without waiting forever during GPU hangs).. This required a different intricate and elaborate synchronization algorithm.

// VERY COMPLICATED SYNCHRONIZATION
sleep(1);

With that out of the way, let’s take a look at the magic number of the hanging submission.

Magic: 0x0

what??? this means neither write nor copy have executed? Alright, what if I add another command writing a magic number right at the beginning of the command buffer?

Magic: 0x0

So… the hang happens before the command buffer starts executing? Something can’t be right here.⁷

At this point I started logging all submits that contained either timestamp writes or timestamp copies, and I noticed that there was another submission with the same pattern of commands right before the hanging one.

Multi-submit madness

This previous submission had executed just fine - all timestamps were written, all shaders finished without hangs. This meant that neither the way timestamps were written nor the way they were copied could be direct causes of hangs, because they worked just one submission prior.

I verified this theory by forcing full shader synchronization to happen before the timestamp write, but only for the submission that actually hangs. To my surprise, this did nothing to fix the hangs.

When I applied the synchronization trick to the previous submit (that always worked fine!), the hangs stopped appearing.

It seems like the cause of the hang is not in the hanging submission, but in a completely separate one that completed successfully.

What is the app doing?

Let’s rewind to the question that started this whole mess. “What is the app doing?”

Splitgate (as of today) uses Unreal Engine 4.27.2. Luckily, Epic Games make the source code of UE available to anyone registering for it with their Epic Games account. There was hope that the benchmark code they were using was built into Unreal, where I could examine what exactly it does.

Searching in the game logs from a run with the workaround enabled, I found this:

LogSynthBenchmark: Display: Graphics:
LogSynthBenchmark: Display:   Adapter Name: 'AMD Custom GPU 0405 (RADV VANGOGH)'
LogSynthBenchmark: Display:   (On Optimus the name might be wrong, memory should be ok)
LogSynthBenchmark: Display:   Vendor Id: 0x1002
LogSynthBenchmark: Display:   Device Id: 0x163F
LogSynthBenchmark: Display:   Device Revision: 0x0
LogSynthBenchmark: Display:   GPU first test: 0.06s
LogSynthBenchmark: Display:          ... 3.519 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 2.804 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 2.487 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 8.917 s/GigaPix, Confidence=100% 'FillOnly' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 0.330 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 0.951 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 6.053 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be very inaccurate)
LogSynthBenchmark: Display:   GPU second test: 0.54s
LogSynthBenchmark: Display:          ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 9.127 s/GigaPix, Confidence=100% 'FillOnly' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be inaccurate)
LogSynthBenchmark: Display:   GPU Final Results:
LogSynthBenchmark: Display:          ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise'
LogSynthBenchmark: Display:          ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy'
LogSynthBenchmark: Display:          ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy'
LogSynthBenchmark: Display:          ... 9.127 s/GigaPix, Confidence=100% 'FillOnly'
LogSynthBenchmark: Display:          ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth'
LogSynthBenchmark: Display:          ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1'
LogSynthBenchmark: Display:          ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2'

FSynthBenchmark indeed appears in the UE codebase as a benchmark tool to auto-calibrate settings. From reading its code, it seemed like it does 3 separate benchmark ru…

wait. 3??

We can clearly see from the logs there are only two benchmark runs. Maybe the third run hangs the GPU somehow?

Hang? Well yes, but actually no

While thinking about this, another possibility came to my mind. The GPU driver can’t actually detect if the GPU is hung because of some fatal error or if it just takes an obscenely long amount of time for some work. No matter what it is, if it isn’t finished in 10 seconds, the GPU will be reset.⁸

So what if the hang I’ve been chasing all this time isn’t actually a hang? How do I even find out?

The amdgpu kernel driver has a parameter named lockup_timeout for this exact purpose: You can modify this parameter to change the amount of time after which the GPU is reset if a job doesn’t finish, or disable this GPU reset entirely. To test this theory, I went with disabling the GPU reset.

After setting all the parameters up and rebooting, I started the game another time.

And it worked! It took a really long time, but eventually, the game started up fully. It was indeed just hammering the poor Deck’s GPU with work that took way too long.

Why does my workaround work?

Finally, things start clearing up a bit. There is still an open question, though: What does the workaround do to prevent this?

The code that runs the 3 benchmark passes doesn’t always run them unconditionally. Instead, the 3 benchmarks have an increasingly larger workload (each roughly 10x as much as the previous one). Comments nearby explain that this choice was made because the larger benchmark runs cause driver resets on low-end APUs (hey, that’s exactly the problem we’re having!). It measures the time it takes for the benchmark workloads to complete using the timestamp queries, and if the total benchmark time is beyond a certain point, it skips the other benchmark runs.

If you’ve been paying extremely close attention all the way until here, you might notice a small problem. UE4 interprets the timestamp values as the time until the benchmark workload completes. But as I pointed out all the way near the beginning, the timestamp can be written before the benchmark workload is even finished!

If the timestamp is written before the benchmark workload finishes, the measured benchmark time is much less than the workload actually took. In practice, this results in the benchmark results indicating a much faster GPU than there actually is. I assume this led to the third benchmark (which was too heavy for the Deck GPU) to be launched. My desktop GPU seems to be powerful enough to get through the benchmark before the lockup timeout, which is why I couldn’t reproduce the issue there.

In the end, the hack I originally found to work around the issue turned out to be a fitting workaround.

And I even got to make my first bugreport for Unreal Engine.

Footnotes

If an app is using Vulkan on multiple threads, this might not always be the case. This is a rare case where I’m grateful for Unreal Engine to have a single RHI thread. ↩
Nvidia also calls them “warps”. ↩
Short for “Packet 3”. Packet 2, 1 and 0 also exist, although they aren’t widely used on newer AMD hardware. ↩
If you insert certain pipeline barriers, writing the timestamp early would be disallowed, but these barriers weren’t there in this case. ↩
This command writes the timestamp immediately when the CP executes it. There is another command which waits for previous commands to finish before writing the timestamp. ↩
You can also view the original NIR and the GLSL translation here ↩
As it turned out later, the debugging method was flawed. In actuality, both timestamp writes and copies completed successfully, but the writes indicating this seemed to be still in the write cache. Forcing the memory containing the magic number to be uncached solved this. ↩
Usually, the kernel driver can also command the GPU to kill whatever job it is doing right now. For some reason, it didn’t work here though. ↩