<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.9.3">Jekyll</generator><link href="https://pixelcluster.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://pixelcluster.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2023-09-05T11:14:23+00:00</updated><id>https://pixelcluster.github.io/feed.xml</id><title type="html">clusterduck - pixelcluster’s GPU blog</title><subtitle>🐸🐸🐸🐸🐸</subtitle><author><name>Friedrich Vock</name></author><entry><title type="html">RADV Ray Tracing: Now ON by default</title><link href="https://pixelcluster.github.io/RADV-Raytracing-ON/" rel="alternate" type="text/html" title="RADV Ray Tracing: Now ON by default" /><published>2023-06-13T00:00:00+00:00</published><updated>2023-06-13T00:00:00+00:00</updated><id>https://pixelcluster.github.io/RADV-Raytracing-ON</id><content type="html" xml:base="https://pixelcluster.github.io/RADV-Raytracing-ON/">&lt;p&gt;Yes, you heard that right.&lt;/p&gt;

&lt;p&gt;Ray Tracing Pipelines.&lt;/p&gt;

&lt;p&gt;On RADV.&lt;/p&gt;

&lt;p&gt;Enabled by default.&lt;/p&gt;

&lt;p&gt;Now merged in Mesa &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This has been in the works for a loooooooooong time. Probably the longest of
any RADV features so far.&lt;/p&gt;

&lt;p&gt;But what makes ray tracing pipelines so complex that it takes this long to implement?
Let’s take a short look at what it took for RADV to get its implementation off the ground.&lt;/p&gt;

&lt;h2 id=&quot;ray-tracing-basics&quot;&gt;Ray Tracing basics&lt;/h2&gt;

&lt;p&gt;For the purposes of this blog, ray tracing is the process of finding intersections between rays and some geometry.&lt;/p&gt;

&lt;p&gt;Most of the time, this geometry will be made up of lots of triangles. We don’t want to test every single triangle for
intersection separately, so Bounding Volume Hierarchies (BVHs) are used to speed up the process by skipping entire
groups of triangles at once.&lt;/p&gt;

&lt;h2 id=&quot;hardware-acceleration&quot;&gt;Hardware acceleration&lt;/h2&gt;

&lt;p&gt;Nowadays, GPUs have dedicated hardware to speed up the ray tracing process.&lt;/p&gt;

&lt;p&gt;AMD’s hardware acceleration for ray tracing is very simple: It consists of a single instruction called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image_bvh_intersect_ray&lt;/code&gt; (and its 64-bit variant).&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Why is it called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image_bvh_intersect_ray&lt;/code&gt;? Because the hardware sees the BVH as a 1D image and uses its memory subsystem for textures to fetch BVH data, of course.&lt;/p&gt;

&lt;p&gt;This instruction takes care of calculating intersections between a ray and a single node in the BVH. But intersecting one node isn’t good enough:
In order to find actual intersections between the ray and geometry, we need to traverse the BVH and check lots of nodes.
The traversal loop that accomplishes this is implemented in software&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;ray-tracing-pipelines&quot;&gt;Ray Tracing Pipelines&lt;/h2&gt;

&lt;p&gt;In Vulkan, you can use ray tracing pipelines to utilize your GPU’s hardware-accelerated ray tracing capabilities. It might not seem like it, but ray tracing pipelines
actually bring a whole lot of new features with them that make them quite complex to implement.&lt;/p&gt;

&lt;p&gt;Ray tracing pipelines introduce a set of new shader stages:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Ray generation shaders calculate origins and directions of rays to trace and call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;traceRayEXT&lt;/code&gt; to start tracing&lt;/li&gt;
  &lt;li&gt;Any-hit shaders are responsible or confirming or rejecting potential intersections&lt;/li&gt;
  &lt;li&gt;Intersection shaders can be used to run custom ray-primitive intersection code, which can be used to do raytracing on non-triangle geometry&lt;/li&gt;
  &lt;li&gt;Closest-hit shaders are responsible for handling rays that have hit geometry, calculating things like lighting for the traced ray&lt;/li&gt;
  &lt;li&gt;Miss shaders handle the case where no accepted intersections were results (either there were no intersections, or all intersections were rejected).&lt;/li&gt;
  &lt;li&gt;Callable shaders can be invoked from the ray generation shader and can do arbitrary calculations, including recursive calls (calling callable shaders from callable shaders)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s right, as a small side effect, ray tracing pipelines also introduced full proper recursion from shaders. This doesn’t just apply to callable shaders:
You can also trace new rays from a closest-hit shader, which can recursively invoke more closest-hit shaders, etc.&lt;/p&gt;

&lt;p&gt;Also, ray tracing pipelines introduce a very dynamic, GPU-driven shader dispatch process: In traditional graphics and compute pipelines, once you bind a pipeline,
you know exactly which shaders are going to execute once you do a draw or dispatch. In ray tracing pipelines, this depends on something called the Shader Binding Table,
which is a piece of memory containing so-called “shader handles”. These shader handles identify the shader that is &lt;em&gt;actually&lt;/em&gt; launched when vkCmdTraceRaysKHR is called.&lt;/p&gt;

&lt;p&gt;In both graphics and compute pipelines, the concept of pipeline stages was quite simple: You have a bunch of shader stages (for graphics pipelines, it’s
usually vertex and fragment, for compute pipelines it’s just compute). Each stage has exactly one shader: You don’t have one graphics pipeline with
many vertex shaders. In ray tracing pipelines, there are no restrictions on how many shaders can exist for each stage.&lt;/p&gt;

&lt;p&gt;In RT pipelines, there is also the concept of shaders dispatching other shaders: Every time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;traceRayEXT&lt;/code&gt; is called, more shaders (any-hit, intersection, closest-hit or miss shaders)
are launched.&lt;/p&gt;

&lt;p&gt;That’s lots of changes just for some ray tracing!&lt;/p&gt;

&lt;h2 id=&quot;hardware-limitations&quot;&gt;Hardware limitations&lt;/h2&gt;

&lt;p&gt;RT pipelines aren’t really a fitting representation of AMD hardware. There is no such thing as reading a memory location to determine which shader to launch, and the hardware has
no concept of a callstack to implement recursion. RADV therefore has to do a bit of magic to transform RT pipelines in a way that will actually run.&lt;/p&gt;

&lt;h3 id=&quot;shader-stages-all-in-one&quot;&gt;Shader stages: All-in-one&lt;/h3&gt;

&lt;p&gt;The first approach RADV used to implement these ray tracing pipelines was essentially to pretend that the whole ray tracing pipelines a normal compute shader:
All shaders from the pipeline are assigned a unique ID. Then, all shaders are inserted into a humongous chain of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (idx == shader_id) { (paste shader code here) }&lt;/code&gt; statements.&lt;/p&gt;

&lt;p&gt;If you wanted to call a shader, it was as simple as setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;idx&lt;/code&gt; to the ID of the shader you wanted to call. You could even implement recursion by storing the ID of the shader
to return to on a call stack.&lt;/p&gt;

&lt;p&gt;Launching shaders according to the shader binding table wasn’t a problem either: You just read the shader binding table at the start and set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;idx&lt;/code&gt; to whatever value is in there.&lt;/p&gt;

&lt;p&gt;But there was a problem.&lt;/p&gt;

&lt;h4 id=&quot;oh-god-theres-so-many-of-them&quot;&gt;Oh God there’s so many of them&lt;/h4&gt;

&lt;p&gt;As it turns out, if you don’t put any restrictions on how many shaders can exist in a stage, there’s going to be apps that use LOTS of them. We’re talking almost a thousand shaders
in some cases. Ludicrously large code like that resulted in lots of ludicrous results (games spending over half an hour compiling shaders!). Clearly, the megashader solution wasn’t
sustainable.&lt;/p&gt;

&lt;h4 id=&quot;also-i-forgot-an-important-addition&quot;&gt;Also I forgot an important addition&lt;/h4&gt;

&lt;p&gt;Ray Tracing Pipelines also add pipeline libraries. You might have heard of them in the context of Graphics Pipeline Libraries, which was also really painful to implement in RADV.&lt;/p&gt;

&lt;p&gt;Pipeline libraries essentially allow you to create parts of your ray tracing pipeline beforehand, and then re-use these created parts all over other ray tracing pipelines. But
if we just paste all shaders into one chonker compute shader, we can’t compile it yet when creating a pipeline library, because other shaders will be added once a real pipeline
is created from it!&lt;/p&gt;

&lt;p&gt;This basically meant that we couldn’t do anything but copy the source code around, and start compiling only when the real pipeline is created. It also turned out that it’s valid
behaviour to query the stack size used for recursion from pipeline libraries, but because RADV didn’t compile any code yet, it didn’t even know what stack size the shaders from that
pipeline used.&lt;/p&gt;

&lt;h3 id=&quot;separate-shader-compilation&quot;&gt;Separate shader compilation&lt;/h3&gt;

&lt;p&gt;This is where separate shader compilation comes in. As the name suggests, most&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; shaders are compiled independently. Instead of using shader IDs to select what shader is called,
we store the VRAM addresses of the shaders and directly jump to whatever shaders we want to execute next.&lt;/p&gt;

&lt;p&gt;Directly jumping to a shader is still impossible because reading the shader binding table is required. Instead, RADV creates a small piece of shader assembly that sets up necessary
parameters, reads the shader binding table, and then directly jumps to the selected shader (like it is done for shader calls).&lt;/p&gt;

&lt;p&gt;This allows us to compile shaders immediately when creating pipeline libraries. It also pretty much resolves the problem of chonker compute shaders taking ludicrously long to compile.
It also required basically reworking the entire ray tracing compilation infrastructure, but I think it forms a great basis for future work in the performance area.&lt;/p&gt;

&lt;h2 id=&quot;faq&quot;&gt;FAQ&lt;/h2&gt;

&lt;h3 id=&quot;what-appsgames-does-radv-ray-tracing-run&quot;&gt;What apps/games does RADV ray tracing run?&lt;/h3&gt;

&lt;p&gt;Everything runs.&lt;/p&gt;

&lt;p&gt;In case you disagree, please &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/issues&quot;&gt;open an issue&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;how-well-do-ray-queries-run&quot;&gt;How well do ray queries run?&lt;/h3&gt;

&lt;p&gt;Pretty competitive with AMDVLK/the AMD Windows drivers! You’ll generally see similar, if not better, performance on RADV.&lt;/p&gt;

&lt;h3 id=&quot;how-well-do-pipelines-run&quot;&gt;How well do pipelines run?&lt;/h3&gt;

&lt;p&gt;Not well (expect significantly less performance compared to AMDVLK/Windows drivers). This is being worked on.&lt;/p&gt;

&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;RDNA3 introduces another instruction that helps with BVH traversal stack management, but RADV doesn’t use it yet. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This is also what makes it so easy to support ray tracing even when there is no hardware acceleration (using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RADV_PERFTEST=emulate_rt&lt;/code&gt;): Most of the traversal code can be reused, only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image_bvh_intersect_ray&lt;/code&gt; needs to be replaced with a software equivalent. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Any-hit and Intersection shaders are still combined into a single traversal shader. This still shows some of the disadvantages of the combined shader method, but generally compile times aren’t that ludicrous anymore. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Friedrich Vock</name></author><summary type="html">Yes, you heard that right. Ray Tracing Pipelines. On RADV. Enabled by default. Now merged in Mesa main.</summary></entry><entry><title type="html">GPU Hang Exploration: Splitgate</title><link href="https://pixelcluster.github.io/Hang-Exploration-Splitgate/" rel="alternate" type="text/html" title="GPU Hang Exploration: Splitgate" /><published>2023-05-11T00:00:00+00:00</published><updated>2023-05-11T00:00:00+00:00</updated><id>https://pixelcluster.github.io/Hang-Exploration-Splitgate</id><content type="html" xml:base="https://pixelcluster.github.io/Hang-Exploration-Splitgate/">&lt;p&gt;GPU hangs are one of the most common results of pretty much anything going
wrong GPU-side, and finding out why they occur isn’t always easy. In this blog
post, I’ll document my journey towards finding the cause of one specific
hang in the game “Splitgate”.&lt;/p&gt;

&lt;p&gt;Right off the bat, I noticed a few oddities with this particular hang.
Firstly, the actual game always ran completely fine. The only place where
it hung was on the first startup where the game automatically configures
graphics settings (I’ll call it autoconfiguration from here).&lt;/p&gt;

&lt;p&gt;Additionally, while I could reproduce the hang on my Steam Deck, I couldn’t
get the hang to appear on my desktop. I have an RDNA2 graphics card, which
is the same architecture as the Deck, so it seemed unlikely that specifics
about the hardware architecture were the problem here.&lt;/p&gt;

&lt;h2 id=&quot;api-validation&quot;&gt;API Validation&lt;/h2&gt;

&lt;p&gt;As a first step, I tried running the game with the Vulkan Validation Layers. If
the game is using the API in an invalid way and that is the cause of the hangs, there’s
a rather good chance the Validation Layers will catch it.&lt;/p&gt;

&lt;p&gt;Even though there were a few errors from the validation layers,
it seemed like none of the errors were actually relevant to the hang. 
Most importantly, the errors with autoconfiguration on were the same as the
errors with autoconfiguration off.&lt;/p&gt;

&lt;p&gt;As any software, the Validation Layers aren’t perfect and can’t detect every
possible invalid behaviour. At this point I was still unsure whether I’d have to
search for the bug on the application side or on the driver side.&lt;/p&gt;

&lt;h2 id=&quot;api-dumping&quot;&gt;API dumping&lt;/h2&gt;

&lt;p&gt;With the validation layers being unable to detect any invalid behaviour by the app
during the autoconfiguration phase, another question comes to mind:
What &lt;em&gt;is&lt;/em&gt; the application doing, actually?&lt;/p&gt;

&lt;p&gt;To answer that, I utilized the API Dump Vulkan layer by LunarG. When this layer is
activated, it dumps all the commands made by the application, including every parameter
and return value to standard output.&lt;/p&gt;

&lt;p&gt;While API dumps are good to have for debugging, large API dumps from large engines
are often difficult to navigate (not just because it’s an 800MB large file and
your text editor dies trying to scroll through them). Instead, it’s often best to
extract just the work that hangs for further debugging. But what frame is this?&lt;/p&gt;

&lt;h4 id=&quot;finding-the-hanging-submission&quot;&gt;Finding the hanging submission&lt;/h4&gt;

&lt;p&gt;The CPU and GPU do work asynchronously, which means that the CPU is free to
do more work while GPU starts with its work. Somewhat unfortunately, this also
means the CPU can do more Vulkan calls which will show up in the API dump after
the app already submitted the hanging frame to the GPU. This means that I
couldn’t just look at the last command in the API dump and assume that command
caused the hang. Luckily, there were other hints towards what caused the hang.&lt;/p&gt;

&lt;p&gt;In Vulkan, when you want to know when a particular work submission finishes,
you give a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VkFence&lt;/code&gt; to the submit function. Later, you can wait for
the submission to finish with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkWaitForFences&lt;/code&gt;, or you can query whether
the submission has already finished with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkGetFenceStatus&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I noticed that after work was submitted, the app seemed to call
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkGetFenceStatus&lt;/code&gt; from time to time, polling whether that submission was
finished. Usually, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkGetFenceStatus&lt;/code&gt; would return &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VK_SUCCESS&lt;/code&gt; after a
few calls, indicating that the submission finished. However, there was one
submission where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkGetFenceStatus&lt;/code&gt; seemed to always return &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VK_NOT_READY&lt;/code&gt;.
It seemed very likely that the GPU was hanging while executing that submission.&lt;/p&gt;

&lt;p&gt;To test my theory, I modified the implementation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkQueueSubmit&lt;/code&gt;, which
you call for submitting work, to call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkDeviceWaitIdle&lt;/code&gt; immediately after
submitting the work. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkDeviceWaitIdle&lt;/code&gt; waits for &lt;em&gt;all&lt;/em&gt; outstanding GPU work to
finish. When the GPU hangs, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkQueueSubmit&lt;/code&gt; which caused the hang should
be the last line in the API dump&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;This time, the API dump cut off at the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkQueueSubmit&lt;/code&gt; for which
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkGetFenceStatus&lt;/code&gt; always returned &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VK_NOT_READY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Bingo.&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;/assets/memes/uncanny-mr-incredible-0.png&quot; alt=&quot;mr incredible&quot; width=&quot;250&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;going-lower-level&quot;&gt;Going lower-level&lt;/h2&gt;

&lt;p&gt;Now we know which submission hangs, but that submission still contains a lot of
commands. Even though the text editor now survives scrolling through the
commands, finding what is wrong by inspection is highly unlikely.&lt;/p&gt;

&lt;p&gt;Instead, I tried to answer the question: “What specific command is making the
GPU hang?”&lt;/p&gt;

&lt;p&gt;In order to find the answer, I needed to find out as much as possible about
what state the GPU is in when it hangs. There are a few useful tools which
helped me gather info:&lt;/p&gt;

&lt;h4 id=&quot;umr&quot;&gt;umr&lt;/h4&gt;

&lt;p&gt;The first thing I did was use &lt;a href=&quot;https://gitlab.freedesktop.org/tomstdenis/umr&quot;&gt;umr&lt;/a&gt;
to query if any waves were active at the time of the hang. Waves (or Wavefronts)
are groups of 32 or 64 shader invocations (or threads in DX terms) that the GPU
executes at the same time&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. There were indeed quite a few waves currently
executing. For each wave, umr can show a disassembly of the GPU code that is
currently executing, as well as the values of all registers, and more.&lt;/p&gt;

&lt;p&gt;In this case, I was especially interested in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;halt&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fatal_halt&lt;/code&gt; status
bits for each wave. These bits are set when the wave encounters a fatal
exception (for example dereferencing invalid pointers) and won’t continue
execution. These bits were not set for any waves I inspected, so it was
unlikely that exceptions in a shader were causing the hang.&lt;/p&gt;

&lt;p&gt;Aside from exceptions, the other common way for shaders to trigger GPU hangs is
by accidentally executing infinite loops. But the shader code currently executing
was very simple and didn’t even have a jump instruction anywhere, so the hang
couldn’t be caused by infinite loops either.&lt;/p&gt;

&lt;h4 id=&quot;radv_debughang&quot;&gt;RADV_DEBUG=hang&lt;/h4&gt;

&lt;p&gt;Shaders aren’t the only thing that the GPU executes, and as such shaders aren’t
the only thing that can cause GPU hangs.&lt;/p&gt;

&lt;p&gt;In RADV, command buffers recorded in Vulkan are translated to a
hardware-specific command buffer format called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PKT3&lt;/code&gt;&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. Commands encoded
in this format are written to GPU-accessible memory, and executed by the
GPU’s command processor (CP for short) when the command buffer is submitted.&lt;/p&gt;

&lt;p&gt;These commands might also be involved in the hang, so I tried finding out which
commands the CP was executing when the hang happened. RADV has integrated debug
functionality that can help with exactly this, which can be enabled by setting
an environment variable named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RADV_DEBUG&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;hang&quot;&lt;/code&gt;. But when I tried
triggering the hang with this environment variable in place, it started up just
fine!&lt;/p&gt;

&lt;p&gt;This isn’t the first time I’ve seen this. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RADV_DEBUG=hang&lt;/code&gt; has a funny side
effect: It also inserts commands to wait for draws or dispatches to complete
immediately after the dispatch is triggered. This immensely helps with
figuring out which shader is faulty if there are multiple shaders executing
concurrently. But it also prevents certain hangs from happening: Where things
executing concurrently &lt;em&gt;causes&lt;/em&gt; the hang in the first place.&lt;/p&gt;

&lt;p&gt;In other words, we seem to be looking at a synchronization issue.&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;/assets/memes/uncanny-mr-incredible-1.png&quot; alt=&quot;uncanny mr incredible&quot; height=&quot;250&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;synchronization-boogaloo&quot;&gt;Synchronization boogaloo&lt;/h2&gt;

&lt;p&gt;Even though we know we’re dealing with a synchronization issue, the original
question remains unsolved: What command causes the hang?&lt;/p&gt;

&lt;p&gt;The “sync after every draw/dispatch” method of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RADV_DEBUG=hang&lt;/code&gt; fixes the
issue, but it has a very broad effect. Since the issue seems to reproduce
very reliably (which in itself is a rarity for synchronization bugs), we
can apply that sync selectively to only some draws or dispatches to narrow
down what commands exactly cause the hangs.&lt;/p&gt;

&lt;p&gt;First, I tried restricting the synchronization to only apply to dispatches
(so no draws were synchronized). This made the hang appear again. Testing
the other way around (restricting the synchronization to only draws) confirmed:
All compute dispatches were fine, the issue was about draw synchronization
only.&lt;/p&gt;

&lt;p&gt;Next, I tried only synchronizing at the end of renderpasses. This also fixed
the hang. However, synchronizing at the start of renderpasses fixed nothing.
Therefore it was impossible that missing synchronization across renderpasses
was the cause of the hang.&lt;/p&gt;

&lt;p&gt;The last likely option was that there was missing synchronization in between
the draws and something in between renderpasses.&lt;/p&gt;

&lt;p&gt;At this point, the API dump of the hanging submission proved very helpful.
Upon taking a closer look, it became clear that the commands in the submitted
command buffer had a very simple pattern (some irrelevant commands omitted for brevity):&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkCmdBeginRenderPass&lt;/code&gt; to begin a new renderpass&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkCmdDraw&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkCmdEndRenderPass&lt;/code&gt;, ending the renderpass&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkCmdWriteTimestamp&lt;/code&gt;, writing the current elapsed time&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What stuck out to me was that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkCmdWriteTimestamp&lt;/code&gt; was called with a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pipelineStage&lt;/code&gt; of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VK_PIPELINE_STAGE_TOP_OF_PIPE&lt;/code&gt;. In simple terms, this means
that the timestamp can be written before the preceding draw finished.&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Further testing confirmed: If I insert synchronization before writing the
timestamp, the hang is fixed. Inserting synchronization immediately
after writing the timestamp makes the hang re-appear.&lt;/p&gt;

&lt;h2 id=&quot;how-hard-can-writing-a-timestamp-be&quot;&gt;How hard can writing a timestamp be?&lt;/h2&gt;

&lt;p&gt;By now, it has become pretty clear that timestamp queries are the problem here.
But it just didn’t really make sense that the timestamp write itself would
hang.&lt;/p&gt;

&lt;p&gt;Timestamp writes on AMD hardware don’t require launching any shaders.
They can be implemented using one PKT3 command called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COPY_DATA&lt;/code&gt;&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;, which
accepts many data sources other than memory. One of these data sources is the
current timestamp. RADV uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COPY_DATA&lt;/code&gt; to write the timestamp to memory.
The memory for these timestamps is managed by the driver, so it’s exceedingly
unlikely the memory write would fail.&lt;/p&gt;

&lt;p&gt;From the wave analysis with umr earlier I also knew that the in-flight shaders
didn’t actually write or read any memory that might interfere with the
timestamp write (somehow). The timestamp write itself being the cause of the
hang seemed impossible.&lt;/p&gt;

&lt;h2 id=&quot;taking-a-step-back&quot;&gt;Taking a step back&lt;/h2&gt;

&lt;p&gt;If timestamp writes can’t be the problem, what else can there be that might
hang the GPU?&lt;/p&gt;

&lt;p&gt;There is one other part to timestamp queries aside from writing the timestamp
itself: In Vulkan, timestamps are always written to opaque “query pool”
objects. In order to actually view the timestamp value, an app has to copy the
results stored in the query pool to a buffer in CPU or GPU memory. Splitgate
uses Unreal Engine 4, which has a known bug related to query pool copies that
RADV has to &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/amd/vulkan/radv_query.c#L1555-1559&quot;&gt;work around&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It isn’t too far-fetched to think there might be other bugs in UE’s Vulkan
RHI regarding query copies. Synchronizing the query copy didn’t do anything,
but just commenting out the query copy fixed the hang as well.&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;/assets/memes/uncanny-mr-incredible-2.png&quot; alt=&quot;uncannier mr incredible&quot; height=&quot;250&quot; /&gt;&lt;/p&gt;

&lt;h2&gt;????&lt;/h2&gt;

&lt;p&gt;Up until this point, I was pretty sure that something about the timestamp write
must be the cause of the problems. Now it seemed like query copies might also
influence the problem somehow? I was pretty unsure how to reconcile these two
observations, so I tried finding out more about how exactly the query copy
affected things.&lt;/p&gt;

&lt;p&gt;Query copies on RADV are implemented using small compute shaders written
directly in &lt;a href=&quot;https://docs.mesa3d.org/nir/index.html&quot;&gt;NIR&lt;/a&gt;. Having the simple
driver-internal shaders in NIR is a nice and simple way of storing them inside
the driver, but they’re a bit hard to read for people not used to the syntax.
For demonstration purposes I’ll use a GLSL translation of the shader&lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;.
The copy shader for timestamp queries looks like this:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;location&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;binding&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buffer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dst_buf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;location&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;binding&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buffer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_buf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flags&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VK_QUERY_RESULT_64_BIT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dst_stride&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flags&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VK_QUERY_RESULT_WITH_AVAILABILITY_BIT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dst_stride&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_stride&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;available&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_stride&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;global_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dst_offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dst_stride&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;global_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timestamp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_buf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src_offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;timestamp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TIMESTAMP_NOT_READY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;available&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flags&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VK_QUERY_RESULT_PARTIAL_BIT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;available&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flags&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VK_QUERY_RESULT_64_BIT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;dst_buf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dst_offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;dst_buf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dst_offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flags&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VK_QUERY_RESULT_WITH_AVAILABILITY_BIT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dst_buf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dst_offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;available&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At first, I tried commenting out the stores to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dst_buf&lt;/code&gt;, which resulted in the
hangs disappearing again. This can indicate that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dst_buf&lt;/code&gt; is the problem, but
it’s not the only possibility. The compiler can also optimize out the load
because it isn’t used further down in the shader, so this could also mask
an invalid read as well. When I commented out the read and always stored a
constant instead - it also didn’t hang!&lt;/p&gt;

&lt;p&gt;But could it be that the shader was reading from an invalid address? Splitgate
is by far not the only app out there using timestamp queries, and those apps
all work fine - so it can’t just be fundamentally broken, right?&lt;/p&gt;

&lt;p&gt;To test this out, I modified the timestamp write command once again. Remember
how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PKT3_COPY_DATA&lt;/code&gt; is really versatile? Aside from copying memory and
timestamps, it can also copy a 32/64-bit constant supplied as a parameter.
I undid all the modifications to the copy shader and forced a constant to be
written instead of timestamps. No hangs to be seen.&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;/assets/memes/uncanny-mr-incredible-3.png&quot; alt=&quot;even uncannier mr incredible&quot; height=&quot;250&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;-1&quot;&gt;?????????&lt;/h2&gt;

&lt;p&gt;It seems like aside from the synchronization, the value that is written as the
timestamp influences whether a hang happens or not. But that also means neither
of the two things already investigated can actually be the source of the hang,
can they?&lt;/p&gt;

&lt;p&gt;It’s essentially the same question as in the beginning, still unanswered:&lt;br /&gt;
“What the heck is hanging here???”&lt;/p&gt;

&lt;h3 id=&quot;radv_debughang-but-useful-this-time&quot;&gt;RADV_DEBUG=hang (but useful this time)&lt;/h3&gt;

&lt;p&gt;Stabbing in the dark with more guesses won’t help here. The only thing that can
is more info. I already had a small GPU buffer that I used for some other
debugging I skipped over. To get definitive info on whether it hangs because of
the timestamp write, the timestamp copy, or something else entirely, I modified
the command buffer recording to write some magic numbers into that debug buffer
whenever these operations happened. It went something along the lines of:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;write &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xAAAAAAAA&lt;/code&gt; if timestamp write is complete&lt;/li&gt;
  &lt;li&gt;write &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xBBBBBBBB&lt;/code&gt; if timestamp copy is complete&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, I still needed to ensure I only read the magic numbers after the
GPU had time to execute them (without waiting forever during GPU hangs)..
This required a different intricate and elaborate synchronization algorithm.&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// VERY COMPLICATED SYNCHRONIZATION&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With that out of the way, let’s take a look at the magic number of the hanging
submission.&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Magic: 0x0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;what???&lt;/em&gt; this means &lt;em&gt;neither write nor copy&lt;/em&gt; have executed? Alright, what if
I add another command writing a magic number right at the beginning of the
command buffer?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Magic: 0x0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So… the hang happens before the command buffer starts executing? Something
can’t be right here.&lt;sup id=&quot;fnref:7&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;/assets/memes/uncanny-mr-incredible-4.png&quot; alt=&quot;even more uncannier mr incredible&quot; height=&quot;250&quot; /&gt;&lt;/p&gt;

&lt;p&gt;At this point I started logging all submits that contained either timestamp
writes or timestamp copies, and I noticed that there was another submission
with the same pattern of commands right before the hanging one.&lt;/p&gt;

&lt;h2 id=&quot;multi-submit-madness&quot;&gt;Multi-submit madness&lt;/h2&gt;

&lt;p&gt;This previous submission had executed just fine - all timestamps were written,
all shaders finished without hangs. This meant that neither the way timestamps
were written nor the way they were copied could be direct causes of hangs,
because they worked just one submission prior.&lt;/p&gt;

&lt;p&gt;I verified this theory by forcing full shader synchronization to happen before
the timestamp write, but only for the submission that actually hangs. To my
surprise, this did nothing to fix the hangs.&lt;/p&gt;

&lt;p&gt;When I applied the synchronization trick to the previous submit (that always
worked fine!), the hangs stopped appearing.&lt;/p&gt;

&lt;p&gt;It seems like the cause of the hang is not in the hanging submission, but in a
completely separate one that completed successfully.&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;/assets/memes/uncanny-mr-incredible-5.png&quot; alt=&quot;most uncanny mr incredible&quot; height=&quot;250&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;what-is-the-app-doing&quot;&gt;What is the app doing?&lt;/h2&gt;

&lt;p&gt;Let’s rewind to the question that started this whole mess. “What is the app
doing?”&lt;/p&gt;

&lt;p&gt;Splitgate (as of today) uses Unreal Engine 4.27.2. Luckily, Epic Games make the
source code of UE available to anyone registering for it with their Epic Games
account. There was hope that the benchmark code they were using was built into
Unreal, where I could examine what exactly it does.&lt;/p&gt;

&lt;p&gt;Searching in the game logs from a run with the workaround enabled, I found this:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;LogSynthBenchmark: Display: Graphics:
LogSynthBenchmark: Display:   Adapter Name: 'AMD Custom GPU 0405 (RADV VANGOGH)'
LogSynthBenchmark: Display:   (On Optimus the name might be wrong, memory should be ok)
LogSynthBenchmark: Display:   Vendor Id: 0x1002
LogSynthBenchmark: Display:   Device Id: 0x163F
LogSynthBenchmark: Display:   Device Revision: 0x0
LogSynthBenchmark: Display:   GPU first test: 0.06s
LogSynthBenchmark: Display:          ... 3.519 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 2.804 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 2.487 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 8.917 s/GigaPix, Confidence=100% 'FillOnly' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 0.330 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 0.951 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be very inaccurate)
LogSynthBenchmark: Display:          ... 6.053 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be very inaccurate)
LogSynthBenchmark: Display:   GPU second test: 0.54s
LogSynthBenchmark: Display:          ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 9.127 s/GigaPix, Confidence=100% 'FillOnly' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be inaccurate)
LogSynthBenchmark: Display:          ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be inaccurate)
LogSynthBenchmark: Display:   GPU Final Results:
LogSynthBenchmark: Display:          ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise'
LogSynthBenchmark: Display:          ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy'
LogSynthBenchmark: Display:          ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy'
LogSynthBenchmark: Display:          ... 9.127 s/GigaPix, Confidence=100% 'FillOnly'
LogSynthBenchmark: Display:          ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth'
LogSynthBenchmark: Display:          ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1'
LogSynthBenchmark: Display:          ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2'
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FSynthBenchmark&lt;/code&gt; indeed appears in the UE codebase as a
benchmark tool to auto-calibrate settings. From reading its code, it seemed
like it does 3 separate benchmark ru…&lt;/p&gt;

&lt;p&gt;wait. 3??&lt;/p&gt;

&lt;p&gt;We can clearly see from the logs there are only two benchmark runs. Maybe the
third run hangs the GPU somehow?&lt;/p&gt;

&lt;h2 id=&quot;hang-well-yes-but-actually-no&quot;&gt;Hang? Well yes, but actually no&lt;/h2&gt;

&lt;p&gt;While thinking about this, another possibility came to my mind. The GPU driver
can’t actually detect if the GPU is hung because of some fatal error or
if it just takes an obscenely long amount of time for some work. No matter what
it is, if it isn’t finished in 10 seconds, the GPU will be reset.&lt;sup id=&quot;fnref:8&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;So what if the hang I’ve been chasing all this time isn’t actually a hang? How
do I even find out?&lt;/p&gt;

&lt;p&gt;The amdgpu kernel driver has a parameter named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lockup_timeout&lt;/code&gt; for this exact
purpose: You can modify this parameter to change the amount of time after which
the GPU is reset if a job doesn’t finish, or disable this GPU reset entirely.
To test this theory, I went with disabling the GPU reset.&lt;/p&gt;

&lt;p&gt;After setting all the parameters up and rebooting, I started the game another
time.&lt;/p&gt;

&lt;p&gt;And it worked! It took a really long time, but eventually, the game started
up fully. It was indeed just hammering the poor Deck’s GPU with work that
took way too long.&lt;/p&gt;

&lt;h2 id=&quot;why-does-my-workaround-work&quot;&gt;Why does my workaround work?&lt;/h2&gt;

&lt;p&gt;Finally, things start clearing up a bit. There is still an open question,
though: What does the workaround do to prevent this?&lt;/p&gt;

&lt;p&gt;The code that runs the 3 benchmark passes doesn’t always run them
unconditionally. Instead, the 3 benchmarks have an increasingly larger
workload (each roughly 10x as much as the previous one). Comments nearby
explain that this choice was made because the larger benchmark runs cause
driver resets on low-end APUs (hey, that’s exactly the problem we’re
having!). It measures the time it takes for the benchmark workloads to
complete using the timestamp queries, and if the total benchmark time is beyond
a certain point, it skips the other benchmark runs.&lt;/p&gt;

&lt;p&gt;If you’ve been paying extremely close attention all the way until here, you
might notice a small problem. UE4 interprets the timestamp values as the
time until the benchmark workload completes. But as I pointed out all the way
near the beginning, the timestamp can be written before the benchmark workload
is even finished!&lt;/p&gt;

&lt;p&gt;If the timestamp is written before the benchmark workload finishes, the
measured benchmark time is much less than the workload actually took.
In practice, this results in the benchmark results indicating a much faster GPU
than there actually is. I assume this led to the third benchmark (which was too
heavy for the Deck GPU) to be launched. My desktop GPU seems to be powerful
enough to get through the benchmark before the lockup timeout, which is why
I couldn’t reproduce the issue there.&lt;/p&gt;

&lt;p&gt;In the end, the hack I originally found to work around the issue turned out
to be a
&lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22823&quot;&gt;fitting workaround&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And I even got to make my first bugreport for Unreal Engine.&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;/assets/memes/canny-mr-incredible.png&quot; alt=&quot;canny mr incredible&quot; width=&quot;250&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If an app is using Vulkan on multiple threads, this might not always be the case. This is a rare case where I’m grateful for Unreal Engine to have a single RHI thread. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Nvidia also calls them “warps”. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Short for “Packet 3”. Packet 2, 1 and 0 also exist, although they aren’t widely used on newer AMD hardware. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If you insert certain pipeline barriers, writing the timestamp early would be disallowed, but these barriers weren’t there in this case. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This command writes the timestamp immediately when the CP executes it. There is another command which waits for previous commands to finish before writing the timestamp. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;You can also view the original NIR and the GLSL translation &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/blob/0b251d43/src/amd/vulkan/radv_query.c#L531&quot;&gt;here&lt;/a&gt; &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;As it turned out later, the debugging method was flawed. In actuality, both timestamp writes and copies completed successfully, but the writes indicating this seemed to be still in the write cache. Forcing the memory containing the magic number to be uncached solved this. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Usually, the kernel driver can also command the GPU to kill whatever job it is doing right now. For some reason, it didn’t work here though. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Friedrich Vock</name></author><summary type="html">GPU hangs are one of the most common results of pretty much anything going wrong GPU-side, and finding out why they occur isn’t always easy. In this blog post, I’ll document my journey towards finding the cause of one specific hang in the game “Splitgate”.</summary></entry></feed>