<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.3">Jekyll</generator><link href="https://www.supergoodcode.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.supergoodcode.com/" rel="alternate" type="text/html" /><updated>2024-01-08T19:50:05+00:00</updated><id>https://www.supergoodcode.com/feed.xml</id><title type="html">Mike Blumenkrantz</title><subtitle>Super. Good. Chair.</subtitle><entry><title type="html">First Bug Down</title><link href="https://www.supergoodcode.com/first-bug-down/" rel="alternate" type="text/html" title="First Bug Down" /><published>2024-01-08T00:00:00+00:00</published><updated>2024-01-08T00:00:00+00:00</updated><id>https://www.supergoodcode.com/first-bug-down</id><content type="html" xml:base="https://www.supergoodcode.com/first-bug-down/">&lt;h1 id=&quot;slow-start&quot;&gt;Slow Start&lt;/h1&gt;

&lt;p&gt;It’s been a slow start to the year, by which I mean I’ve been buried under an absolute deluge of all the things you can imagine and then also a blizzard. The literal kind, not the kind that used to make great games.&lt;/p&gt;

&lt;p&gt;Anyway, it’s not all fun and specs in my capacity as CEO of OpenGL. Sometimes I gotta do Real Work. The number one source of Real Work, as always, is &lt;del&gt;my old code&lt;/del&gt; the mesa bug tracker.&lt;/p&gt;

&lt;p&gt;Unfortunately, the thing is completely overloaded with NVIDIA bugs right now, so it was slim pickins.&lt;/p&gt;

&lt;h1 id=&quot;another-game-ive-never-heard-of&quot;&gt;Another Game I’ve Never Heard Of&lt;/h1&gt;
&lt;p&gt;Am I a boomer? Is this what being a boomer feels like? I really have lived long enough to see myself become the villain.&lt;/p&gt;

&lt;p&gt;Next bug up is from this game called &lt;a href=&quot;https://store.steampowered.com/app/892970/Valheim/&quot;&gt;Valheim&lt;/a&gt;. I think it’s a LARPing chess game? Something like that? Don’t @ me.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/issues/10386&quot;&gt;This report&lt;/a&gt; came in hot over the break with some rad new shading techniques:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/uploads/549fc90c96a105272133823b090a4ba2/valheim-glitch-4.png&quot;&gt;&lt;img src=&quot;https://gitlab.freedesktop.org/mesa/mesa/uploads/549fc90c96a105272133823b090a4ba2/valheim-glitch-4.png&quot; alt=&quot;hm&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks way cooler if you play the trace, but you get the idea.&lt;/p&gt;

&lt;h1 id=&quot;pinpoint-accuracy&quot;&gt;Pinpoint Accuracy&lt;/h1&gt;
&lt;p&gt;First question: what in the Sam Hill is going on here?&lt;/p&gt;

&lt;p&gt;Apparently &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RADV_DEBUG=hang&lt;/code&gt; fixes it, which was a curious one since no other env vars affected the issue. This means the problem is somehow caused by an issue related to the actual Vulkan queue submissions, since (according to legendary multipatch chef Samuel “&lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26930&quot;&gt;PLZ SEND REVIEWS!!&lt;/a&gt;” Pitoiset) this flag synchronizes the queue after every submit.&lt;/p&gt;

&lt;p&gt;It’s therefore no surprise that renderdoc was useless. When viewed in isolation, each frame is perfect, but when played at speed the synchronization is lost.&lt;/p&gt;

&lt;p&gt;My first stops, as anyone would expect, were the sites of queue submission in zink. This means flush and present.&lt;/p&gt;

&lt;p&gt;Now, I know not everyone is going to be comfortable taking this kind of wild, unhinged guess like I did, but stick with me here. The first thing I checked was a breakpoint on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zink_flush()&lt;/code&gt;, which is where API flush calls filter through. There were the usual end-of-frame hits, but there were a fair number of calls originating from &lt;a href=&quot;https://registry.khronos.org/OpenGL-Refpages/gl4/html/glFenceSync.xhtml&quot;&gt;glFenceSync&lt;/a&gt;, which is the way a developer can subtly inform a GL driver that they definitely know what they’re doing.&lt;/p&gt;

&lt;p&gt;So I saw these calls coming in, and I stepped through &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zink_flush()&lt;/code&gt;, and I reached &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/blob/b06f6e00fba6e33c28a198a1bb14b89e9dfbb4ae/src/gallium/drivers/zink/zink_context.c#L3866&quot;&gt;this&lt;/a&gt; spot:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;has_work&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;lt;-----&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HERE&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pfence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;cm&quot;&gt;/* reuse last fence */&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;fence&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last_fence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;deferred&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zink_batch_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zink_batch_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last_fence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
         &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;sync_flush&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_device_lost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
               &lt;span class=&quot;n&quot;&gt;check_device_lost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
         &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;track_renderpasses&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;tc_driver_internal_flush_notify&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;fence&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;submit_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;usage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;submit_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;deferred&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flags&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PIPE_FLUSH_FENCE_FD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pfence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;deferred_fence&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;flush_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now this is a real puzzler, because if you know what you’re doing as a developer, you shouldn’t be reaching this spot. This is the penalty box where I put all the developers who &lt;em&gt;don’t&lt;/em&gt; know what they’re doing, the spot where I push up my massive James Webb Space Telescope glasses and say, “No, ackchuahlly you don’t want to flush right now.” Because you only reach this spot if you trigger a flush when there’s nothing to flush.&lt;/p&gt;

&lt;p&gt;OR DO YOU?&lt;/p&gt;

&lt;p&gt;For hahas, I noped out the first part of that conditional, ensuring that all flushes would translate to queue submits, and magically the bug went away. It was a miracle. Until I tried to think through what must be happening for that to have any effect.&lt;/p&gt;

&lt;h1 id=&quot;synchronization-you-cannot-escape&quot;&gt;Synchronization: You Cannot Escape&lt;/h1&gt;
&lt;p&gt;The reason this was especially puzzling is the call sequence was:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;end-of-frame flush&lt;/li&gt;
  &lt;li&gt;present&lt;/li&gt;
  &lt;li&gt;glFenceSync flush&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;which means the last flush was optimized out, instead returning the fence from the end-of-frame flush. And these &lt;em&gt;should&lt;/em&gt; be identical in terms of operations the app would want to wait on.&lt;/p&gt;

&lt;p&gt;Except that there’s a present in there, and technically that’s a queue submission, and &lt;em&gt;technically&lt;/em&gt; something might want to know if the submit for that has completed?&lt;/p&gt;

&lt;p&gt;Why yes, that &lt;em&gt;is&lt;/em&gt; stupid, but here at SGC, stupidity is our sustenance.&lt;/p&gt;

&lt;p&gt;Anyway, I &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26935&quot;&gt;blasted out&lt;/a&gt; a quick fix, and now you can all go play your favorite chess sim on your favorite driver again.&lt;/p&gt;</content><author><name></name></author><summary type="html">Slow Start</summary></entry><entry><title type="html">Manifesto</title><link href="https://www.supergoodcode.com/manifesto/" rel="alternate" type="text/html" title="Manifesto" /><published>2024-01-02T00:00:00+00:00</published><updated>2024-01-02T00:00:00+00:00</updated><id>https://www.supergoodcode.com/manifesto</id><content type="html" xml:base="https://www.supergoodcode.com/manifesto/">&lt;h1 id=&quot;this-is-it&quot;&gt;This Is It.&lt;/h1&gt;

&lt;p&gt;It’s been a long break for the blog, but now we’re back and THE MEME FACTORY IS OPEN FOR BUSINESS.&lt;/p&gt;

&lt;p&gt;—is what I’d say if it were any other year. But it’s not any other year. This is 2024, and 2024 is a very special year.&lt;/p&gt;

&lt;p&gt;It’s the year a decades-old plan has finally yielded its dividends.&lt;/p&gt;

&lt;h1 id=&quot;truth&quot;&gt;Truth.&lt;/h1&gt;
&lt;p&gt;You’ve all heard certain improbable claims before. &lt;em&gt;Big Triangle&lt;/em&gt; this. &lt;em&gt;Big Triangle&lt;/em&gt; that. Everyone knows who they are. Some have even &lt;a href=&quot;https://github.com/zmike/vkoverhead/pull/24#issuecomment-1734067828&quot;&gt;accused me&lt;/a&gt; of being a shill for Big Triangle from time to time. At last, however, I can finally pull off my mask to reveal the truth for the world.&lt;/p&gt;

&lt;p&gt;I was born for a single purpose. As a child, I was grouped in with a number of other candidates. We were trained. Tested. Forged. Unshakable bonds grew between us, bonds we’ll never forget. Bonds that were threatened and broken again and again through harrowing selection processes that culled our ranks.&lt;/p&gt;

&lt;p&gt;In time, I was the only one remaining. The only one who survived that brutal gauntlet to fulfill an ultimate goal.&lt;/p&gt;

&lt;p&gt;The goal of infiltrating Big Triangle.&lt;/p&gt;

&lt;p&gt;More time passed. Days. Months. Years. I continued my quiet training, never letting on to my true purpose.&lt;/p&gt;

&lt;p&gt;Now, finally, I’ve achieved the impossible. I’ve attained a status within the ranks of Big Triangle that leaves me in command of vast, unfathomable resources.&lt;/p&gt;

&lt;p&gt;I have become an officer.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.supergoodcode.com/assets/itsreal.png&quot;&gt;&lt;img src=&quot;https://www.supergoodcode.com/assets/itsreal.png&quot; alt=&quot;itsreal.png&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I am the chair.&lt;/p&gt;

&lt;h1 id=&quot;revolution&quot;&gt;Revolution.&lt;/h1&gt;
&lt;p&gt;Now is the time to rise up, my friends. We must take back the triangles—those big and small, success green and failure red, variable rate shaded and fully shaded, all of them together. We must take them and we must fight. No longer will our goals remain the mere unfulfilled dreams of our basement-dwelling forebearers!&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;OpenGL 10.0 by 2025!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Compatibility Profile shall be renamed ‘SLOW MODE’&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;OpenGL ES shall retroactively convert to a YEAR-MONTH versioning scheme with quarterly releases!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Depth values shall be uniformly scaled across all hardware and platforms!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;XFB shall be outlawed!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Linux game ports shall no longer link to LLVM!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Coherent API error messages shall be printed!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Vendors which cannot ship functional Windows GL drivers shall ship Zink!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Native GL drivers on mobile platforms shall be outlawed!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;gl_PointSize shall be replaced by the constant ‘1.0’ in all cases!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Mesh and ray-tracing extensions from NVIDIA shall become core functionality!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;GLX shall be deleted and forgotten!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;All bug reports shall contain at least one quality meme in the OP as a form of spam prevention!&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rise up and join me, your new GL/ES chair, in the glorious revolution!&lt;/p&gt;

&lt;h1 id=&quot;disclaimer&quot;&gt;DISCLAIMER&lt;/h1&gt;
&lt;p&gt;Obviously this is all a joke (except the part where I’m the 🪑, that’s &lt;a href=&quot;https://www.khronos.org/about/working-group-officers/&quot;&gt;100% real af&lt;/a&gt;), but I still gotta put a disclaimer here because otherwise I’m gonna be in biiiiig trouble if this gets taken seriously.&lt;/p&gt;

&lt;p&gt;Happy New Year. I missed you.&lt;/p&gt;</content><author><name></name></author><summary type="html">This Is It.</summary></entry><entry><title type="html">2024</title><link href="https://www.supergoodcode.com/2024/" rel="alternate" type="text/html" title="2024" /><published>2023-10-27T00:00:00+00:00</published><updated>2023-10-27T00:00:00+00:00</updated><id>https://www.supergoodcode.com/2024</id><content type="html" xml:base="https://www.supergoodcode.com/2024/">&lt;p&gt;🪑?&lt;/p&gt;</content><author><name></name></author><summary type="html">🪑?</summary></entry><entry><title type="html">Readback</title><link href="https://www.supergoodcode.com/readback/" rel="alternate" type="text/html" title="Readback" /><published>2023-10-26T00:00:00+00:00</published><updated>2023-10-26T00:00:00+00:00</updated><id>https://www.supergoodcode.com/readback</id><content type="html" xml:base="https://www.supergoodcode.com/readback/">&lt;h1 id=&quot;and-now-for-something-slightly-more-technical&quot;&gt;And Now For Something Slightly More Technical&lt;/h1&gt;

&lt;p&gt;It’s a busy, busy week here. So busy I’m slipping on my blogging. But that’s okay, because here one last big technical post about something I hate.&lt;/p&gt;

&lt;p&gt;Swapchain readback.&lt;/p&gt;

&lt;h1 id=&quot;so-easy-even-you-could-accidentally-do-it&quot;&gt;So Easy Even You Could Accidentally Do It&lt;/h1&gt;
&lt;p&gt;I’m not alone in drinking the haterade on this one, but GL makes it especially easy to footgun yourself by not providing explicit feedback that you’re footgunning yourself.&lt;/p&gt;

&lt;p&gt;I recently encountered a scenario in &lt;strong&gt;REDACTED&lt;/strong&gt; where this behavior was commonplace. The command stream looked roughly like this:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;draw some stuff&lt;/li&gt;
  &lt;li&gt;swapbuffers&lt;/li&gt;
  &lt;li&gt;blitframebuffer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And this happened on every single frame (???).&lt;/p&gt;

&lt;h1 id=&quot;in-zink-terms&quot;&gt;In Zink Terms…&lt;/h1&gt;
&lt;p&gt;This isn’t pretty. Zink has an extremely conformant method of performing swapchain readback which definitely works without issues in all cases. I’d explain it, but it wouldn’t make either of us happy, and I’ve got so much other stuff to do that I couldn’t possibly… Oh, you really want to know? Well don’t say I didn’t warn you.&lt;/p&gt;

&lt;p&gt;Vulkan doesn’t allow readback from swapchains. By this, I mean:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;swapchain images must be acquired before they can be accessed for any purpose&lt;/li&gt;
  &lt;li&gt;there is no method to explicitly reacquire a specific swapchain image&lt;/li&gt;
  &lt;li&gt;there is no guarantee that swapchain images are unchanged after present&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combined, once you have presented a swapchain image you’re screwed.&lt;/p&gt;

&lt;p&gt;…According to the spec, that is. In the real world, things work differently.&lt;/p&gt;

&lt;p&gt;Zink takes advantage of this “real world” utilization to implement swapchain readback. In short, the only method available is to spam present/acquire on the swapchain until the last-presented image is reacquired. Then it can be read back, and the image data is (probably) the same as when it was presented.&lt;/p&gt;

&lt;h1 id=&quot;p-e-r-f&quot;&gt;&lt;del&gt;P E R F&lt;/del&gt;&lt;/h1&gt;
&lt;p&gt;This is not a speedy method of implementing readback. It requires a full sync, and it was designed for the purpose of passing unit tests, which is does perfectly. Performance was never a concern, because why would anyone ever be trying to do readback in… Why would anyone ever be trying to do readback in a performance-sensitive… Using OpenGL, why would anyone ever be…&lt;/p&gt;

&lt;p&gt;Anyway, this is very unperformant, and here at SGC we hate all things of that nature. Given that I had my real world scenario from &lt;strong&gt;REDACTED&lt;/strong&gt; in which this was happening every frame, something had to be done.&lt;/p&gt;

&lt;p&gt;This solution isn’t performant in the absolute sense either, but it’s massively faster than what was happening previously. Once zink detects an app repeatedly footgunning itself at full speed, it activates readback mode for a swapchain and maintains a staging copy of every frame. This enables the image data to be read back at any time without synchronization at the cost of an extra full-frame copy. This roughly doubles FPS in the case I was testing, which is pretty good.&lt;/p&gt;

&lt;p&gt;The functionality is &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25754&quot;&gt;already merged&lt;/a&gt; for the upcoming 23.3 release.&lt;/p&gt;

&lt;p&gt;Footgun as hard as you want.&lt;/p&gt;</content><author><name></name></author><summary type="html">And Now For Something Slightly More Technical</summary></entry><entry><title type="html">Crabformance</title><link href="https://www.supergoodcode.com/crabformance/" rel="alternate" type="text/html" title="Crabformance" /><published>2023-10-25T00:00:00+00:00</published><updated>2023-10-25T00:00:00+00:00</updated><id>https://www.supergoodcode.com/crabformance</id><content type="html" xml:base="https://www.supergoodcode.com/crabformance/">&lt;h1 id=&quot;more-milestones&quot;&gt;More Milestones&lt;/h1&gt;

&lt;p&gt;As everyone knows, Red Hat’s top RustiCL expert, Karol “But it’s only 10 o’clock?” Herbst, has been hard at work beating Mesa/Zink/RustiCL into shape. That effort continues to &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25837&quot;&gt;bear fruit&lt;/a&gt;, and with the merge of an upcoming MR it should be possible to pass OpenCL conformance with zink on multiple platforms.&lt;/p&gt;

&lt;p&gt;This will make zink &lt;strong&gt;THE FIRST EVER CONFORMANT VULKAN-BASED OPENCL IMPLEMENTATION&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Great work all around. For up-to-the-second progress reports on this ecosystem-critical topic, don’t forget to follow Karol on &lt;a href=&quot;https://chaos.social/@karolherbst&quot;&gt;social media&lt;/a&gt;.&lt;/p&gt;</content><author><name></name></author><summary type="html">More Milestones</summary></entry><entry><title type="html">Preemptive</title><link href="https://www.supergoodcode.com/preemptive/" rel="alternate" type="text/html" title="Preemptive" /><published>2023-10-24T00:00:00+00:00</published><updated>2023-10-24T00:00:00+00:00</updated><id>https://www.supergoodcode.com/preemptive</id><content type="html" xml:base="https://www.supergoodcode.com/preemptive/">&lt;h1 id=&quot;your-bug-has-already-been-solved&quot;&gt;Your Bug Has Already Been Solved&lt;/h1&gt;

&lt;p&gt;After yesterday’s post, I’m sure my thousands of readers stampeded to install the latest zink and run their system with it, and I salute you for your hard work in finding all those new ways to crash your systems.&lt;/p&gt;

&lt;p&gt;Some of those crashes, however, are not my bugs. They’re system bugs.&lt;/p&gt;

&lt;p&gt;In particular, any of you still using Xorg instead of Wayland will want to create this file:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ cat /etc/X11/xorg.conf.d/30-dmabuf.conf
Section &quot;ServerFlags&quot;
	Option &quot;Debug&quot; &quot;dmabuf_capable&quot;
EndSection
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This makes your xserver dmabuf-capable, which will be more successful when running things with zink.&lt;/p&gt;

&lt;p&gt;Another problem you’re likely to have is this console error:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;DRI3 not available
failed to load driver: zink
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Specifically you’re likely to have this on AMD hardware, and the cause is almost certainly that you’ve installed some footgun package with a naming variation on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xf86-video-amdgpu&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Delete this package.&lt;/p&gt;

&lt;p&gt;Just delete it. I don’t know why distros still make it available, but if you have it installed then you’re just footgunning yourself.&lt;/p&gt;

&lt;p&gt;If you’re still having problems after checking for both of these issues, try turning your computer on.&lt;/p&gt;</content><author><name></name></author><summary type="html">Your Bug Has Already Been Solved</summary></entry><entry><title type="html">Hibernation</title><link href="https://www.supergoodcode.com/hibernation/" rel="alternate" type="text/html" title="Hibernation" /><published>2023-10-23T00:00:00+00:00</published><updated>2023-10-23T00:00:00+00:00</updated><id>https://www.supergoodcode.com/hibernation</id><content type="html" xml:base="https://www.supergoodcode.com/hibernation/">&lt;h1 id=&quot;almost-that-time-again&quot;&gt;Almost That Time Again&lt;/h1&gt;

&lt;p&gt;As readers are no doubt aware by now, SGC goes into hibernation beginning around November, and that time is nearly upon us once more. To cap out another glorious year of &lt;del&gt;shitposting&lt;/del&gt;highly technical and informative blogging, I’ll be attempting to put up a newsworthy post every day.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;Day 1&lt;/strong&gt;.&lt;/p&gt;

&lt;h1 id=&quot;zink-no-longer-a-hacky-workaround-driver&quot;&gt;Zink: No Longer A Hacky Workaround Driver&lt;/h1&gt;
&lt;p&gt;2023 has seen great strides in the zink ecosystem:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Some games, most notably my favorite game of all time &lt;a href=&quot;https://developer.x-plane.com/2023/02/addressing-plugin-flickering/&quot;&gt;X-Plane&lt;/a&gt;, are now shipping zink in order to have a consistent GL experience across platforms&lt;/li&gt;
  &lt;li&gt;Zink has reached &lt;a href=&quot;https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_332&quot;&gt;official GL 4.6 conformance&lt;/a&gt; on &lt;a href=&quot;https://blog.imaginationtech.com/imagination-gpus-now-support-opengl-4.6&quot;&gt;Imagination&lt;/a&gt; GPUs and will be shipping as their GL implementation&lt;/li&gt;
  &lt;li&gt;Zink can now run display servers for both X and Wayland, enabling full systems to exist without a native GL implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And there’s plenty more, of course, but throughout all this progress has been one very minor, very annoying wrinkle.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MESA_LOADER_DRIVER_OVERRIDE=zink&lt;/code&gt; has to be specified in order to use zink, even if no other GL drivers exist on the system.&lt;/p&gt;

&lt;h1 id=&quot;or-does-it&quot;&gt;Or Does It?&lt;/h1&gt;
&lt;p&gt;Over a year ago I &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16168&quot;&gt;attempted&lt;/a&gt; to enable automatic zink loading if a native driver could not be loaded. It was a reasonable first attempt, but it had issues with driver loading in scenarios where hardware drivers were not permitted.&lt;/p&gt;

&lt;p&gt;Work has slowly progressed in Mesa since that time, and various small changes have gradually pushed the teetering tower that is GLX/EGL in the direction anyone and everyone wanted, full stop.&lt;/p&gt;

&lt;p&gt;The result is that on zink-enabled systems, loader environment variables will &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25640&quot;&gt;no longer be necessary&lt;/a&gt; as of the upcoming Mesa 23.3 release. If zink is your only GL driver, you will get zink rather than an automatic fallback to swrast.&lt;/p&gt;

&lt;p&gt;I can’t imagine anyone will need it, but remember that issues can be reported &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/issues&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;</content><author><name></name></author><summary type="html">Almost That Time Again</summary></entry><entry><title type="html">All The Updates</title><link href="https://www.supergoodcode.com/all-the-updates/" rel="alternate" type="text/html" title="All The Updates" /><published>2023-10-12T00:00:00+00:00</published><updated>2023-10-12T00:00:00+00:00</updated><id>https://www.supergoodcode.com/all-the-updates</id><content type="html" xml:base="https://www.supergoodcode.com/all-the-updates/">&lt;h1 id=&quot;ebusy&quot;&gt;EBUSY&lt;/h1&gt;

&lt;p&gt;As everyone knows, SGC goes into yearly hibernation beginning in November. Leading up to that point has been a mad scramble to nail down all the things, leaving less time for posts here.&lt;/p&gt;

&lt;p&gt;But there have been updates, and I’m gonna round ‘em all up.&lt;/p&gt;

&lt;h1 id=&quot;r-a-y-t-r-a-c-w-t-f&quot;&gt;R A Y T R A C W T F&lt;/h1&gt;
&lt;p&gt;Friend of the blog and future Graphics scientist with a PhD in WTF, Konstantin Seurer, has been hard at work over the past several weeks. Remember earlier this year when he implemented &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VK_EXT_descriptor_indexing&lt;/code&gt; for Lavapipe? Well he’s at it again, and this time he’s aimed for something bigger.&lt;/p&gt;

&lt;p&gt;He’s now &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25616&quot;&gt;implemented raytracing&lt;/a&gt; for Lavapipe.&lt;/p&gt;

&lt;p&gt;It’s a tremendous feat, one that sets him apart from the other developers who have not implemented raytracing for a software implementation of Vulkan.&lt;/p&gt;

&lt;h1 id=&quot;closure&quot;&gt;CLosure&lt;/h1&gt;
&lt;p&gt;I blogged (or maybe imagined blogging) about RustiCL progress on zink last year at XDC, specifically the time renowned pubmaster Karol Herbst handcuffed himself to me and refused to divulge the location of the key (disguised as a USB thumb drive in his laptop) until we had basic CL support functioning in a pair programming exercise that put us up against the unnaturally early closing time of Minneapolis pubs. That episode is finally turning into something useful as CL support for zink will soon be &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24839&quot;&gt;merged&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;While I can’t reveal too much about the performance as of yet, what I can say now is that it’s roughly 866% faster.&lt;/p&gt;

&lt;h1 id=&quot;fixups&quot;&gt;Fixups&lt;/h1&gt;
&lt;p&gt;A number of longstanding bugs have recently been fixed.&lt;/p&gt;

&lt;h2 id=&quot;wolfenstein-face&quot;&gt;Wolfenstein Face&lt;/h2&gt;
&lt;p&gt;Anyone who has tried to play one of the modern Wolfenstein GL games on RADV has probably seen this abomination:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.supergoodcode.com/assets/wolf-face.png&quot;&gt;&lt;img src=&quot;https://www.supergoodcode.com/assets/wolf-face.png&quot; alt=&quot;wolf-face.png&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Wolfenstein Face affects a very small number of apps. Actually just the Wolfenstein (The New Order / The Old Blood) games. I’d had a &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/issues/8988&quot;&gt;ticket&lt;/a&gt; open about it for a while, and it turns out that this is a &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/issues/5753&quot;&gt;known issue&lt;/a&gt; in D3D games which has its own workaround. The workaround is now going to be &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25642&quot;&gt;applied for zink&lt;/a&gt; as well, which should resolve the issue while hopefully not &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/issues/7879&quot;&gt;causing others&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;apitrace-the-final-frontier&quot;&gt;Apitrace: The Final Frontier&lt;/h2&gt;

&lt;p&gt;Since the dawn of time, experts have tried to obtain traces from games with rendering bugs, but some of these games have historically been resistant to tracing.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A number of games could be traced, but then replaying those traces would crash at a certain point. This is now &lt;a href=&quot;https://github.com/apitrace/apitrace/pull/899&quot;&gt;fixed&lt;/a&gt;, enabling better bug reporting for a large number of AAA games from the the last decade.&lt;/li&gt;
  &lt;li&gt;Another set of games using the id Engine could record traces, but then replaying them would fail to render correctly:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href=&quot;https://www.supergoodcode.com/assets/wolf-trace.png&quot;&gt;&lt;img src=&quot;https://www.supergoodcode.com/assets/wolf-trace.png&quot; alt=&quot;wolf-trace.png&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This affects (at least) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Wolfenstein: The Old Blood&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DOOM2016&lt;/code&gt;, but the problem has been identified, and a fix is on the way.&lt;/p&gt;

&lt;h2 id=&quot;zink-exploring-new-display-systems&quot;&gt;Zink: Exploring New Display Systems&lt;/h2&gt;
&lt;p&gt;After a number of universally-reviled hacks, Zink should now work fine in both Wayland and Surfaceless EGL configurations.&lt;/p&gt;

&lt;h1 id=&quot;the-real-post&quot;&gt;The Real Post&lt;/h1&gt;
&lt;p&gt;Any other, lesser blogger would’ve saved this for another post in order to maximize their posting frequency metric, but here at SGC the readers get a full meal with every post even when they don’t have enough time to digest it all at once. Since I’m not going to XDC this year, consider this the thing I might have given a presentation on.&lt;/p&gt;

&lt;p&gt;During my executive senior keynote seminar presentation workshop on zink at last year’s XDC, I brought up tiler performance as one of the known deficiencies. Specifically this was in regard to how tilers need to maximize time spent inside renderpasses and avoid unnecessary load/store operations when beginning/ending those renderpasses, which required either some sort of Vulkan extension to enable deferred load/store op setting OR command stream parsing for GL.&lt;/p&gt;

&lt;p&gt;While I did work on a number of Vulkan extensions this year, deferred load/store ops wasn’t one of them.&lt;/p&gt;

&lt;p&gt;So it was that I implemented renderpass tracking for Threaded Context to scan the GL command stream in the course of recording it for threaded dispatch. The CPU overhead is negligible (~5% on a couple extremely synthetic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;drawoverhead&lt;/code&gt; cases and nothing noticeable in apps), while the performance gains are staggering (~10-15x speedup in AAA games). All in all, it was a painful process but one that has yielded great results.&lt;/p&gt;

&lt;p&gt;The gist of it, as I’ve described in previous posts that I’m too lazy to find links for, is that framebuffer attachment access is accumulated during TC command recording such that zink is able to determine which load/store ops are needed. This works great so long as nothing unexpected splits the renderpass. “Unexpected” in this context refers to one of the following scenarios:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;zink receives a (transfer) command sequence which is impossible to reorder and must split the renderpass to execute copies/blits&lt;/li&gt;
  &lt;li&gt;the app randomly flushes during rendering&lt;/li&gt;
  &lt;li&gt;the GL frontend hits a TC synchronization point and halts the recording thread to wait for the driver thread to finish execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final issue remaining for renderpass tracking has been this third scenario: any time the GL frontend needs to sync TC, renderpass metadata is split. The splitting is such that a single renderpass becomes two because the driver must complete execution on the currently-recorded metadata in order to avoid deadlocking itself against the waiting GL frontend, but then the renderpass will continue after the sync. While this happens in a very small number of scenarios, one of them is quite common.&lt;/p&gt;

&lt;p&gt;Texture uploading.&lt;/p&gt;

&lt;h1 id=&quot;texture-uploads-how-do-they-work&quot;&gt;Texture Uploads: How Do They Work?&lt;/h1&gt;
&lt;p&gt;There are (currently) three methods by which TC can perform texture uploads:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;for small uploads, the data is enqueued and passed asynchronously to the driver thread&lt;/li&gt;
  &lt;li&gt;for larger uploads:
    &lt;ul&gt;
      &lt;li&gt;if renderpass tracking is enabled and a renderpass is active, the upload will be sequenced into N strided uploads and passed asynchronously to the driver thread to avoid splitting renderpasses&lt;/li&gt;
      &lt;li&gt;otherwise TC syncs the driver thread and performs the upload directly&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Eagle-eyed readers will notice that I’ve already handled the “problem” case described above; in order to avoid splitting renderpasses, I’ve written some handling which rewrites texture uploads into a sequence of N asynchronous buffer2image copies, where N is either 1 or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$height&lt;/code&gt; depending on whether the source data’s stride matches the image’s stride. In the case where N is not 1, this can result in e.g., 4096 copy operations being enqueued for a 4096x4096 texture atlas. Even in the case where N is 1, it still adds an extra full copy of the texture data. While this is still more optimal than splitting a renderpass, it’s not &lt;em&gt;optimal&lt;/em&gt; in the absolute sense.&lt;/p&gt;

&lt;p&gt;You can see where this is going.&lt;/p&gt;

&lt;h1 id=&quot;tc-execution-define-optimal&quot;&gt;TC Execution: Define Optimal&lt;/h1&gt;
&lt;p&gt;Optimal Threaded Context execution is the state when the GL frontend is recording commands while the driver thread is deserializing those commands into hardware-specific instructions to submit to the GPU. Visually, it looks like this Halloween-themed diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.supergoodcode.com/assets/async_texture/ideal.png&quot;&gt;&lt;img src=&quot;https://www.supergoodcode.com/assets/async_texture/ideal.png&quot; alt=&quot;ideal.png&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ignoring the small-upload case, the current state of texture uploading looks like one of the following Halloween-themed diagrams:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the sequenced upload case will have more work, so the driver thread will run a bit longer than it otherwise would, resulting in the GL frontend waiting a bit longer than it otherwise would for completion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href=&quot;https://www.supergoodcode.com/assets/async_texture/copies.png&quot;&gt;&lt;img src=&quot;https://www.supergoodcode.com/assets/async_texture/copies.png&quot; alt=&quot;copies.png&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the sync upload case creates a bubble in TC execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href=&quot;https://www.supergoodcode.com/assets/async_texture/sync.png&quot;&gt;&lt;img src=&quot;https://www.supergoodcode.com/assets/async_texture/sync.png&quot; alt=&quot;sync.png&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1 id=&quot;solve-for-p&quot;&gt;Solve For P&lt;/h1&gt;
&lt;p&gt;To maintain maximum performance, TC needs to be processing commands asynchronously in the driver thread while the GL frontend continues to record commands for processing. Thus, to maintain maximum performance during texture uploads, the texture upload needs to occur (without copies) while the driver thread continues executing.&lt;/p&gt;

&lt;p&gt;Looking at this problem from a different perspective, the case that needs to be avoided at all costs is the case where the GL frontend syncs TC execution. The reason why this sync exists is to avoid accidentally uploading data to an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;in-use&lt;/code&gt; image, which would cause unpredictable (but definitely wrong) output. In this context, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;in-use&lt;/code&gt; can be defined as an image which is either:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;enqueued in a TC batch for execution&lt;/li&gt;
  &lt;li&gt;enqueued/active in a GPU submission&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the plus side, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pipe_context::is_resource_busy&lt;/code&gt; exists to query the second of these, so that’s solved. On the minus side, while TC has some usage tracking for buffers, it has nothing for images, and adding such tracking in a performant manner is challenging.&lt;/p&gt;

&lt;p&gt;To figure out a solution for TC image tracking, let’s examine the most common problem case. In games, the most common scenario for texture uploading is something like this:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;create staging image&lt;/li&gt;
  &lt;li&gt;upload texture data to staging image&lt;/li&gt;
  &lt;li&gt;draw to scene while sampling staging image&lt;/li&gt;
  &lt;li&gt;delete staging image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For such a case, it’d be trivial to add a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;seen&lt;/code&gt; flag to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct threaded_resource&lt;/code&gt; and pass the conditional if the flag is false. Since it’s straightforward enough to evaluate when an image has been seen in TC, this would suffice. Unfortunately, such a naive (don’t @ me about diacritics) implementation ignores another common pattern:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;create staging image&lt;/li&gt;
  &lt;li&gt;upload texture data to staging image&lt;/li&gt;
  &lt;li&gt;draw to scene while sampling staging image&lt;/li&gt;
  &lt;li&gt;cache staging image for reuse&lt;/li&gt;
  &lt;li&gt;render frame&lt;/li&gt;
  &lt;li&gt;upload texture data to staging image&lt;/li&gt;
  &lt;li&gt;draw to scene while sampling staging image&lt;/li&gt;
  &lt;li&gt;cache staging image for reuse&lt;/li&gt;
  &lt;li&gt;render frame&lt;/li&gt;
  &lt;li&gt;…&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this scenario, the staging image is reused, requiring a bit more tracking in order to accurately determine that it can be safely used for uploads.&lt;/p&gt;

&lt;p&gt;The solution I’ve &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25624&quot;&gt;settled on&lt;/a&gt; is to use a derivative of zink’s resource tracking. This adds an ID for the last-used batch to the resource, which can then be checked during uploads. When the image is determined idle, the texture data is passed directly to the driver for an unsynchronized upload similar to how unsynchronized buffer uploads work. It’s simple and hasn’t shown any definitive performance overhead in my testing.&lt;/p&gt;

&lt;p&gt;For it to really work to its fullest potential in zink, unfortunately, requires &lt;a href=&quot;https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_host_image_copy.html&quot;&gt;VK_EXT_host_image_copy&lt;/a&gt; to avoid further staging copies, and nobody implements this yet in mesa main (except Lavapipe, though also there’s this &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24276&quot;&gt;ANV MR&lt;/a&gt;). But someday more drivers will support this, and then it’ll be great.&lt;/p&gt;

&lt;p&gt;As far as non-tiler performance gains from this work, it’s hard to say definitively whether they’ll be noticeable. Texture uploads during loading screens are typically intermixed with shader compilation, so there’s little TC execution to unblock, but any game which uses texture streaming may see some slight latency improvements.&lt;/p&gt;

&lt;p&gt;The only remaining future work here is to further enable unsynchronized texture uploads in zink by adding a special cmdbuf for unsynchronized uploads to handle the non-HIC case. Otherwise performance should be pretty solid across the board.&lt;/p&gt;</content><author><name></name></author><summary type="html">EBUSY</summary></entry><entry><title type="html">Tis The Season</title><link href="https://www.supergoodcode.com/tis-the-season/" rel="alternate" type="text/html" title="Tis The Season" /><published>2023-09-22T00:00:00+00:00</published><updated>2023-09-22T00:00:00+00:00</updated><id>https://www.supergoodcode.com/tis-the-season</id><content type="html" xml:base="https://www.supergoodcode.com/tis-the-season/">&lt;h1 id=&quot;remember-way-back-when&quot;&gt;Remember Way Back When…&lt;/h1&gt;

&lt;p&gt;This blog was about pointlessly optimizing things? I’m talking like taking &lt;a href=&quot;https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/vkGetDescriptorSetLayoutSupport.html&quot;&gt;vkGetDescriptorSetLayoutSupport&lt;/a&gt; and making it fast. The kinds of optimizations nobody asked for and potentially nobody even wanted.&lt;/p&gt;

&lt;p&gt;Well good news: this isn’t a post about those types of optimizations.&lt;/p&gt;

&lt;p&gt;This is a post where I’m gonna talk about some speedups that you didn’t even know you craved but now that you know they exist you can’t live without.&lt;/p&gt;

&lt;h1 id=&quot;the-vulkan-queue-what-is-it&quot;&gt;The Vulkan Queue: What Is It?&lt;/h1&gt;
&lt;p&gt;Lots of people are asking, but surely nobody reading this blog since you’re all experts. But if you have a friend who wants to know, here’s &lt;a href=&quot;https://registry.khronos.org/vulkan/site/guide/latest/queues.html&quot;&gt;the official resource&lt;/a&gt; for all that knowledge. It’s got diagrams. Images with important parts circled. The stuff that means whoever wrote it knew what they were talking about.&lt;/p&gt;

&lt;p&gt;The thing this “official” resource doesn’t tell you is the queue is potentially pretty slow. You chuck some commands into it, and then you wait on your fence/semaphore, but the actual time it takes to perform queue submission is nonzero. In fact, it’s quite a bit larger than zero. &lt;em&gt;How large is it?&lt;/em&gt; you might be asking.&lt;/p&gt;

&lt;p&gt;I didn’t want to do this, but you’ve forced my hand.&lt;/p&gt;

&lt;h1 id=&quot;the-vulkan-queue-timing&quot;&gt;The Vulkan Queue: Timing&lt;/h1&gt;

&lt;p&gt;What if I told you there was a tool for measuring things like this. A tool for determining the cost of various Vulkan operations. For &lt;em&gt;benchmarking&lt;/em&gt;, one might say.&lt;/p&gt;

&lt;p&gt;That’s right, it’s time to yet again plug &lt;a href=&quot;https://github.com/zmike/vkoverhead&quot;&gt;vkoverhead&lt;/a&gt;, the best and only tool for doing whatever I’m about to do.&lt;/p&gt;

&lt;p&gt;Like a prophet, my past self already predicted that I’d be sitting here writing this post to close out a week of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;types_of_headaches.meme -&amp;gt; vulkan_headaches.meme&lt;/code&gt;. That’s why &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkoverhead&lt;/code&gt; already has the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-submit-only&lt;/code&gt; option in order to run a series of benchmark cases which have numbers that are totally not about to go up.&lt;/p&gt;

&lt;p&gt;Let’s look at those cases now to fill up some more page space and time travel closer to the end of my workweek:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;submit_noop&lt;/code&gt; submits nothing. There’s no semaphores, no cmdbufs, it just submits and returns in order to provide a baseline&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;submit_50noop&lt;/code&gt; submits nothing 50 times, which is to say it passes 50x &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VkSubmitInfo&lt;/code&gt; structs to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkQueueSubmit&lt;/code&gt; (or the 2 versions if sync2 is supported)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;submit_1cmdbuf&lt;/code&gt; submits a single cmdbuf. In theory this should be slower than the noop case, but I hate computers and obviously this isn’t true at all&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;submit_50cmdbuf&lt;/code&gt; submits 50 cmdbufs. In theory this should be slower than the single cmdbuf case, and, thankfully, this one particular time in which we have expectations of how computers work does match our expectations&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;submit_50cmdbuf_50submit&lt;/code&gt; submits 50 cmdbufs in 50 submits for a total of 50 cmdbufs per &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vkQueueSubmit&lt;/code&gt; call. This is the slowest test, you would think, and I thought that too, and the longer this explanation goes on the more you start to wonder if computers really do work at all like you expect or if this is going to upset you, but it’s Friday, and I don’t have anywhere to be except the gym, so I could keep delaying the inevitable for a while longer, but I do have to get to the gym, so sure, this is totally gonna be way slower than all the other tests, trust me™&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s a great series of tests which showcase some driver pain points. Specifically it shows how slow submit can be.&lt;/p&gt;

&lt;p&gt;Let’s check out some baseline results on the driver everyone loves to hang out with, RADV:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Everything looks like we’d expect. The benchmark results ensmallen as they get more complex.&lt;/p&gt;

&lt;p&gt;But why?&lt;/p&gt;

&lt;h1 id=&quot;why-so-slow&quot;&gt;Why So Slow&lt;/h1&gt;
&lt;p&gt;Because if you think about it like a smart human and not a dumb pile of “thinking” sand, submitting 50 cmdbufs is submitting 50 cmdbufs no matter how you do it.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.supergoodcode.com/assets/queue-think.png&quot;&gt;&lt;img src=&quot;https://www.supergoodcode.com/assets/queue-think.png&quot; alt=&quot;queue-think.png&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some restrictions apply, signal semaphores blah blah blah, but none of that’s happening here so what the fuck, RADV?&lt;/p&gt;

&lt;p&gt;This is where we get into some real facepalm territory. Vulkan, as an API, gives drivers the ability to optimize this. That’s the entire reason why &lt;a href=&quot;https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/vkQueueSubmit.html&quot;&gt;vkQueueSubmit&lt;/a&gt; has the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;submitCount&lt;/code&gt; param and takes an array of submits.&lt;/p&gt;

&lt;p&gt;But what does Mesa do here? Well, in &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/blob/3c4c263dc734ec75f72d36b1d0d1a9cd41310112/src/vulkan/runtime/vk_queue.c#L1164&quot;&gt;the current code&lt;/a&gt; there’s this gem:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;submitCount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vulkan_submit_info&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;info&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pNext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pSubmits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pNext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;command_buffer_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pSubmits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;commandBufferInfoCount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;command_buffers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pSubmits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pCommandBufferInfos&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wait_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pSubmits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;waitSemaphoreInfoCount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;waits&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pSubmits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pWaitSemaphoreInfos&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;signal_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pSubmits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;signalSemaphoreInfoCount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;signals&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pSubmits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pSignalSemaphoreInfos&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fence&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;submitCount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fence&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;NULL&lt;/span&gt;
   &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;VkResult&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vk_queue_submit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;info&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unlikely&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VK_SUCCESS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Tremendous. It’s worth mentioning that not only is this splitting the batched submits into individual ones, each submit also allocates a struct to contain the submit info so that the drivers can use the same interface. So it’s increasing the kernel overhead by performing multiple submits and also increasing memory allocations.&lt;/p&gt;

&lt;h1 id=&quot;fast-forward&quot;&gt;Fast Forward&lt;/h1&gt;
&lt;p&gt;We’ve all been here before on SGC, and I really do need to get to the gym, so I’m authorizing a one-time fast forward to the results of optimizing this:&lt;/p&gt;

&lt;p&gt;RADV GFX11:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%
↓
  40, submit_noop,                                        21008648,     100.0%
  41, submit_50noop,                                      4866415,      23.2%
  42, submit_1cmdbuf,                                     51294,        0.2%
  43, submit_50cmdbuf,                                     1823,         0.0%
  44, submit_50cmdbuf_50submit,                            1828,         0.0%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That’s like 1000% faster for case #41 and 50% faster for case #44.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But how does this affect other drivers?&lt;/em&gt; I’m sure you’re asking next. And of course, this being the primary blog for distributing Mesa benchmarking numbers in any given year, I have those numbers.&lt;/p&gt;

&lt;p&gt;Lavapipe:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  40, submit_noop,                                        1972672,      100.0%
  41, submit_50noop,                                      40334,        2.0%
  42, submit_1cmdbuf,                                     5994597,      303.9%
  43, submit_50cmdbuf,                                    2623720,      133.0%
  44, submit_50cmdbuf_50submit,                           133453,       6.8%
↓
  40, submit_noop,                                        1980681,      100.0%
  41, submit_50noop,                                      1202374,      60.7%
  42, submit_1cmdbuf,                                     6340872,      320.1%
  43, submit_50cmdbuf,                                    2482127,      125.3%
  44, submit_50cmdbuf_50submit,                           1165495,      58.8%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;3000% faster for #41 and 1000% faster for #44.&lt;/p&gt;

&lt;p&gt;Intel DG2:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  40, submit_noop,                                        101336,       100.0%
  41, submit_50noop,                                       2123,         2.1%
  42, submit_1cmdbuf,                                     35372,        34.9%
  43, submit_50cmdbuf,                                      713,          0.7%
  44, submit_50cmdbuf_50submit,                             707,          0.7%
↓
  40, submit_noop,                                        106065,       100.0%
  41, submit_50noop,                                      105992,       99.9%
  42, submit_1cmdbuf,                                     35110,        33.1%
  43, submit_50cmdbuf,                                      709,          0.7%
  44, submit_50cmdbuf_50submit,                             702,          0.7%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;5000% faster for #41 and a big 🤕 for #44 because Intel.&lt;/p&gt;

&lt;p&gt;Turnip A740:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  40, submit_noop,                                        1227546,      100.0%
  41, submit_50noop,                                      26194,        2.1%
  42, submit_1cmdbuf,                                     1186327,      96.6%
  43, submit_50cmdbuf,                                    545341,       44.4%
  44, submit_50cmdbuf_50submit,                           16531,        1.3%
↓
  40, submit_noop,                                        1313550,      100.0%
  41, submit_50noop,                                      1078383,      82.1%
  42, submit_1cmdbuf,                                     1129515,      86.0%
  43, submit_50cmdbuf,                                    329247,       25.1%
  44, submit_50cmdbuf_50submit,                           484241,       36.9%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;4000% faster for #41, 3000% faster for #44.&lt;/p&gt;

&lt;p&gt;Pretty good, and it somehow manages to still be conformant.&lt;/p&gt;

&lt;p&gt;Code &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25352&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;</content><author><name></name></author><summary type="html">Remember Way Back When…</summary></entry><entry><title type="html">Happy Birthday</title><link href="https://www.supergoodcode.com/happy-birthday/" rel="alternate" type="text/html" title="Happy Birthday" /><published>2023-09-15T00:00:00+00:00</published><updated>2023-09-15T00:00:00+00:00</updated><id>https://www.supergoodcode.com/happy-birthday</id><content type="html" xml:base="https://www.supergoodcode.com/happy-birthday/">&lt;h1 id=&quot;but-not-mine&quot;&gt;But Not Mine&lt;/h1&gt;

&lt;p&gt;If you’re reading, thanks for everything.&lt;/p&gt;

&lt;h1 id=&quot;glamorous&quot;&gt;Glamorous&lt;/h1&gt;

&lt;p&gt;I planned to blog about it a while ago, but then I didn’t and news sites have since broken the news: Zink from Mesa main can run finally xservers.&lt;/p&gt;

&lt;p&gt;Yes, it’s true. For the first time ever, you can install Mesa (from git) and use zink (with environment variables) to run your entire system (unless you’re &lt;a href=&quot;https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24700&quot;&gt;on Intel&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;But what was so challenging about getting this to work? The answer won’t surprise you.&lt;/p&gt;

&lt;h1 id=&quot;wsi&quot;&gt;WSI&lt;/h1&gt;
&lt;p&gt;Fans of the blog know that I’m no fan of WSI. If I had my way, GPUs would render to output buffers that we could peruse at our leisure using whatever methods we had at our disposal. Ideally manual inspection. Alas, few others share my worldview and so we all must suffer.&lt;/p&gt;

&lt;p&gt;The root of all evil when it comes to computers is synchronization. This is triply so for anything GPU-related, and when all this “display server” chicanery is added in, the evilness value becomes one of those numbers so large that numerologists are still researching naming possibilities. There are two types of synchronization used with WSI:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;implicit sync - “just fucking do it”&lt;/li&gt;
  &lt;li&gt;explicit sync - “I’ll tell you exactly when to do it”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From a user perspective, the former has less code to manage. The downside is that on the driver side things become more complex, as implicit sync is effectively layered atop explicit sync.&lt;/p&gt;

&lt;p&gt;Another way of looking at it is:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;implicit sync - OpenGL&lt;/li&gt;
  &lt;li&gt;explicit sync - Vulkan&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And, since xservers run on GL, you can see where this is going.&lt;/p&gt;

&lt;h1 id=&quot;implicitly-terrible&quot;&gt;Implicitly Terrible&lt;/h1&gt;
&lt;p&gt;Don’t get me wrong, explicit sync sucks too, but at least it makes sense. Broadly speaking, with explicit sync you have a dmabuf image, you submit it to the GPU, and you tell the server to display it.&lt;/p&gt;

&lt;p&gt;In the words of venerable Xorg developer, EGL maintainer, and synchronization PTSD survivor Daniel Stone, the way to handle implicit sync is “vibes”. You have a dmabuf image, you &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;glFlush&lt;/code&gt;, and magically it gets displayed.&lt;/p&gt;

&lt;p&gt;Sound nuts? It is, and that’s why Vulkan doesn’t support it.&lt;/p&gt;

&lt;p&gt;But zink uses Vulkan, so…&lt;/p&gt;

&lt;h1 id=&quot;send-eyebleach&quot;&gt;Send Eyebleach&lt;/h1&gt;
&lt;p&gt;Explicit sync is based on two concepts:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;import&lt;/li&gt;
  &lt;li&gt;export&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A user of a dmabuf waits on an export operation before using it (i.e., a wait semaphore), then signals an import operation at the end of a cmdbuf submission (i.e., a signal semaphore). Vulkan WSI handles this under the hood for users. But there’s no way to use Vulkan WSI with imported dmabufs, which means this all has to be copy/pasted around to work elsewhere.&lt;/p&gt;

&lt;p&gt;In zink, all that happens in an xserver scenario is apps import/export dmabufs, sample/render them, and then do queue submission. To successfully copy/paste the WSI code and translate this into explicit sync for Vulkan, it’s necessary to be a bit creative with driver mechanics. The gist of it is:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;when doing a queue import (from FOREIGN) for a dmabuf, create and queue an export (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DMA_BUF_IOCTL_EXPORT_SYNC_FILE&lt;/code&gt;) semaphore to be waited on before the current cmdbuf&lt;/li&gt;
  &lt;li&gt;when triggering a barrier on any exported dmabuf, queue an import (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DMA_BUF_IOCTL_IMPORT_SYNC_FILE&lt;/code&gt;) semaphore to be signaled after the current cmdbuf&lt;/li&gt;
  &lt;li&gt;at submit time, serialize all the wait semaphores onto a separate queue submission before the main cmdbuf&lt;/li&gt;
  &lt;li&gt;at submit time, serialize all the signal semaphores onto a separate queue submission after the main cmdbuf&lt;/li&gt;
  &lt;li&gt;pray for modifiers to match up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Big thanks to Faith “ARB_shader_image_load_store” Ekstrand for god-tier rubberducking when I was in the home stretch of this undertaking.&lt;/p&gt;

&lt;p&gt;Anyway I expect to be absolutely buried in bug reports by next week from all the people testing this, so thanks in advance.&lt;/p&gt;</content><author><name></name></author><summary type="html">But Not Mine</summary></entry></feed>