Mike Blumenkrantz

First Bug Down

2024-01-08T00:00:00+00:00

Slow Start

It’s been a slow start to the year, by which I mean I’ve been buried under an absolute deluge of all the things you can imagine and then also a blizzard. The literal kind, not the kind that used to make great games.

Anyway, it’s not all fun and specs in my capacity as CEO of OpenGL. Sometimes I gotta do Real Work. The number one source of Real Work, as always, is ~~my old code~~ the mesa bug tracker.

Unfortunately, the thing is completely overloaded with NVIDIA bugs right now, so it was slim pickins.

Another Game I’ve Never Heard Of

Am I a boomer? Is this what being a boomer feels like? I really have lived long enough to see myself become the villain.

Next bug up is from this game called Valheim. I think it’s a LARPing chess game? Something like that? Don’t @ me.

This report came in hot over the break with some rad new shading techniques:

It looks way cooler if you play the trace, but you get the idea.

Pinpoint Accuracy

First question: what in the Sam Hill is going on here?

Apparently RADV_DEBUG=hang fixes it, which was a curious one since no other env vars affected the issue. This means the problem is somehow caused by an issue related to the actual Vulkan queue submissions, since (according to legendary multipatch chef Samuel “PLZ SEND REVIEWS!!” Pitoiset) this flag synchronizes the queue after every submit.

It’s therefore no surprise that renderdoc was useless. When viewed in isolation, each frame is perfect, but when played at speed the synchronization is lost.

My first stops, as anyone would expect, were the sites of queue submission in zink. This means flush and present.

Now, I know not everyone is going to be comfortable taking this kind of wild, unhinged guess like I did, but stick with me here. The first thing I checked was a breakpoint on zink_flush(), which is where API flush calls filter through. There were the usual end-of-frame hits, but there were a fair number of calls originating from glFenceSync, which is the way a developer can subtly inform a GL driver that they definitely know what they’re doing.

So I saw these calls coming in, and I stepped through zink_flush(), and I reached this spot:

if (!batch->has_work) {
<-----HERE
      if (pfence) {
         /* reuse last fence */
         fence = ctx->last_fence;
      }
      if (!deferred) {
         struct zink_batch_state *last = zink_batch_state(ctx->last_fence);
         if (last) {
            sync_flush(ctx, last);
            if (last->is_device_lost)
               check_device_lost(ctx);
         }
      }
      if (ctx->tc && !ctx->track_renderpasses)
      tc_driver_internal_flush_notify(ctx->tc);
} else {
   fence = &batch->state->fence;
   submit_count = batch->state->usage.submit_count;
   if (deferred && !(flags & PIPE_FLUSH_FENCE_FD) && pfence)
      deferred_fence = true;
   else
      flush_batch(ctx, true);
}

Now this is a real puzzler, because if you know what you’re doing as a developer, you shouldn’t be reaching this spot. This is the penalty box where I put all the developers who don’t know what they’re doing, the spot where I push up my massive James Webb Space Telescope glasses and say, “No, ackchuahlly you don’t want to flush right now.” Because you only reach this spot if you trigger a flush when there’s nothing to flush.

OR DO YOU?

For hahas, I noped out the first part of that conditional, ensuring that all flushes would translate to queue submits, and magically the bug went away. It was a miracle. Until I tried to think through what must be happening for that to have any effect.

Synchronization: You Cannot Escape

The reason this was especially puzzling is the call sequence was:

end-of-frame flush
present
glFenceSync flush

which means the last flush was optimized out, instead returning the fence from the end-of-frame flush. And these should be identical in terms of operations the app would want to wait on.

Except that there’s a present in there, and technically that’s a queue submission, and technically something might want to know if the submit for that has completed?

Why yes, that is stupid, but here at SGC, stupidity is our sustenance.

Anyway, I blasted out a quick fix, and now you can all go play your favorite chess sim on your favorite driver again.

Manifesto

2024-01-02T00:00:00+00:00

This Is It.

It’s been a long break for the blog, but now we’re back and THE MEME FACTORY IS OPEN FOR BUSINESS.

—is what I’d say if it were any other year. But it’s not any other year. This is 2024, and 2024 is a very special year.

It’s the year a decades-old plan has finally yielded its dividends.

Truth.

You’ve all heard certain improbable claims before. Big Triangle this. Big Triangle that. Everyone knows who they are. Some have even accused me of being a shill for Big Triangle from time to time. At last, however, I can finally pull off my mask to reveal the truth for the world.

I was born for a single purpose. As a child, I was grouped in with a number of other candidates. We were trained. Tested. Forged. Unshakable bonds grew between us, bonds we’ll never forget. Bonds that were threatened and broken again and again through harrowing selection processes that culled our ranks.

In time, I was the only one remaining. The only one who survived that brutal gauntlet to fulfill an ultimate goal.

The goal of infiltrating Big Triangle.

More time passed. Days. Months. Years. I continued my quiet training, never letting on to my true purpose.

Now, finally, I’ve achieved the impossible. I’ve attained a status within the ranks of Big Triangle that leaves me in command of vast, unfathomable resources.

I have become an officer.

I am the chair.

Revolution.

Now is the time to rise up, my friends. We must take back the triangles—those big and small, success green and failure red, variable rate shaded and fully shaded, all of them together. We must take them and we must fight. No longer will our goals remain the mere unfulfilled dreams of our basement-dwelling forebearers!

OpenGL 10.0 by 2025!
Compatibility Profile shall be renamed ‘SLOW MODE’
OpenGL ES shall retroactively convert to a YEAR-MONTH versioning scheme with quarterly releases!
Depth values shall be uniformly scaled across all hardware and platforms!
XFB shall be outlawed!
Linux game ports shall no longer link to LLVM!
Coherent API error messages shall be printed!
Vendors which cannot ship functional Windows GL drivers shall ship Zink!
Native GL drivers on mobile platforms shall be outlawed!
gl_PointSize shall be replaced by the constant ‘1.0’ in all cases!
Mesh and ray-tracing extensions from NVIDIA shall become core functionality!
GLX shall be deleted and forgotten!
All bug reports shall contain at least one quality meme in the OP as a form of spam prevention!

Rise up and join me, your new GL/ES chair, in the glorious revolution!

DISCLAIMER

Obviously this is all a joke (except the part where I’m the 🪑, that’s 100% real af), but I still gotta put a disclaimer here because otherwise I’m gonna be in biiiiig trouble if this gets taken seriously.

Happy New Year. I missed you.

2024

2023-10-27T00:00:00+00:00

🪑?

Readback

2023-10-26T00:00:00+00:00

And Now For Something Slightly More Technical

It’s a busy, busy week here. So busy I’m slipping on my blogging. But that’s okay, because here one last big technical post about something I hate.

Swapchain readback.

So Easy Even You Could Accidentally Do It

I’m not alone in drinking the haterade on this one, but GL makes it especially easy to footgun yourself by not providing explicit feedback that you’re footgunning yourself.

I recently encountered a scenario in REDACTED where this behavior was commonplace. The command stream looked roughly like this:

draw some stuff
swapbuffers
blitframebuffer

And this happened on every single frame (???).

In Zink Terms…

This isn’t pretty. Zink has an extremely conformant method of performing swapchain readback which definitely works without issues in all cases. I’d explain it, but it wouldn’t make either of us happy, and I’ve got so much other stuff to do that I couldn’t possibly… Oh, you really want to know? Well don’t say I didn’t warn you.

Vulkan doesn’t allow readback from swapchains. By this, I mean:

swapchain images must be acquired before they can be accessed for any purpose
there is no method to explicitly reacquire a specific swapchain image
there is no guarantee that swapchain images are unchanged after present

Combined, once you have presented a swapchain image you’re screwed.

…According to the spec, that is. In the real world, things work differently.

Zink takes advantage of this “real world” utilization to implement swapchain readback. In short, the only method available is to spam present/acquire on the swapchain until the last-presented image is reacquired. Then it can be read back, and the image data is (probably) the same as when it was presented.

P E R F

This is not a speedy method of implementing readback. It requires a full sync, and it was designed for the purpose of passing unit tests, which is does perfectly. Performance was never a concern, because why would anyone ever be trying to do readback in… Why would anyone ever be trying to do readback in a performance-sensitive… Using OpenGL, why would anyone ever be…

Anyway, this is very unperformant, and here at SGC we hate all things of that nature. Given that I had my real world scenario from REDACTED in which this was happening every frame, something had to be done.

This solution isn’t performant in the absolute sense either, but it’s massively faster than what was happening previously. Once zink detects an app repeatedly footgunning itself at full speed, it activates readback mode for a swapchain and maintains a staging copy of every frame. This enables the image data to be read back at any time without synchronization at the cost of an extra full-frame copy. This roughly doubles FPS in the case I was testing, which is pretty good.

The functionality is already merged for the upcoming 23.3 release.

Footgun as hard as you want.

Crabformance

2023-10-25T00:00:00+00:00

More Milestones

As everyone knows, Red Hat’s top RustiCL expert, Karol “But it’s only 10 o’clock?” Herbst, has been hard at work beating Mesa/Zink/RustiCL into shape. That effort continues to bear fruit, and with the merge of an upcoming MR it should be possible to pass OpenCL conformance with zink on multiple platforms.

This will make zink THE FIRST EVER CONFORMANT VULKAN-BASED OPENCL IMPLEMENTATION.

Great work all around. For up-to-the-second progress reports on this ecosystem-critical topic, don’t forget to follow Karol on social media.

Preemptive

2023-10-24T00:00:00+00:00

Your Bug Has Already Been Solved

After yesterday’s post, I’m sure my thousands of readers stampeded to install the latest zink and run their system with it, and I salute you for your hard work in finding all those new ways to crash your systems.

Some of those crashes, however, are not my bugs. They’re system bugs.

In particular, any of you still using Xorg instead of Wayland will want to create this file:

$ cat /etc/X11/xorg.conf.d/30-dmabuf.conf
Section "ServerFlags"
	Option "Debug" "dmabuf_capable"
EndSection

This makes your xserver dmabuf-capable, which will be more successful when running things with zink.

Another problem you’re likely to have is this console error:

DRI3 not available
failed to load driver: zink

Specifically you’re likely to have this on AMD hardware, and the cause is almost certainly that you’ve installed some footgun package with a naming variation on xf86-video-amdgpu.

Delete this package.

Just delete it. I don’t know why distros still make it available, but if you have it installed then you’re just footgunning yourself.

If you’re still having problems after checking for both of these issues, try turning your computer on.

Hibernation

2023-10-23T00:00:00+00:00

Almost That Time Again

As readers are no doubt aware by now, SGC goes into hibernation beginning around November, and that time is nearly upon us once more. To cap out another glorious year of ~~shitposting~~highly technical and informative blogging, I’ll be attempting to put up a newsworthy post every day.

This is Day 1.

Zink: No Longer A Hacky Workaround Driver

2023 has seen great strides in the zink ecosystem:

Some games, most notably my favorite game of all time X-Plane, are now shipping zink in order to have a consistent GL experience across platforms
Zink has reached official GL 4.6 conformance on Imagination GPUs and will be shipping as their GL implementation
Zink can now run display servers for both X and Wayland, enabling full systems to exist without a native GL implementation

And there’s plenty more, of course, but throughout all this progress has been one very minor, very annoying wrinkle.

MESA_LOADER_DRIVER_OVERRIDE=zink has to be specified in order to use zink, even if no other GL drivers exist on the system.

Or Does It?

Over a year ago I attempted to enable automatic zink loading if a native driver could not be loaded. It was a reasonable first attempt, but it had issues with driver loading in scenarios where hardware drivers were not permitted.

Work has slowly progressed in Mesa since that time, and various small changes have gradually pushed the teetering tower that is GLX/EGL in the direction anyone and everyone wanted, full stop.

The result is that on zink-enabled systems, loader environment variables will no longer be necessary as of the upcoming Mesa 23.3 release. If zink is your only GL driver, you will get zink rather than an automatic fallback to swrast.

I can’t imagine anyone will need it, but remember that issues can be reported here.

All The Updates

2023-10-12T00:00:00+00:00

EBUSY

As everyone knows, SGC goes into yearly hibernation beginning in November. Leading up to that point has been a mad scramble to nail down all the things, leaving less time for posts here.

But there have been updates, and I’m gonna round ‘em all up.

R A Y T R A C W T F

Friend of the blog and future Graphics scientist with a PhD in WTF, Konstantin Seurer, has been hard at work over the past several weeks. Remember earlier this year when he implemented VK_EXT_descriptor_indexing for Lavapipe? Well he’s at it again, and this time he’s aimed for something bigger.

He’s now implemented raytracing for Lavapipe.

It’s a tremendous feat, one that sets him apart from the other developers who have not implemented raytracing for a software implementation of Vulkan.

CLosure

I blogged (or maybe imagined blogging) about RustiCL progress on zink last year at XDC, specifically the time renowned pubmaster Karol Herbst handcuffed himself to me and refused to divulge the location of the key (disguised as a USB thumb drive in his laptop) until we had basic CL support functioning in a pair programming exercise that put us up against the unnaturally early closing time of Minneapolis pubs. That episode is finally turning into something useful as CL support for zink will soon be merged.

While I can’t reveal too much about the performance as of yet, what I can say now is that it’s roughly 866% faster.

Fixups

A number of longstanding bugs have recently been fixed.

Wolfenstein Face

Anyone who has tried to play one of the modern Wolfenstein GL games on RADV has probably seen this abomination:

Wolfenstein Face affects a very small number of apps. Actually just the Wolfenstein (The New Order / The Old Blood) games. I’d had a ticket open about it for a while, and it turns out that this is a known issue in D3D games which has its own workaround. The workaround is now going to be applied for zink as well, which should resolve the issue while hopefully not causing others.

Apitrace: The Final Frontier

Since the dawn of time, experts have tried to obtain traces from games with rendering bugs, but some of these games have historically been resistant to tracing.

A number of games could be traced, but then replaying those traces would crash at a certain point. This is now fixed, enabling better bug reporting for a large number of AAA games from the the last decade.
Another set of games using the id Engine could record traces, but then replaying them would fail to render correctly:

This affects (at least) Wolfenstein: The Old Blood and DOOM2016, but the problem has been identified, and a fix is on the way.

Zink: Exploring New Display Systems

After a number of universally-reviled hacks, Zink should now work fine in both Wayland and Surfaceless EGL configurations.

The Real Post

Any other, lesser blogger would’ve saved this for another post in order to maximize their posting frequency metric, but here at SGC the readers get a full meal with every post even when they don’t have enough time to digest it all at once. Since I’m not going to XDC this year, consider this the thing I might have given a presentation on.

During my executive senior keynote seminar presentation workshop on zink at last year’s XDC, I brought up tiler performance as one of the known deficiencies. Specifically this was in regard to how tilers need to maximize time spent inside renderpasses and avoid unnecessary load/store operations when beginning/ending those renderpasses, which required either some sort of Vulkan extension to enable deferred load/store op setting OR command stream parsing for GL.

While I did work on a number of Vulkan extensions this year, deferred load/store ops wasn’t one of them.

So it was that I implemented renderpass tracking for Threaded Context to scan the GL command stream in the course of recording it for threaded dispatch. The CPU overhead is negligible (~5% on a couple extremely synthetic drawoverhead cases and nothing noticeable in apps), while the performance gains are staggering (~10-15x speedup in AAA games). All in all, it was a painful process but one that has yielded great results.

The gist of it, as I’ve described in previous posts that I’m too lazy to find links for, is that framebuffer attachment access is accumulated during TC command recording such that zink is able to determine which load/store ops are needed. This works great so long as nothing unexpected splits the renderpass. “Unexpected” in this context refers to one of the following scenarios:

zink receives a (transfer) command sequence which is impossible to reorder and must split the renderpass to execute copies/blits
the app randomly flushes during rendering
the GL frontend hits a TC synchronization point and halts the recording thread to wait for the driver thread to finish execution

The final issue remaining for renderpass tracking has been this third scenario: any time the GL frontend needs to sync TC, renderpass metadata is split. The splitting is such that a single renderpass becomes two because the driver must complete execution on the currently-recorded metadata in order to avoid deadlocking itself against the waiting GL frontend, but then the renderpass will continue after the sync. While this happens in a very small number of scenarios, one of them is quite common.

Texture uploading.

Texture Uploads: How Do They Work?

There are (currently) three methods by which TC can perform texture uploads:

for small uploads, the data is enqueued and passed asynchronously to the driver thread
for larger uploads:
- if renderpass tracking is enabled and a renderpass is active, the upload will be sequenced into N strided uploads and passed asynchronously to the driver thread to avoid splitting renderpasses
- otherwise TC syncs the driver thread and performs the upload directly

Eagle-eyed readers will notice that I’ve already handled the “problem” case described above; in order to avoid splitting renderpasses, I’ve written some handling which rewrites texture uploads into a sequence of N asynchronous buffer2image copies, where N is either 1 or $height depending on whether the source data’s stride matches the image’s stride. In the case where N is not 1, this can result in e.g., 4096 copy operations being enqueued for a 4096x4096 texture atlas. Even in the case where N is 1, it still adds an extra full copy of the texture data. While this is still more optimal than splitting a renderpass, it’s not optimal in the absolute sense.

You can see where this is going.

TC Execution: Define Optimal

Optimal Threaded Context execution is the state when the GL frontend is recording commands while the driver thread is deserializing those commands into hardware-specific instructions to submit to the GPU. Visually, it looks like this Halloween-themed diagram:

Ignoring the small-upload case, the current state of texture uploading looks like one of the following Halloween-themed diagrams:

the sequenced upload case will have more work, so the driver thread will run a bit longer than it otherwise would, resulting in the GL frontend waiting a bit longer than it otherwise would for completion

the sync upload case creates a bubble in TC execution

Solve For P

To maintain maximum performance, TC needs to be processing commands asynchronously in the driver thread while the GL frontend continues to record commands for processing. Thus, to maintain maximum performance during texture uploads, the texture upload needs to occur (without copies) while the driver thread continues executing.

Looking at this problem from a different perspective, the case that needs to be avoided at all costs is the case where the GL frontend syncs TC execution. The reason why this sync exists is to avoid accidentally uploading data to an in-use image, which would cause unpredictable (but definitely wrong) output. In this context, in-use can be defined as an image which is either:

enqueued in a TC batch for execution
enqueued/active in a GPU submission

On the plus side, pipe_context::is_resource_busy exists to query the second of these, so that’s solved. On the minus side, while TC has some usage tracking for buffers, it has nothing for images, and adding such tracking in a performant manner is challenging.

To figure out a solution for TC image tracking, let’s examine the most common problem case. In games, the most common scenario for texture uploading is something like this:

create staging image
upload texture data to staging image
draw to scene while sampling staging image
delete staging image

For such a case, it’d be trivial to add a seen flag to struct threaded_resource and pass the conditional if the flag is false. Since it’s straightforward enough to evaluate when an image has been seen in TC, this would suffice. Unfortunately, such a naive (don’t @ me about diacritics) implementation ignores another common pattern:

create staging image
upload texture data to staging image
draw to scene while sampling staging image
cache staging image for reuse
render frame
upload texture data to staging image
draw to scene while sampling staging image
cache staging image for reuse
render frame
…

For this scenario, the staging image is reused, requiring a bit more tracking in order to accurately determine that it can be safely used for uploads.

The solution I’ve settled on is to use a derivative of zink’s resource tracking. This adds an ID for the last-used batch to the resource, which can then be checked during uploads. When the image is determined idle, the texture data is passed directly to the driver for an unsynchronized upload similar to how unsynchronized buffer uploads work. It’s simple and hasn’t shown any definitive performance overhead in my testing.

For it to really work to its fullest potential in zink, unfortunately, requires VK_EXT_host_image_copy to avoid further staging copies, and nobody implements this yet in mesa main (except Lavapipe, though also there’s this ANV MR). But someday more drivers will support this, and then it’ll be great.

As far as non-tiler performance gains from this work, it’s hard to say definitively whether they’ll be noticeable. Texture uploads during loading screens are typically intermixed with shader compilation, so there’s little TC execution to unblock, but any game which uses texture streaming may see some slight latency improvements.

The only remaining future work here is to further enable unsynchronized texture uploads in zink by adding a special cmdbuf for unsynchronized uploads to handle the non-HIC case. Otherwise performance should be pretty solid across the board.

Tis The Season

2023-09-22T00:00:00+00:00

Remember Way Back When…

This blog was about pointlessly optimizing things? I’m talking like taking vkGetDescriptorSetLayoutSupport and making it fast. The kinds of optimizations nobody asked for and potentially nobody even wanted.

Well good news: this isn’t a post about those types of optimizations.

This is a post where I’m gonna talk about some speedups that you didn’t even know you craved but now that you know they exist you can’t live without.

The Vulkan Queue: What Is It?

Lots of people are asking, but surely nobody reading this blog since you’re all experts. But if you have a friend who wants to know, here’s the official resource for all that knowledge. It’s got diagrams. Images with important parts circled. The stuff that means whoever wrote it knew what they were talking about.

The thing this “official” resource doesn’t tell you is the queue is potentially pretty slow. You chuck some commands into it, and then you wait on your fence/semaphore, but the actual time it takes to perform queue submission is nonzero. In fact, it’s quite a bit larger than zero. How large is it? you might be asking.

I didn’t want to do this, but you’ve forced my hand.

The Vulkan Queue: Timing

What if I told you there was a tool for measuring things like this. A tool for determining the cost of various Vulkan operations. For benchmarking, one might say.

That’s right, it’s time to yet again plug vkoverhead, the best and only tool for doing whatever I’m about to do.

Like a prophet, my past self already predicted that I’d be sitting here writing this post to close out a week of types_of_headaches.meme -> vulkan_headaches.meme. That’s why vkoverhead already has the -submit-only option in order to run a series of benchmark cases which have numbers that are totally not about to go up.

Let’s look at those cases now to fill up some more page space and time travel closer to the end of my workweek:

submit_noop submits nothing. There’s no semaphores, no cmdbufs, it just submits and returns in order to provide a baseline
submit_50noop submits nothing 50 times, which is to say it passes 50x VkSubmitInfo structs to vkQueueSubmit (or the 2 versions if sync2 is supported)
submit_1cmdbuf submits a single cmdbuf. In theory this should be slower than the noop case, but I hate computers and obviously this isn’t true at all
submit_50cmdbuf submits 50 cmdbufs. In theory this should be slower than the single cmdbuf case, and, thankfully, this one particular time in which we have expectations of how computers work does match our expectations
submit_50cmdbuf_50submit submits 50 cmdbufs in 50 submits for a total of 50 cmdbufs per vkQueueSubmit call. This is the slowest test, you would think, and I thought that too, and the longer this explanation goes on the more you start to wonder if computers really do work at all like you expect or if this is going to upset you, but it’s Friday, and I don’t have anywhere to be except the gym, so I could keep delaying the inevitable for a while longer, but I do have to get to the gym, so sure, this is totally gonna be way slower than all the other tests, trust me™

It’s a great series of tests which showcase some driver pain points. Specifically it shows how slow submit can be.

Let’s check out some baseline results on the driver everyone loves to hang out with, RADV:

  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%

Everything looks like we’d expect. The benchmark results ensmallen as they get more complex.

But why?

Why So Slow

Because if you think about it like a smart human and not a dumb pile of “thinking” sand, submitting 50 cmdbufs is submitting 50 cmdbufs no matter how you do it.

Some restrictions apply, signal semaphores blah blah blah, but none of that’s happening here so what the fuck, RADV?

This is where we get into some real facepalm territory. Vulkan, as an API, gives drivers the ability to optimize this. That’s the entire reason why vkQueueSubmit has the submitCount param and takes an array of submits.

But what does Mesa do here? Well, in the current code there’s this gem:

for (uint32_t i = 0; i < submitCount; i++) {
   struct vulkan_submit_info info = {
      .pNext = pSubmits[i].pNext,
      .command_buffer_count = pSubmits[i].commandBufferInfoCount,
      .command_buffers = pSubmits[i].pCommandBufferInfos,
      .wait_count = pSubmits[i].waitSemaphoreInfoCount,
      .waits = pSubmits[i].pWaitSemaphoreInfos,
      .signal_count = pSubmits[i].signalSemaphoreInfoCount,
      .signals = pSubmits[i].pSignalSemaphoreInfos,
      .fence = i == submitCount - 1 ? fence : NULL
   };
   VkResult result = vk_queue_submit(queue, &info);
   if (unlikely(result != VK_SUCCESS))
      return result;
}

Tremendous. It’s worth mentioning that not only is this splitting the batched submits into individual ones, each submit also allocates a struct to contain the submit info so that the drivers can use the same interface. So it’s increasing the kernel overhead by performing multiple submits and also increasing memory allocations.

Fast Forward

We’ve all been here before on SGC, and I really do need to get to the gym, so I’m authorizing a one-time fast forward to the results of optimizing this:

RADV GFX11:

  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%
↓
  40, submit_noop,                                        21008648,     100.0%
  41, submit_50noop,                                      4866415,      23.2%
  42, submit_1cmdbuf,                                     51294,        0.2%
  43, submit_50cmdbuf,                                     1823,         0.0%
  44, submit_50cmdbuf_50submit,                            1828,         0.0%

That’s like 1000% faster for case #41 and 50% faster for case #44.

But how does this affect other drivers? I’m sure you’re asking next. And of course, this being the primary blog for distributing Mesa benchmarking numbers in any given year, I have those numbers.

Lavapipe:

  40, submit_noop,                                        1972672,      100.0%
  41, submit_50noop,                                      40334,        2.0%
  42, submit_1cmdbuf,                                     5994597,      303.9%
  43, submit_50cmdbuf,                                    2623720,      133.0%
  44, submit_50cmdbuf_50submit,                           133453,       6.8%
↓
  40, submit_noop,                                        1980681,      100.0%
  41, submit_50noop,                                      1202374,      60.7%
  42, submit_1cmdbuf,                                     6340872,      320.1%
  43, submit_50cmdbuf,                                    2482127,      125.3%
  44, submit_50cmdbuf_50submit,                           1165495,      58.8%

3000% faster for #41 and 1000% faster for #44.

Intel DG2:

  40, submit_noop,                                        101336,       100.0%
  41, submit_50noop,                                       2123,         2.1%
  42, submit_1cmdbuf,                                     35372,        34.9%
  43, submit_50cmdbuf,                                      713,          0.7%
  44, submit_50cmdbuf_50submit,                             707,          0.7%
↓
  40, submit_noop,                                        106065,       100.0%
  41, submit_50noop,                                      105992,       99.9%
  42, submit_1cmdbuf,                                     35110,        33.1%
  43, submit_50cmdbuf,                                      709,          0.7%
  44, submit_50cmdbuf_50submit,                             702,          0.7%

5000% faster for #41 and a big 🤕 for #44 because Intel.

Turnip A740:

  40, submit_noop,                                        1227546,      100.0%
  41, submit_50noop,                                      26194,        2.1%
  42, submit_1cmdbuf,                                     1186327,      96.6%
  43, submit_50cmdbuf,                                    545341,       44.4%
  44, submit_50cmdbuf_50submit,                           16531,        1.3%
↓
  40, submit_noop,                                        1313550,      100.0%
  41, submit_50noop,                                      1078383,      82.1%
  42, submit_1cmdbuf,                                     1129515,      86.0%
  43, submit_50cmdbuf,                                    329247,       25.1%
  44, submit_50cmdbuf_50submit,                           484241,       36.9%

4000% faster for #41, 3000% faster for #44.

Pretty good, and it somehow manages to still be conformant.

Code here.

Happy Birthday

2023-09-15T00:00:00+00:00

But Not Mine

If you’re reading, thanks for everything.

Glamorous

I planned to blog about it a while ago, but then I didn’t and news sites have since broken the news: Zink from Mesa main can run finally xservers.

Yes, it’s true. For the first time ever, you can install Mesa (from git) and use zink (with environment variables) to run your entire system (unless you’re on Intel).

But what was so challenging about getting this to work? The answer won’t surprise you.

WSI

Fans of the blog know that I’m no fan of WSI. If I had my way, GPUs would render to output buffers that we could peruse at our leisure using whatever methods we had at our disposal. Ideally manual inspection. Alas, few others share my worldview and so we all must suffer.

The root of all evil when it comes to computers is synchronization. This is triply so for anything GPU-related, and when all this “display server” chicanery is added in, the evilness value becomes one of those numbers so large that numerologists are still researching naming possibilities. There are two types of synchronization used with WSI:

implicit sync - “just fucking do it”
explicit sync - “I’ll tell you exactly when to do it”

From a user perspective, the former has less code to manage. The downside is that on the driver side things become more complex, as implicit sync is effectively layered atop explicit sync.

Another way of looking at it is:

implicit sync - OpenGL
explicit sync - Vulkan

And, since xservers run on GL, you can see where this is going.

Implicitly Terrible

Don’t get me wrong, explicit sync sucks too, but at least it makes sense. Broadly speaking, with explicit sync you have a dmabuf image, you submit it to the GPU, and you tell the server to display it.

In the words of venerable Xorg developer, EGL maintainer, and synchronization PTSD survivor Daniel Stone, the way to handle implicit sync is “vibes”. You have a dmabuf image, you glFlush, and magically it gets displayed.

Sound nuts? It is, and that’s why Vulkan doesn’t support it.

But zink uses Vulkan, so…

Send Eyebleach

Explicit sync is based on two concepts:

import
export

A user of a dmabuf waits on an export operation before using it (i.e., a wait semaphore), then signals an import operation at the end of a cmdbuf submission (i.e., a signal semaphore). Vulkan WSI handles this under the hood for users. But there’s no way to use Vulkan WSI with imported dmabufs, which means this all has to be copy/pasted around to work elsewhere.

In zink, all that happens in an xserver scenario is apps import/export dmabufs, sample/render them, and then do queue submission. To successfully copy/paste the WSI code and translate this into explicit sync for Vulkan, it’s necessary to be a bit creative with driver mechanics. The gist of it is:

when doing a queue import (from FOREIGN) for a dmabuf, create and queue an export (DMA_BUF_IOCTL_EXPORT_SYNC_FILE) semaphore to be waited on before the current cmdbuf
when triggering a barrier on any exported dmabuf, queue an import (DMA_BUF_IOCTL_IMPORT_SYNC_FILE) semaphore to be signaled after the current cmdbuf
at submit time, serialize all the wait semaphores onto a separate queue submission before the main cmdbuf
at submit time, serialize all the signal semaphores onto a separate queue submission after the main cmdbuf
pray for modifiers to match up

Big thanks to Faith “ARB_shader_image_load_store” Ekstrand for god-tier rubberducking when I was in the home stretch of this undertaking.

Anyway I expect to be absolutely buried in bug reports by next week from all the people testing this, so thanks in advance.