planet.freedesktop.org
January 24, 2024

Today the initial merge request for Teflon was merged into Mesa, along with the first hardware driver, for VeriSilicon's Vivante NPU.

For those who don't know, Teflon is a TensorFlow Lite delegate that aims to support several AI accelerators (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the MIT license.


This will have the following advantages for the project:

  1. The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the release calendar.
  2. Contribution to the project will happen within the development process of Mesa. This is a well-established process in which employees from companies such as Google, Valve, Imagination, Intel, Microsoft and AMD work together on their GPU drivers.
  3. The project has great technical infrastructure, maintained by awesome sysadmins:
  4. More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:
    • The NIR intermediate representation with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.
    • The Gallium internal API that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as Android NNAPI.
  5. And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: XDC.

The story so far

In 2022, while still at Collabora, I started adding OpenCL support to the Etnaviv driver in Mesa. Etnaviv is a userspace and kernel driver for VeriSilicon's Vivante NPUs.

The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.

I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (systolic arrays, as I realized later).

Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since Google released their first TPU.

With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.

During the summer and with Libre Computer's sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.

By autumn I was able to run that same object classification model (MobileNet V1) 3 times faster than the CPU was able to. A month later I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.

Afterwards I got to work on object detection models, and by the start of 2024 I managed to run SSDLite MobileDet at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!

The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.

Next steps

Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!

I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.

There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.

There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.

And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.

Thanks

One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:

Collabora for allowing me to start playing with this while I still worked with them.

Libre Computer and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.

Igalia for letting Christian Gmeiner spend time reviewing all my code and answering my questions about Etnaviv.

Embedded Recipes for giving me the opportunity to present my work last autumn in Paris.

Lucas Stach from Pengutronix for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.

Neil Armstrong from Linaro for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.

And a collective thanks to the DRI/Mesa community for being so awesome!

January 22, 2024

Time flies! Back in October, Igalia organized X.Org Developers Conference 2023 in A Coruña, Spain.

In case you don’t know it, X.Org Developers Conference, despite the X.Org in the name, is a conference for all developers working in the open-source graphics stack: anything related to DRM/KMS, Mesa, X11 and Wayland compositors, etc.

A Coruña's Orzán beach

This year, I participated in the organization of XDC in A Coruña, Spain (again!) by taking care of different aspects: from logistics in the venue (Palexco) to running it in person. It was a very tiring but fulfilling experience.

Sponsors

First of all, I would like to thank all the sponsors for their support, as without them, this conference wouldn’t happen:

XDC 2023 sponsors

They didn’t only give economic support to the conference: Igalia sponsored the welcome event and lunches; X.Org Foundation sponsored coffee breaks; Tourism Office of A Coruña sponsored the guided tour in the city center; and Raspberry Pi sent Raspberry Pi 5 boards to all speakers!

XDC 2023 Stats

XDC 2023 was a success on attendance and talks submissions. Here you have some stats:

  • 📈 160 registered attendees.
  • 👬 120 attendees picked their badge in person.
  • 💻 25 attendees registered as virtual.
  • 📺 More than 6,000 views on live stream.
  • 📝 55 talks/workshops/demos distributed in three days of conference..
  • 🧗‍♀️ There were 3 social events: welcome event, city center guide tour, and one unofficial climbing activity!

XDC 2023 welcome event

Was XDC 2023 perfect organization-wise? Of course… no! Like in any event, we had some issues here and there: one with the Wi-Fi network that was quickly detected and fixed; some issues with the meals and coffee breaks (food allergies mainly), we lost some seconds of audio of a talk in the on-live streaming, and other minor things. Not bad for a community-run event!

Nevertheless, I would like to thank all the staff at Palexco for their quick response and their understanding.

Talk recordings & slides

XDC 2023 talk by André Almeida

Want to see again some talks? All conference recordings were uploaded to X.Org Foundation Youtube channel.

Slides are available to download in each talk description.

Enjoy!

XDC 2024

XDC 2024 will be in North America

We cannot tell yet where is going to happen XDC 2024, other than it will be in North America… but I can tell you that this will be announced soon. Stay tuned!

Want to organize XDC 2025 or XDC 2026?

If we continue with the current cadence: 2025 would be again in Europe, and 2026 event would be in North America.

There is a list of requirements here. Nevertheless, feel free to contact me, or to the X.Org Board of Directors, in order to get first-hand experience and knowledge about what organizing XDC entails.

XDC 2023 audience

Thanks

Thanks to all volunteers, collaborators, Palexco staff, GPUL, X.Org Foundation and many other people for their hard work. Special thanks to my Igalia colleague Chema, who did an outstanding job organizing the event together with me.

Thanks for the sponsors for their extraordinary support to this conference.

Thanks to Igalia not only for sponsoring the event, but also for all the support I got during the past year. I am glad to be part of this company, and I am always surprised by how great my colleagues are.

And last, but not least, thanks to all speakers and attendees. Without you, the conference won’t exist.

See you at XDC 2024!

January 21, 2024

In the past 7 months, I’ve lost 55 pounds and went from completely sedentary to becoming much more fit, while putting in a minimum of effort. I have no desire to be a bodybuilder, but I want to be in great shape now and be as healthy and mobile as possible well into my old age. A year ago, my blood pressure was already at pre-hypertension levels, despite being at a relatively young age. 

I wasn’t willing to let this last any longer, and I wasn’t willing to accept that future.

Research shows that 5 factors are key to a long life — correlated with extending your life by 12–14 years:

  • Never smoking
  • BMI (body mass index) of 18.5–24.9
  • 30+ min a day of moderate/vigorous exercise
  • Moderate alcohol intake (vs none, occasional, or heavy)
  • Diet quality in the upper 40% (Alternate Healthy Eating Index)

In addition, people who are in good health have a much shorter end-of-life period. This means they extend the healthy portion of their lifespan (the “healthspan”) and compress the worst parts into a shorter period at the very end. Having seen many grandparents go through years of struggle as they grew older, I wanted my own story to have a different ending.

Although I’m not a smoker, I was missing three of the other factors. My weight was massively unhealthy, I didn’t exercise at all and spent most of my day in front of a desk, and my diet was awful. On the bright side for these purposes, I drink moderately (almost entirely beer).

In this post, I’ll walk through my own experience going from obese to a healthy weight, with plenty of research-driven references and data along the way.

Why is this the lazy technologist’s guide, though? I wanted to lose weight in the “laziest” way possible — in the same sense that lazy programmers find the most efficient solutions to problems, according to an apocryphal quote by Bill Gates and a real one by Larry Wall, creator of Perl. Gates supposedly said, “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” Wall wrote in Programming Perl, “Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.”

What’s the lowest-effort, most research-driven way to lose weight as quickly as possible without losing health? Discovering and executing upon that was my journey. Read on if you’re considering taking a similar path.

My weight-loss journey begins

My initial goal was to get down from 240 pounds (obese, BMI of 31.7) into the healthy range, reaching 185 pounds (BMI of 24.4). 

My aim was to lose at the high end of a healthy rate, 2 pounds per week. Credible sources like the Mayo Clinic and the CDC suggested aiming for 1–2 pounds a week, because anything beyond that can cause issues with muscle loss as well as malnutrition.

But how could I accomplish that?

One weird trick — Eat less

I’ve lost weight once previously (about 15 years ago), although it was a smaller amount. Back then, I learned that there’s no silver bullet — the trick is to create a calorie deficit, so that your body consumes more energy than the calories in what you eat. 

Every pound is about 3500 calories, which helps to set a weekly and daily goal for your calorie deficit. For me to lose 2 pounds a week, that’s 2*3500 = 7000 calories/week, or 1000 calories/day of deficit (eating that much less than my body uses).

Exercise barely makes a dent

It’s far more effective and efficient to create this deficit primarily through eating less rather than expecting exercise to make a huge difference. If you were previously gaining weight, you might’ve been eating 3000 calories/day or more! You can easily reduce what you eat by 1500 calories/day from that starting point, but it’s almost impossible to exercise enough to burn that many calories. An hour of intense exercise might burn 500 calories, and it’s very hard to keep up that level of effort for even one full hour — especially if you’ve been sitting in a chair all day for years on end.

Not to mention, that much exercise would defeat the whole idea of this being the lazy person’s way of making progress.

So how exactly can you reduce calories? You’ve got a lot of options, but they basically boil down to two things — eat less (portion control), and eat better (food choice).

The plan

At this point, I knew I needed to eat 1000 calories/day less than I burned. I used this calculator to identify that, as a sedentary person, I burned about 2450 calories/day. So to create that deficit, I needed to eat about 1450 calories/day. At that point, I was probably eating 2800–3000 calories/day, so that would require massive changes in my diet.

I don’t like the idea of fad diets that completely remove one or many types of foods entirely (Atkins, keto, paleo, etc), although they can work for other people. One of those big lessons about dieting is that as long as you’re removing something from what you eat, you’ll probably lose weight. 

I decided to make two big changes: how often I ate healthy vs unhealthy food, and when I ate over the course of the day. At the time, I was eating a huge amount of high-fat, high-sugar, and low-health foods like burgers and fries multiple times per week, fried food, lots of chips/crisps, white bread (very high sugar in the US) & white rice, cheese, chocolate and candy. 

I decided to shift that toward white meat (chicken/pork/turkey), seafood, salads & veggies, and whole grains (whole-wheat bread, brown rice, quinoa, etc). One pro-tip: American salad dressings are super unhealthy, often even the “vinaigrettes” that sound better. Do like Italians do, and dress salads yourself with olive oil, salt, and vinegar. However, I didn’t want to remove my favorite foods entirely, because that would destroy my long-term motivation and enjoyment of my progress. For example, once a week, I still allow myself to get a cheeseburger. But I’ll typically get a single patty, no fatty sauces, and with a side like salad (w/ healthy dressing) or cole slaw. I’ll also ensure my other meal of the day is very light. Many days, I’ll enjoy a small treat like 1–2 chocolates, as well (50–100 calories).

What if you like beer?

I wanted to reach my calorie target without eliminating beer, so I could both preserve my quality of life and also maintain the moderate drinking that research shows is correlated with increased lifespan. 

I was also drinking very high-calorie beer (like double IPAs and bourbon-barrel–aged imperial stouts). I shifted that toward low-alcohol, low-calorie beer (alcohol levels and calories are correlated). Bell’s Light-Hearted IPA and Lagunitas DayTime IPA are two pretty good ones in my area. Of the non-alcoholic (NA) beers, Athletic Free Wave Hazy IPA is the best I’ve found in my area, but Untappd has reasonably good ratings for Sam Adams Just the Haze and Sierra Nevada Trail Pass IPA, which should be broadly available. As a rough estimate on calories in beer, you can use this formula:

Beer calories = ABV (alcohol percentage) * 2.5 * fluid ounces

As an exception, many Belgian beers are quite “efficient” to drink, in that roughly 75% of the calories are alcohol rather than other carbs that just add calories. As a result, they violate the above formula and tend to be lower-calorie than you’d expect. This could be the result of carefully crafted recipes that consume most of the carbs, and fermentation that uses up all of the sugar. 

Here’s a more specific formula that you can use, if you’re curious about how “efficient” a given beer is, and you know how many total calories it has (find this online):

Beer calories from ethanol = (ABV * 0.8 / 100) * (29.6 * fluid ounces) * 7

(Simplified form): Beer calories from ethanol = ABV * 1.7 * fluid ounces

This uses the density and calories of ethanol (0.8 g/ml and 7 cal/g, respectively) and converts from milliliters to ounces (29.6 ml/oz). If you then calculate that number as a fraction of the total calories in a beer, you can find its “efficiency.” For example, a 12-ounce bottle of 8.5% beer might have 198 calories total. Using the equation, we can calculate that it’s got 169 calories from ethanol, so 169/198 = 85% “efficient.”

If you’re really trying to optimize for this, however, beer is the wrong drink. Have a low-calorie mixed drink instead, like a vodka soda or ranch water.

The plan (part 2)

Therefore, instead of giving up beer entirely, I decided to skip breakfast. I’d eaten light breakfasts for years (a small bowl of cereal, or a banana and a granola bar), so this wasn’t a big deal to me. 

Later, I discovered this qualified my diet as time-restricted intermittent fasting as well, since I was only eating/drinking between ~12pm–6pm. This approach of 18 hours off / 6 hours on (18:6 fasting) may have aided in my weight loss, but studies are mixed with some suggesting no effect.

Here’s what a day might look like on 1450 calories:

  • Light lunch (450 calories). A tuna-salad sandwich (made with Greek yogurt instead of mayo) on whole-wheat bread, and a side of cottage cheese.
  • Afternoon snack of veggies (~25 calories). Sliced bell peppers, no dip.  
  • A treat (50–100 calories). A truffle or a couple of small chocolates as an afternoon treat.
  • Dinner (700 calories). Fried chicken/fish sandwich (or kids-size burger) and a small order of fries, from a fast-casual restaurant.
  • One or two low-alcohol or NA beers (150–200 calories).

When I get hungry, I often drink some water instead, because my body’s easily confused about hunger vs thirst. It’s a mental game too — I remind myself that hunger means my body is burning fat, and that’s a good thing.

For a long time, I kept track of my estimated calorie consumption mentally. More recently, I decided to make my life a little easier by switching to an app. I chose MyFitnessPal because it’s got a big database including almost everything I eat.

On this plan, I had a great deal of success in losing my first 40 pounds, getting down from 240 to 200. However, it started to feel like a bit of a struggle to maintain my weight loss as I reached 200 pounds and wanted to continue losing at the same rate of 2 pounds/week.

Adaptation, plateaus and persistence

I fell behind by about two weeks on my weight-loss goal, which was massively frustrating because I’d done so well all along. I convinced myself to keep persisting because it had worked all along for months, and this was a temporary setback.

Finally I re-used the same weight-loss calculator and realized what seemed obvious in hindsight: Since I now weighed less, I also burned fewer calories per day! Those 40 pounds that were now gone didn’t use any energy anymore, but I was still eating as if I had them. I needed to change something to restore the 1000-calorie daily deficit. 

At this point, I aimed to decrease my intake to about 1200 calories per day. This quickly became frustrating because it started to affect my quality of life by forcing choices I didn’t want to make, such as choosing between a decent dinner or a beer, or forcing me to eat a salad with no protein for dinner if I had a little bit bigger lunch.

That low calorie limit also carried the risk of causing metabolic adaptation — meaning my body could burn hundreds fewer calories per day as a result of being in a “starvation mode” of sorts. That ends up being a vicious cycle that continually forces you to eat less, and it makes weight loss even more challenging.

Consequently, I began to introduce moderate exercise (walking), so I could bring my intake back up to 1400 calories on days when I burned 200 extra calories. I’ll discuss the details in a follow-up guide for fitness.

Over the course of my learning, I discovered that it’s ideal (according to actuarial tables) to sit in the middle of the healthy range rather than be at the top of it. I maintained my initial weight-loss goal to keep myself motivated on progress, but set a second goal of reaching 170 pounds.

Eat lots of protein

I also discovered that high-protein diets are better at preserving muscle, so more of the weight loss is fat. This is especially true when coupled with resistance or strength training, which also sends your body a signal that it needs to keep its muscle instead of losing it. The minimum recommended daily allowance (RDA) of protein (0.36 grams per pound of body weight, or 67 g/day for me) could be your absolute lower limit, while as much as 0.6 g/lb (111 g/day for me) could help in improving your muscle mass. 

Another study suggested multiplying the RDA by 1.25–1.5 (or more if you exercise) to maintain muscle during weight loss, which would put my recommended protein at 84–100 grams per day. The same study also said exercise helps to maintain muscle during weight loss, so it could be an either/or situation rather than needing both. Additionally, high-protein diets can help with hunger and weight loss, in part because they keep you fuller for longer. Getting 25%–30% of daily calories from protein will get you to this level, which is a whole lot of protein. Starting from your overall daily calories, you can apply this percentage and then divide your desired protein calories by 4 to get the number of grams per day:

Protein grams per day = Total daily calories * {25%, 30%} / 4

For my calorie limit, that’s about 88–105 grams per day. 

I’ve found that eating near the absolute minimum recommended protein level (67 grams per day, for my weight) tends to happen fairly naturally with my originally planned diet, while getting much higher protein takes real effort. I needed to identify low-calorie, high-protein foods and incorporate them more intentionally into meals, so that I can get enough protein without compromising my daily calorie limit. 

Here’s a good list of low-calorie, high-protein foods that are pretty affordable:

  • Breakfast/Lunch: eggs or low-fat/nonfat Greek yogurt (with honey/berries), 
  • Entree: grilled/roasted chicken (or pork/turkey) or seafood (especially shrimp, canned salmon, canned tuna), and
  • Sides: cottage cheese or lentils/beans (including soups, to make it an entree).

If you’re vegetarian, you’d want to go heavier on lentils and beans, and add plenty of nuts, including hummus and peanut butter. You probably also want to bring in tempeh, and you likely already eat tofu.

I’d never tried canned salmon before, and I was impressed with how easily I could make it into a salad or an open-faced sandwich (like Danish smørrebrød). The salmon came in large pieces and retained the original texture, as you’d want. Canned tuna comes in more of a huge chunk and doesn’t look very appetizing until it’s broken up.

Avoid the most common brands of canned fish though, like Chicken of the Sea, StarKist, or Bumble Bee. They are often farmed or net-caught instead of pole/line-caught, and they may be higher in parasites (for farmed fish like salmon). I also buy lower-mercury types of salmon and tuna — this means I can eat each kind of fish as often as I want, instead of once a week. I buy canned Wild Planet skipjack tuna (not albacore, but yellowfin is pretty good too) and canned Deming’s sockeye salmon (not pink salmon) at my local grocery store, and I pick up large trays of refrigerated cocktail shrimp at Costco. The Genova brand also garners good reviews for canned fish and may be easier to find. All of those are pre-cooked and ready to eat, so they’re easy to use for a quick lunch. 

Go ahead and get fresh seafood if you want, but be aware that you’ll be going through a lot of it so it could get expensive. Fish only stays good for a couple of days unless frozen, so you’ll also be making a lot of trips to the store or regularly thawing/cooking frozen fish.

Summary

Over the past 7 months, I’ve managed to lose 55 pounds (and counting!) through a low-effort approach that has minimized the overall impact on my quality of life. I’ve continued to eat the foods I want — but less of them.

The biggest challenge has been persistence through the tough times. However, not cutting out any foods completely, but rather just decreasing the frequency of unhealthy foods in my life, has been a massive help with that. That meant I didn’t feel like I was breaking my whole diet whenever I had something I really wanted, as long as it fit within my calorie limit.

What’s next? A few months after beginning my weight loss, I also started working out to get into better shape, which was another one of those original 5 factors to a long life. Right now, I’m aiming to get down to about 10% body fat, which is likely to be around 170 pounds. Then I’ll flip my eating habits into muscle-building mode, which will require a slight caloric excess rather than a deficit. 

Stay tuned to see what happens!

January 20, 2024

Hi! This month has been pretty hectic due to the SourceHut network outage. We’ve all in the staff team invested a lot of time and energy to minimize the downtime as much as possible. Thankfully things have settled down now, there are still a lot of follow-up tasks to complete but with less urgency. I’m really grateful for the community’s reaction, everybody had been very understanding and supportive. Thank you!

In other SourceHut news, I’ve been working on yojo, a bridge which provides CI for Codeberg projects via builds.sr.ht. I’ve added support for pull requests, taught yojo to handle multiple manifests, added logic to automatically refresh access tokens before they expire, and fixed a bunch of bugs.

The NPotM is sr.ht-container-compose, a docker-compose configuration for SourceHut. It provides an easy way to spin up a SourceHut development environment without having to set up each service and its dependencies individually. I hope this project can reduce friction for new SourceHut contributors. There are many services missing, patches welcome!

This month, we’ve finally merged the Sway pull request to use the wlroots scene-graph API! This is exciting because it fixes a whole class of bugs, it removes a lot of manual hand-rolled logic in Sway (e.g. rendering, damage tracking, input event routing, direct scan-out, some of the protocol support…), it provides nice performance optimizations via culling (e.g. the background image is no longer painted if a web browser is covering it), and it unlocks upcoming performance optimizations (e.g. KMS plane offloading). Many thanks to Alexander for writing the patches and maintaining them for over a year, and to Kirill for pushing it over the finish line!

On the wlroots side, my work on wlr_surface_synced has been merged, allowing us to latch surface commits until an arbitrary condition is met. This work is necessary for the upcoming explicit synchronization protocol, as well as the work-in-progress transactions protocol and avoiding compositor freezes when a client is very slow to render. We’ve released wlroots 0.17.1, with a collection of bugfixes backported by Simon Zeni. Last, we’ve dropped support for the legacy wl_drm protocol by default, and this caused a bit of breakage here and there (xserver, libva, amdvlk). We do really want to phase out wl_drm though, so we’ve decided to stick with that removal.

This month’s collection of miscellaneous project updates includes go-imap v2 alpha 8 with separate types for sequence numbers and UIDs, which was a lot of work to get right but I think was worth it. I’ve also released go-maildir v0.4.0 with a new Walk function (to iterate over messages without allocating a list) and numerous fixes. I’ve sent a GitLab cli patch to fix invalid release asset links for third-party GitLab instances, and a Meson patch to add C23 support.

See you next month!

January 11, 2024

This post is in part a response to an aspect of Nate’s post “Does Wayland really break everything?“, but also my reflection on discussing Wayland protocol additions, a unique pleasure that I have been involved with for the past months1.

Some facts

Before I start I want to make a few things clear: The Linux desktop will be moving to Wayland2 – this is a fact at this point (and has been for a while), sticking to X11 makes no sense for future projects. From reading Wayland protocols and working with it at a much lower level than I ever wanted to, it is also very clear to me that Wayland is an exceptionally well-designed core protocol, and so are the additional extension protocols (xdg-shell & Co.). The modularity of Wayland is great, it gives it incredible flexibility and will for sure turn out to be good for the long-term viability of this project (and also provides a path to correct protocol issues in future, if one is found). In other words: Wayland is an amazing foundation to build on, and a lot of its design decisions make a lot of sense!

The shift towards people seeing “Linux” more as an application developer platform, and taking PipeWire and XDG Portals into account when designing for Wayland is also an amazing development and I love to see this – this holistic approach is something I always wanted!

Furthermore, I think Wayland removes a lot of functionality that shouldn’t exist in a modern compositor – and that’s a good thing too! Some of X11’s features and design decisions had clear drawbacks that we shouldn’t replicate. I highly recommend to read Nate’s blog post, it’s very good and goes into more detail. And due to all of this, I firmly believe that any advancement in the Wayland space must come from within the project.

But!

But! Of course there was a “but” coming 😉 – I think while developing Wayland-as-an-ecosystem we are now entrenched into narrow concepts of how a desktop should work. While discussing Wayland protocol additions, a lot of concepts clash, people from different desktops with different design philosophies debate the merits of those over and over again never reaching any conclusion (just as you will never get an answer out of humans whether sushi or pizza is the clearly superior food, or whether CSD or SSD is better). Some people want to use Wayland as a vehicle to force applications to submit to their desktop’s design philosophies, others prefer the smallest and leanest protocol possible, other developers want the most elegant behavior possible. To be clear, I think those are all very valid approaches.

But this also creates problems: By switching to Wayland compositors, we are already forcing a lot of porting work onto toolkit developers and application developers. This is annoying, but just work that has to be done. It becomes frustrating though if Wayland provides toolkits with absolutely no way to reach their goal in any reasonable way. For Nate’s Photoshop analogy: Of course Linux does not break Photoshop, it is Adobe’s responsibility to port it. But what if Linux was missing a crucial syscall that Photoshop needed for proper functionality and Adobe couldn’t port it without that? In that case it becomes much less clear on who is to blame for Photoshop not being available.

A lot of Wayland protocol work is focused on the environment and design, while applications and work to port them often is considered less. I think this happens because the overlap between application developers and developers of the desktop environments is not necessarily large, and the overlap with people willing to engage with Wayland upstream is even smaller. The combination of Windows developers porting apps to Linux and having involvement with toolkits or Wayland is pretty much nonexistent. So they have less of a voice.

A quick detour through the neuroscience research lab

I have been involved with Freedesktop, GNOME and KDE for an incredibly long time now (more than a decade), but my actual job (besides consulting for Purism) is that of a PhD candidate in a neuroscience research lab (working on the morphology of biological neurons and its relation to behavior). I am mostly involved with three research groups in our institute, which is about 35 people. Most of us do all our data analysis on powerful servers which we connect to using RDP (with KDE Plasma as desktop). Since I joined, I have been pushing the envelope a bit to extend Linux usage to data acquisition and regular clients, and to have our data acquisition hardware interface well with it. Linux brings some unique advantages for use in research, besides the obvious one of having every step of your data management platform introspectable with no black boxes left, a goal I value very highly in research (but this would be its own blogpost).

In terms of operating system usage though, most systems are still Windows-based. Windows is what companies develop for, and what people use by default and are familiar with. The choice of operating system is very strongly driven by application availability, and WSL being really good makes this somewhat worse, as it removes the need for people to switch to a real Linux system entirely if there is the occasional software requiring it. Yet, we have a lot more Linux users than before, and use it in many places where it makes sense. I also developed a novel data acquisition software that even runs on Linux-only and uses the abilities of the platform to its fullest extent. All of this resulted in me asking existing software and hardware vendors for Linux support a lot more often. Vendor-customer relationship in science is usually pretty good, and vendors do usually want to help out. Same for open source projects, especially if you offer to do Linux porting work for them… But overall, the ease of use and availability of required applications and their usability rules supreme. Most people are not technically knowledgeable and just want to get their research done in the best way possible, getting the best results with the least amount of friction.

KDE/Linux usage at a control station for a particle accelerator at Adlershof Technology Park, Germany, for reference (by 25years of KDE)3

Back to the point

The point of that story is this: GNOME, KDE, RHEL, Debian or Ubuntu: They all do not matter if the necessary applications are not available for them. And as soon as they are, the easiest-to-use solution wins. There are many facets of “easiest”: In many cases this is RHEL due to Red Hat support contracts being available, in many other cases it is Ubuntu due to its mindshare and ease of use. KDE Plasma is also frequently seen, as it is perceived a bit easier to onboard Windows users with it (among other benefits). Ultimately, it comes down to applications and 3rd-party support though.

Here’s a dirty secret: In many cases, porting an application to Linux is not that difficult. The thing that companies (and FLOSS projects too!) struggle with and will calculate the merits of carefully in advance is whether it is worth the support cost as well as continuous QA/testing. Their staff will have to do all of that work, and they could spend that time on other tasks after all.

So if they learn that “porting to Linux” not only means added testing and support, but also means to choose between the legacy X11 display server that allows for 1:1 porting from Windows or the “new” Wayland compositors that do not support the same features they need, they will quickly consider it not worth the effort at all. I have seen this happen.

Of course many apps use a cross-platform toolkit like Qt, which greatly simplifies porting. But this just moves the issue one layer down, as now the toolkit needs to abstract Windows, macOS and Wayland. And Wayland does not contain features to do certain things or does them very differently from e.g. Windows, so toolkits have no way to actually implement the existing functionality in a way that works on all platforms. So in Qt’s documentation you will often find texts like “works everywhere except for on Wayland compositors or mobile”4.

Many missing bits or altered behavior are just papercuts, but those add up. And if users will have a worse experience, this will translate to more support work, or people not wanting to use the software on the respective platform.

What’s missing?

Window positioning

SDI applications with multiple windows are very popular in the scientific world. For data acquisition (for example with microscopes) we often have one monitor with control elements and one larger one with the recorded image. There is also other configurations where multiple signal modalities are acquired, and the experimenter aligns windows exactly in the way they want and expects the layout to be stored and to be loaded upon reopening the application. Even in the image from Adlershof Technology Park above you can see this style of UI design, at mega-scale. Being able to pop-out elements as windows from a single-window application to move them around freely is another frequently used paradigm, and immensely useful with these complex apps.

It is important to note that this is not a legacy design, but in many cases an intentional choice – these kinds of apps work incredibly well on larger screens or many screens and are very flexible (you can have any window configuration you want, and switch between them using the (usually) great window management abilities of your desktop).

Of course, these apps will work terribly on tablets and small form factors, but that is not the purpose they were designed for and nobody would use them that way.

I assumed for sure these features would be implemented at some point, but when it became clear that that would not happen, I created the ext-placement protocol which had some good discussion but was ultimately rejected from the xdg namespace. I then tried another solution based on feedback, which turned out not to work for most apps, and now proposed xdg-placement (v2) in an attempt to maybe still get some protocol done that we can agree on, exploring more options before pushing the existing protocol for inclusion into the ext Wayland protocol namespace. Meanwhile though, we can not port any application that needs this feature, while at the same time we are switching desktops and distributions to Wayland by default.

Window position restoration

Similarly, a protocol to save & restore window positions was already proposed in 2018, 6 years ago now, but it has still not been agreed upon, and may not even help multiwindow apps in its current form. The absence of this protocol means that applications can not restore their former window positions, and the user has to move them to their previous place again and again.

Meanwhile, toolkits can not adopt these protocols and applications can not use them and can not be ported to Wayland without introducing papercuts.

Window icons

Similarly, individual windows can not set their own icons, and not-installed applications can not have an icon at all because there is no desktop-entry file to load the icon from and no icon in the theme for them. You would think this is a niche issue, but for applications that create many windows, providing icons for them so the user can find them is fairly important. Of course it’s not the end of the world if every window has the same icon, but it’s one of those papercuts that make the software slightly less user-friendly. Even applications with fewer windows like LibrePCB are affected, so much so that they rather run their app through Xwayland for now.

I decided to address this after I was working on data analysis of image data in a Python virtualenv, where my code and the Python libraries used created lots of windows all with the default yellow “W” icon, making it impossible to distinguish them at a glance. This is xdg-toplevel-icon now, but of course it is an uphill battle where the very premise of needing this is questioned. So applications can not use it yet.

Limited window abilities requiring specialized protocols

Firefox has a picture-in-picture feature, allowing it to pop out media from a mediaplayer as separate floating window so the user can watch the media while doing other things. On X11 this is easily realized, but on Wayland the restrictions posed on windows necessitate a different solution. The xdg-pip protocol was proposed for this specialized usecase, but it is also not merged yet. So this feature does not work as well on Wayland.

Automated GUI testing / accessibility / automation

Automation of GUI tasks is a powerful feature, so is the ability to auto-test GUIs. This is being worked on, with libei and wlheadless-run (and stuff like ydotool exists too), but we’re not fully there yet.

Wayland is frustrating for (some) application authors

As you see, there is valid applications and valid usecases that can not be ported yet to Wayland with the same feature range they enjoyed on X11, Windows or macOS. So, from an application author’s perspective, Wayland does break things quite significantly, because things that worked before can no longer work and Wayland (the whole stack) does not provide any avenue to achieve the same result.

Wayland does “break” screen sharing, global hotkeys, gaming latency (via “no tearing”) etc, however for all of these there are solutions available that application authors can port to. And most developers will gladly do that work, especially since the newer APIs are usually a lot better and more robust. But if you give application authors no path forward except “use Xwayland and be on emulation as second-class citizen forever”, it just results in very frustrated application developers.

For some application developers, switching to a Wayland compositor is like buying a canvas from the Linux shop that forces your brush to only draw triangles. But maybe for your avant-garde art, you need to draw a circle. You can approximate one with triangles, but it will never be as good as the artwork of your friends who got their canvases from the Windows or macOS art supply shop and have more freedom to create their art.

Triangles are proven to be the best shape! If you are drawing circles you are creating bad art!

Wayland, via its protocol limitations, forces a certain way to build application UX – often for the better, but also sometimes to the detriment of users and applications. The protocols are often fairly opinionated, a result of the lessons learned from X11. In any case though, it is the odd one out – Windows and macOS do not pose the same limitations (for better or worse!), and the effort to port to Wayland is orders of magnitude bigger, or sometimes in case of the multiwindow UI paradigm impossible to achieve to the same level of polish. Desktop environments of course have a design philosophy that they want to push, and want applications to integrate as much as possible (same as macOS and Windows!). However, there are many applications out there, and pushing a design via protocol limitations will likely just result in fewer apps.

The porting dilemma

I spent probably way too much time looking into how to get applications cross-platform and running on Linux, often talking to vendors (FLOSS and proprietary) as well. Wayland limitations aren’t the biggest issue by far, but they do start to come come up now, especially in the scientific space with Ubuntu having switched to Wayland by default. For application authors there is often no way to address these issues. Many scientists do not even understand why their Python script that creates some GUIs suddenly behaves weirdly because Qt is now using the Wayland backend on Ubuntu instead of X11. They do not know the difference and also do not want to deal with these details – even though they may be programmers as well, the real goal is not to fiddle with the display server, but to get to a scientific result somehow.

Another issue is portability layers like Wine which need to run Windows applications as-is on Wayland. Apparently Wine’s Wayland driver has some heuristics to make window positioning work (and I am amazed by the work done on this!), but that can only go so far.

A way out?

So, how would we actually solve this? Fundamentally, this excessively long blog post boils down to just one essential question:

Do we want to force applications to submit to a UX paradigm unconditionally, potentially loosing out on application ports or keeping apps on X11 eternally, or do we want to throw them some rope to get as many applications ported over to Wayland, even through we might sacrifice some protocol purity?

I think we really have to answer that to make the discussions on wayland-protocols a lot less grueling. This question can be answered at the wayland-protocols level, but even more so it must be answered by the individual desktops and compositors.

If the answer for your environment turns out to be “Yes, we want the Wayland protocol to be more opinionated and will not make any compromises for application portability”, then your desktop/compositor should just immediately NACK protocols that add something like this and you simply shouldn’t engage in the discussion, as you reject the very premise of the new protocol: That it has any merit to exist and is needed in the first place. In this case contributors to Wayland and application authors also know where you stand, and a lot of debate is skipped. Of course, if application authors want to support your environment, you are basically asking them now to rewrite their UI, which they may or may not do. But at least they know what to expect and how to target your environment.

If the answer turns out to be “We do want some portability”, the next question obviously becomes where the line should be drawn and which changes are acceptable and which aren’t. We can’t blindly copy all X11 behavior, some porting work to Wayland is simply inevitable. Some written rules for that might be nice, but probably more importantly, if you agree fundamentally that there is an issue to be fixed, please engage in the discussions for the respective MRs! We for sure do not want to repeat X11 mistakes, and I am certain that we can implement protocols which provide the required functionality in a way that is a nice compromise in allowing applications a path forward into the Wayland future, while also being as good as possible and improving upon X11. For example, the toplevel-icon proposal is already a lot better than anything X11 ever had. Relaxing ACK requirements for the ext namespace is also a good proposed administrative change, as it allows some compositors to add features they want to support to the shared repository easier, while also not mandating them for others. In my opinion, it would allow for a lot less friction between the two different ideas of how Wayland protocol development should work. Some compositors could move forward and support more protocol extensions, while more restrictive compositors could support less things. Applications can detect supported protocols at launch and change their behavior accordingly (ideally even abstracted by toolkits).

You may now say that a lot of apps are ported, so surely this issue can not be that bad. And yes, what Wayland provides today may be enough for 80-90% of all apps. But what I hope the detour into the research lab has done is convince you that this smaller percentage of apps matters. A lot. And that it may be worthwhile to support them.

To end on a positive note: When it came to porting concrete apps over to Wayland, the only real showstoppers so far5 were the missing window-positioning and window-position-restore features. I encountered them when porting my own software, and I got the issue as feedback from colleagues and fellow engineers. In second place was UI testing and automation support, the window-icon issue was mentioned twice, but being a cosmetic issue it likely simply hurts people less and they can ignore it easier.

What this means is that the majority of apps are already fine, and many others are very, very close! A Wayland future for everyone is within our grasp! 😄

I will also bring my two protocol MRs to their conclusion for sure, because as application developers we need clarity on what the platform (either all desktops or even just a few) supports and will or will not support in future. And the only way to get something good done is by contribution and friendly discussion.

Footnotes

  1. Apologies for the clickbait-y title – it comes with the subject 😉 ↩
  2. When I talk about “Wayland” I mean the combined set of display server protocols and accepted protocol extensions, unless otherwise clarified. ↩
  3. I would have picked a picture from our lab, but that would have needed permission first ↩
  4. Qt has awesome “platform issues” pages, like for macOS and Linux/X11 which help with porting efforts, but Qt doesn’t even list Linux/Wayland as supported platform. There is some information though, like window geometry peculiarities, which aren’t particularly helpful when porting (but still essential to know). ↩
  5. Besides issues with Nvidia hardware – CUDA for simulations and machine-learning is pretty much everywhere, so Nvidia cards are common, which causes trouble on Wayland still. It is improving though. ↩

Igalia is always working hard to improve 3D rendering drivers of the Broadcom VideoCore GPU, found in Raspberry Pi devices. One of our most recent efforts in this sense was the implementation of CPU jobs from the Vulkan driver to the V3D kernel driver.

What are CPU jobs and why do we need them?

In the V3DV driver, there are some Vulkan commands that cannot be performed by the GPU alone, so we implement those as CPU jobs on Mesa. A CPU job is a job that requires CPU intervention to be performed. For example, in the Broadcom VideoCore GPUs, we don’t have a way to calculate the timestamp. But we need the timestamp for Vulkan timestamp queries. Therefore, we need to calculate the timestamp on the CPU.

A CPU job in userspace also implies CPU stalling. Sometimes, we need to hold part of the command submission flow in order to correctly synchronize their execution. This waiting period caused the CPU to stall, thereby preventing the continuous submission of jobs to the GPU. To mitigate this issue, we decided to move CPU job mechanisms from the V3DV driver to the V3D kernel driver.

In the V3D kernel driver, we have different kinds of jobs: RENDER jobs, BIN jobs, CSD jobs, TFU jobs, and CLEAN CACHE jobs. For each of those jobs, we have a DRM scheduler instance that helps us to synchronize the jobs.

If you want to know more about the different kinds of V3D jobs, check out this November Update: Exploring V3D blogpost, where I explain more about all the V3D IOCTLs and jobs.

Jobs of the same kind are submitted, dispatched, and processed in the same order they are executed, using a standard first-in-first-out (FIFO) queue system. We can synchronize different jobs across different queues using DRM syncobjs. More about the V3D synchronization framework and user extensions can be learned in this two-part blog post from Melissa Wen.

From the kernel documentation, a DRM syncobj (synchronisation objects) are containers for stuff that helps sync up GPU commands. They’re super handy because you can use them in your own programs, share them with other programs, and even use them across different DRM drivers. Mostly, they’re used for making Vulkan fences and semaphores work.

By moving the CPU job from userspace to the kernel, we can make use of the DRM schedule queues and all the advantages it brings with it. For this, we created a new type of job in the V3D kernel driver, a CPU job, which also means creating a new DRM scheduler instance and a CPU job queue. Now, instead of stalling the submission thread waiting for the GPU to idle, we can use DRM syncobjs to synchronize both CPU and GPU jobs in a submission, providing more efficient usage of the GPU.

How did we implement the CPU jobs in the kernel driver?

After we decided to have a CPU job implementation in the kernel space, we could think about two possible implementations for this job: creating an IOCTL for each type of CPU job or using a user extension to provide a polymorphic behavior to a single CPU job IOCTL.

We have different types of CPU jobs (indirect CSD jobs, timestamp query jobs, copy query results jobs…) and each of them has a common infrastructure of allocation and synchronization but performs different operations. Therefore, we decided to go with the option to use user extensions.

On Melissa’s blogpost, she digs deep into the implementation of generic IOCTL extensions in the V3D kernel driver. But, to put it simply, instead of expanding the data struct for each IOCTL every time we need to add a new feature, we define a user extension chain instead. As we add new optional interfaces to control the IOCTL, we define a new extension struct that can be linked to the IOCTL data only when required by the user.

Therefore, we created a new IOCTL, drm_v3d_submit_cpu, which is used to submit any type of CPU job. This single IOCTL can be extended by a user extension, which allows us to reuse the common infrastructure - avoiding code repetition - and yet use the user extension ID to identify the type of job and depending on the type of job, perform a certain operation.

struct drm_v3d_submit_cpu {
        /* Pointer to a u32 array of the BOs that are referenced by the job.
         *
         * For DRM_V3D_EXT_ID_CPU_INDIRECT_CSD, it must contain only one BO,
         * that contains the workgroup counts.
         *
         * For DRM_V3D_EXT_ID_TIMESTAMP_QUERY, it must contain only one BO,
         * that will contain the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY, it must contain only
         * one BO, that contains the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, it must contain two
         * BOs. The first is the BO where the timestamp queries will be written
         * to. The second is the BO that contains the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY, it must contain no
         * BOs.
         *
         * For DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY, it must contain one
         * BO, where the performance queries will be written.
         */
        __u64 bo_handles;

        /* Number of BO handles passed in (size is that times 4). */
        __u32 bo_handle_count;

        __u32 flags;

        /* Pointer to an array of ioctl extensions*/
        __u64 extensions;
};

Now, we can create a CPU job and submit it with a CPU job user extension.

And which extensions are available?

  1. DRM_V3D_EXT_ID_CPU_INDIRECT_CSD: this CPU job allows us to submit an indirect CSD job. An indirect CSD job is a job that, when executed in the queue, will map an indirect buffer, read the dispatch parameters, and submit a regular dispatch. This CPU job is used in Vulkan calls like vkCmdDispatchIndirect().
  2. DRM_V3D_EXT_ID_CPU_TIMESTAMP_QUERY: this CPU job calculates the query timestamp and updates the query availability by signaling a syncobj. This CPU job is used in Vulkan calls like vkCmdWriteTimestamp().
  3. DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY: this CPU job resets the timestamp queries based on the value offset of the first query. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for timestamp queries.
  4. DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY: this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for timestamp queries.
  5. DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY: this CPU job resets the performance queries by resetting the values of the perfmons. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for performance queries.
  6. DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY: similar to DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for performance queries.

The CPU job IOCTL structure is similar to any other V3D job. We allocate the job struct, parse all the extensions, init the job, look up the BOs and lock its reservations, add the proper dependencies, and push the job to the DRM scheduler entity.

When running a CPU job, we execute the following code:

static const v3d_cpu_job_fn cpu_job_function[] = {
        [V3D_CPU_JOB_TYPE_INDIRECT_CSD] = v3d_rewrite_csd_job_wg_counts_from_indirect,
        [V3D_CPU_JOB_TYPE_TIMESTAMP_QUERY] = v3d_timestamp_query,
        [V3D_CPU_JOB_TYPE_RESET_TIMESTAMP_QUERY] = v3d_reset_timestamp_queries,
        [V3D_CPU_JOB_TYPE_COPY_TIMESTAMP_QUERY] = v3d_copy_query_results,
        [V3D_CPU_JOB_TYPE_RESET_PERFORMANCE_QUERY] = v3d_reset_performance_queries,
        [V3D_CPU_JOB_TYPE_COPY_PERFORMANCE_QUERY] = v3d_copy_performance_query,
};

static struct dma_fence *
v3d_cpu_job_run(struct drm_sched_job *sched_job)
{
        struct v3d_cpu_job *job = to_cpu_job(sched_job);
        struct v3d_dev *v3d = job->base.v3d;

        v3d->cpu_job = job;

        if (job->job_type >= ARRAY_SIZE(cpu_job_function)) {
                DRM_DEBUG_DRIVER("Unknown CPU job: %d\n", job->job_type);
                return NULL;
        }

        trace_v3d_cpu_job_begin(&v3d->drm, job->job_type);

        cpu_job_function[job->job_type](job);

        trace_v3d_cpu_job_end(&v3d->drm, job->job_type);

        return NULL;
}

The interesting thing is that each CPU job type executes a completely different operation.

The complete kernel implementation has already landed in drm-misc-next and can be seen right here.

What did we change in Mesa-V3DV to use the new kernel-V3D CPU job?

After landing the kernel implementation, I needed to accommodate the new CPU job approach in the userspace.

A fundamental rule is not to cause regressions, i.e., to keep backwards userspace compatibility with old versions of the Linux kernel. This means we cannot break new versions of Mesa running in old kernels. Therefore, we needed to create two paths: one preserving the old way to perform CPU jobs and the other using the kernel to perform CPU jobs.

So, for example, the indirect CSD job used to add two different jobs to the queue: a CPU job and a CSD job. Now, if we have the CPU job capability in the kernel, we only add a CPU job and the CSD job is dispatched from within the kernel.

-   list_addtail(&csd_job->list_link, &cmd_buffer->jobs);
+
+   /* If we have a CPU queue we submit the CPU job directly to the
+    * queue and the CSD job will be dispatched from within the kernel
+    * queue, otherwise we will have to dispatch the CSD job manually
+    * right after the CPU job by adding it to the list of jobs in the
+    * command buffer.
+    */
+   if (!cmd_buffer->device->pdevice->caps.cpu_queue)
+      list_addtail(&csd_job->list_link, &cmd_buffer->jobs);

Furthermore, now we can use syncobjs to sync the CPU jobs. For example, in the timestamp query CPU job, we used to stall the submission thread and wait for completion of all work queued before the timestamp query. Now, we can just add a barrier to the CPU job and it will be properly synchronized by the syncobjs without stalling the submission thread.

   /* The CPU job should be serialized so it only executes after all previously
    * submitted work has completed
    */
   job->serialize = V3DV_BARRIER_ALL;

We were able to test the implementation using multiple CTS tests, such as dEQP-VK.compute.pipeline.indirect_dispatch.*, dEQP-VK.pipeline.monolithic.timestamp.*, dEQP-VK.synchronization.*, dEQP-VK.query_pool.* and dEQP-VK.multiview.*.

The userspace implementation has already landed in Mesa and the full implementation can be checked in this MR.


More about the on-going challenges in the Raspberry Pi driver stack can be checked during this XDC 2023 talk presented by Iago Toral, Juan Suárez and myself. During this talk, Iago mentioned the CPU job work that we have been doing.

Also I cannot finish this post without thanking Melissa Wen and Iago Toral for all the help while developing the CPU jobs for the V3D kernel driver.

January 10, 2024

When almost two months ago I got MobileNetV1 running with useful performance on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.

Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?

Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.

SSDLite MobileDet is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.

The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.

So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:

Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.

I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.

After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.

If we look at a practical use case such that supported by Frigate NVR, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.

Given the price level of the single board computers that contain the VIPNano, this is quite a good bang for your bucks. And all open source and heading to mainline!

Next steps

I have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.

January 09, 2024

This is a guest post written by Daan De Meyer, systemd and mkosi maintainer

Almost 7 years ago, Lennart first wrote about mkosi on this blog. Some years ago, I took over development and there's been a huge amount of changes and improvements since then. So I figure this is a good time to re-introduce mkosi.

mkosi stands for Make Operating System Image. It generates OS images that can be used for a variety of purposes.

If you prefer watching a video over reading a blog post, you can also watch my presentation on mkosi at All Systems Go 2023.

What is mkosi?

mkosi was originally written as a tool to simplify hacking on systemd and for experimenting with images using many of the new concepts being introduced in systemd at the time. In the meantime, it has evolved into a general purpose image builder that can be used in a multitude of scenarios.

Instructions to install mkosi can be found in its readme. We recommend running the latest version to take advantage of all the latest features and bug fixes. You'll also need bubblewrap and the package manager of your favorite distribution to get started.

At its core, the workflow of mkosi can be divided into 3 steps:

  1. Generate an OS tree for some distribution by installing a set of packages.
  2. Package up that OS tree in a variety of output formats.
  3. (Optionally) Boot the resulting image in qemu or systemd-nspawn.

Images can be built for any of the following distributions:

  • Fedora Linux
  • Ubuntu
  • OpenSUSE
  • Debian
  • Arch Linux
  • CentOS Stream
  • RHEL
  • Rocky Linux
  • Alma Linux

And the following output formats are supported:

  • GPT disk images built with systemd-repart
  • Tar archives
  • CPIO archives (for building initramfs images)
  • USIs (Unified System Images which are full OS images packed in a UKI)
  • Sysext, confext and portable images
  • Directory trees

For example, to build an Arch Linux GPT disk image and boot it in qemu, you can run the following command:

$ mkosi -d arch -p systemd -p udev -p linux -t disk qemu

To instead boot the image in systemd-nspawn, replace qemu with boot:

$ mkosi -d arch -p systemd -p udev -p linux -t disk boot

The actual image can be found in the current working directory named image.raw. However, using a separate output directory is recommended which is as simple as running mkdir mkosi.output.

To rebuild the image after it's already been built once, add -f to the command line before the verb to rebuild the image. Any arguments passed after the verb are forwarded to either systemd-nspawn or qemu itself. To build the image without booting it, pass build instead of boot or qemu or don't pass a verb at all.

By default, the disk image will have an appropriately sized root partition and an ESP partition, but the partition layout and contents can be fully customized using systemd-repart by creating partition definition files in mkosi.repart/. This allows you to customize the partition as you see fit:

  • The root partition can be encrypted.
  • Partition sizes can be customized.
  • Partitions can be protected with signed dm-verity.
  • You can opt out of having a root partition and only have a /usr partition instead.
  • You can add various other partitions, e.g. an XBOOTLDR partition or a swap partition.
  • ...

As part of building the image, we'll run various tools such as systemd-sysusers, systemd-firstboot, depmod, systemd-hwdb and more to make sure the image is set up correctly.

Configuring mkosi image builds

Naturally with extended use you don't want to specify all settings on the command line every time, so mkosi supports configuration files where the same settings that can be specified on the command line can be written down.

For example, the command we used above can be written down in a configuration file mkosi.conf:

[Distribution]
Distribution=arch

[Output]
Format=disk

[Content]
Packages=
        systemd
        udev
        linux

Like systemd, mkosi uses INI configuration files. We also support dropins which can be placed in mkosi.conf.d. Configuration files can also be conditionalized using the [Match] section. For example, to only install a specific package on Arch Linux, you can write the following to mkosi.conf.d/10-arch.conf:

[Match]
Distribution=arch

[Content]
Packages=pacman

Because not everything you need will be supported in mkosi, we support running scripts at various points during the image build process where all extra image customization can be done. For example, if it is found, mkosi.postinst is called after packages have been installed. Scripts are executed on the host system by default (in a sandbox), but can be executed inside the image by suffixing the script with .chroot, so if mkosi.postinst.chroot is found it will be executed inside the image.

To add extra files to the image, you can place them in mkosi.extra in the source directory and they will be automatically copied into the image after packages have been installed.

Bootable images

If the necessary packages are installed, mkosi will automatically generate a UEFI/BIOS bootable image. As mkosi is a systemd project, it will always build UKIs (Unified Kernel Images), except if the image is BIOS-only (since UKIs cannot be used on BIOS). The initramfs is built like a regular image by installing distribution packages and packaging them up in a CPIO archive instead of a disk image. Specifically, we do not use dracut, mkinitcpio or initramfs-tools to generate the initramfs from the host system. ukify is used to assemble all the individual components into a UKI.

If you don't want mkosi to generate a bootable image, you can set Bootable=no to explicitly disable this logic.

Using mkosi for development

The main requirements to use mkosi for development is that we can build our source code against the image we're building and install it into the image we're building. mkosi supports this via build scripts. If a script named mkosi.build (or mkosi.build.chroot) is found, we'll execute it as part of the build. Any files put by the build script into $DESTDIR will be installed into the image. Required build dependencies can be installed using the BuildPackages= setting. These packages are installed into an overlay which is put on top of the image when running the build script so the build packages are available when running the build script but don't end up in the final image.

An example mkosi.build.chroot script for a project using meson could look as follows:

#!/bin/sh
meson setup "$BUILDDIR" "$SRCDIR"
ninja -C "$BUILDDIR"
if ((WITH_TESTS)); then
    meson test -C "$BUILDDIR"
fi
meson install -C "$BUILDDIR"

Now, every time the image is built, the build script will be executed and the results will be installed into the image.

The $BUILDDIR environment variable points to a directory that can be used as the build directory for build artifacts to allow for incremental builds if the build system supports it.

Of course, downloading all packages from scratch every time and re-installing them again every time the image is built is rather slow, so mkosi supports two modes of caching to speed things up.

The first caching mode caches all downloaded packages so they don't have to be downloaded again on subsequent builds. Enabling this is as simple as running mkdir mkosi.cache.

The second mode of caching caches the image after all packages have been installed but before running the build script. On subsequent builds, mkosi will copy the cache instead of reinstalling all packages from scratch. This mode can be enabled using the Incremental= setting. While there is some rudimentary cache invalidation, the cache can also forcibly be rebuilt by specifying -ff on the command line instead of -f.

Note that when running on a btrfs filesystem, mkosi will automatically use subvolumes for the cached images which can be snapshotted on subsequent builds for even faster rebuilds. We'll also use reflinks to do copy-on-write copies where possible.

With this setup, by running mkosi -f qemu in the systemd repository, it takes about 40 seconds to go from a source code change to a root shell in a virtual machine running the latest systemd with your change applied. This makes it very easy to test changes to systemd in a safe environment without risk of breaking your host system.

Of course, while 40 seconds is not a very long time, it's still more than we'd like, especially if all we're doing is modifying the kernel command line. That's why we have the KernelCommandLineExtra= option to configure kernel command line options that are passed to the container or virtual machine at runtime instead of being embedded into the image. These extra kernel command line options are picked up when the image is booted with qemu's direct kernel boot (using -append), but also when booting a disk image in UEFI mode (using SMBIOS). The same applies to systemd credentials (using the Credentials= setting). These settings allow configuring the image without having to rebuild it, which means that you only have to run mkosi qemu or mkosi boot again afterwards to apply the new settings.

Building images without root privileges and loop devices

By using newuidmap/newgidmap and systemd-repart, mkosi is able to build images without needing root privileges. As long as proper subuid and subgid mappings are set up for your user in /etc/subuid and /etc/subgid, you can run mkosi as your regular user without having to switch to root.

Note that as of the writing of this blog post this only applies to the build and qemu verbs. Booting the image in a systemd-nspawn container with mkosi boot still needs root privileges. We're hoping to fix this in an future systemd release.

Regardless of whether you're running mkosi with root or without root, almost every tool we execute is invoked in a sandbox to isolate as much of the build process from the host as possible. For example, /etc and /var from the host are not available in this sandbox, to avoid host configuration inadvertently affecting the build.

Because systemd-repart can build disk images without loop devices, mkosi can run from almost any environment, including containers. All that's needed is a UID range with 65536 UIDs available, either via running as the root user or via /etc/subuid and newuidmap. In a future systemd release, we're hoping to provide an alternative to newuidmap and /etc/subuid to allow running mkosi from all containers, even those with only a single UID available.

Supporting older distributions

mkosi depends on very recent versions of various systemd tools (v254 or newer). To support older distributions, we implemented so called tools trees. In short, mkosi can first build a tools image for you that contains all required tools to build the actual image. This can be enabled by adding ToolsTree=default to your mkosi configuration. Building a tools image does not require a recent version of systemd.

In the systemd mkosi configuration, we automatically use a tools tree if we detect your distribution does not have the minimum required systemd version installed.

Configuring variants of the same image using profiles

Profiles can be defined in the mkosi.profiles/ directory. The profile to use can be selected using the Profile= setting (or --profile=) on the command line. A profile allows you to bundle various settings behind a single recognizable name. Profiles can also be matched on if you want to apply some settings only to a few profiles.

For example, you could have a bootable profile that sets Bootable=yes, adds the linux and systemd-boot packages and configures Format=disk to end up with a bootable disk image when passing --profile bootable on the kernel command line.

Building system extension images

System extension images may – dynamically at runtime — extend the base system with an overlay containing additional files.

To build system extensions with mkosi, we need a base image on top of which we can build our extension.

To keep things manageable, we'll make use of mkosi's support for building multiple images so that we can build our base image and system extension in one go.

We start by creating a temporary directory with a base configuration file mkosi.conf with some shared settings:

[Output]
OutputDirectory=mkosi.output
CacheDirectory=mkosi.cache

Now let's continue with the base image definition by writing the following to mkosi.images/base/mkosi.conf:

[Output]
Format=directory

[Content]
CleanPackageMetadata=no
Packages=systemd
         udev

We use the directory output format here instead of the disk output so that we can build our extension without needing root privileges.

Now that we have our base image, we can define a sysext that builds on top of it by writing the following to mkosi.images/btrfs/mkosi.conf:

[Config]
Dependencies=base

[Output]
Format=sysext
Overlay=yes

[Content]
BaseTrees=%O/base
Packages=btrfs-progs

BaseTrees= point to our base image and Overlay=yes instructs mkosi to only package the files added on top of the base tree.

We can't sign the extension image without a key. We can generate one by running mkosi genkey which will generate files that are automatically picked up when building the image.

Finally, you can build the base image and the extensions by running mkosi -f. You'll find btrfs.raw in mkosi.output which is the extension image.

Various other interesting features

  • To sign any generated UKIs for secure boot, put your secure boot key and certificate in mkosi.key and mkosi.crt and enable the SecureBoot= setting. You can also run mkosi genkey to have mkosi generate a key and certificate itself.
  • The Ephemeral= setting can be enabled to boot the image in an ephemeral copy that is thrown away when the container or virtual machine exits.
  • ShimBootloader= and BiosBootloader= settings are available to configure shim and grub installation if needed.
  • mkosi can boot directory trees in a virtual using virtiofsd. This is very useful for quickly rebuilding an image and booting it as the image does not have to be packed up as a disk image.
  • ...

There's many more features that we won't go over in detail here in this blog post. Learn more about those by reading the documentation.

Conclusion

I'll finish with a bunch of links to more information about mkosi and related tooling:

January 08, 2024

Slow Start

It’s been a slow start to the year, by which I mean I’ve been buried under an absolute deluge of all the things you can imagine and then also a blizzard. The literal kind, not the kind that used to make great games.

Anyway, it’s not all fun and specs in my capacity as CEO of OpenGL. Sometimes I gotta do Real Work. The number one source of Real Work, as always, is my old code the mesa bug tracker.

Unfortunately, the thing is completely overloaded with NVIDIA bugs right now, so it was slim pickins.

Another Game I’ve Never Heard Of

Am I a boomer? Is this what being a boomer feels like? I really have lived long enough to see myself become the villain.

Next bug up is from this game called Valheim. I think it’s a LARPing chess game? Something like that? Don’t @ me.

This report came in hot over the break with some rad new shading techniques:

hm

It looks way cooler if you play the trace, but you get the idea.

Pinpoint Accuracy

First question: what in the Sam Hill is going on here?

Apparently RADV_DEBUG=hang fixes it, which was a curious one since no other env vars affected the issue. This means the problem is somehow caused by an issue related to the actual Vulkan queue submissions, since (according to legendary multipatch chef Samuel “PLZ SEND REVIEWS!!” Pitoiset) this flag synchronizes the queue after every submit.

It’s therefore no surprise that renderdoc was useless. When viewed in isolation, each frame is perfect, but when played at speed the synchronization is lost.

My first stops, as anyone would expect, were the sites of queue submission in zink. This means flush and present.

Now, I know not everyone is going to be comfortable taking this kind of wild, unhinged guess like I did, but stick with me here. The first thing I checked was a breakpoint on zink_flush(), which is where API flush calls filter through. There were the usual end-of-frame hits, but there were a fair number of calls originating from glFenceSync, which is the way a developer can subtly inform a GL driver that they definitely know what they’re doing.

So I saw these calls coming in, and I stepped through zink_flush(), and I reached this spot:

if (!batch->has_work) {
<-----HERE
      if (pfence) {
         /* reuse last fence */
         fence = ctx->last_fence;
      }
      if (!deferred) {
         struct zink_batch_state *last = zink_batch_state(ctx->last_fence);
         if (last) {
            sync_flush(ctx, last);
            if (last->is_device_lost)
               check_device_lost(ctx);
         }
      }
      if (ctx->tc && !ctx->track_renderpasses)
      tc_driver_internal_flush_notify(ctx->tc);
} else {
   fence = &batch->state->fence;
   submit_count = batch->state->usage.submit_count;
   if (deferred && !(flags & PIPE_FLUSH_FENCE_FD) && pfence)
      deferred_fence = true;
   else
      flush_batch(ctx, true);
}

Now this is a real puzzler, because if you know what you’re doing as a developer, you shouldn’t be reaching this spot. This is the penalty box where I put all the developers who don’t know what they’re doing, the spot where I push up my massive James Webb Space Telescope glasses and say, “No, ackchuahlly you don’t want to flush right now.” Because you only reach this spot if you trigger a flush when there’s nothing to flush.

OR DO YOU?

For hahas, I noped out the first part of that conditional, ensuring that all flushes would translate to queue submits, and magically the bug went away. It was a miracle. Until I tried to think through what must be happening for that to have any effect.

Synchronization: You Cannot Escape

The reason this was especially puzzling is the call sequence was:

  • end-of-frame flush
  • present
  • glFenceSync flush

which means the last flush was optimized out, instead returning the fence from the end-of-frame flush. And these should be identical in terms of operations the app would want to wait on.

Except that there’s a present in there, and technically that’s a queue submission, and technically something might want to know if the submit for that has completed?

Why yes, that is stupid, but here at SGC, stupidity is our sustenance.

Anyway, I blasted out a quick fix, and now you can all go play your favorite chess sim on your favorite driver again.

January 02, 2024

This Is It.

It’s been a long break for the blog, but now we’re back and THE MEME FACTORY IS OPEN FOR BUSINESS.

—is what I’d say if it were any other year. But it’s not any other year. This is 2024, and 2024 is a very special year.

It’s the year a decades-old plan has finally yielded its dividends.

Truth.

You’ve all heard certain improbable claims before. Big Triangle this. Big Triangle that. Everyone knows who they are. Some have even accused me of being a shill for Big Triangle from time to time. At last, however, I can finally pull off my mask to reveal the truth for the world.

I was born for a single purpose. As a child, I was grouped in with a number of other candidates. We were trained. Tested. Forged. Unshakable bonds grew between us, bonds we’ll never forget. Bonds that were threatened and broken again and again through harrowing selection processes that culled our ranks.

In time, I was the only one remaining. The only one who survived that brutal gauntlet to fulfill an ultimate goal.

The goal of infiltrating Big Triangle.

More time passed. Days. Months. Years. I continued my quiet training, never letting on to my true purpose.

Now, finally, I’ve achieved the impossible. I’ve attained a status within the ranks of Big Triangle that leaves me in command of vast, unfathomable resources.

I have become an officer.

itsreal.png

I am the chair.

Revolution.

Now is the time to rise up, my friends. We must take back the triangles—those big and small, success green and failure red, variable rate shaded and fully shaded, all of them together. We must take them and we must fight. No longer will our goals remain the mere unfulfilled dreams of our basement-dwelling forebearers!

  • OpenGL 10.0 by 2025!

  • Compatibility Profile shall be renamed ‘SLOW MODE’

  • OpenGL ES shall retroactively convert to a YEAR-MONTH versioning scheme with quarterly releases!

  • Depth values shall be uniformly scaled across all hardware and platforms!

  • XFB shall be outlawed!

  • Linux game ports shall no longer link to LLVM!

  • Coherent API error messages shall be printed!

  • Vendors which cannot ship functional Windows GL drivers shall ship Zink!

  • Native GL drivers on mobile platforms shall be outlawed!

  • gl_PointSize shall be replaced by the constant ‘1.0’ in all cases!

  • Mesh and ray-tracing extensions from NVIDIA shall become core functionality!

  • GLX shall be deleted and forgotten!

  • All bug reports shall contain at least one quality meme in the OP as a form of spam prevention!

Rise up and join me, your new GL/ES chair, in the glorious revolution!

DISCLAIMER

Obviously this is all a joke (except the part where I’m the 🪑, that’s 100% real af), but I still gotta put a disclaimer here because otherwise I’m gonna be in biiiiig trouble if this gets taken seriously.

Happy New Year. I missed you.

December 26, 2023

Holidays are here and I have time to look back at 2023. For six months I have been working for Igalia and what should I say?

I ❤️ it!

This was the best decision to leave my comfort zone of a normal 9-5 job. I am so proud to work on open source GPU drivers and I am able to spend much of my work time on etnaviv.

Driver maintenance

Before adding any new feature I thought it would be great idea to improve the current state of etnaviv’s gallium driver. Therefor I reworked some general driver code to be more consistent and to have a more modern feeling, and made it possible to drop some hand-rolled conversion helpers by switching to already existing solutions (U_FIXED(..), S_FIXED(..), float_to_ubyte(..)).

I worked through the low hanging fruits of crashes seen in CI runs and fixed many of them.

Feature wise, I also looked at some easy to implement extensions like GL_NV_conditional_render and GL_OES_texture_half_float_linear.

Besides the gallium driver I also worked on some NIR and isaspec features that are beneficial for etnaviv.

XDC2023

A personal highlight was to give a talk about etnaviv at XDC2023 in person.

You might wonder what happened since mid October in etnaviv land.

GLES3

I worked on some features that are needed to expose GLES3 and it turned out that an easy to maintain, extend and test compiler backend is needed. Sadly etnaviv’s current backend compiler does not check any of these boxes. It is so fragile that I only added some needed lowerings to pass some of the dEQP-GLES3.functional.shaders.texture_functions.* tests.

Some more fun work regarding some feature emulation is on the horizon and it’s blocked again by the current compiler.

Backend Compiler

etnaviv includes an isaspec powered disassembler now - a small step towards a new backend compiler. Next on the road to success is the etnaviv backend IR with an assembler.

The new backend compiler is able to run OpenCL kernels with the help of rusticl but I want to land the new backend compiler in smaller chunks that are easier to review.

Multiple Render Targets

During my XDC presentation I talked about a feature I got working on GC7000L - Multiple Render Targets (MRT). At this point it was more or less a proof-of-concept regarding the gallium drivers. There were some missing bits and register for full support on more GPU models and therefore more reverse engineering work was needed. Also the gallium driver needed lots of work to add support for MRT.

Some weeks later I had MRT working on a wider range of Vivante GPUs that are supporting this feature. This includes GC2000, GC3000 and GC7000 models among others. As etnaviv makes heavy use of GPU features it should work on even more models.

Looking forward to 2024

I am really confident that we will see GLES3 and OpenCL for etnaviv. As driver testing is quite important for my work I will expand my current board farm and will look into the new star in CI world - ci-tron.

With that, have a happy holiday season and we’ll be back with more improvements in 2024!

December 22, 2023

Last year I wrote a recap of the Vulkan extensions Igalia helped ship in 2022, and in this post I’ll do the exact same for 2023.

Igalia Logo next to the Vulkan Logo

For context and quoting the previous recap:

The ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions.

In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own.

So, without further ado, this is the list of extensions we helped ship in 2023.

VK_EXT_attachment_feedback_loop_dynamic_state

This extension builds on last year’s VK_EXT_attachment_feedback_loop_layout, which is used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets. The new extension shipped this year adds support for setting attachment feedback loops dynamically on command buffers. As all extensions that add more dynamic state, the goal here is to reduce the number of pipeline objects applications need to create, which makes using the API more flexible. It was created by our beloved super good coder and Valve contractor Mike Blumenkrantz. We reviewed the spec and are listed as contributors, and we wrote dynamic variants of the existing CTS tests.

VK_EXT_depth_bias_control

A new extension proposed by Joshua Ashton that also helps with layering D3D9 on top of Vulkan. The original problem is quite specific. In D3D9 and other APIs, applications can specify what is called a “depth bias” for geometry using an offset that is to be added directly as an exact value to the original depth of each fragment. In Vulkan, however, the depth bias is expressed as a factor of “r”, where “r” is a number that depends on the depth buffer format and, furthermore, may not have a specific fixed value. Implementations can use different values of “r” in an acceptable range. The mechanism provided by Vulkan without this extension is useful to apply small offsets and solve some problems, but it’s not useful to apply large offsets and/or emulate D3D9 by applying a fixed-value bias. The new extension solves these problems by giving apps the chance to control depth bias in a precise way. We reviewed the spec and are listed as contributors, and wrote CTS tests for this extension to help ship it.

VK_EXT_dynamic_rendering_unused_attachments

This extension was proposed by Piers Daniell from NVIDIA to lift some restrictions in the original VK_KHR_dynamic_rendering extension, which is used in Vulkan to avoid having to create render passes and framebuffer objects. Dynamic rendering is very interesting because it makes the API much easier to use and, in many cases and specially in desktop platforms, it can be shipped without any associated performance loss. The new extension relaxes some restrictions that made pipelines more tightly coupled with render pass instances. Again, the goal here is to be able to reuse the same pipeline object with multiple render pass instances and remove some combinatorial explosions that may occur in some apps. We reviewed the spec and are listed as contributors, and wrote CTS tests for the new extension.

VK_EXT_image_sliced_view_of_3d

Shipped at the beginning of the year by Mike Blumenkrantz, the extension again helps emulating other APIs on top of Vulkan. Specifically, the extension allows creating 3D views of 3D images such that the views contain a subset of the slices in the image, using a Z offset and range, in the same way D3D12 allows. We reviewed the spec, we’re listed as contributors, and we wrote CTS tests for it.

VK_EXT_pipeline_library_group_handles

This one comes from Valve contractor Hans-Kristian Arntzen, who is mostly known for working on Proton projects like VKD3D-Proton. The extension is related to ray tracing and adds more flexibility when creating ray tracing pipelines. Ray tracing pipelines can hold thousands of different shaders and are sometimes built incrementally by combining so-called pipeline libraries that contain subsets of those shaders. However, to properly use those pipelines we need to create a structure called a shader binding table, which is full of shader group handles that have to be retrieved from pipelines. Prior to this extension, shader group handles from pipeline libraries had to be requeried once the final pipeline is linked, as they were not guaranteed to be constant throughout the whole process. With this extension, an implementation can tell apps they will not modify shader group handles in subsequent link steps, which makes it easier for apps to build shader binding tables. More importantly, this also more closely matches functionality in DXR 1.1, making it easier to emulate DirectX Raytracing on top of Vulkan raytracing. We reviewed the spec, we’re listed as contributors and we wrote CTS tests for it.

VK_EXT_shader_object

Shader objects is probably the most notorious extension shipped this year, and we contributed small bits to it. This extension makes every piece of state dynamic and removes the need to use pipelines. It’s always used in combination with dynamic rendering, which also removes render passes and framebuffers as explained above. This results in great flexibility from the application point of view. The extension was created by Daniel Story from Nintendo, and its vast set of CTS tests was created by Žiga Markuš but we added our grain of sand by reviewing the spec and proposing some changes (which is why we’re listed as contributors), as well as fixing some shader object tests and providing some improvements here and there once they had been merged. A good part of this work was done in coordination with Mesa developers which were working on implementing this extension for different drivers.

VK_KHR_video_encode_h264 and VK_KHR_video_encode_h265

Fresh out of the oven, these Vulkan Video extensions allow leveraging the hardware to efficiently encode H.264 and H.265 streams. This year we’ve been doing a ton of work related to Vulkan Video in drivers, libraries like GStreamer and CTS/spec, including the two extensions mentioned above. Although not listed as contributors to the spec in those two Vulkan extensions, our work played a major role in advancing the state of Vulkan Video and getting them shipped.

Epilogue

That’s it for this year! I’m looking forward to help ship more extension work the next one and trying to add my part in making Vulkan drivers on Linux (and other platforms!) more stable and feature rich. My Vulkan Video colleagues at Igalia have already started work on future Vulkan Video extensions for AV1 and VP9. Hopefully some of that work is ratified next year. Fingers crossed!

December 21, 2023

"Don't cross the streams. It would be bad."

IR refactorings

A big part of what I have been up to in the past two weeks has been a serious refactoring of the data structures that hold the model data in the different phases until the HW configurations is generated.

What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.

The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:

A small subsection of SSDLite MobileDet

The adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.

As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.

For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.

I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.

For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.

Tensor addition

Also put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.

Lowering the strides

Similarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.

Status and next steps

Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.

The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.

The whole of SSDLite MobileDet

 

December 20, 2023

Last week marked a major milestone for me: the AMD driver-specific color management properties reached the upstream linux-next!

And to celebrate, I’m happy to share the slides notes from my 2023 XDC talk, “The Rainbow Treasure Map” along with the individual recording that just dropped last week on youtube – talk about happy coincidences!

Steam Deck Rainbow: Treasure Map & Magic Frogs

While I may be bubbly and chatty in everyday life, the stage isn’t exactly my comfort zone (hallway talks are more my speed). But the journey of developing the AMD color management properties was so full of discoveries that I simply had to share the experience. Witnessing the fantastic work of Jeremy and Joshua bring it all to life on the Steam Deck OLED was like uncovering magical ingredients and whipping up something truly enchanting.

For XDC 2023, we split our Rainbow journey into two talks. My focus, “The Rainbow Treasure Map,” explored the new color features we added to the Linux kernel driver, diving deep into the hardware capabilities of AMD/Steam Deck. Joshua then followed with “The Rainbow Frogs” and showed the breathtaking color magic released on Gamescope thanks to the power unlocked by the kernel driver’s Steam Deck color properties.

Packing a Rainbow into 15 Minutes

I had so much to tell, but a half-slot talk meant crafting a concise presentation. To squeeze everything into 15 minutes (and calm my pre-talk jitters a bit!), I drafted and practiced those slides and notes countless times.

So grab your map, and let’s embark on the Rainbow journey together!

Slide 1: The Rainbow Treasure Map - Advanced Color Management on Linux with AMD/SteamDeck

Intro: Hi, I’m Melissa from Igalia and welcome to the Rainbow Treasure Map, a talk about advanced color management on Linux with AMD/SteamDeck.

Slide 2: List useful links for this technical talk

Useful links: First of all, if you are not used to the topic, you may find these links useful.

  1. XDC 2022 - I’m not an AMD expert, but… - Melissa Wen
  2. XDC 2022 - Is HDR Harder? - Harry Wentland
  3. XDC 2022 Lightning - HDR Workshop Summary - Harry Wentland
  4. Color management and HDR documentation for FOSS graphics - Pekka Paalanen et al.
  5. Cinematic Color - 2012 SIGGRAPH course notes - Jeremy Selan
  6. AMD Driver-specific Properties for Color Management on Linux (Part 1) - Melissa Wen

Slide 3: Why do we need advanced color management on Linux?

Context: When we talk about colors in the graphics chain, we should keep in mind that we have a wide variety of source content colorimetry, a variety of output display devices and also the internal processing. Users expect consistent color reproduction across all these devices.

The userspace can use GPU-accelerated color management to get it. But this also requires an interface with display kernel drivers that is currently missing from the DRM/KMS framework.

Slide 4: Describe our work on AMD driver-specific color properties

Since April, I’ve been bothering the DRM community by sending patchsets from the work of me and Joshua to add driver-specific color properties to the AMD display driver. In parallel, discussions on defining a generic color management interface are still ongoing in the community. Moreover, we are still not clear about the diversity of color capabilities among hardware vendors.

To bridge this gap, we defined a color pipeline for Gamescope that fits the latest versions of AMD hardware. It delivers advanced color management features for gamut mapping, HDR rendering, SDR on HDR, and HDR on SDR.

Slide 5: Describe the AMD/SteamDeck - our hardware

AMD/Steam Deck hardware: AMD frequently releases new GPU and APU generations. Each generation comes with a DCN version with display hardware improvements. Therefore, keep in mind that this work uses the AMD Steam Deck hardware and its kernel driver. The Steam Deck is an APU with a DCN3.01 display driver, a DCN3 family.

It’s important to have this information since newer AMD DCN drivers inherit implementations from previous families but aldo each generation of AMD hardware may introduce new color capabilities. Therefore I recommend you to familiarize yourself with the hardware you are working on.

Slide 6: Diagram with the three layers of the AMD display driver on Linux

The AMD display driver in the kernel space: It consists of three layers, (1) the DRM/KMS framework, (2) the AMD Display Manager, and (3) the AMD Display Core. We extended the color interface exposed to userspace by leveraging existing DRM resources and connecting them using driver-specific functions for color property management.

Slide 7: Three-layers diagram highlighting AMD Display Manager, DM - the layer that connects DC and DRM

Bridging DC color capabilities and the DRM API required significant changes in the color management of AMD Display Manager - the Linux-dependent part that connects the AMD DC interface to the DRM/KMS framework.

Slide 8: Three-layers diagram highlighting AMD Display Core, DC - the shared code

The AMD DC is the OS-agnostic layer. Its code is shared between platforms and DCN versions. Examining this part helps us understand the AMD color pipeline and hardware capabilities, since the machinery for hardware settings and resource management are already there.

Slide 9: Diagram of the AMD Display Core Next architecture with main elements and data flow

The newest architecture for AMD display hardware is the AMD Display Core Next.

Slide 10: Diagram of the AMD Display Core Next where only DPP and MPC blocks are highlighted

In this architecture, two blocks have the capability to manage colors:

  • Display Pipe and Plane (DPP) - for pre-blending adjustments;
  • Multiple Pipe/Plane Combined (MPC) - for post-blending color transformations.

Let’s see what we have in the DRM API for pre-blending color management.

Slide 11: Blank slide with no content only a title 'Pre-blending: DRM plane'

DRM plane color properties:

This is the DRM color management API before blending.

Nothing!

Except two basic DRM plane properties: color_encoding and color_range for the input colorspace conversion, that is not covered by this work.

Slide 12: Diagram with color capabilities and structures in AMD DC layer without any DRM plane color interface (before blending), only the DRM CRTC color interface for post blending

In case you’re not familiar with AMD shared code, what we need to do is basically draw a map and navigate there!

We have some DRM color properties after blending, but nothing before blending yet. But much of the hardware programming was already implemented in the AMD DC layer, thanks to the shared code.

Slide 13: Previous Diagram with a rectangle to highlight the empty space in the DRM plane interface that will be filled by AMD plane properties

Still both the DRM interface and its connection to the shared code were missing. That’s when the search begins!

Slide 14: Color Pipeline Diagram with the plane color interface filled by AMD plane properties but without connections to AMD DC resources

AMD driver-specific color pipeline:

Looking at the color capabilities of the hardware, we arrive at this initial set of properties. The path wasn’t exactly like that. We had many iterations and discoveries until reached to this pipeline.

Slide 15: Color Pipeline Diagram connecting AMD plane degamma properties, LUT and TF, to AMD DC resources

The Plane Degamma is our first driver-specific property before blending. It’s used to linearize the color space from encoded values to light linear values.

Slide 16: Describe plane degamma properties and hardware capabilities

We can use a pre-defined transfer function or a user lookup table (in short, LUT) to linearize the color space.

Pre-defined transfer functions for plane degamma are hardcoded curves that go to a specific hardware block called DPP Degamma ROM. It supports the following transfer functions: sRGB EOTF, BT.709 inverse OETF, PQ EOTF, and pure power curves Gamma 2.2, Gamma 2.4 and Gamma 2.6.

We also have a one-dimensional LUT. This 1D LUT has four thousand ninety six (4096) entries, the usual 1D LUT size in the DRM/KMS. It’s an array of drm_color_lut that goes to the DPP Gamma Correction block.

Slide 17: Color Pipeline Diagram connecting AMD plane CTM property to AMD DC resources

We also have now a color transformation matrix (CTM) for color space conversion.

Slide 18: Describe plane CTM property and hardware capabilities

It’s a 3x4 matrix of fixed points that goes to the DPP Gamut Remap Block.

Both pre- and post-blending matrices were previously gone to the same color block. We worked on detaching them to clear both paths.

Now each CTM goes on its own way.

Slide 19: Color Pipeline Diagram connecting AMD plane HDR multiplier property to AMD DC resources

Next, the HDR Multiplier. HDR Multiplier is a factor applied to the color values of an image to increase their overall brightness.

Slide 20: Describe plane HDR mult property and hardware capabilities

This is useful for converting images from a standard dynamic range (SDR) to a high dynamic range (HDR). As it can range beyond [0.0, 1.0] subsequent transforms need to use the PQ(HDR) transfer functions.

Slide 21: Color Pipeline Diagram connecting AMD plane shaper properties, LUT and TF, to AMD DC resources

And we need a 3D LUT. But 3D LUT has a limited number of entries in each dimension, so we want to use it in a colorspace that is optimized for human vision. It means in a non-linear space. To deliver it, userspace may need one 1D LUT before 3D LUT to delinearize content and another one after to linearize content again for blending.

Slide 22: Describe plane shaper properties and hardware capabilities

The pre-3D-LUT curve is called Shaper curve. Unlike Degamma TF, there are no hardcoded curves for shaper TF, but we can use the AMD color module in the driver to build the following shaper curves from pre-defined coefficients. The color module combines the TF and the user LUT values into the LUT that goes to the DPP Shaper RAM block.

Slide 23: Color Pipeline Diagram connecting AMD plane 3D LUT property to AMD DC resources

Finally, our rockstar, the 3D LUT. 3D LUT is perfect for complex color transformations and adjustments between color channels.

Slide 24: Describe plane 3D LUT property and hardware capabilities

3D LUT is also more complex to manage and requires more computational resources, as a consequence, its number of entries is usually limited. To overcome this restriction, the array contains samples from the approximated function and values between samples are estimated by tetrahedral interpolation. AMD supports 17 and 9 as the size of a single-dimension. Blue is the outermost dimension, red the innermost.

Slide 25: Color Pipeline Diagram connecting AMD plane blend properties, LUT and TF, to AMD DC resources

As mentioned, we need a post-3D-LUT curve to linearize the color space before blending. This is done by Blend TF and LUT.

Slide 26: Describe plane blend properties and hardware capabilities

Similar to shaper TF, there are no hardcoded curves for Blend TF. The pre-defined curves are the same as the Degamma block, but calculated by the color module. The resulting LUT goes to the DPP Blend RAM block.

Slide 27: Color Pipeline Diagram  with all AMD plane color properties connect to AMD DC resources and links showing the conflict between plane and CRTC degamma

Now we have everything connected before blending. As a conflict between plane and CRTC Degamma was inevitable, our approach doesn’t accept that both are set at the same time.

Slide 28: Color Pipeline Diagram connecting AMD CRTC gamma TF property to AMD DC resources

We also optimized the conversion of the framebuffer to wire encoding by adding support to pre-defined CRTC Gamma TF.

Slide 29: Describe CRTC gamma TF property and hardware capabilities

Again, there are no hardcoded curves and TF and LUT are combined by the AMD color module. The same types of shaper curves are supported. The resulting LUT goes to the MPC Gamma RAM block.

Slide 30: Color Pipeline Diagram with all AMD driver-specific color properties connect to AMD DC resources

Finally, we arrived in the final version of DRM/AMD driver-specific color management pipeline. With this knowledge, you’re ready to better enjoy the rainbow treasure of AMD display hardware and the world of graphics computing.

Slide 31: SteamDeck/Gamescope Color Pipeline Diagram with rectangles labeling each block of the pipeline with the related AMD color property

With this work, Gamescope/Steam Deck embraces the color capabilities of the AMD GPU. We highlight here how we map the Gamescope color pipeline to each AMD color block.

Slide 32: Final slide. Thank you!

Future works: The search for the rainbow treasure is not over! The Linux DRM subsystem contains many hidden treasures from different vendors. We want more complex color transformations and adjustments available on Linux. We also want to expose all GPU color capabilities from all hardware vendors to the Linux userspace.

Thanks Joshua and Harry for this joint work and the Linux DRI community for all feedback and reviews.

The amazing part of this work comes in the next talk with Joshua and The Rainbow Frogs!

Any questions?


References:

  1. Slides of the talk The Rainbow Treasure Map.
  2. Youtube video of the talk The Rainbow Treasure Map.
  3. Patch series for AMD driver-specific color management properties (upstream Linux 6.8v).
  4. SteamDeck/Gamescope color management pipeline
  5. XDC 2023 website.
  6. Igalia website.
December 19, 2023

Vulkan 1.3.274 moves the Vulkan encode work out of BETA and moves h264 and h265 into KHR extensions. radv support for the Vulkan video encode extensions has been in progress for a while.

The latest branch is at [1]. This branch has been updated for the new final headers.

Updated: It passes all of h265 CTS now, but it is failing one h264 test.

Initial ffmpeg support is [2].

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-encode-h2645-spec-latest?ref_type=heads

[2] https://github.com/cyanreg/FFmpeg/commits/vulkan/

December 17, 2023

Hi all!

This month we’ve finally released wlroots 0.17.0! It’s been a long time since the previous release (1 year), we’ll try to ship future releases a bit more frequently. We’re preparing 0.17.1 with a collection of bugfixes, it should be ready soon.

I’ve been working on wlr_surface_synced, a new wlroots abstraction to allow surface commits coming from clients to be delayed. This is required to avoid stalling the whole compositor if a client GPU work is slow and to implement explicit synchronization. I’ve also been working on a commit-queue-v1 implementation for wlroots and gamescope, which will allow us to get rid of a CPU wait in Mesa. And I’ve put some finishing touches on Rose’s frame scheduler patches. Last, I’ve merged André Almeida’s kernel patches for atomic async page-flips, making it so modern compositors can enable tearing page-flips without having to go through the legacy KMS uAPI.

I’ve added OAuth refresh tokens to meta.sr.ht. Having to renew OAuth tokens every year on my clients is annoying, with refresh tokens that’s a thing of the past! I’ve already updated hottub (CI bridge for GitHub) to leverage this, and I’d like to also implement this in hut (CLI tool) and yojo (CI bridge for Codeberg). Note that since meta.sr.ht has only now started returning refresh tokens on login, users will need to re-login one last time so that the OAuth clients can grab the refresh token.

The NPotM is a bit peculiar: I haven’t actually started working on it this month, and it’s not in a usable state yet. It’s go-sqlgen, a Go code generator which takes SQL as input. The goal is to store SQL queries in a separate file, to make them safer (type checking for the arguments) and faster (prepared statements). It’s somewhat similar to sqlc except it aims at being simpler and database-agnostic. There’s still much to do: I’d like to add support for named parameters, check that the number of parameters in the query matches the number of procedure arguments, and make it easy to write migrations. I’m not yet sure go-sqlgen is worth the trouble: being database-agnostic limits its abilities, perhaps too much.

Then comes the usual mix of random smaller updates. I’ve released soju 0.7.0 and goguma 0.6.0 with a few new features and bugfixes. pyonji now understands the b4 config file, so it’s possible to add this file to your project to preconfigure pyonji with a mailing list (example). delthas has implemented account data import in hut, so it’s now easy to migrate accounts between sr.ht instances, or projects between accounts. go-scfg now supports decoding a configuration file directly into a Go struct, making it unnecessary to hand-roll parsing code (example).

I’ll be giving a FOSDEM talk about quirks and gotchas of the IMAP protocol this year. I’ll be happy to say hi if any of you are coming as well. That’s all I have for this month, see you in January!

December 14, 2023

You may have seen the news that Red Hat Enterprise Linux 10 plans to remove Xorg. But Xwayland will stay around, and given the name overloading and them sharing a git repository there's some confusion over what is Xorg. So here's a very simple "picture". This is the xserver git repository:

$ tree -d -L 2 xserver
xserver
├── composite
├── config
├── damageext
├── dbe
├── dix
├── doc
│   └── dtrace
├── dri3
├── exa
├── fb
├── glamor
├── glx
├── hw
│   ├── kdrive
│   ├── vfb
│   ├── xfree86              <- this one is Xorg
│   ├── xnest
│   ├── xquartz
│   ├── xwayland
│   └── xwin
├── include
├── m4
├── man
├── mi
├── miext
│   ├── damage
│   ├── rootless
│   ├── shadow
│   └── sync
├── os
├── present
├── pseudoramiX
├── randr
├── record
├── render
├── test
│   ├── bigreq
│   ├── bugs
│   ├── damage
│   ├── scripts
│   ├── sync
│   ├── xi1
│   └── xi2
├── Xext
├── xfixes
├── Xi
└── xkb
The git repo produces several X servers, including the one designed to run on bare metal: Xorg (in hw/xfree86 for historical reasons). The other hw directories are the other X servers including Xwayland. All the other directories are core X server functionality that's shared between all X servers [1]. Removing Xorg from a distro but keeping Xwayland means building with --disable-xfree86 -enable-xwayland [1]. That's simply it (plus the resulting distro packaging work of course).

Removing Xorg means you need something else that runs on bare metal and that is your favourite Wayland compositor. Xwayland then talks to that while presenting an X11-compatible socket to existing X11 applications.

Of course all this means that the X server repo will continue to see patches and many of those will also affect Xorg. For those who are running git master anyway. Don't get your hopes up for more Xorg releases beyond the security update background noise [2].

Xwayland on the other hand is actively maintained and will continue to see releases. But those releases are a sequence [1] of

$ git new-branch xwayland-23.x.y
$ git rm hw/{kdrive/vfb/xfree86/xnest,xquartz,xwin}
$ git tag xwayland-23.x.y
In other words, an Xwayland release is the xserver git master branch with all X servers but Xwayland removed. That's how Xwayland can see new updates and releases without Xorg ever seeing those (except on git master of course). And that's how your installed Xwayland has code from 2023 while your installed Xorg is still stuck on the branch created and barely updated after 2021.

I hope this helps a bit with the confusion of the seemingly mixed messages sent when you see headlines like "Xorg is unmaintained", "X server patches to fix blah", "Xorg is abandoned", "new Xwayland release.

[1] not 100% accurate but close enough
[2] historically an Xorg release included all other X servers (Xquartz, Xwin, Xvfb, ...) too so this applies to those servers too unless they adopt the Xwayland release model

December 13, 2023

A self-help guide for examining and debugging the AMD display driver within the Linux kernel/DRM subsystem.

It’s based on my experience as an external developer working on the driver, and are shared with the goal of helping others navigate the driver code.

Acknowledgments: These tips were gathered thanks to the countless help received from AMD developers during the driver development process. The list below was obtained by examining open source code, reviewing public documentation, playing with tools, asking in public forums and also with the help of my former GSoC mentor, Rodrigo Siqueira.

Pre-Debugging Steps:

Before diving into an issue, it’s crucial to perform two essential steps:

1) Check the latest changes: Ensure you’re working with the latest AMD driver modifications located in the amd-staging-drm-next branch maintained by Alex Deucher. You may also find bug fixes for newer kernel versions on branches that have the name pattern drm-fixes-<date>.

2) Examine the issue tracker: Confirm that your issue isn’t already documented and addressed in the AMD display driver issue tracker. If you find a similar issue, you can team up with others and speed up the debugging process.

Understanding the issue:

Do you really need to change this? Where should you start looking for changes?

3) Is the issue in the AMD kernel driver or in the userspace?: Identifying the source of the issue is essential regardless of the GPU vendor. Sometimes this can be challenging so here are some helpful tips:

  • Record the screen: Capture the screen using a recording app while experiencing the issue. If the bug appears in the capture, it’s likely a userspace issue, not the kernel display driver.
  • Analyze the dmesg log: Look for error messages related to the display driver in the dmesg log. If the error message appears before the message “[drm] Display Core v...”, it’s not likely a display driver issue. If this message doesn’t appear in your log, the display driver wasn’t fully loaded and you will see a notification that something went wrong here.

4) AMD Display Manager vs. AMD Display Core: The AMD display driver consists of two components:

  • Display Manager (DM): This component interacts directly with the Linux DRM infrastructure. Occasionally, issues can arise from misinterpretations of DRM properties or features. If the issue doesn’t occur on other platforms with the same AMD hardware - for example, only happens on Linux but not on Windows - it’s more likely related to the AMD DM code.
  • Display Core (DC): This is the platform-agnostic part responsible for setting and programming hardware features. Modifications to the DC usually require validation on other platforms, like Windows, to avoid regressions.

5) Identify the DC HW family: Each AMD GPU has variations in its hardware architecture. Features and helpers differ between families, so determining the relevant code for your specific hardware is crucial.

  • Find GPU product information in Linux/AMD GPU documentation
  • Check the dmesg log for the Display Core version (since this commit in Linux kernel 6.3v). For example:
    • [drm] Display Core v3.2.241 initialized on DCN 2.1
    • [drm] Display Core v3.2.237 initialized on DCN 3.0.1

Investigating the relevant driver code:

Keep from letting unrelated driver code to affect your investigation.

6) Narrow the code inspection down to one DC HW family: the relevant code resides in a directory named after the DC number. For example, the DCN 3.0.1 driver code is located at drivers/gpu/drm/amd/display/dc/dcn301. We all know that the AMD’s shared code is huge and you can use these boundaries to rule out codes unrelated to your issue.

7) Newer families may inherit code from older ones: you can find dcn301 using code from dcn30, dcn20, dcn10 files. It’s crucial to verify which hooks and helpers your driver utilizes to investigate the right portion. You can leverage ftrace for supplemental validation. To give an example, it was useful when I was updating DCN3 color mapping to correctly use their new post-blending color capabilities, such as:

Additionally, you can use two different HW families to compare behaviours. If you see the issue in one but not in the other, you can compare the code and understand what has changed and if the implementation from a previous family doesn’t fit well the new HW resources or design. You can also count on the help of the community on the Linux AMD issue tracker to validate your code on other hardware and/or systems.

This approach helped me debug a 2-year-old issue where the cursor gamma adjustment was incorrect in DCN3 hardware, but working correctly for DCN2 family. I solved the issue in two steps, thanks for community feedback and validation:

8) Check the hardware capability screening in the driver: You can currently find a list of display hardware capabilities in the drivers/gpu/drm/amd/display/dc/dcn*/dcn*_resource.c file. More precisely in the dcn*_resource_construct() function. Using DCN301 for illustration, here is the list of its hardware caps:

	/*************************************************
	 *  Resource + asic cap harcoding                *
	 *************************************************/
	pool->base.underlay_pipe_index = NO_UNDERLAY_PIPE;
	pool->base.pipe_count = pool->base.res_cap->num_timing_generator;
	pool->base.mpcc_count = pool->base.res_cap->num_timing_generator;
	dc->caps.max_downscale_ratio = 600;
	dc->caps.i2c_speed_in_khz = 100;
	dc->caps.i2c_speed_in_khz_hdcp = 5; /*1.4 w/a enabled by default*/
	dc->caps.max_cursor_size = 256;
	dc->caps.min_horizontal_blanking_period = 80;
	dc->caps.dmdata_alloc_size = 2048;
	dc->caps.max_slave_planes = 2;
	dc->caps.max_slave_yuv_planes = 2;
	dc->caps.max_slave_rgb_planes = 2;
	dc->caps.is_apu = true;
	dc->caps.post_blend_color_processing = true;
	dc->caps.force_dp_tps4_for_cp2520 = true;
	dc->caps.extended_aux_timeout_support = true;
	dc->caps.dmcub_support = true;

	/* Color pipeline capabilities */
	dc->caps.color.dpp.dcn_arch = 1;
	dc->caps.color.dpp.input_lut_shared = 0;
	dc->caps.color.dpp.icsc = 1;
	dc->caps.color.dpp.dgam_ram = 0; // must use gamma_corr
	dc->caps.color.dpp.dgam_rom_caps.srgb = 1;
	dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1;
	dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1;
	dc->caps.color.dpp.dgam_rom_caps.pq = 1;
	dc->caps.color.dpp.dgam_rom_caps.hlg = 1;
	dc->caps.color.dpp.post_csc = 1;
	dc->caps.color.dpp.gamma_corr = 1;
	dc->caps.color.dpp.dgam_rom_for_yuv = 0;

	dc->caps.color.dpp.hw_3d_lut = 1;
	dc->caps.color.dpp.ogam_ram = 1;
	// no OGAM ROM on DCN301
	dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
	dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
	dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
	dc->caps.color.dpp.ogam_rom_caps.pq = 0;
	dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
	dc->caps.color.dpp.ocsc = 0;

	dc->caps.color.mpc.gamut_remap = 1;
	dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; //2
	dc->caps.color.mpc.ogam_ram = 1;
	dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
	dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
	dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
	dc->caps.color.mpc.ogam_rom_caps.pq = 0;
	dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
	dc->caps.color.mpc.ocsc = 1;

	dc->caps.dp_hdmi21_pcon_support = true;

	/* read VBIOS LTTPR caps */
	if (ctx->dc_bios->funcs->get_lttpr_caps) {
		enum bp_result bp_query_result;
		uint8_t is_vbios_lttpr_enable = 0;

		bp_query_result = ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, &is_vbios_lttpr_enable);
		dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) && !!is_vbios_lttpr_enable;
	}

	if (ctx->dc_bios->funcs->get_lttpr_interop) {
		enum bp_result bp_query_result;
		uint8_t is_vbios_interop_enabled = 0;

		bp_query_result = ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, &is_vbios_interop_enabled);
		dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) && !!is_vbios_interop_enabled;
	}

Keep in mind that the documentation of color capabilities are available at the Linux kernel Documentation.

Understanding the development history:

What has brought us to the current state?

9) Pinpoint relevant commits: Use git log and git blame to identify commits targeting the code section you’re interested in.

10) Track regressions: If you’re examining the amd-staging-drm-next branch, check for regressions between DC release versions. These are defined by DC_VER in the drivers/gpu/drm/amd/display/dc/dc.h file. Alternatively, find a commit with this format drm/amd/display: 3.2.221 that determines a display release. It’s useful for bisecting. This information helps you understand how outdated your branch is and identify potential regressions. You can consider each DC_VER takes around one week to be bumped. Finally, check testing log of each release in the report provided on the amd-gfx mailing list, such as this one Tested-by: Daniel Wheeler:

Reducing the inspection area:

Focus on what really matters.

11) Identify involved HW blocks: This helps isolate the issue. You can find more information about DCN HW blocks in the DCN Overview documentation. In summary:

  • Plane issues are closer to HUBP and DPP.
  • Blending/Stream issues are closer to MPC, OPP and OPTC. They are related to DRM CRTC subjects.

This information was useful when debugging a hardware rotation issue where the cursor plane got clipped off in the middle of the screen.

Finally, the issue was addressed by two patches:

12) Issues around bandwidth (glitches) and clocks: May be affected by calculations done in these HW blocks and HW specific values. The recalculation equations are found in the DML folder. DML stands for Display Mode Library. It’s in charge of all required configuration parameters supported by the hardware for multiple scenarios. See more in the AMD DC Overview kernel docs. It’s a math library that optimally configures hardware to find the best balance between power efficiency and performance in a given scenario.

Finding some clk variables that affect device behavior may be a sign of it. It’s hard for a external developer to debug this part, since it involves information from HW specs and firmware programming that we don’t have access. The best option is to provide all relevant debugging information you have and ask AMD developers to check the values from your suspicions.

  • Do a trick: If you suspect the power setup is degrading performance, try setting the amount of power supplied to the GPU to the maximum and see if it affects the system behavior with this command: sudo bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"

I learned it when debugging glitches with hardware cursor rotation on Steam Deck. My first attempt was changing the clock calculation. In the end, Rodrigo Siqueira proposed the right solution targeting bandwidth in two steps:

Checking implicit programming and hardware limitations:

Bring implicit programming to the level of consciousness and recognize hardware limitations.

13) Implicit update types: Check if the selected type for atomic update may affect your issue. The update type depends on the mode settings, since programming some modes demands more time for hardware processing. More details in the source code:

/* Surface update type is used by dc_update_surfaces_and_stream
 * The update type is determined at the very beginning of the function based
 * on parameters passed in and decides how much programming (or updating) is
 * going to be done during the call.
 *
 * UPDATE_TYPE_FAST is used for really fast updates that do not require much
 * logical calculations or hardware register programming. This update MUST be
 * ISR safe on windows. Currently fast update will only be used to flip surface
 * address.
 *
 * UPDATE_TYPE_MED is used for slower updates which require significant hw
 * re-programming however do not affect bandwidth consumption or clock
 * requirements. At present, this is the level at which front end updates
 * that do not require us to run bw_calcs happen. These are in/out transfer func
 * updates, viewport offset changes, recout size changes and pixel
depth changes.
 * This update can be done at ISR, but we want to minimize how often
this happens.
 *
 * UPDATE_TYPE_FULL is slow. Really slow. This requires us to recalculate our
 * bandwidth and clocks, possibly rearrange some pipes and reprogram
anything front
 * end related. Any time viewport dimensions, recout dimensions,
scaling ratios or
 * gamma need to be adjusted or pipe needs to be turned on (or
disconnected) we do
 * a full update. This cannot be done at ISR level and should be a rare event.
 * Unless someone is stress testing mpo enter/exit, playing with
colour or adjusting
 * underscan we don't expect to see this call at all.
 */

enum surface_update_type {
UPDATE_TYPE_FAST, /* super fast, safe to execute in isr */
UPDATE_TYPE_MED,  /* ISR safe, most of programming needed, no bw/clk change*/
UPDATE_TYPE_FULL, /* may need to shuffle resources */
};

Using tools:

Observe the current state, validate your findings, continue improvements.

14) Use AMD tools to check hardware state and driver programming: help on understanding your driver settings and checking the behavior when changing those settings.

  • DC Visual confirmation: Check multiple planes and pipe split policy.

  • DTN logs: Check display hardware state, including rotation, size, format, underflow, blocks in use, color block values, etc.

  • UMR: Check ASIC info, register values, KMS state - links and elements (framebuffers, planes, CRTCs, connectors). Source: UMR project documentation

15) Use generic DRM/KMS tools:

  • IGT test tools: Use generic KMS tests or develop your own to isolate the issue in the kernel space. Compare results across different GPU vendors to understand their implementations and find potential solutions. Here AMD also has specific IGT tests for its GPUs that is expect to work without failures on any AMD GPU. You can check results of HW-specific tests using different display hardware families or you can compare expected differences between the generic workflow and AMD workflow.

  • drm_info: This tool summarizes the current state of a display driver (capabilities, properties and formats) per element of the DRM/KMS workflow. Output can be helpful when reporting bugs.

Don’t give up!

Debugging issues in the AMD display driver can be challenging, but by following these tips and leveraging available resources, you can significantly improve your chances of success.

Worth mentioning: This blog post builds upon my talk, “I’m not an AMD expert, but…” presented at the 2022 XDC. It shares guidelines that helped me debug AMD display issues as an external developer of the driver.

Open Source Display Driver: The Linux kernel/AMD display driver is open source, allowing you to actively contribute by addressing issues listed in the official tracker. Tackling existing issues or resolving your own can be a rewarding learning experience and a valuable contribution to the community. Additionally, the tracker serves as a valuable resource for finding similar bugs, troubleshooting tips, and suggestions from AMD developers. Finally, it’s a platform for seeking help when needed.

Remember, contributing to the open source community through issue resolution and collaboration is mutually beneficial for everyone involved.

December 08, 2023

2023-12-10 UPDATE: From Mastodon, arcepi suggested the instability problems that I described below and served as a motivation to try Far Cry 6 on Linux could be coming from having switched from NVIDIA to AMD without reinstalling Windows, because of leftover files from the NVIDIA drivers. Today morning I reinstalled Windows to test this and, indeed, the Deathloop and Far Cry 6 crashes seem to be gone (yay!). That would have removed my original motivation to try to run the game on Linux, but it doesn’t take away the main points of the post. Do take into account that the instability doesn’t seem to exist anymore (and I hope this applies to more future titles I play) but it’s still the background story to explain why I decided to install Far Cry 6 on my Fedora 39 system, so the original post follows below.

If you’ve been paying attention to the evolution of the Linux gaming ecosystem in recent years, including the release of the Steam Deck and the new Steam Deck OLED, it’s likely your initial reaction to the blog post title is a simple “OK”. However, I’m coming from a very particular place so I wanted to explain my point of view and the significance of this, and hopefully you’ll find the story interesting.

steam running on fedora 39.tn
Figure 1. Steam running on Fedora Linux 39

As a background, let me say I’ve always gamed on Windows when using my PC. If you think I’m an idiot for doing so lately, specially because my work at Igalia involves frequently interacting with Valve contractors like Samuel Pitoiset, Timur Kristóf, Mike Blumenkrantz or Hans-Kristian Arntzen, you’d be more than right. But hear me out. I’ve always gamed on Windows because it’s the safe bet. With a couple of small kids at home and very limited free time, when I game everything has to just work. No fiddling around with software, config files, or wasting time setting up the software stack. I’m supposed to boot Windows when I want to play, play, and then turn my computer off. The experience needs to be as close to a console as possible. And, for anything non-gaming, which is most of it, I’d be using my Linux system.

In the last years, thanks to the work done by Valve, the Linux gaming stack has improved a lot. Despite this, I’ve kept gaming on Windows for a variety of reasons:

  1. For a long time, my Linux disk only had a capacity of 128GB, so installing games was not a real possibility due to the amount of disk space they need.

  2. Also, I was running Slackware and installing Steam and getting the whole thing running implied a fair amount of fiddling I didn’t even want to think about.

  3. Then, when I was running Fedora on a large disk, I had kids and I didn’t want to take any risks or possibly waste time on that.

So, what changed?

sapphire pulse amd rx 6700 box
Figure 2. Sapphire Pulse AMD Radeon RX 6700 box

Earlier this year I upgraded my PC and replaced an old Intel Haswell i7-4770k with a Ryzen R5 7600X, and my GPU changed from an NVIDIA GTX 1070 to a Radeon RX 6700. The jump in CPU power was much bigger and impressive than the more modest jump in GPU power. But talking about that and the sorry state of the GPU market is a story for another blog post. In any case, I had put up with the NVIDIA proprietary driver for many years and I think, on Windows and for gaming, NVIDIA is the obvious first choice for many people, including me. Dealing with the proprietary blob under Linux was not particularly problematic, specially with the excellent way it’s handled by RPMFusion on Fedora, where essentially you only have to install a few packages and you can mostly forget about it.

However, given my recent professional background I decided to go with an AMD card for the first time. I wanted to use a fully open source graphics stack and I didn’t want to think about making compromises in Wayland support or other fronts whatsoever. Plus, at the time I upgraded my PC, the timing was almost perfect for me to switch to an AMD card, because:

  1. AMD cards were, in general, performing better for the same price than NVIDIA cards, except for ray tracing.

  2. The RX 6700 non-XT was on sale.

  3. It had the same performance as a PS5 or so.

  4. It didn’t draw a ton of power like many recent high-end GPUs (175W, similar to the 1070 and its 150W TDP).

After the system upgrade, I did notice a few more stability problems when gaming under Windows, compared to what I was used to with an NVIDIA card. You can find thousands of opinions, comments and anecdotes on the Internet about the quality of AMD drivers, and a lot of people say they’re a couple of steps below NVIDIA drivers. It’s not my intention at all to pile up on those, but it’s true my own personal experience is having generally more crashes in games and having to face more weird situations since I switched to AMD. Normally, it doesn’t get to the point of being annoying at all, but sometimes it’s a bit surprising and I could definitely notice that increase in instability without any bias on my side, I believe. Which takes us to Far Cry 6.

A few days ago I finished playing Doom Eternal and its expansions (really nice game, by the way!) and I decided to go with Far Cry 6 next. I’m slowly working my way up with some more graphically demanding games that I didn’t feel comfortable with playing on the 1070. I went ahead and installed the game on Windows. Being a big 70GB download (100GB on disk), that took a bit of time. Then I launched it, adjusted the keyboard and mouse settings to my liking and I went to the video options menu. The game had chosen the high preset for me and everything looked good, so I attempted to run the in-game benchmark to see if the game performed well with that preset (I love it when games have built-in benchmarks!). After a few seconds in a loading screen, the game crashed I was back to the desktop. “Oh, what a bad way to start!”, I thought, without knowing what lied ahead. I launched the game again, same thing.

On the course of the 2 hours that followed, I tried everything:

  1. Launching the main game instead of the benchmark, just in case the bug only happened in the benchmark. Nope.

  2. Lowering quality and resolution.

  3. Disabling any advanced setting.

  4. Trying windowed mode, or borderless full screen.

  5. Vsync off or on.

  6. Disabling the overlays for Ubisoft, Steam, AMD.

  7. Rebooting multiple times.

  8. Uninstalling the drivers normally as well as using DDU and installing them again.

Same result every time. I also searched on the web for people having similar problems, but got no relevant search results anywhere. Yes, a lot of people both using AMD and NVIDIA had gotten crashes somewhere in the game under different circumstances, but nobody mentioned specifically being unable to reach any gameplay at all. That day I went to bed tired and a bit annoyed. I was also close to having run the game for 2 hours according to Steam, which is the limit for refunds if I recall correctly. I didn’t want to refund the game, though, I wanted to play it.

The next day I was ready to uninstall it and move on to another title in my list but, out of pure curiosity, given that I had already spent a good amount of time trying to make it run, I searched for it on the Proton compatibility database to see if it could be run on Linux, and it seemed to be possible. The game appeared to be well supported and it was verified to run on the Deck, which was good because both the Deck and my system have an RDNA2 GPU. In my head I wasn’t fully convinced this could work, because I didn’t know if the problem was in the game (maybe a bug with recent updates) or the drivers or anywhere else (like a hardware problem).

And this was, for me, when the fun started. I installed Steam on Linux from the Gnome Software app. For those who don’t know it, it’s like an app store for Gnome that acts as a frontend to the package manager.

gnome software steam.tn
Figure 3. Gnome Software showing Steam as an installed application

Steam showed up there with 3 possible sources: Flathub, an “rpmfusion-nonfree-steam” repo and the more typical “rpmfusion-nonfree” repo. I went with the last option and soon I had Steam in my list of apps. I launched that and authenticated using the Steam mobile app QR code scanning function for logging in (which is a really cool way to log in, by the way, without needing to recall your username and password).

My list of installed games was empty and I couldn’t find a way to install Far Cry 6 because it was not available for Linux. However, I thought there should be an easy way to install it and launch it using the famous Proton compatibility layer, and a quick web search revealed I only had to right-click on the game title, select Properties and choose to “Force the use of a specific Steam Play compatibility tool” under the Compatibility section. Click-click-click and, sure, the game was ready to install. I let it download again and launched it.

Context menu shown after right-clicking on Far Cry 6 on the Steam application, with the Properties option highlighted
Far Cry 6 Compatibility tab displaying the option to force the use of a specific Steam Play compatibility tool

Some stuff pops up about processing or downloading Vulkan shaders and I see it doing some work. In that first launch, the game takes more time to start compared to what I had seen under Windows, but it ends up launching (and subsequent launches were noticeably faster). That includes some Ubisoft Connect stuff showing up before the game starts and so on. Intro videos play normally and I reach the game menu in full screen. No indication that I was running it on Linux whatsoever. I go directly to the video options menu, see that the game again selected the high preset, I turn off VSync and launch the benchmark. Sincerely, honestly, completely and totally expecting it to crash one more time and that would’ve been OK, pointing to a game bug. But no, for the first time in two days this is what I get:

far cry 6 benchmark screenshot.tn
Figure 4. Far Cry 6 benchmark screenshot displaying the game running at over 100 frames per second

The benchmark runs perfectly, no graphical glitches, no stuttering, frame rates above 100FPS normally, and I had a genuinely happy and surprised grin on my face. I laughed out loud and my wife asked what was so funny. Effortless. No command lines, no config files, nothing.

As of today, I’ve played the game for over 30 hours and the game has crashed exactly once out of the blue. And I think it was an unfortunate game bug. The rest of the time it’s been running as smooth and as perfect as the first time I ran the benchmark. Framerate is completely fine and way over the 0 frames per second I got on Windows because it wouldn’t run. The only problem seems to be that when I finish playing and exit to the desktop, Steam is unable to stop the game completely for some reason (I don’t know the cause) and it shows up as still running. I usually click on the Stop button in the Steam interface after a few seconds, it stops the game and that’s it. No problem synchronizing game saves to the cloud or anything. Just that small bug that, again, only requires a single extra click.

2023-12-10 UPDATE: From Mastodon, Jaco G and Berto Garcia tell me the game not stopping problem is present in all Ubisoft games and is directly related to the Ubisoft launcher. It keeps running after closing the game, which makes Steam think the game is still running. You can try to close it from the tray if you see the Ubisoft icon there and, if that fails, you can stop the game like I described above.

Then I remember something that had happened a few months before, prior to starting to play Doom Eternal under Windows. I had tried to play Deathloop first, another game in my backlog. However, the game crashed every few minutes and an error window popped up. The amount and timing of the crashes didn’t look constant, and lowering the graphics settings sometimes would allow me to play the game a bit longer, but in any case I wasn’t able to finish the game intro level without crashes and being very annoyed. Searching for the error message on the web, I saw it looked like a game problem that was apparently affecting not only AMD users, but also NVIDIA ones, so I had mentally classified that as a game bug and, similarly to the Far Cry 6 case, I had given up on running the game without refunding it hoping to be able to play it in the future.

Now I was wondering if it was really a game bug and, even if it was, if maybe Proton could have a workaround for it and maybe it could be played on Linux. Again, ProtonDB showed the game to be verified on the Deck with encouraging recent reports. So I installed Deathloop on Linux, launched it just once and played for 20 minutes or so. No crashes and I got as far as I had gotten on Windows in the intro level. Again, no graphical glitches that I could see, smooth framerates, etc. Maybe it was a coincidence and I was lucky, but I think I will be able to play the game without issues when I’m done with Far Cry 6.

In conclusion, this story is another data point that tells us the quality of Proton as a product and software compatibility layer is outstanding. In combination with some high quality open source Mesa drivers like RADV, I’m amazed the experience can be actually better than gaming natively on Windows. Think about that: the Windows game binary running natively on a DX12 or Vulkan official driver crashes more and doesn’t work as well as the game running on top of a Windows compatibility layer with a graphics API translation layer, on top of a different OS kernel and a different Vulkan driver. Definitely amazing to me and it speaks wonders of the work Valve has been doing on Linux. Or it could also speak badly of AMD Windows drivers, or both.

Sure, some new games on launch have more compatibility issues, bugs that need fixing, maybe workarounds applied in Proton, etc. But even in those cases, if you have a bit of patience, play the game some months down the line and check ProtonDB first (ideally before buying the game), you may be in for a great experience. You don’t need to be an expert either. Not to mention that some of these details are even better and smoother if you use a Steam Deck as compared to an (officially) unsupported Linux distribution like I do.

December 06, 2023

During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to MobileDet, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example Frigate NVR.

I haven't mentioned it in a few updates, but all this work keeps being sponsored by Libre Computer, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out Alta and Solitude for the first such boards in the market.

Upstreaming

Igalia's Christian Gmeiner has been giving me great feedback at the merge request, and as part of that I submitted a patch to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded. 

This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.

Convolutions with 5x5 weights

Until now I had implemented support only for weights with dimensions 1x1 (aka pointwise convolutions) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.

I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.

Tensor addition

I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.

This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.

Softmax pooling

One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.

The blob implements these operations on the programmable core, with CL-like kernels.

So I undusted the work that I did in the first half of 2023 and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in Etnaviv to make use of the programmable core.

Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.

With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.

A new test suite

With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.

To get around that, I reimplemented the tests in C++ with GoogleTest, which is supported by Emma Anholt's deqp-runner and will allow me to run the tests in parallel, making full use of the CPU cores in the board.

That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.

With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.

Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.

There is a kernel module parameter to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.

Next steps

I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.

So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.


November 29, 2023

I have not been so active for a while with writing these Fedora Workstation updates and part of the reason was that I felt I was beginning to repeat myself a lot, which I partly felt was a side effect of writing them so often, but with some time now since my last update I felt that time was ripe again. So what are some of the things we have been working on and what are our main targets going forward? This is not a exhaustive list, but hopefully items you find interesting. Apologize for weird sentences and potential spelling mistakes, but it ended up a a long post and when you read your own words over for the Nth time you start going blind to issues :)

PipeWire

PipeWire 1.0 is available! PipeWire keeps the Linux Multimedia revolution rolling[/caption]So lets start with one of your favorite topics, PipeWire. As you probably know PipeWire 1.0 is now out and I feel it is a project we definitely succeeded with, so big kudos to Wim Taymans for leading this effort. I think the fact that we got both the creator of JACK, Paul Davis and the creator of PulseAudio Lennart Poettering to endorse it means our goal of unifying the Linux audio landscape is being met. I include their endorsement comments from the PipeWire 1.0 release announcement here :

“PipeWire represents the next evolution of audio handling for Linux, taking
the best of both pro-audio (JACK) and desktop audio servers (PulseAudio) and
linking them into a single, seamless, powerful new system.”
– Paul Davis, JACK and Ardour author

“PipeWire is a worthy successor to PulseAudio, providing a feature set
closer to how modern audio hardware works, and with a security model
with today’s application concepts in mind. Version 1.0 marks a
major milestone in completing the adoption of PipeWire in the standard
set of Linux subsystems. Congratulations to the team!”
– Lennart Poettering, Pulseaudio and systemd author

So for new readers, PipeWire is a audio and video server we created for Fedora Workstation to replace PulseAudio for consumer audio, JACK for pro-audio and add similar functionality for video to your linux operating system. So instead of having to deal with two different sound server architectures users now just have to deal with one and at the same time they get the same advantages for video handling. Since PipeWire implemented both the PulseAudio API and the JACK API it is a drop in replacement for both of them without needing any changes to the audio applications built for those two sound servers. Wim Taymans alongside the amazing community that has grown around the project has been hard at work maturing PipeWire and adding any missing feature they could find that blocked anyone from moving to it from either PulseAudio and JACK. Wims personal focus recently has been on an IRQ based ALSA driver for PipeWire to be able to provide 100% performance parity with the old JACK server. So while a lot of Pro-audio users felt that PipeWire’s latency was already good enough, this work by Wim shaves of the last few milliseconds to reach the same level of latency as JACK itself had.

In parallel with the work on PipeWire the community and especially Collabora has been hard at work on the new 0.5 release of WirePlumber, the session manager which handles all policy issues for PipeWire. I know people often get a little confused about PipeWire vs WirePlumber, but think of it like this: PipeWire provides you the ability to output audio through a connected speaker, through a bluetooth headset, through an HDMI connection and so on, but it doesn’t provide any ‘smarts’ for how that happens. The smarts are instead provided by WirePlumber which then contains policies to decide where to route your audio or video, either based on user choice or through preset policies making the right choices automatically, like if you disconnect your USB speaker it will move the audio to your internal speaker instead. Anyway, WirePlumber 0.5 will be a major step forward for WirePlumber moving from using lua scripts for configuration to instead using JSON for configuration while retaining lua for scripting. This has many advantages, but I point you to this excellent blog post by Collabora’s Ashok Sidipotu for the details. Ashok got further details about WirePlumber 0.5 that you can find here.

With PipeWire 1.0 out the door I feel we are very close to reaching one of our initial goals with PipeWire, to remove the need for custom pro-audio distributions like Fedora JAM or Ubuntu Studio, and instead just let audio folks be able to use the same great Fedora Workstation as the rest of the world. With 1.0 done Wim plans next to look a bit at things like configuration tools and similar used by pro-audio folks and also dive into the Flatpak portal needs of pro-audio applications more, to ensure that Flatpaks + PipeWire is the future of pro-audio.

On the video handling side its been a little slow going since there applications need to be ported from relying directly on v4l. Jan Grulich has been working with our friends at Mozilla and Google to get PipeWire camera handling support into Firefox and Google Chrome. At the moment it looks like the Firefox support will land first, in fact Jan has set up a COPR that lets you try it out here. For tracking the upstream work in WebRTC to add PipeWire support Jan set up this tracker bug. Getting the web browsers to use PipeWire is important both to enable the advanced video routing capabilities of PipeWire, but it will also provide applications the ability to use libcamera which is a needed for new modern MIPI cameras to work properly under Linux.

Another important application to get PipeWire camera support into is OBS Studio and the great thing is that community member Georges Stavracas is working on getting the PipeWire patches merged into OBS Studio, hopefully in time for their planned release early next year. You can track Georges work in this pull request.

For more information about PipeWire 1.0 I recommend our interview with Wim Taymans in Fedora Magazine and also the interview with Wim on Linux Unplugged podcast.

HDR
HDRHDR, or High Dynamic Range, is another major effort for us. HDR is a technology I think many of you have become familiar with due to it becoming quite common in TVs these days. It basically provides for greatly increased color depth and luminescence on your screen. This is a change that entails a lot of changes through the stack, because when you introduce into an existing ecosystem like the Linux desktop you have to figure out how to combine both new HDR capable applications and content and old non-HDR applications and content. Sebastian Wick, Jonas Ådahl, Oliver Fourdan, Michel Daenzer and more on the team has been working with other members of the ecosystem from Intel, AMD, NVIDIA, Collabora and more to pick and define the standards and protocols needed in this space. A lot of design work was done early in the year so we been quite focused on implementation work across the drivers, Wayland, Mesa, GStreamer, Mutter, GTK+ and more. Some of the more basic scenarios, like running a fullscreen HDR application is close to be ready, while we are still working hard on getting all the needed pieces together for the more complex scenarios like running SDR and HDR windows composited together on your desktop. So getting for instance full screen games to run in HDR mode with Steam should happen shortly, but the windowed support will probably land closer to summer next year.

Wayland remoting
One feature we been also spending a lot of time on is enabling remote logins to a Wayland desktop. You have been able to share your screen under Wayland more or less from day one, but it required your desktop session to be already active. But lets say you wanted to access your Wayland desktop running on a headless system you been out of luck so far and had to rely on the old X session instead. So putting in place all the pieces for this has been quite an undertaking with work having been done on PipeWire, on Wayland portals, gnome remote desktop daemon, libei; the new input emulation library, gdm and more. The pieces needed are finally falling into place and we expect to have everything needed landed in time for GNOME 46. This support is currently done using a private GNOME API, but a vendor less API is being worked on to replace it.

As a sidenote here not directly related to desktop remoting, but libei has also enabled us to bring xtest support to XWayland which was important for various applications including Valves gamescope.

NVIDIA drivers
One area we keep investing in is improving the state of NVIDIA support on Linux. This comes both in the form of being the main company backing the continued development of the Nouveau graphics driver. So the challenge with Nouveau is that for the longest while it offered next to no hardware acceleration for 3D graphics. The reason for this was that the firmware that NVIDIA provided for Nouveau to use didn’t expose that functionality and since recent generations of NVIDIA cards only works with firmware signed by NVIDIA this left us stuck. So Nouveau was a good tool for doing an initial install of a system, but if you where doing any kind of serious 3D acceleration, including playing games, then you would need to install the NVIDIA binary driver. So in the last year that landscape around that has changed drastically, with the release of the new out-of-tree open source driver from NVIDIA. Alongside that driver a new firmware has also been made available , one that do provide full support for hardware acceleration.
Let me quickly inject a quick explanation of out-of-tree versus in-tree drivers here. An in-tree driver is basically a kernel driver for a piece of hardware that has been merged into the official Linux kernel from Linus Torvalds and is thus being maintained as part of the official Linux kernel releases. This ensures that the driver integrates well with the rest of the Linux kernel and that it gets updated in sync with the rest of the Linux kernel. So Nouveau is an in-tree kernel driver which also integrates with the rest of the open source graphics stack, like Mesa. The new NVIDIA open source driver is an out-of-tree driver which ships as a separate source code release on its own schedule, but of course NVIDIA works to keeps it working with the upstream kernel releases (which is a lot of work of course and thus considered a major downside to being an out of tree driver).

As of the time of writing this blog post NVIDIAs out-of-tree kernel driver and firmware is still a work in progress for display usercases, but that is changing with NVIDIA exposing more and more display features in the driver (and the firmware) with each new release they do. But if you saw the original announcement of the new open source driver from NVIDIA and have been wondering why no distribution relies on it yet, this is why. So what does this mean for Nouveau? Well our plan is to keep supporting Nouveau for the foreseeable future because it is an in-tree driver, which is a lot easier to ensure keeps working with each new upstream kernel release.

At the same time the new firmware updates allows Nouveau to eventually offer performance levels competitive with the official out-of-tree driver, kind of how the open source AMD driver with MESA offers comparable performance to AMD binary GPU driver userspace. So Nouvea maintainer Ben Skeggs spent the last year working hard on refactoring Nouveau to work with the new firmware and we now have a new release of Nouveau out showing the fruits of that labor, enabling support for NVIDIAs latest chipset. Over time we will have it cover more chipset and expand Vulkan and OpenGL (using Zink) support to be a full fledged accelerated graphics driver.
So some news here, Ben after having worked tirelessly on keeping Nouveau afloat for so many years decided he needed a change of pace and thus decided to leave software development behind for the time being. A big thank you to Ben from all us at Red Hat and Fedora ! The good news is that Danilo Krummrich will take over as the development lead, with Lyude Paul taking on working on the Display side specifically of the driver. We also expect to have other members of the team chipping in too. They will pick up Bens work and continue working with NVIDIA and the community on a bright future for Nouveau.

So as I mentioned though the new open source driver from NVIDIA is still being matured for the display usercase and until it works fully as a display driver neither will Nouveau be able to be a full alternative since they share the same firmware. So people will need to rely on the binary NVIDIA Driver for some time still. One thing we are looking at there and discussing is if there are ways for us to improve the experience of using that binary driver with Secure Boot enabled. Atm that requires quite a bit of manual fiddling with tools like mokutils, but we have some ideas on how to streamline that a bit, but it is a hard nut to solve due to a combination of policy issues, legal issues, security issues and hardware/UEFI bugs so I am making no promises at this point, just a promise that it is something we are looking at.

Accessibility
Accessibility is an important feature for us in Fedora Workstation and thus we hired Lukáš Tyrychtr to focus on the issue. Lukáš has been working through across the stack fixing issues blocking proper accessibility support in Fedora Workstation and also participated in various accessibility related events. There is still a lot to do there so I was very happy to hear recently that the GNOME Foundation got a million Euro sponsorship from the Sovereign Tech Fund to improve various things across the stack, especially improving accessibility. So the combination of Lukáš continued efforts and that new investment should make for a much improved accessibility experience in GNOME and in Fedora Workstation going forward.

GNOME Software
Another area that we keep investing in is improving GNOME Software, with Milan Crha working continuously on bugfixing and performance improvements. GNOME Software is actually a fairly complex piece of software as it has to be able to handle the installation and updating of RPMS, OSTree system images, Flatpaks, fonts and firmware for us in addition to the formats it handles for other distributions. For some time it felt was GNOME Software was struggling with the load of all those different formats and usercases and was becoming both slow and with a lot of error messages. Milan has been spending a lot of time dealing with those issues one by one and also recently landed some major performance improvements making the GNOME Software experience a lot better. One major change that Milan is working on that I think we will be able to land in Fedora Workstation 40/41 is porting GNOME Software to use DNF5. The main improvement end users will probably notice is that it unifies the caches used for GNOME Software and using dnf on the command line, saving you storage space and also ensuring the two are fully in sync on what RPMS is installed/updated at any given time.

Fedora and Flatpaks

Flatpaks is another key element of our strategy for moving the Linux desktop forward and as part of that we have now enabled all of Flathub to be available if you choose to enable 3rd party repositories when you install Fedora Workstation. This means that the huge universe of applications available on Flathub will be easy to install through GNOME Software alongside the content available in Fedora’s own repositories. That said we have also spent time improving the ease of making Fedora Flatpaks. Owen Taylor jumped in and removed the dependency on a technology called ‘modularity‘ which was initially introduced to Fedora to bring new features around having different types of content and ease keeping containers up to date. Unfortunately it did not work out as intended and instead it became something that everyone just felt made things a lot more complicated, including building Flatpaks from Fedora content. With Owens updates building Flatpaks in Fedora has become a lot simpler and should help energize the effort building Flatpaks in Fedora.

Toolbx
As we continue marching towards a vision for Fedora Workstation to be a highly robust operating we keep evolving Toolbx. Our tool for making running your development environment(s) inside a container and thus allows you to both keep your host OS pristine and up to date, while at the same time using specific toolchains and tools inside the development container. This is a hard requirement for immutable operating systems such as Fedora Silverblue or Universal blue, but it is also useful on operating systems like Fedora Workstation as a way to do development for other platforms, like for instance Red Hat Enterprise Linux.

A major focus for Toolbx since the inception is to get it a stage where it is robust and reliable. So for instance while we prototyped it as a shell script, today it is written in Go to be more maintainable and also to confirm with the rest of the container ecosystem. A recent major step forward for getting that stability there is that starting with Fedora 39, the toolbox image is now a release blocking deliverable. This means it is now built as part of the nightly compose and the whole Toolbx stack (ie. the fedora-toolbox image and the toolbox RPM) is part of the release-blocking test criteria. This shows the level of importance we put on Toolbx as the future of Linux software development and its criticality to Fedora Workstation. Earlier, we built the fedora-toobox image as a somewhat separate and standalone thing, and people interested in Toolbx would try to test and keep the whole thing working, as much as possible, on their own. This was becoming unmanageable because Toolbx integrates with many parts of the distribution from Mutter (ie, the Wayland and X sockets) to Kerberos to RPM (ie., %_netsharedpath in /usr/lib/rpm/macros.d/macros.toolbox) to glibc locale definitions and translations. The list of things that could change elsewhere in Fedora, and end up breaking Toolbx, was growing too large for a small group of Toolbx contributors to keep track of.

We the next release we now also have built-in support for Arch Linux and Ubuntu through the –distro flag in toolbox.git main, thanks again to the community contributors who worked with us on this allowing us to widen the amount of distros supported while keeping with our policy of reliability and dependability. And along the same theme of ensuring Toolbx is a tool developers can rely on we have added lots and lots of new tests. We now have more than 280 tests that run on CentOS Stream 9, all supported Fedoras and Rawhide, and Ubuntu 22.04.

Another feature that Toolbx maintainer Debarshi Ray put a lot of effort into is setting up full RHEL containers in Toolbx on top of Fedora. Today, thanks to Debarshi work you do subscription-manager register --username [email protected] on the Fedora or RHEL host, and the container is automatically entitled to RHEL content. We are still looking at how we can provide a graphical interface for that process or at least how to polish up the CLI for doing subscription-manager register. If you are interested in this feature, Debarshi provides a full breakdown here.

Other nice to haves added is support for enterprise FreeIPA set-ups, where the user logs into their machine through Kerberos and support for automatically generated shell completions for Bash, fish and Z shell.

Flatpak and Foreman & Katello
For those out there using Foreman to manage your fleet of Linux installs we have some good news. We are in the process of implementing support for Flatpaks in these tools so that you can manage and deploy applications in the Flatpak format using them. This is still a work in progress, but relevant Pulp and Katello commits are Pulp commit Support for Flatpak index endpoints and Katello commits Reporting results of docker v2 repo discovery” and Support Link header in docker v2 repo discovery“.

LVFS
Another effort that Fedora Workstation has brought to the world of Linux and that is very popular arethe LVFS and fwdup formware update repository and tools. Thanks to that effort we are soon going to be passing one hundred million firmware updates on Linux devices soon! These firmware updates has helped resolve countless bugs and much improved security for Linux users.

But we are not slowing down. Richard Hughes worked with industry partners this year to define a Bill of Materials defintion to firmware updates allowing usings to be better informed on what is included in their firmware updates.

We now support over 1400 different devices on the LVFS (covering 78 different protocols!), with over 8000 public firmware versions (image below) from over 150 OEMs and ODMs. We’ve now done over 100,000 static analysis tests on over 2,000,000 EFI binaries in the firmware capsules!

Some examples of recently added hardware:
* AMD dGPUs, Navi3x and above, AVer FONE540, Belkin Thunderbolt 4 Core Hub dock, CE-LINK TB4 Docks,CH347 SPI programmer, EPOS ADAPT 1×5, Fibocom FM101, Foxconn T99W373, SDX12, SDX55 and SDX6X devices, Genesys GL32XX SD readers, GL352350, GL3590, GL3525S and GL3525 USB hubs, Goodix Touch controllers, HP Rata/Remi BLE Mice, Intel USB-4 retimers, Jabra Evolve 65e/t and SE, Evolve2, Speak2 and Link devices, Logitech Huddle, Rally System and Tap devices, Luxshare Quad USB4 Dock, MediaTek DP AUX Scalers, Microsoft USB-C Travel Hub, More Logitech Unifying receivers, More PixartRF HPAC devices, More Synaptics Prometheus fingerprint readers, Nordic HID devices, nRF52 Desktop Keyboard, PixArt BLE HPAC OTA, Quectel EM160 and RM520, Some Western Digital eMMC devices, Star Labs StarBook Mk VIr2, Synaptics Triton devices, System76 Launch 3, Launch Heavy 3 and Thelio IO 2, TUXEDO InfinityBook Pro 13 v3, VIA VL122, VL817S, VL822T, VL830 and VL832, Wacom Cintiq Pro 27, DTH134 and DTC121, One 13 and One 12 Tablets

InputLeap on Wayland
One really interesting feature that landed for Fedora Workstation 39 was the support for InputLeap. It’s probably not on most peoples radar, but it’s an important feature for system administrators, developers and generally anyone with more than a single computer on their desk.

Historically, InputLeap is a fork of Barrier which itself was a fork of Synergy, it allows to share the same input devices (mouse, keyboard) across different computers (Linux, Windows, MacOS) and to move the pointer between the screens of these computers seamlessly as if they were one.

InputLeap has a client/server architecture with the server running on the main host (the one with the keyboard and mouse connected) and multiple clients, the other machines sitting next to the server machine. That implies two things, the InputLeap daemon on the server must be able to “capture” all the input events to forward them to the remote clients when the pointer reaches the edge of the screen, and the InputLeap client must be able to “replay” those input events on the client host to make it as if the keyboard and mouse were connected directly to the (other) computer. Historically, that relied on X11 mechanisms and neither InputLeap (nor Barrier or even Synergy as a matter of fact) would work on Wayland.

This is one of the use cases that Peter Hutterer had in mind when he started libEI, a low-level library aimed at providing a separate communication channel for input emulation in Wayland compositors and clients (even though libEI is not strictly tied to Wayland). But libEI alone is far from being sufficient to implement InputLeap features, with Wayland we had the opportunity to make things more secure than X11 and take benefit from the XDG portal mechanisms.

On the client side, for replaying input events, it’s similar to remote desktop but we needed to update the existing RemoteDesktop portal to pass the libEI socket. On the server side, it required a brand new portal for input capture . These also required their counterparts in the GNOME portal, for both RemoteDesktop and InputCapture [8], and of course, all that needs to be supported by the Wayland compositor, in the case of GNOME that’s mutter. That alone was a lot of work.

Yet, even with all that in place, that’s just the basic requirements to support a Synergy/Barrier/InputLeap-like feature, the tools in question need to have support for the portal and libEI implemented to benefit from the mechanisms we’ve put in place and for the all feature to work and be usable. So libportal was also updated to support the new portal features and a new “Wayland” backend alongside the X11, Windows and Mac OS backends was contributed to InputLeap.

The merge request in InputLeap was accepted very early, even before the libEI API was completely stabilized and before the rest of the stack was merged, which I believe was a courageous choice from Povilas (who maintains InputLeap) which helped reduce the time to have the feature actually working, considering the number of components and inter-dependencies involved. Of course, there are still features missing in the Wayland backend, like copy/pasting between hosts, but a clipboard interface was fairly recently added to the remote desktop portal and therefore could be used by InputLeap to implement that feature.

Fun fact, Xwayland also grew support for libEI also using the remote desktop portal and wires that to the XTEST extension on X11 that InputLeap’s X11 backend uses, so it might even be possible to use the X11 backend of InputLeap in the client side through Xwayland, but of course it’s better to use the Wayland backend on both the client and server sides.

InputLeap is a great example of collaboration between multiple parties upstream including key contributions from us at Red Hat to implement and contribute a feature that has been requested for years upstream..

Thank you to Olivier Fourdan, Debarshi Ray, Richard Hughes, Sebastian Wick and Jonas Ådahl for their contributions to this blog post.

November 17, 2023

Progress

 
This update's highlight is that last week I finally got the TP jobs working, which allows us to make the tensor manipulation in the HW, removing 18ms from the tensor preprocessing. We can currently use them for transposing tensors from the format that TensorFlow prefers to that which the HW expects and the other way around, and for lowering strided convolutions to regular ones.
 
This makes our image classification benchmark twice as fast, as expected:

tomeu@arm-64:~/mesa$ ETNA_MESA_DEBUG=ml_msgs python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
Running the NN job took 13 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 15.650ms

60 FPS is already quite interesting for many use cases, but the proprietary driver is able to do the same at around 8 ms, so there is still plenty of room for improvements.
 
Some preliminary testing indicates that enabling zero-run length compression in the weight buffers will make the biggest difference, so that is what I will be working on when I get back to performance work.

Additionally, I also got some experimental jobs running on the programmable core in this NPU, which will allow us to run more advanced models, which tend to use operations that the hardware couldn't be designed for back then.

Upstreaming is going well, those interested can follow it here:
 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714.
 

Next steps

 

These will be my priorities during the next couple of weeks, in order:

  1. Upstreaming
  2. Get the Mobilenet SSD V1 model running on the HW, for object detection
  3. Performance
November 15, 2023

Hi! This month I’ve started a new PotM called pyonji. It’s an easy-to-use replacement for the venerable git-send-email command. The goal is to make it less painful for a new contributor not familiar with the e-mail based patch submission to submit patches.

Users are expected to use the same workflow as GitHub, GitLab and friends when contributing: create a new branch and add commits there. Instead of pushing to a fork though, users simply invoke pyonji.

When run for the first time, pyonji will ask for your e-mail account details: e-mail address, password… and nothing else. The SMTP server hostname, port and other details are automatically detected (via multiple means: SRV records, Mozilla auto-configuration database, common subdomains, etc). Once the password is verified pyonji will store everything in the Git configuration (in the same fashion that git-send-email expects it).

Then pyonji will present a UI with a list of commits to be submitted for review. The user can tweak details such as the base branch, the mailing list address, the version of the patch, however that’s rarely needed: pyonji will find good defaults for these. The user can add a cover letter if desired with a longer description for the set of patches. Then the big blue “submit” button can be pressed to send the patches.

Unlike git-send-email, pyonji will remember for you what the last submitted version number was (and automatically increment it). pyonji will save the cover letter so that it’s not lost if the network is flaky and you don’t need to re-type it for the next submission. pyonji will not waste your time with uninteresting questions such as “which encoding should I use?”. pyonji will automatically include the base tree information in the patches so that any conflicts are more easily resolved by the reviewer.

Please try it and let me know how it goes! In particular, I’m wondering if the logic to auto-detect the e-mail server settings are robust enough, or if there are e-mail providers I don’t handle correctly yet.

There is still a lot to be done to improve pyonji. Setup is painful for GMail and Fastmail users because app passwords are required. I wanted to use OAuth to fix this but both of these providers heavily restrict how SMTP OAuth apps can be registered. Setup doesn’t work for ProtonMail users because the bridge uses a self-signed certificate, that can be fixed but setup will remain painful. I’d like to add UI to change the base branch, improve the heuristics to pick a good default for the base branch, support for the MAINTAINERS file for easier contribution to big projects such as the kernel, add an easy way to mark a patch series as RFC, and probably a million of other things.

Apart from pyonji, I’ve been working on some graphics-related stuff as always. We’re getting closer to the wlroots 0.17 release, fixing the few remaining blocking issues. A new API to clip surfaces with the scene-graph has been merged, many thanks to Alexander Orzechowski and Isaac Freund! I’ve fixed a Mesa regression introduced by a previous patch I’ve reviewed related to EGL and split render/display SoCs (I hate these). And I’ve been discussing with other kernel developers about a way to stop (ab)using KMS dumb buffers for split render/display SoCs (I swear I really hate these). We’re trying to come up with a solution which could on the long run also help with the Buffer Allocation Constraints Problem (see the XDC 2020 talk for more info).

I’ve written a few patches to add support for OAuth 2.0 refresh tokens to meta.sr.ht. If you’ve ever used an OAuth sr.ht app (like hottub or yojo to integrate builds.sr.ht with GitHub or Forgejo), you probably know that tokens expire after one year, and that you need to redo the setup step when that happens. This is annoying, and adding support for refresh tokens to meta.sr.ht and the OAuth apps should fix this.

Last, I’m now part of the FreeDesktop Code of Conduct team. This is not a technical role, but it’s very important to have folks doing this work. I’ve attended a Code of Conduct workshop to learn how to do it, that’s been pretty interesting and helpful. The workshop focused a lot more on trying to change people’s behavior, instead of bringing down the ban hammer.

That’s all for now, see you next month!

November 11, 2023

Today, 12 years after the meeting where AppStream was first discussed and 11 years after I released a prototype implementation I am excited to announce AppStream 1.0! 🎉🎉🎊

Check it out on GitHub, or get the release tarball or read the documentation or release notes! 😁

Some nostalgic memories

I was not in the original AppStream meeting, since in 2011 I was extremely busy with finals preparations and ball organization in high school, but I still vividly remember sitting at school in the students’ lounge during a break and trying to catch the really choppy live stream from the meeting on my borrowed laptop (a futile exercise, I watched parts of the blurry recording later).

I was extremely passionate about getting software deployment to work better on Linux and to improve the overall user experience, and spent many hours on the PackageKit IRC channel discussing things with many amazing people like Richard Hughes, Daniel Nicoletti, Sebastian Heinlein and others.

At the time I was writing a software deployment tool called Listaller – this was before Linux containers were a thing, and building it was very tough due to technical and personal limitations (I had just learned C!). Then in university, when I intended to recreate this tool, but for real and better this time as a new project called Limba, I needed a way to provide metadata for it, and AppStream fit right in! Meanwhile, Richard Hughes was tackling the UI side of things while creating GNOME Software and needed a solution as well. So I implemented a prototype and together we pretty much reshaped the early specification from the original meeting into what would become modern AppStream.

Back then I saw AppStream as a necessary side-project for my actual project, and didn’t even consider me as the maintainer of it for quite a while (I hadn’t been at the meeting afterall). All those years ago I had no idea that ultimately I was developing AppStream not for Limba, but for a new thing that would show up later, with an even more modern design called Flatpak. I also had no idea how incredibly complex AppStream would become and how many features it would have and how much more maintenance work it would be – and also not how ubiquitous it would become.

The modern Linux desktop uses AppStream everywhere now, it is supported by all major distributions, used by Flatpak for metadata, used for firmware metadata via Richard’s fwupd/LVFS, runs on every Steam Deck, can be found in cars and possibly many places I do not know yet.

What is new in 1.0?

API breaks

The most important thing that’s new with the 1.0 release is a bunch of incompatible changes. For the shared libraries, all deprecated API elements have been removed and a bunch of other changes have been made to improve the overall API and especially make it more binding-friendly. That doesn’t mean that the API is completely new and nothing looks like before though, when possible the previous API design was kept and some changes that would have been too disruptive have not been made. Regardless of that, you will have to port your AppStream-using applications. For some larger ones I already submitted patches to build with both AppStream versions, the 0.16.x stable series as well as 1.0+.

For the XML specification, some older compatibility for XML that had no or very few users has been removed as well. This affects for example release elements that reference downloadable data without an artifact block, which has not been supported for a while. For all of these, I checked to remove only things that had close to no users and that were a significant maintenance burden. So as a rule of thumb: If your XML validated with no warnings with the 0.16.x branch of AppStream, it will still be 100% valid with the 1.0 release.

Another notable change is that the generated output of AppStream 1.0 will always be 1.0 compliant, you can not make it generate data for versions below that (this greatly reduced the maintenance cost of the project).

Developer element

For a long time, you could set the developer name using the top-level developer_name tag. With AppStream 1.0, this is changed a bit. There is now a developer tag with a name child (that can be translated unless the translate="no" attribute is set on it). This allows future extensibility, and also allows to set a machine-readable id attribute in the developer element. This permits software centers to group software by developer easier, without having to use heuristics. If we decide to extend the developer information per-app in future, this is also now possible. Do not worry though the developer_name tag is also still read, so there is no high pressure to update. The old 0.16.x stable series also has this feature backported, so it can be available everywhere. Check out the developer tag specification for more details.

Scale factor for screenshots

Screenshot images can now have a scale attribute, to indicate an (integer) scaling factor to apply. This feature was a breaking change and therefore we could not have it for the longest time, but it is now available. Please wait a bit for AppStream 1.0 to become deployed more widespread though, as using it with older AppStream versions may lead to issues in some cases. Check out the screenshots tag specification for more details.

Screenshot environments

It is now possible to indicate the environment a screenshot was recorded in (GNOME, GNOME Dark, KDE Plasma, Windows, etc.) via an environment attribute on the respective screenshot tag. This was also a breaking change, so use it carefully for now! If projects want to, they can use this feature to supply dedicated screenshots depending on the environment the application page is displayed in. Check out the screenshots tag specification for more details.

References tag

This is a feature more important for the scientific community and scientific applications. Using the references tag, you can associate the AppStream component with a DOI (Digital object identifier) or provide a link to a CFF file to provide citation information. It also allows to link to other scientific registries. Check out the references tag specification for more details.

Release tags

Releases can have tags now, just like components. This is generally not a feature that I expect to be used much, but in certain instances it can become useful with a cooperating software center, for example to tag certain releases as long-term supported versions.

Multi-platform support

Thanks to the interest and work of many volunteers, AppStream (mostly) runs on FreeBSD now, a NetBSD port exists, support for macOS was written and a Windows port is on its way! Thank you to everyone working on this 🙂

Better compatibility checks

For a long time I thought that the AppStream library should just be a thin layer above the XML and that software centers should just implement a lot of the actual logic. This has not been the case for a while, but there was still a lot of complex AppStream features that were hard for software centers to implement and where it makes sense to have one implementation that projects can just use.

The validation of component relations is one such thing. This was implemented in 0.16.x as well, but 1.0 vastly improves upon the compatibility checks, so you can now just run as_component_check_relations and retrieve a detailed list of whether the current component will run well on the system. Besides better API for software developers, the appstreamcli utility also has much improved support for relation checks, and I wrote about these changes in a previous post. Check it out!

With these changes, I hope this feature will be used much more, and beyond just drivers and firmware.

So much more!

The changelog for the 1.0 release is huge, and there are many papercuts resolved and changes made that I did not talk about here, like us using gi-docgen (instead of gtkdoc) now for nice API documentation, or the many improvements that went into better binding support, or better search, or just plain bugfixes.

Outlook

I expect the transition to 1.0 to take a bit of time. AppStream has not broken its API for many, many years (since 2016), so a bunch of places need to be touched even if the changes themselves are minor in many cases. In hindsight, I should have also released 1.0 much sooner and it should not have become such a mega-release, but that was mainly due to time constraints.

So, what’s in it for the future? Contrary to what I thought, AppStream does not really seem to be “done” and fetature complete at a point, there is always something to improve, and people come up with new usecases all the time. So, expect more of the same in future: Bugfixes, validator improvements, documentation improvements, better tools and the occasional new feature.

Onwards to 1.0.1! 😁

November 10, 2023

TLDR: see the title of this blog post, it's really that trivial.

Now that GodotWayland has been coming for ages and all new development focuses on a pile of software that steams significantly less, we're seeing cracks appear in the old Xorg support. Not intentionally, but there's only so much time that can be spent on testing and things that are more niche fall through. One of these was a bug I just had the pleasure of debugging and was triggered by GNOME on Xorg user using the xf86-input-libinput driver for tablet devices.

On the surface of it, this should be fine because libinput (and thus xf86-input-libinput) handles tablets just fine. But libinput is the new kid on the block. The old kid on said block is the xf86-input-wacom driver, older than libinput by slightly over a decade. And oh man, history has baked things into the driver that are worse than raisins in apple strudel [1].

The xf86-input-libinput driver was written as a wrapper around libinput and makes use of fancy things that (from libinput's POV) have always been around: things like input device hotplugging. Fancy, I know. For tablet devices the driver creates an X device for each new tool as it comes into proximity first. Future events from that tool will go through that device. A second tool, be it a new pen or the eraser on the original pen, will create a second X device and events from that tool will go through that X device. Configuration on any device will thus only affect that particular pen. Almost like the whole thing makes sense.

The wacom driver of course doesn't do this. It pre-creates X devices for some possible types of tools (pen, eraser, and cursor [2] but not airbrush or artpen). When a tool goes into proximity the events are sent through the respective device, i.e. all pens go through the pen tool, all erasers through the eraser tool. To actually track pens there is the "Wacom Serial IDs" property that contains the current tool's serial number. If you want to track multiple tools you need to query the property on proximity in [4]. At the time this was within a reasonable error margin of a good idea.

Of course and because MOAR CONFIGURATION! will save us all from the great filter you can specify the "ToolSerials" xorg.conf option as e.g. "airbrush;12345;artpen" and get some extra X devices pre-created, in this case a airbrush and artpen X device and an X device just for the tool with the serial number 12345. All other tools multiplex through the default devices. Again, at the time this was a great improvement. [5]

Anyway, where was I? Oh, right. The above should serve as a good approximation of a reason why the xf86-input-libinput driver does not try to be fullly compatible to the xf86-input-wacom driver. In everyday use these things barely matter [6] but for the desktop environment which needs to configure these devices all these differences mean multiple code paths. Those paths need to be tested but they aren't, so things fall through the cracks.

So quite a while ago, we made the decision that until Xorg goes dodo, the xf86-input-wacom driver is the tablet driver to use in GNOME. So if you're using a GNOME on Xorg session [7], do make sure the xf86-input-wacom driver is installed. It will make both of us happier and that's a good aim to strive for.

[1] It's just a joke. Put the pitchforks down already.
[2] The cursor is the mouse-like thing Wacom sells. Which is called cursor [3] because the English language has a limited vocabulary and we need to re-use words as much as possible lest we run out of them.
[3] It's also called puck. Because [2].
[4] And by "query" I mean "wait for the XI2 event notifying you of a property change". Because of lolz the driver cannot update the property on proximity in but needs to schedule that as idle func so the property update for the serial always arrives at some unspecified time after the proximity in but hopefully before more motion events happen. Or not, and that's how hope dies.
[5] Think about this next time someone says they long for some unspecified good old days.
[6] Except the strip axis which on the wacom driver is actually a bit happily moving left/right as your finger moves up/down on the touch strip and any X client needs to know this. libinput normalizes this to...well, a normal value but now the X client needs to know which driver is running so, oh deary deary.
[7] e.g because your'e stockholmed into it by your graphics hardware

November 08, 2023

I’ve recently worked on a patch for the vc4 display driver used on the Raspberry Pi 4. To test this patch, I needed to compile the kernel and install it, something I know how to do on x86 but not on Raspberry Pi. Because I’m pretty stubborn I’ve also insisted on making my life harder:

  • I installed Arch Linux ARM as the base system, instead of Raspberry Pi OS or Raspbian.
  • I based my patches on top of the mainline kernel, instead of using Raspberry Pi’s tree.
  • I wanted to install my built kernel alongside the one provided by the distribution, instead of overwriting it.

Raspberry Pi has an official guide to compile the kernel, however it assumes Raspberry Pi OS, Raspberry Pi’s kernel tree, and overwrites the current kernel. It was still very useful to get an idea of the process. Still, quite a few adaptations have been required. This blog post serves as my personal notepad to remember how to Do It.

First, the official guide instructs us to run make bcm2711_defconfig to generate the kernel config, however mainline complains with:

Can't find default configuration "arch/arm/configs/bcm2711_defconfig"

This can be fixed by grabbing this file from the Raspberry Pi tree:

curl -L -o arch/arm/configs/bcm2711_defconfig "https://github.com/raspberrypi/linux/raw/rpi-6.1.y/arch/arm/configs/bcm2711_defconfig"

Once that’s done, compiling the kernel as usual works fine. Then we need to install it to the /boot partition. We can ignore the overlays stuff from the official guide, we don’t use these. The source paths need to be slightly adjusted, and the destination paths need to be fixed up to use a subdirectory:

doas make modules_install
doas cp arch/arm/boot/dts/broadcom/*.dtb /boot/custom/
doas cp arch/arm/boot/zImage /boot/custom/kernel7.img

Then we need to generate an initramfs. At first I forgot to do that step and the kernel was hanging around USB bus discovery.

doas mkinitcpio --generate /boot/custom/initramfs-linux.img --kernel /boot/custom/kernel7.img

The last step is updating the boot firmware configuration located at /boot/config.txt. Comment out any dtoverlay directive, then add os_prefix=custom/ to point the firmware to our subdirectory (note, the final slash is important).

For some reason my memory card was showing up as /dev/mmcblk1 instead of /dev/mmcblk0, so I had to bang my head against the wall until I notice the difference adjust /boot/cmdline.txt and /etc/fstab accordingly.

That’s it! After a reboot I was ready to start kernel hacking. Thanks to Maíra Canal for replying to my distress signal on Mastodon and providing recommendations!

November 07, 2023

TL;DR:

This blog post explores the color capabilities of AMD hardware and how they are exposed to userspace through driver-specific properties. It discusses the different color blocks in the AMD Display Core Next (DCN) pipeline and their capabilities, such as predefined transfer functions, 1D and 3D lookup tables (LUTs), and color transformation matrices (CTMs). It also highlights the differences in AMD HW blocks for pre and post-blending adjustments, and how these differences are reflected in the available driver-specific properties.

Overall, this blog post provides a comprehensive overview of the color capabilities of AMD hardware and how they can be controlled by userspace applications through driver-specific properties. This information is valuable for anyone who wants to develop applications that can take advantage of the AMD color management pipeline.

Get a closer look at each hardware block’s capabilities, unlock a wealth of knowledge about AMD display hardware, and enhance your understanding of graphics and visual computing. Stay tuned for future developments as we embark on a quest for GPU color capabilities in the ever-evolving realm of rainbow treasures.


Operating Systems can use the power of GPUs to ensure consistent color reproduction across graphics devices. We can use GPU-accelerated color management to manage the diversity of color profiles, do color transformations to convert between High-Dynamic-Range (HDR) and Standard-Dynamic-Range (SDR) content and color enhacements for wide color gamut (WCG). However, to make use of GPU display capabilities, we need an interface between userspace and the kernel display drivers that is currently absent in the Linux/DRM KMS API.

In the previous blog post I presented how we are expanding the Linux/DRM color management API to expose specific properties of AMD hardware. Now, I’ll guide you to the color features for the Linux/AMD display driver. We embark on a journey through DRM/KMS, AMD Display Manager, and AMD Display Core and delve into the color blocks to uncover the secrets of color manipulation within AMD hardware. Here we’ll talk less about the color tools and more about where to find them in the hardware.

We resort to driver-specific properties to reach AMD hardware blocks with color capabilities. These blocks display features like predefined transfer functions, color transformation matrices, and 1-dimensional (1D LUT) and 3-dimensional lookup tables (3D LUT). Here, we will understand how these color features are strategically placed into color blocks both before and after blending in Display Pipe and Plane (DPP) and Multiple Pipe/Plane Combined (MPC) blocks.

That said, welcome back to the second part of our thrilling journey through AMD’s color management realm!

AMD Display Driver in the Linux/DRM Subsystem: The Journey

In my 2022 XDC talk “I’m not an AMD expert, but…”, I briefly explained the organizational structure of the Linux/AMD display driver where the driver code is bifurcated into a Linux-specific section and a shared-code portion. To reveal AMD’s color secrets through the Linux kernel DRM API, our journey led us through these layers of the Linux/AMD display driver’s software stack. It includes traversing the DRM/KMS framework, the AMD Display Manager (DM), and the AMD Display Core (DC) [1].

The DRM/KMS framework provides the atomic API for color management through KMS properties represented by struct drm_property. We extended the color management interface exposed to userspace by leveraging existing resources and connecting them with driver-specific functions for managing modeset properties.

On the AMD DC layer, the interface with hardware color blocks is established. The AMD DC layer contains OS-agnostic components that are shared across different platforms, making it an invaluable resource. This layer already implements hardware programming and resource management, simplifying the external developer’s task. While examining the DC code, we gain insights into the color pipeline and capabilities, even without direct access to specifications. Additionally, AMD developers provide essential support by answering queries and reviewing our work upstream.

The primary challenge involved identifying and understanding relevant AMD DC code to configure each color block in the color pipeline. However, the ultimate goal was to bridge the DC color capabilities with the DRM API. For this, we changed the AMD DM, the OS-dependent layer connecting the DC interface to the DRM/KMS framework. We defined and managed driver-specific color properties, facilitated the transport of user space data to the DC, and translated DRM features and settings to the DC interface. Considerations were also made for differences in the color pipeline based on hardware capabilities.

Exploring Color Capabilities of the AMD display hardware

Now, let’s dive into the exciting realm of AMD color capabilities, where a abundance of techniques and tools await to make your colors look extraordinary across diverse devices.

First, we need to know a little about the color transformation and calibration tools and techniques that you can find in different blocks of the AMD hardware. I borrowed some images from [2] [3] [4] to help you understand the information.

Predefined Transfer Functions (Named Fixed Curves):

Transfer functions serve as the bridge between the digital and visual worlds, defining the mathematical relationship between digital color values and linear scene/display values and ensuring consistent color reproduction across different devices and media. You can learn more about curves in the chapter GPU Gems 3 - The Importance of Being Linear by Larry Gritz and Eugene d’Eon.

ITU-R 2100 introduces three main types of transfer functions:

  • OETF: the opto-electronic transfer function, which converts linear scene light into the video signal, typically within a camera.
  • EOTF: electro-optical transfer function, which converts the video signal into the linear light output of the display.
  • OOTF: opto-optical transfer function, which has the role of applying the “rendering intent”.

AMD’s display driver supports the following pre-defined transfer functions (aka named fixed curves):

  • Linear/Unity: linear/identity relationship between pixel value and luminance value;
  • Gamma 2.2, Gamma 2.4, Gamma 2.6: pure power functions;
  • sRGB: 2.4: The piece-wise transfer function from IEC 61966-2-1:1999;
  • BT.709: has a linear segment in the bottom part and then a power function with a 0.45 (~1/2.22) gamma for the rest of the range; standardized by ITU-R BT.709-6;
  • PQ (Perceptual Quantizer): used for HDR display, allows luminance range capability of 0 to 10,000 nits; standardized by SMPTE ST 2084.

These capabilities vary depending on the hardware block, with some utilizing hardcoded curves and others relying on AMD’s color module to construct curves from standardized coefficients. It also supports user/custom curves built from a lookup table.

1D LUTs (1-dimensional Lookup Table):

A 1D LUT is a versatile tool, defining a one-dimensional color transformation based on a single parameter. It’s very well explained by Jeremy Selan at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

It enables adjustments to color, brightness, and contrast, making it ideal for fine-tuning. In the Linux AMD display driver, the atomic API offers a 1D LUT with 4096 entries and 8-bit depth, while legacy gamma uses a size of 256.

3D LUTs (3-dimensional Lookup Table):

These tables work in three dimensions – red, green, and blue. They’re perfect for complex color transformations and adjustments between color channels. It’s also more complex to manage and require more computational resources. Jeremy also explains 3D LUT at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

CTM (Color Transformation Matrices):

Color transformation matrices facilitate the transition between different color spaces, playing a crucial role in color space conversion.

HDR Multiplier:

HDR multiplier is a factor applied to the color values of an image to increase their overall brightness.

AMD Color Capabilities in the Hardware Pipeline

First, let’s take a closer look at the AMD Display Core Next hardware pipeline in the Linux kernel documentation for AMDGPU driver - Display Core Next

In the AMD Display Core Next hardware pipeline, we encounter two hardware blocks with color capabilities: the Display Pipe and Plane (DPP) and the Multiple Pipe/Plane Combined (MPC). The DPP handles color adjustments per plane before blending, while the MPC engages in post-blending color adjustments. In short, we expect DPP color capabilities to match up with DRM plane properties, and MPC color capabilities to play nice with DRM CRTC properties.

Note: here’s the catch – there are some DRM CRTC color transformations that don’t have a corresponding AMD MPC color block, and vice versa. It’s like a puzzle, and we’re here to solve it!

AMD Color Blocks and Capabilities

We can finally talk about the color capabilities of each AMD color block. As it varies based on the generation of hardware, let’s take the DCN3+ family as reference. What’s possible to do before and after blending depends on hardware capabilities describe in the kernel driver by struct dpp_color_caps and struct mpc_color_caps.

The AMD Steam Deck hardware provides a tangible example of these capabilities. Therefore, we take SteamDeck/DCN301 driver as an example and look at the “Color pipeline capabilities” described in the file: driver/gpu/drm/amd/display/dcn301/dcn301_resources.c

/* Color pipeline capabilities */

dc->caps.color.dpp.dcn_arch = 1; // If it is a Display Core Next (DCN): yes. Zero means DCE.
dc->caps.color.dpp.input_lut_shared = 0;
dc->caps.color.dpp.icsc = 1; // Intput Color Space Conversion  (CSC) matrix.
dc->caps.color.dpp.dgam_ram = 0; // The old degamma block for degamma curve (hardcoded and LUT). `Gamma correction` is the new one.
dc->caps.color.dpp.dgam_rom_caps.srgb = 1; // sRGB hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1; // BT2020 hardcoded curve support (seems not actually in use)
dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1; // Gamma 2.2 hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.pq = 1; // PQ hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.hlg = 1; // HLG hardcoded curve support
dc->caps.color.dpp.post_csc = 1; // CSC matrix
dc->caps.color.dpp.gamma_corr = 1; // New `Gamma Correction` block for degamma user LUT;
dc->caps.color.dpp.dgam_rom_for_yuv = 0;

dc->caps.color.dpp.hw_3d_lut = 1; // 3D LUT support. If so, it's always preceded by a shaper curve. 
dc->caps.color.dpp.ogam_ram = 1; // `Blend Gamma` block for custom curve just after blending
// no OGAM ROM on DCN301
dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.dpp.ogam_rom_caps.pq = 0;
dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
dc->caps.color.dpp.ocsc = 0;

dc->caps.color.mpc.gamut_remap = 1; // Post-blending CTM (pre-blending CTM is always supported)
dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; // Post-blending 3D LUT (preceded by shaper curve)
dc->caps.color.mpc.ogam_ram = 1; // Post-blending regamma.
// No pre-defined TF supported for regamma.
dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.mpc.ogam_rom_caps.pq = 0;
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1; // Output CSC matrix.

I included some inline comments in each element of the color caps to quickly describe them, but you can find the same information in the Linux kernel documentation. See more in struct dpp_color_caps, struct mpc_color_caps and struct rom_curve_caps.

Now, using this guideline, we go through color capabilities of DPP and MPC blocks and talk more about mapping driver-specific properties to corresponding color blocks.

DPP Color Pipeline: Before Blending (Per Plane)

Let’s explore the capabilities of DPP blocks and what you can achieve with a color block. The very first thing to pay attention is the display architecture of the display hardware: previously AMD uses a display architecture called DCE

  • Display and Compositing Engine, but newer hardware follows DCN - Display Core Next.

The architectute is described by: dc->caps.color.dpp.dcn_arch

AMD Plane Degamma: TF and 1D LUT

Described by: dc->caps.color.dpp.dgam_ram, dc->caps.color.dpp.dgam_rom_caps,dc->caps.color.dpp.gamma_corr

AMD Plane Degamma data is mapped to the initial stage of the DPP pipeline. It is utilized to transition from scanout/encoded values to linear values for arithmetic operations. Plane Degamma supports both pre-defined transfer functions and 1D LUTs, depending on the hardware generation. DCN2 and older families handle both types of curve in the Degamma RAM block (dc->caps.color.dpp.dgam_ram); DCN3+ separate hardcoded curves and 1D LUT into two block: Degamma ROM (dc->caps.color.dpp.dgam_rom_caps) and Gamma correction block (dc->caps.color.dpp.gamma_corr), respectively.

Pre-defined transfer functions:

  • they are hardcoded curves (read-only memory - ROM);
  • supported curves: sRGB EOTF, BT.709 inverse OETF, PQ EOTF and HLG OETF, Gamma 2.2, Gamma 2.4 and Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. Setting TF = Identity/Default and LUT as NULL means bypass.

References:

AMD Plane 3x4 CTM (Color Transformation Matrix)

AMD Plane CTM data goes to the DPP Gamut Remap block, supporting a 3x4 fixed point (s31.32) matrix for color space conversions. The data is interpreted as a struct drm_color_ctm_3x4. Setting NULL means bypass.

References:

AMD Plane Shaper: TF + 1D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The Shaper block fine-tunes color adjustments before applying the 3D LUT, optimizing the use of the limited entries in each dimension of the 3D LUT. On AMD hardware, a 3D LUT always means a preceding shaper 1D LUT used for delinearizing and/or normalizing the color space before applying a 3D LUT, so this entry on DPP color caps dc->caps.color.dpp.hw_3d_lut means support for both shaper 1D LUT and 3D LUT.

Pre-defined transfer function enables delinearizing content with or without shaper LUT, where AMD color module calculates the resulted shaper curve. Shaper curves go from linear values to encoded values. If we are already in a non-linear space and/or don’t need to normalize values, we can set a Identity TF for shaper that works similar to bypass and is also the default TF value.

Pre-defined transfer functions:

  • there is no DPP Shaper ROM. Curves are calculated by AMD color modules. Check calculate_curve() function in the file amd/display/modules/color/color_gamma.c.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting Plane Shaper TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT as NULL works as bypass.

References:

AMD Plane 3D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The 3D LUT in the DPP block facilitates complex color transformations and adjustments. 3D LUT is a three-dimensional array where each element is an RGB triplet. As mentioned before, the dc->caps.color.dpp.hw_3d_lut describe if DPP 3D LUT is supported.

The AMD driver-specific property advertise the size of a single dimension via LUT3D_SIZE property. Plane 3D LUT is a blog property where the data is interpreted as an array of struct drm_color_lut elements and the number of entries is LUT3D_SIZE cubic. The array contains samples from the approximated function. Values between samples are estimated by tetrahedral interpolation The array is accessed with three indices, one for each input dimension (color channel), blue being the outermost dimension, red the innermost. This distribution is better visualized when examining the code in [RFC PATCH 5/5] drm/amd/display: Fill 3D LUT from userspace by Alex Hung:

+	for (nib = 0; nib < 17; nib++) {
+		for (nig = 0; nig < 17; nig++) {
+			for (nir = 0; nir < 17; nir++) {
+				ind_lut = 3 * (nib + 17*nig + 289*nir);
+
+				rgb_area[ind].red = rgb_lib[ind_lut + 0];
+				rgb_area[ind].green = rgb_lib[ind_lut + 1];
+				rgb_area[ind].blue = rgb_lib[ind_lut + 2];
+				ind++;
+			}
+		}
+	}

In our driver-specific approach we opted to advertise it’s behavior to the userspace instead of implicitly dealing with it in the kernel driver. AMD’s hardware supports 3D LUTs with 17-size or 9-size (4913 and 729 entries respectively), and you can choose between 10-bit or 12-bit. In the current driver-specific work we focus on enabling only 17-size 12-bit 3D LUT, as in [PATCH v3 25/32] drm/amd/display: add plane 3D LUT support:

+		/* Stride and bit depth are not programmable by API yet.
+		 * Therefore, only supports 17x17x17 3D LUT (12-bit).
+		 */
+		lut->lut_3d.use_tetrahedral_9 = false;
+		lut->lut_3d.use_12bits = true;
+		lut->state.bits.initialized = 1;
+		__drm_3dlut_to_dc_3dlut(drm_lut, drm_lut3d_size, &lut->lut_3d,
+					lut->lut_3d.use_tetrahedral_9,
+					MAX_COLOR_3DLUT_BITDEPTH);

A refined control of 3D LUT parameters should go through a follow-up version or generic API.

Setting 3D LUT to NULL means bypass.

References:

AMD Plane Blend/Out Gamma: TF + 1D LUT

Described by: dc->caps.color.dpp.ogam_ram

The Blend/Out Gamma block applies the final touch-up before blending, allowing users to linearize content after 3D LUT and just before the blending. It supports both 1D LUT and pre-defined TF. We can see Shaper and Blend LUTs as 1D LUTs that are sandwich the 3D LUT. So, if we don’t need 3D LUT transformations, we may want to only use Degamma block to linearize and skip Shaper, 3D LUT and Blend.

Pre-defined transfer function:

  • there is no DPP Blend ROM. Curves are calculated by AMD color modules;
  • supported curves: Identity, sRGB EOTF, BT.709 inverse OETF, PQ EOTF, HLG inverse OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. If plane_blend_tf_property != Identity TF, AMD color module will combine the user LUT values with pre-defined TF into the LUT parameters to be programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

MPC Color Pipeline: After Blending (Per CRTC)

DRM CRTC Degamma 1D LUT

The degamma lookup table (LUT) for converting framebuffer pixel data before apply the color conversion matrix. The data is interpreted as an array of struct drm_color_lut elements. Setting NULL means bypass.

Not really supported. The driver is currently reusing the DPP degamma LUT block (dc->caps.color.dpp.dgam_ram and dc->caps.color.dpp.gamma_corr) for supporting DRM CRTC Degamma LUT, as explaning by [PATCH v3 20/32] drm/amd/display: reject atomic commit if setting both plane and CRTC degamma.

DRM CRTC 3x3 CTM

Described by: dc->caps.color.mpc.gamut_remap

It sets the current transformation matrix (CTM) apply to pixel data after the lookup through the degamma LUT and before the lookup through the gamma LUT. The data is interpreted as a struct drm_color_ctm. Setting NULL means bypass.

DRM CRTC Gamma 1D LUT + AMD CRTC Gamma TF

Described by: dc->caps.color.mpc.ogam_ram

After all that, you might still want to convert the content to wire encoding. No worries, in addition to DRM CRTC 1D LUT, we’ve got a AMD CRTC gamma transfer function (TF) to make it happen. Possible TF values are defined by enum amdgpu_transfer_function.

Pre-defined transfer functions:

  • there is no MPC Gamma ROM. Curves are calculated by AMD color modules.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting CRTC Gamma TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

Others

AMD CRTC Shaper and 3D LUT

We have previously worked on exposing CRTC shaper and CRTC 3D LUT, but they were removed from the AMD driver-specific color series because they lack userspace case. CRTC shaper and 3D LUT works similar to plane shaper and 3D LUT but after blending (MPC block). The difference here is that setting (not bypass) Shaper and Gamma blocks together are not expected, since both blocks are used to delinearize the input space. In summary, we either set Shaper + 3D LUT or Gamma.

Input and Output Color Space Conversion

There are two other color capabilities of AMD display hardware that were integrated to DRM by previous works and worth a brief explanation here. The DC Input CSC sets pre-defined coefficients from the values of DRM plane color_range and color_encoding properties. It is used for color space conversion of the input content. On the other hand, we have de DC Output CSC (OCSC) sets pre-defined coefficients from DRM connector colorspace properties. It is uses for color space conversion of the composed image to the one supported by the sink.

References:

The search for rainbow treasures is not over yet

If you want to understand a little more about this work, be sure to watch Joshua and I presented two talks at XDC 2023 about AMD/Steam Deck colors on Gamescope:

In the time between the first and second part of this blog post, Uma Shashank and Chaitanya Kumar Borah published the plane color pipeline for Intel and Harry Wentland implemented a generic API for DRM based on VKMS support. We discussed these two proposals and the next steps for Color on Linux during the Color Management workshop at XDC 2023 and I briefly shared workshop results in the 2023 XDC lightning talk session.

The search for rainbow treasures is not over yet! We plan to meet again next year in the 2024 Display Hackfest in Coruña-Spain (Igalia’s HQ) to keep up the pace and continue advancing today’s display needs on Linux.

Finally, a HUGE thank you to everyone who worked with me on exploring AMD’s color capabilities and making them available in userspace.

November 06, 2023

 If you remember the last update two weeks ago, I got MobileNetV1 working with good performance, and I was planning to move to upstreaming my changes to the Linux kernel and Mesa.

One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.

Regarding Mesa, I have made several cleanups and have started getting great review comments from Christian Gmeiner.

While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster  than the naive code I was running on the CPU for this.

Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.

I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.

During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.


November 05, 2023

Linus has pulled the initial GSP firmware support for nouveau. This is just the first set of work to use the new GSP firmware and there are likely many challenges and improvements ahead.

To get this working you need to install the firmware which hasn't landed in linux-firmware yet.

For Fedora this copr has the firmware in the necessary places:

https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6593115/ 

Hopefully we can upstream that in next week or so.

If you have an ADA based GPU then it should just try and work out of the box, if you have Turing or Ampere you currently need to pass nouveau.config=NvGspRm=1 on the kernel command line to attempt to use GSP.

Going forward, I've got a few fixes and stabilization bits to land, which we will concentrate on for 6.7, then going forward we have to work out how to keep it up to date and support new hardware and how to add new features.


November 03, 2023

This is the second part of the Xwayland rootful post, the first part is there

Using Xwayland rootful to run a full X11 desktop

Xwayland rootful can run more than just a window manager, it can as well run an entire X11 desktop, for example with Xfce:

$ Xwayland -geometry 1024x768 -decorate :12 &
DISPLAY=:12 SESSION_MANAGER= GDK_BACKEND=x11 dbus-run-session startxfce4

Xfce running on Xwayland rootful in GNOME Shell on Wayland


Unfortunately, not all the keyboard shortcuts within the nested X11 session actually work, because some of those (such a Alt-Tab for example) get processed by the Wayland compositor directly, instead of being forwarded to the nested environment.

This however isn't a problem specific to Wayland or Xwayland, an X11 window manager running in Xnest or Xephyr will have the same issues with keyboard shortcuts. To avoid that, Xephyr is able to „grab“ the keyboard and pointer so that all input events end up in the nested X11 session and do not get processed by the parent session.

Xwayland 23.1 has a similar functionality using the Wayland pointer locking & confinement protocol and the keyboard shortcuts inhibitor protocol.

So if your favorite Wayland compositor supports these protocols (in doubt, you can check that it is the case using „wayland-info“), you can use the „-host-grab“ option in Xwayland rootful:

$ Xwayland -geometry 1024x768 -decorate -host-grab :12 &
DISPLAY=:12 SESSION_MANAGER= GDK_BACKEND=x11 dbus-run-session startxfce4

Pressing the Control and Shift keys simultaneously will release the keyboard and pointer (just like with Xephyr actually).

Using Xwayland rootful to run a single X11 application

In some cases, it might be desirable to run a single X11 application isolated from the rest of the X11 clients, on its own X11 server.

On such a setup, one could run a single X11 client either maximized or fullscreen within Xwayland rootful.

Since Xwayland 23.2 allows to interactively resize the root window, users could mode and resize that window at will.

But for that to work, we need a simple X11 window manager that could resize the X11 client window along with the root window, using XRANDR notifications, such as the matchbox window manager for example.

$ Xwayland -geometry 1024x768 -decorate :12 &
matchbox-window-manager -display :12 &
$ GDK_BACKEND=x11 midori --display=:12

When the Xwayland rootful window is resized, corresponding XRANDR events are emitted, notifying the X11 window manager which in turn resizes the client window.

Using Xwayland rootful fullscreen

For years now, Xwayland rootless had support for the viewport Wayland protocol, to emulate XRandR for legacy games thanks to the work from Hans De Goede.

So the idea is to add a fullscreen mode to Xwayland rootful and take advantage of the Wayland viewports support to emulate resolution changes.

This is exactly what the „-fullscreen“ command line options does, it starts Xwayland rootful in fullscreen mode using the xdg_toplevel Wayland protocol and uses the existing viewport support to scale the window and to match the actual display physical resolution.

The emulated resolution is not even limited by the physical resolution, it's possible to use XRANDR to select an emulated resolution much higher than the actual monitor's resolution, quite handy to test X11 applications on high resolution without having to purchase expensive monitors!

$ Xwayland -fullscreen :12 &
matchbox-window-manager -display :12 &
$ xterm -display :12 &
$ xrandr -s 5120x2880 -display :12

Are we done yet?

Well, there's still one thing Xwayland is not handling well, it's HiDPI and fractional scaling.

With rootless Xwayland (as on a typical Wayland desktop session), all X11 clients share the same Xwayland server, and can span across different Wayland outputs of different scales.

Even though theoretically each Wayland surface associated with each X11 window could have a different scale factor set by Xwayland, all X11 clients on the same Xserver share the same coordinate space, so in practice different X11 windows cannot have different scale factors applied.

That's the reason why all the existing merge requests to add support for HiDPI to Xwayland set the same scale to all X11 surfaces. But that means that the rendered surface could end up being way too small depending on the actual scale the window is placed on, on a mixed-DPI multi-monitor setup (I already shared my views of the problem in this issue upstream).

But such limitation does not apply to rootful Xwayland, considering that all the X11 clients running on a rootful Xwayland actually belong to and remain within the same visible root window. They are part of the same visual entity and move all together along with the Xwayland rootful window.

So we could possibly add support for HiDPI (and hence achieve fractional scaling without blurred fonts) to rootful Xwayland. The idea is that Xwayland would set the surface scale to match the scale of the output it's placed on, and automatically resize its root window according to the scale, whenever that changes or when the rootful Xwayland window is moved from one monitor to another.

So for example, when Xwayland rootful with a size of 640×480 is moved from an output with scale 1 to an output with scale 2, the size of the root window (hence the Xwayland rootful window) would be automatically changed to 1280×960, along with the corresponding XRANDR notifications so that an X11 window manager running nested can adjust the X11 clients size and positions.

And if we want a way to communicate that to the X11 clients running within Xwayland rootful, we can use an X11 property on the root window that reflects the actual scale factor being applied. An X11 client could either use that property directly, or more likely, a simple dedicated daemon could adjust the scaling factor of the various X11 toolkits depending on the value set for Wayland scaling.

That's what that proposed merge request upstream does.

gnome-calculator running on Xwayland rootful with 150% fractional scaling

Of course, at this time of writing, this is just a merge request I just posted upstream, and there is no promise that it will accepted eventually. We'll see how that goes, but if that could find its way to Xwayland upstream, it would be part of the next major release of Xwayland some time next year.

October 30, 2023

I was at XDC 2023 in A Coruña a few days ago where I had the opportunity to talk about some of the work we have been doing on the Raspberry Pi driver stack together with my colleagues Juan Suárez and Maíra Canal. We talked about Raspberry Pi 5, CPU job handling in the Vulkan driver, OpenGL 3.1 support and how we are exposing GPU stats to user space. If you missed it here is the link to Youtube.

Big thanks to Igalia for organizing it and to all the sponsors and specially to Samuel and Chema for all the work they put into making this happen.

October 27, 2023

🪑?

October 26, 2023

And Now For Something Slightly More Technical

It’s a busy, busy week here. So busy I’m slipping on my blogging. But that’s okay, because here one last big technical post about something I hate.

Swapchain readback.

So Easy Even You Could Accidentally Do It

I’m not alone in drinking the haterade on this one, but GL makes it especially easy to footgun yourself by not providing explicit feedback that you’re footgunning yourself.

I recently encountered a scenario in REDACTED where this behavior was commonplace. The command stream looked roughly like this:

  • draw some stuff
  • swapbuffers
  • blitframebuffer

And this happened on every single frame (???).

In Zink Terms…

This isn’t pretty. Zink has an extremely conformant method of performing swapchain readback which definitely works without issues in all cases. I’d explain it, but it wouldn’t make either of us happy, and I’ve got so much other stuff to do that I couldn’t possibly… Oh, you really want to know? Well don’t say I didn’t warn you.

Vulkan doesn’t allow readback from swapchains. By this, I mean:

  • swapchain images must be acquired before they can be accessed for any purpose
  • there is no method to explicitly reacquire a specific swapchain image
  • there is no guarantee that swapchain images are unchanged after present

Combined, once you have presented a swapchain image you’re screwed.

…According to the spec, that is. In the real world, things work differently.

Zink takes advantage of this “real world” utilization to implement swapchain readback. In short, the only method available is to spam present/acquire on the swapchain until the last-presented image is reacquired. Then it can be read back, and the image data is (probably) the same as when it was presented.

P E R F

This is not a speedy method of implementing readback. It requires a full sync, and it was designed for the purpose of passing unit tests, which is does perfectly. Performance was never a concern, because why would anyone ever be trying to do readback in… Why would anyone ever be trying to do readback in a performance-sensitive… Using OpenGL, why would anyone ever be…

Anyway, this is very unperformant, and here at SGC we hate all things of that nature. Given that I had my real world scenario from REDACTED in which this was happening every frame, something had to be done.

This solution isn’t performant in the absolute sense either, but it’s massively faster than what was happening previously. Once zink detects an app repeatedly footgunning itself at full speed, it activates readback mode for a swapchain and maintains a staging copy of every frame. This enables the image data to be read back at any time without synchronization at the cost of an extra full-frame copy. This roughly doubles FPS in the case I was testing, which is pretty good.

The functionality is already merged for the upcoming 23.3 release.

Footgun as hard as you want.

October 25, 2023

More Milestones

As everyone knows, Red Hat’s top RustiCL expert, Karol “But it’s only 10 o’clock?” Herbst, has been hard at work beating Mesa/Zink/RustiCL into shape. That effort continues to bear fruit, and with the merge of an upcoming MR it should be possible to pass OpenCL conformance with zink on multiple platforms.

This will make zink THE FIRST EVER CONFORMANT VULKAN-BASED OPENCL IMPLEMENTATION.

Great work all around. For up-to-the-second progress reports on this ecosystem-critical topic, don’t forget to follow Karol on social media.

October 24, 2023

Hi all, long time no see! It’s been more than two months since the last status update. My excuse for this silence is two-fold: I was on leave for 5 weeks, and then X.Org Developer’s Conference happened. During my time off, I’ve traveled in Korea and Japan. I will be blunt: these last two months have been fantastic! And to be honest, that’s a huge understatement.

Busan view from Jangsan

East gate

After my trip in Asia, I went to a 2-day Valve hackfest in Igalia’s headquarters. I met other Valve contractors there, we discussed about various topics such as color management, variable refresh rate, flicker-free startup, and more.

At XDC, there were lots of interesting talks and workshops: HDR by Joshua and Melissa, NVK by Faith, Asahi by Alyssa et al, wlroots frame scheduling by Rose (my GSoC student), CI by Martin, VKMS by Maíra, Wine Wayland by Alexandros, Wine X11 by Arek, and many more! Everything should be available online if you haven’t watched live. That said, as usual, the part I enjoyed the most is the so-called hallway track. It’s great to have free-form discussions with fellow graphics developers, it results in a pretty different train of thought than the usual focused discussions we have online.

Apart from these events, I’ve found some time to do a bit of actual work, too. I’ve re-spinned an old patch I wrote to introduce a new CLOSEFB IOCTL, to allow a DRM master to leave a framebuffer on-screen when quitting so that the next DRM master can take over without a black screen in-between. This time I also included a user-space patch and an IGT test (both requirements for new kernel uAPI). I sent (and merged) another kernel patch to fix black screens in some situations when unplugging USB-C docks.

On the Wayland side, I continued working on explicit synchronization, updating the protocol and submitting a gamescope patch. Joshua has been working on a Mesa patch, so all of the pieces are coming together now. On the SourceHut side, I’ve sent a patch to add HTTP/2 support to pages.sr.ht. It’s been merged and deployed, enjoy! The NPotTM is libicc, a small library to parse ICC profile files. Unlike LittleCMS, it provides lower-level access to the ICC structure and the exact color transformation operations.

That’s all for now, see you next month!

There is an issue with the rpmfusion packaged IPU6 camera stack for Fedora is not working on many Dell laptop models after upgrading the kernel to a 6.5.y kernel.

This is caused by a new mainline ov0a10 sensor driver which takes precedence over the akmod ov0a10 driver but lacks VSC integration.

This can be worked around by running the following command:
sudo rm /lib/modules/$(uname -r)/kernel/drivers/media/i2c/ov01a10.ko.xz; sudo depmod -a

After the rm + depmod run:
sudo rmmod ov01a10; sudo modprobe ov01a10

Or reboot. After this your camera will hopefully work again.

I have submitted a pull-request to disable the mainline kernel's non working ov01a10 driver, so after the next Fedora kernel update this workaround should no longer be necessary.

Your Bug Has Already Been Solved

After yesterday’s post, I’m sure my thousands of readers stampeded to install the latest zink and run their system with it, and I salute you for your hard work in finding all those new ways to crash your systems.

Some of those crashes, however, are not my bugs. They’re system bugs.

In particular, any of you still using Xorg instead of Wayland will want to create this file:

$ cat /etc/X11/xorg.conf.d/30-dmabuf.conf
Section "ServerFlags"
	Option "Debug" "dmabuf_capable"
EndSection

This makes your xserver dmabuf-capable, which will be more successful when running things with zink.

Another problem you’re likely to have is this console error:

DRI3 not available
failed to load driver: zink

Specifically you’re likely to have this on AMD hardware, and the cause is almost certainly that you’ve installed some footgun package with a naming variation on xf86-video-amdgpu.

Delete this package.

Just delete it. I don’t know why distros still make it available, but if you have it installed then you’re just footgunning yourself.

If you’re still having problems after checking for both of these issues, try turning your computer on.

October 23, 2023

Progress

Since the last update I finally got the whole of MobileNetv1 running at full-accuracy on the NPU with Mesa: 
tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Processing the input took 18 ms.
Running the NN job took 13 ms.
Processing the output took 1 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 33.094ms
That takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.

Most of the time (18 ms.) is spent in my naive manipulation of the input tensor, transposing and reshuffling it to match what the HW expects. Once we learn to do these operations on the 4 tensor manipulation cores, this time should be brought close to zero.

The 13 ms. that the convolutions take in the NPU is still sensibly higher than the 8 ms. that the blob achieves, but the optimizations mentioned in previous updates in this blog should bring us pretty close.
 

Next steps

Now that we have something that people can use in their products, I will switch to upstreaming mode.

I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found here.

I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.


Almost That Time Again

As readers are no doubt aware by now, SGC goes into hibernation beginning around November, and that time is nearly upon us once more. To cap out another glorious year of shitpostinghighly technical and informative blogging, I’ll be attempting to put up a newsworthy post every day.

This is Day 1.

Zink: No Longer A Hacky Workaround Driver

2023 has seen great strides in the zink ecosystem:

  • Some games, most notably my favorite game of all time X-Plane, are now shipping zink in order to have a consistent GL experience across platforms
  • Zink has reached official GL 4.6 conformance on Imagination GPUs and will be shipping as their GL implementation
  • Zink can now run display servers for both X and Wayland, enabling full systems to exist without a native GL implementation

And there’s plenty more, of course, but throughout all this progress has been one very minor, very annoying wrinkle.

MESA_LOADER_DRIVER_OVERRIDE=zink has to be specified in order to use zink, even if no other GL drivers exist on the system.

Or Does It?

Over a year ago I attempted to enable automatic zink loading if a native driver could not be loaded. It was a reasonable first attempt, but it had issues with driver loading in scenarios where hardware drivers were not permitted.

Work has slowly progressed in Mesa since that time, and various small changes have gradually pushed the teetering tower that is GLX/EGL in the direction anyone and everyone wanted, full stop.

The result is that on zink-enabled systems, loader environment variables will no longer be necessary as of the upcoming Mesa 23.3 release. If zink is your only GL driver, you will get zink rather than an automatic fallback to swrast.

I can’t imagine anyone will need it, but remember that issues can be reported here.

October 20, 2023

A bit of background

Xwayland is intended as a compatibility layer, to allow legacy X11 applications to continue to work in a Wayland environment.

Most Wayland compositors run Xwayland „rootless“ (using the command line option „-rootless“ when spawning Xwayland) so that X11 clients can integrate seamlessly with the other Wayland native clients, the Wayland compositor taking care of stacking the various windows (or surfaces) regardless of the client being X11 or Wayland native.

That actually works very well, so well that in many cases users do not even realize that any particular client is still running on X11, using Xwayland.

For that to work, the Wayland compositor needs to integrate a fully functional X11 window manager.

Sometimes, however, it is useful to use a separate X11 server to run X11 applications with another X11 window manager or even a full X11 environment.

Nested X11 servers

With X11, it is possible to run a nested X11 server such as Xnest or Xephyr, and run a full X11 environment within those nested X servers.

That can be useful for a number of reasons, like connecting remotely to a remote legacy Unix server using XDMCP (not that I would recommend that anyway!), or for testing a particular X11 application with different window managers, or even because a particular X11 application is certified only with a specific window manager. The possibilities are endless.

$ Xephyr -retro -screen 1024x768 :12

Xephyr running the Motif window manager on a GNOME Shell Wayland session

But Xnest or Xephyr are X11 clients themselves, meaning that they run on top of Xwayland when running on a Wayland compositor. That's a bit of a waste, using two X11 servers on top of a Wayland compositor.

Besides, with X.org development winding down, downstream maintainers and packagers may want to reduce the number of X11 servers they ship and have to maintain in the future.

What's wrong with Xwayland rootful?

Right, so if Xwayland already runs rootful by default, why not just using that instead of Xnest or Xephyr?

Well, up until Xwayland 23.1, Xwayland rootful would take its screen configuration from the Wayland compositor itself (using the wl_output or xdg-output Wayland protocols), meaning that when running rootful, Xwayland would map a surface the size of all the monitors, and the user would have no way to easily move or resize it.

That's far from being practical, especially when using a multi-monitor setup!

Making Xwayland rootful (more) usable

So the first step to help making Xwayland rootful suitable as a nested X11 server is to provide a command line option to specify the desired size of the Xwayland window.

That's the „-geometry“ option introduced in Xwayland 23.1 so that one can specifies the desired size of the Xwayland rootful window:

$ Xwayland -geometry 1024x768 :12

That will grant you with a black window of the specified size. If you want more of a „retro“ look, you can get the classic stipple and the X cursor, you can use:

$ Xwayland -geometry 1024x768 -retro :12

Still, the Xwayland window is missing a title bar that would allow for moving the window around.

This is because Wayland does not decorate its surfaces, this is left to the Wayland client themselves to add window decorations (also known as client side decorations, or CSD for short).

This however would add a lot of complexity to Xwayland (which is primarily an Xserver, not a full fledged Wayland application). Thankfully, there is libdecor which can fence Xwayland from that complexity and provide window decorations for us.

So if libdecor is installed on the system and Xwayland is built with libdecor enabled (this is an optional dependency though), then we can request that Xwayland uses decorations with the „-decorate“ command line option:

$ Xwayland -geometry 1024x768 -retro -decorate :12


No we can have fun running some legacy X11 applications on that Xwayland rootful server:

$ xterm -display :12 &
$ twm -display :12 &
$ xsetroot -solid dodgerblue -display :12


We can even use „xrandr“ to query the size of the Xwayland window and resize it:


New with Xwayland 23.2, the Xwayland window is also resize-able interactively and the resulting display size is available in XRandR, creating an XRandR configuration to match the actual window size set interactively by the user:



October 18, 2023

This is my first blog post, ever!

I'm afraid there isn't much yet, but my intention is to post things related to Xwayland and various other projects I contribute to.

October 12, 2023

EBUSY

As everyone knows, SGC goes into yearly hibernation beginning in November. Leading up to that point has been a mad scramble to nail down all the things, leaving less time for posts here.

But there have been updates, and I’m gonna round ‘em all up.

R A Y T R A C W T F

Friend of the blog and future Graphics scientist with a PhD in WTF, Konstantin Seurer, has been hard at work over the past several weeks. Remember earlier this year when he implemented VK_EXT_descriptor_indexing for Lavapipe? Well he’s at it again, and this time he’s aimed for something bigger.

He’s now implemented raytracing for Lavapipe.

It’s a tremendous feat, one that sets him apart from the other developers who have not implemented raytracing for a software implementation of Vulkan.

CLosure

I blogged (or maybe imagined blogging) about RustiCL progress on zink last year at XDC, specifically the time renowned pubmaster Karol Herbst handcuffed himself to me and refused to divulge the location of the key (disguised as a USB thumb drive in his laptop) until we had basic CL support functioning in a pair programming exercise that put us up against the unnaturally early closing time of Minneapolis pubs. That episode is finally turning into something useful as CL support for zink will soon be merged.

While I can’t reveal too much about the performance as of yet, what I can say now is that it’s roughly 866% faster.

Fixups

A number of longstanding bugs have recently been fixed.

Wolfenstein Face

Anyone who has tried to play one of the modern Wolfenstein GL games on RADV has probably seen this abomination:

wolf-face.png

Wolfenstein Face affects a very small number of apps. Actually just the Wolfenstein (The New Order / The Old Blood) games. I’d had a ticket open about it for a while, and it turns out that this is a known issue in D3D games which has its own workaround. The workaround is now going to be applied for zink as well, which should resolve the issue while hopefully not causing others.

Apitrace: The Final Frontier

Since the dawn of time, experts have tried to obtain traces from games with rendering bugs, but some of these games have historically been resistant to tracing.

  • A number of games could be traced, but then replaying those traces would crash at a certain point. This is now fixed, enabling better bug reporting for a large number of AAA games from the the last decade.
  • Another set of games using the id Engine could record traces, but then replaying them would fail to render correctly:

wolf-trace.png

This affects (at least) Wolfenstein: The Old Blood and DOOM2016, but the problem has been identified, and a fix is on the way.

Zink: Exploring New Display Systems

After a number of universally-reviled hacks, Zink should now work fine in both Wayland and Surfaceless EGL configurations.

The Real Post

Any other, lesser blogger would’ve saved this for another post in order to maximize their posting frequency metric, but here at SGC the readers get a full meal with every post even when they don’t have enough time to digest it all at once. Since I’m not going to XDC this year, consider this the thing I might have given a presentation on.

During my executive senior keynote seminar presentation workshop on zink at last year’s XDC, I brought up tiler performance as one of the known deficiencies. Specifically this was in regard to how tilers need to maximize time spent inside renderpasses and avoid unnecessary load/store operations when beginning/ending those renderpasses, which required either some sort of Vulkan extension to enable deferred load/store op setting OR command stream parsing for GL.

While I did work on a number of Vulkan extensions this year, deferred load/store ops wasn’t one of them.

So it was that I implemented renderpass tracking for Threaded Context to scan the GL command stream in the course of recording it for threaded dispatch. The CPU overhead is negligible (~5% on a couple extremely synthetic drawoverhead cases and nothing noticeable in apps), while the performance gains are staggering (~10-15x speedup in AAA games). All in all, it was a painful process but one that has yielded great results.

The gist of it, as I’ve described in previous posts that I’m too lazy to find links for, is that framebuffer attachment access is accumulated during TC command recording such that zink is able to determine which load/store ops are needed. This works great so long as nothing unexpected splits the renderpass. “Unexpected” in this context refers to one of the following scenarios:

  • zink receives a (transfer) command sequence which is impossible to reorder and must split the renderpass to execute copies/blits
  • the app randomly flushes during rendering
  • the GL frontend hits a TC synchronization point and halts the recording thread to wait for the driver thread to finish execution

The final issue remaining for renderpass tracking has been this third scenario: any time the GL frontend needs to sync TC, renderpass metadata is split. The splitting is such that a single renderpass becomes two because the driver must complete execution on the currently-recorded metadata in order to avoid deadlocking itself against the waiting GL frontend, but then the renderpass will continue after the sync. While this happens in a very small number of scenarios, one of them is quite common.

Texture uploading.

Texture Uploads: How Do They Work?

There are (currently) three methods by which TC can perform texture uploads:

  • for small uploads, the data is enqueued and passed asynchronously to the driver thread
  • for larger uploads:
    • if renderpass tracking is enabled and a renderpass is active, the upload will be sequenced into N strided uploads and passed asynchronously to the driver thread to avoid splitting renderpasses
    • otherwise TC syncs the driver thread and performs the upload directly

Eagle-eyed readers will notice that I’ve already handled the “problem” case described above; in order to avoid splitting renderpasses, I’ve written some handling which rewrites texture uploads into a sequence of N asynchronous buffer2image copies, where N is either 1 or $height depending on whether the source data’s stride matches the image’s stride. In the case where N is not 1, this can result in e.g., 4096 copy operations being enqueued for a 4096x4096 texture atlas. Even in the case where N is 1, it still adds an extra full copy of the texture data. While this is still more optimal than splitting a renderpass, it’s not optimal in the absolute sense.

You can see where this is going.

TC Execution: Define Optimal

Optimal Threaded Context execution is the state when the GL frontend is recording commands while the driver thread is deserializing those commands into hardware-specific instructions to submit to the GPU. Visually, it looks like this Halloween-themed diagram:

ideal.png

Ignoring the small-upload case, the current state of texture uploading looks like one of the following Halloween-themed diagrams:

  • the sequenced upload case will have more work, so the driver thread will run a bit longer than it otherwise would, resulting in the GL frontend waiting a bit longer than it otherwise would for completion

copies.png

  • the sync upload case creates a bubble in TC execution

sync.png

Solve For P

To maintain maximum performance, TC needs to be processing commands asynchronously in the driver thread while the GL frontend continues to record commands for processing. Thus, to maintain maximum performance during texture uploads, the texture upload needs to occur (without copies) while the driver thread continues executing.

Looking at this problem from a different perspective, the case that needs to be avoided at all costs is the case where the GL frontend syncs TC execution. The reason why this sync exists is to avoid accidentally uploading data to an in-use image, which would cause unpredictable (but definitely wrong) output. In this context, in-use can be defined as an image which is either:

  • enqueued in a TC batch for execution
  • enqueued/active in a GPU submission

On the plus side, pipe_context::is_resource_busy exists to query the second of these, so that’s solved. On the minus side, while TC has some usage tracking for buffers, it has nothing for images, and adding such tracking in a performant manner is challenging.

To figure out a solution for TC image tracking, let’s examine the most common problem case. In games, the most common scenario for texture uploading is something like this:

  • create staging image
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • delete staging image

For such a case, it’d be trivial to add a seen flag to struct threaded_resource and pass the conditional if the flag is false. Since it’s straightforward enough to evaluate when an image has been seen in TC, this would suffice. Unfortunately, such a naive (don’t @ me about diacritics) implementation ignores another common pattern:

  • create staging image
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • cache staging image for reuse
  • render frame
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • cache staging image for reuse
  • render frame

For this scenario, the staging image is reused, requiring a bit more tracking in order to accurately determine that it can be safely used for uploads.

The solution I’ve settled on is to use a derivative of zink’s resource tracking. This adds an ID for the last-used batch to the resource, which can then be checked during uploads. When the image is determined idle, the texture data is passed directly to the driver for an unsynchronized upload similar to how unsynchronized buffer uploads work. It’s simple and hasn’t shown any definitive performance overhead in my testing.

For it to really work to its fullest potential in zink, unfortunately, requires VK_EXT_host_image_copy to avoid further staging copies, and nobody implements this yet in mesa main (except Lavapipe, though also there’s this ANV MR). But someday more drivers will support this, and then it’ll be great.

As far as non-tiler performance gains from this work, it’s hard to say definitively whether they’ll be noticeable. Texture uploads during loading screens are typically intermixed with shader compilation, so there’s little TC execution to unblock, but any game which uses texture streaming may see some slight latency improvements.

The only remaining future work here is to further enable unsynchronized texture uploads in zink by adding a special cmdbuf for unsynchronized uploads to handle the non-HIC case. Otherwise performance should be pretty solid across the board.

October 10, 2023

At the moment I am hard at work putting together the final bits for the AppStream 1.0 release (hopefully to be released this month). The new release comes with many new new features, an improved developer API and removal of most deprecated things (so it carefully breaks compatibility with very old data and the previous C API). One of the tasks for the upcoming 1.0 release was #481 asking about a formal way to distinguish Linux phone applications from desktop applications.

AppStream infamously does not support any “is-for-phone” label for software components, instead the decision whether something is compatible with a device is based the the device’s capabilities and the component’s requirements. This allows for truly adaptive applications to describe their requirements correctly, and does not lock us into “form factors” going into the future, as there are many and the feature range between a phone, a tablet and a tiny laptop is quite fluid.

Of course the “match to current device capabilities” check does not work if you are a website ranking phone compatibility. It also does not really work if you are a developer and want to know which devices your component / application will actually be considered compatible with. One goal for AppStream 1.0 is to have its library provide more complete building blocks to software centers. Instead of just a “here’s the data, interpret it according to the specification” API, libappstream now interprets the specification for the application and provides API to handle most common operations – like checking device compatibility. For developers, AppStream also now implements a few “virtual chassis configurations”, to roughly gauge which configurations a component may be compatible with.

To test the new code, I ran it against the large Debian and Flatpak repositories to check which applications are considered compatible with what chassis/device type already. The result was fairly disastrous, with many applications not specifying compatibility correctly (many do, but it’s by far not the norm!). Which brings me to the actual topic of this blog post: Very few seem to really know how to mark an application compatible with certain screen sizes and inputs! This is most certainly a matter of incomplete guides and good templates, so maybe this post can help with that a bit:

The ultimate cheat-sheet to mark your app “chassis-type” compatible

As a quick reminder, compatibility is indicated using AppStream’s relations system: A requires relation indicates that the system will not run at all or will run terribly if the requirement is not met. If the requirement is not met, it should not be installable on a system. A recommends relation means that it would be advantageous to have the recommended items, but it’s not essential to run the application (it may run with a degraded experience without the recommended things though). And a supports relation means a given interface/device/control/etc. is supported by this application, but the application may work completely fine without it.

I have a desktop-only application

A desktop-only application is characterized by needing a larger screen to fit the application, and requiring a physical keyboard and accurate mouse input. This type is assumed by default if no capabilities are set for an application, but it’s better to be explicit. This is the metadata you need:

<component type="desktop-application">
  <id>org.example.desktopapp</id>
  <name>DesktopApp</name>
  [...]
  <requires>
    <display_length>768</display_length>

    <control>keyboard</control>
    <control>pointing</control>
  </requires>
  [...]
</component>

With this requires relation, you require a small-desktop sized screen (at least 768 device-independent pixels (dp) on its smallest edge) and require a keyboard and mouse to be present / connectable. Of course, if your application needs more minimum space, adjust the requirement accordingly. Note that if the requirement is not met, your application may not be offered for installation.

Note: Device-independent / logical pixels

One logical pixel (= device independent pixel) roughly corresponds to the visual angle of one pixel on a device with a pixel density of 96 dpi (for historical X11 reasons) and a distance from the observer of about 52 cm, making the physical pixel about 0.26 mm in size. When using logical pixels as unit, they might not always map to exact physical lengths as their exact size is defined by the device providing the display. They do however accurately depict the maximum amount of pixels that can be drawn in the depicted direction on the device’s display space. AppStream always uses logical pixels when measuring lengths in pixels.

I have an application that works on mobile and on desktop / an adaptive app

Adaptive applications have fewer hard requirements, but a wide range of support for controls and screen sizes. For example, they support touch input, unlike desktop apps. An example MetaInfo snippet for these kind of apps may look like this:

<component type="desktop-application">
  <id>org.example.adaptive_app</id>
  <name>AdaptiveApp</name>
  [...]

  <requires>
    <display_length>360</display_length>
  </requires>

  <supports>
    <control>keyboard</control>
    <control>pointing</control>
    <control>touch</control>
  </supports>
  [...]
</component>

Unlike the pure desktop application, this adaptive application requires a much smaller lowest display edge length, and also supports touch input, in addition to keyboard and mouse/touchpad precision input.

I have a pure phone/table app

Making an application a pure phone application is tricky: We need to mark it as compatible with phones only, while not completely preventing its installation on non-phone devices (even though its UI is horrible, you may want to test the app, and software centers may allow its installation when requested explicitly even if they don’t show it by default). This is how to achieve that result:

<component type="desktop-application">
  <id>org.example.phoneapp</id>
  <name>PhoneApp</name>
  [...]

  <requires>
    <display_length>360</display_length>
  </requires>

  <recommends>
    <display_length compare="lt">1280</display_length>
    <control>touch</control>
  </recommends>
  [...]
</component>

We require a phone-sized display minimum edge size (adjust to a value that is fit for your app!), but then also recommend the screen to have a smaller edge size than a larger tablet/laptop, while also recommending touch input and not listing any support for keyboard and mouse.

Please note that this blog post is of course not a comprehensive guide, so if you want to dive deeper into what you can do with requires/recommends/suggests/supports, you may want to have a look at the relations tags described in the AppStream specification.

Validation

It is still easy to make mistakes with the system requirements metadata, which is why AppStream 1.0 will provide more commands to check MetaInfo files for system compatibility. Current pre-1.0 AppStream versions already have an is-satisfied command to check if the application is compatible with the currently running operating system:

:~$ appstreamcli is-satisfied ./org.example.adaptive_app.metainfo.xml
Relation check for: */*/*/org.example.adaptive_app/*

Requirements:
 • Unable to check display size: Can not read information without GUI toolkit access.
Recommendations:
 • No recommended items are set for this software.
Supported:
  Physical keyboard found.
  Pointing device (e.g. a mouse or touchpad) found.
 • This software supports touch input.

In addition to this command, AppStream 1.0 will introduce a new one as well: check-syscompat. This command will check the component against libappstream’s mock system configurations that define a “most common” (whatever that is at the time) configuration for a respective chassis type.

If you pass the --details flag, you can even get an explanation why the component was considered or not considered for a specific chassis type:

:~$ appstreamcli check-syscompat --details ./org.example.phoneapp.metainfo.xml
Chassis compatibility check for: */*/*/org.example.phoneapp/*

Desktop:
  Incompatible
 • recommends: This software recommends a display with its shortest edge
   being << 1280 px in size, but the display of this device has 1280 px.
 • recommends: This software recommends a touch input device.

Laptop:
  Incompatible
 • recommends: This software recommends a display with its shortest edge 
   being << 1280 px in size, but the display of this device has 1280 px.
 • recommends: This software recommends a touch input device.

Server:
  Incompatible
 • requires: This software needs a display for graphical content.
 • recommends: This software needs a display for graphical content.
 • recommends: This software recommends a touch input device.

Tablet:
  Compatible (100%)

Handset:
  Compatible (100%)

I hope this is helpful for people. Happy metadata writing! 😀

October 06, 2023

Progress

Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.

Refactored the Gallium front-end

As commented in the previous update, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.

I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.

As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.

Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.

Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.

Got MobileNet V1 to run

As part of the refactoring  from above, I got multiple operations in the same model to work, which got us to correctly running some inferences, even if at low accuracy rates:

by Julien Langlois CC BY-SA 3.0

tomeu@arm-64:~/mesa$ LD_PRELOAD=libtensorflow_lite.so python3.10 class_device.py -i hen.bmp -m mobilenet_v1_0.25_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
tflite_plugin_create_delegate
Teflon delegate: loaded etnaviv driver
INFO: Initialized TensorFlow Lite runtime.
PrepareDelegate
VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph.
0.960784: hen
0.015686: cock
0.007843: goose
0.003922: Pembroke
0.003922: Ibizan hound
time: 22.802ms
tflite_plugin_destroy_delegate

This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.

Presented this work at Embedded Recipes 2023

Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.


You can find the slides here, and listen to the talk at:



Next steps

The previous update got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:

  1. Get input and output channels working at the 512 level, so we can run a higher accuracy version of the MobileNet V1 network
  2. Learn to use the TP units to remove those costly transpositions and reshuffles in the CPU (at this point, we would have something useful to people on the field)
  3. Upstream changes to the Linux kernel
  4. Propose Teflon to the Mesa folks

September 26, 2023

Progress

With the kids back in school I have been able to work on the Vivante VIP NPU driver full-time during the two weeks after the last update, with quite some work coming out of the pipeline:

Found the problem with enabling the 8th NN core

Though I don't know exactly yet what the problem is, I found that by going back to a previous brute-force approach to powering up the NPU, the 8th core works just fine.

For now this unblocks the work and gets me closer to the initial goal of running a MobileNetv1 inference and seeing what the performance is like, so I'm leaving a proper fix for this for later.

I bet there's either a register that is being written in the wrong order, or a delay between register writes that is too short. Will have to delve into the power domain subsystem and/or the common clock framework in the Linux kernel to fix this one.

Added support for depthwise convolutions

MobileNetV1 introduced Separable Depthwise Convolutions (see the linked paper for an in-depth description), which are layers that contain a depthwise convolution to process each depth level separately, plus a pointwise convolution to rejoin them again. This offers the same result with 23x less multiplications, so it's very attractive for mobile use-cases.

This hardware doesn't support depthwise convolutions directly, but we can lower them to regular convolutions after modifying the weight tensor to cover each IFM/depth separately.

Added support for pointwise convolutions

For the second half of a Separable Depthwise Convolution, I just had to take into account that 1x1 kernels are packed in a different format in memory, as otherwise it would be very inefficient for each NN core to pull each 1-byte kernel separately from the memory bus.

Added support for unsigned weights

TensorFlow Lite has moved towards implementing a new quantization specification which gives preference to signed weights because of convenience, as symmetric quantization is simpler to implement. Unfortunately for us, our hardware works natively with unsigned weights so we would need to convert them if we were to use TFLite's new quantization.

But the models that Google themselves publish make use of the ancient tooling that still support the old, unsigned quantization scheme, so I had to find a way of producing models with unsigned quantization for our test suite, to match what MobileNetV1 does.

That also implied moving to per-tensor quantization, instead of per-axis.

Added support for higher IFMs and OFMs (up to 256 each)

In the previous update I explained how support for multiple input and output channels (or feature maps) was added, but I wasn't able to test with more than 7 output channels because the 8th NN core was MIA.

With that solved, I was able to see what would be needed for convolutions with higher channel counts, such as those that MobileNetV1 use (32, 64, 128, 256, 512 and 1024).

Each level implied revisiting the tiled format in which weights and biases are laid out in memory, making it more and more complex.

I got to 256, with 512 and 1024 bringing more changes in the tiled format that I still need to reverse engineer.


Next steps

Model partition compilation and resource management

I'm facing problems with testing coverage as we support so many different parameters that need to be tested in combination, with a explosion in the number of individual tests. Because of the hacky current state of the TFLite delegate (and Gallium state tracker) I'm not able to run all the tests because I don't have proper resource management implemented and so we reach OOM before the end.

So my next task after I get back from Embedded Recipes will be to refactor the delegate implementation so we have a proper compilation of the model partitions. These will own the weight+bias buffers as well as the intermediate tensors, with each inference just feeding an input tensor to the partition and retrieving an output tensor at the end.

This will allow me to scale up the automated testing further, so I can keep adding new features with confidence, knowing that I'm not adding regressions.

Move development to Cottonwood A311D board

Da Xue of LibreComputer has got Etnaviv and Teflon working on the new boards that his company is releasing soon. One of them contain a A311D SoC, the same as the VIM3 I'm currently using for development. I will be initially targeting that one, and later make sure that it also works on the Cottonwood boards that will have the S905D3 SoC, which has a VIP Pico instead of a VIP Nano.

Besides being in general a great FOSS champion and specifically being supportive of ML inference with open source, Da is directly sponsoring this work, so I look forward to meet him in Paris this week and exchange notes.

Bigger coefficient tensors

The last known features missing before being able to run MobileNetV1 are IFMs and OFMs of 512 and 1024, each.

Hopefully it will only require some further tweaking of the tiled memory representation of the coefficient buffer.

Medium term goals

I don't expect performance to be that great yet, so I plan on switching the focus to it after the above has been accomplished. I expect for the features below making the most impact in improving performance:
  1. Avoid copies in and out of the model partition, by mapping user buffers to the NPU
  2. Use the TP units for tensor manipulation (transposing, mostly)
  3. Properly configuring the automatic caching of kernels and images in the internal on-chip SRAM
  4. Use the external SRAM for intermediate tensor data
  5. Chain all TP and NN jobs in a model partition in the same command stream
  6. Enable zero-run-length compression in the coefficient buffer
  7. Tune the tiling parameters for reduced memory bandwidth usage
September 25, 2023

Recently I’ve been working on a project where I needed to convert an application written in OpenGL to a software renderer. The matrix transformation code in OpenGL made use of the GLM library for matrix math, and I needed to convert the 4x4 matrices to be 3x3 matrices to work with the software renderer. There was some existing code to do this that was broken, and looked something like this:

glm::mat3 mat3x3 = glm::mat3(mat4x4);

Don’t worry if you don’t see the problem already, I’m going to illustrate in more detail with the example of a translation matrix. In 3D a standard translation matrix to translate by a vector (x, y, z) looks something like this:

[1 0 0 x]
[0 1 0 y]
[0 0 1 z]
[0 0 0 1]

Then when we multiply this matrix by a vector like (a, b, c, 1) the result is (a + x, b + y, c + z, 1). If you don’t understand why the matrix is 4x4 or why we have that extra 1 at the end don’t worry, I’ll explain that in more detail later.

Now using the existing conversion code to get a 3x3 matrix will simply take the first 3 columns and first 3 rows of the matrix and produce a 3x3 matrix from those. Converting the translation matrix above using this code produces the following matrix:

[1 0 0]
[0 1 0]
[0 0 1]

See the problem now? The (x, y, z) values disappeared! In the conversion process we lost these critical values from the translation matrix, and now if we multiply by this matrix nothing will happen since we are just left with the identity matrix. So if we can’t use this simple “cast” function in GLM, what can we use?

Well one thing we can do is preserve the last column and last row of the matrix. So assume we have a 4x4 matrix like this:

[a b c d]
[e f g h]
[i j k l]
[m n o p]

Then preserving the last row and column we should get a matrix like this:

[a b d]
[e f h]
[m n p]

And if we use this conversion process for the same translation matrix we will get:

[1 0 x]
[0 1 y]
[0 0 1]

Now we see that the (x, y) part of the translation is preserved, and if we try to multiply this matrix by the vector (a, b, 1) the result will be (a + x, b + y, 1). The translation is preserved in the conversion!

Why do we have to use this conversion?

The reason the conversion is more complicated is hidden in how we defined the translation matrix and vector we wanted to translate. The vector was actually a 4D vector with the final component set to 1. The reason we do this is that we actually want to represent an affine space instead of just a vector space. An affine space being a type of space where you can have both points and vectors. A point is exactly what you would expect it to be just a point in space from some origin, and vector is a direction with magnitude but no origin. This is important because strictly speaking translation isn’t actually defined for vectors in a normal vector space. Additionally if you try to construct a matrix to represent translation for a vector space you’ll find that its impossible to derive a matrix to do this and that operation is not a linear function. On the other hand operations like translation are well defined in an affine space and do what you would expect.

To get around the problem of vector spaces, mathematicians more clever than I figured out you can implement an affine space in a normal vector space by increasing the dimension of the vector space by one, and by adding an extra row and column to the transformation matrices used. They called this a homogeneous coordinate system. This lets you say that a vector is actually just a point if the 4th component is 1, but if its 0 its just a vector. Using this abstraction one can implement all the well defined operations for an affine space (like translation!).

So using the “homogeneous coordinate system” abstraction, translation is an operation that defined by taking a point and moving it by a vector. Lets look at how that works with the translation matrix I used as an example above. If you multiply that matrix by a 4D vector where the 4th component is 0, it will just return the same vector. Now if we multiply by a 4D vector where the 4th component is 1, it will return the point translated by the vector we used to construct that translation matrix. This implements the translation operation as its defined in an affine space!

If you’re interested in understanding more about homogeneous coordinate spaces, (like how the translation matrix is derived in the first place) I would encourage you to look at resources like “Mathematics for Computer Graphics Applications”. They provide a much more detailed explanation than I am providing here. (The homogeneous coordinate system also has some benefits for representing projections which I won’t get into here, but are explained in that text book.)

Now to finally answer the question about why we needed to preserve those final columns and vectors. Based on what we now know, we weren’t actually just converting from a “3D space” to a “2D space” we were converting from a “3D homogeneous space” to a “2D homogeneous space”. The process of converting from a higher dimension matrix to a lower dimensional matrix is lossy and some transformation details are going to be lost in process (like for example the translation along the z-axis). There is no way to tell what kind of space a given matrix is supposed to transform just by looking at the matrix itself. The matrix does not carry any information about about what space its operating in and any conversion function would need to know that information to properly convert that matrix. Therefore we need develop our own conversion function that preserves the transformations that are important to our application when moving from a “3D homogeneous space” to a “2D homogeneous space”.

Hopefully this explanation helps if you are every working on converting 3D transformation code to 2D.

September 22, 2023

Remember Way Back When…

This blog was about pointlessly optimizing things? I’m talking like taking vkGetDescriptorSetLayoutSupport and making it fast. The kinds of optimizations nobody asked for and potentially nobody even wanted.

Well good news: this isn’t a post about those types of optimizations.

This is a post where I’m gonna talk about some speedups that you didn’t even know you craved but now that you know they exist you can’t live without.

The Vulkan Queue: What Is It?

Lots of people are asking, but surely nobody reading this blog since you’re all experts. But if you have a friend who wants to know, here’s the official resource for all that knowledge. It’s got diagrams. Images with important parts circled. The stuff that means whoever wrote it knew what they were talking about.

The thing this “official” resource doesn’t tell you is the queue is potentially pretty slow. You chuck some commands into it, and then you wait on your fence/semaphore, but the actual time it takes to perform queue submission is nonzero. In fact, it’s quite a bit larger than zero. How large is it? you might be asking.

I didn’t want to do this, but you’ve forced my hand.

The Vulkan Queue: Timing

What if I told you there was a tool for measuring things like this. A tool for determining the cost of various Vulkan operations. For benchmarking, one might say.

That’s right, it’s time to yet again plug vkoverhead, the best and only tool for doing whatever I’m about to do.

Like a prophet, my past self already predicted that I’d be sitting here writing this post to close out a week of types_of_headaches.meme -> vulkan_headaches.meme. That’s why vkoverhead already has the -submit-only option in order to run a series of benchmark cases which have numbers that are totally not about to go up.

Let’s look at those cases now to fill up some more page space and time travel closer to the end of my workweek:

  • submit_noop submits nothing. There’s no semaphores, no cmdbufs, it just submits and returns in order to provide a baseline
  • submit_50noop submits nothing 50 times, which is to say it passes 50x VkSubmitInfo structs to vkQueueSubmit (or the 2 versions if sync2 is supported)
  • submit_1cmdbuf submits a single cmdbuf. In theory this should be slower than the noop case, but I hate computers and obviously this isn’t true at all
  • submit_50cmdbuf submits 50 cmdbufs. In theory this should be slower than the single cmdbuf case, and, thankfully, this one particular time in which we have expectations of how computers work does match our expectations
  • submit_50cmdbuf_50submit submits 50 cmdbufs in 50 submits for a total of 50 cmdbufs per vkQueueSubmit call. This is the slowest test, you would think, and I thought that too, and the longer this explanation goes on the more you start to wonder if computers really do work at all like you expect or if this is going to upset you, but it’s Friday, and I don’t have anywhere to be except the gym, so I could keep delaying the inevitable for a while longer, but I do have to get to the gym, so sure, this is totally gonna be way slower than all the other tests, trust me™

It’s a great series of tests which showcase some driver pain points. Specifically it shows how slow submit can be.

Let’s check out some baseline results on the driver everyone loves to hang out with, RADV:

  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%

Everything looks like we’d expect. The benchmark results ensmallen as they get more complex.

But why?

Why So Slow

Because if you think about it like a smart human and not a dumb pile of “thinking” sand, submitting 50 cmdbufs is submitting 50 cmdbufs no matter how you do it.

queue-think.png

Some restrictions apply, signal semaphores blah blah blah, but none of that’s happening here so what the fuck, RADV?

This is where we get into some real facepalm territory. Vulkan, as an API, gives drivers the ability to optimize this. That’s the entire reason why vkQueueSubmit has the submitCount param and takes an array of submits.

But what does Mesa do here? Well, in the current code there’s this gem:

for (uint32_t i = 0; i < submitCount; i++) {
   struct vulkan_submit_info info = {
      .pNext = pSubmits[i].pNext,
      .command_buffer_count = pSubmits[i].commandBufferInfoCount,
      .command_buffers = pSubmits[i].pCommandBufferInfos,
      .wait_count = pSubmits[i].waitSemaphoreInfoCount,
      .waits = pSubmits[i].pWaitSemaphoreInfos,
      .signal_count = pSubmits[i].signalSemaphoreInfoCount,
      .signals = pSubmits[i].pSignalSemaphoreInfos,
      .fence = i == submitCount - 1 ? fence : NULL
   };
   VkResult result = vk_queue_submit(queue, &info);
   if (unlikely(result != VK_SUCCESS))
      return result;
}

Tremendous. It’s worth mentioning that not only is this splitting the batched submits into individual ones, each submit also allocates a struct to contain the submit info so that the drivers can use the same interface. So it’s increasing the kernel overhead by performing multiple submits and also increasing memory allocations.

Fast Forward

We’ve all been here before on SGC, and I really do need to get to the gym, so I’m authorizing a one-time fast forward to the results of optimizing this:

RADV GFX11:

  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%
↓
  40, submit_noop,                                        21008648,     100.0%
  41, submit_50noop,                                      4866415,      23.2%
  42, submit_1cmdbuf,                                     51294,        0.2%
  43, submit_50cmdbuf,                                     1823,         0.0%
  44, submit_50cmdbuf_50submit,                            1828,         0.0%

That’s like 1000% faster for case #41 and 50% faster for case #44.

But how does this affect other drivers? I’m sure you’re asking next. And of course, this being the primary blog for distributing Mesa benchmarking numbers in any given year, I have those numbers.

Lavapipe:

  40, submit_noop,                                        1972672,      100.0%
  41, submit_50noop,                                      40334,        2.0%
  42, submit_1cmdbuf,                                     5994597,      303.9%
  43, submit_50cmdbuf,                                    2623720,      133.0%
  44, submit_50cmdbuf_50submit,                           133453,       6.8%
↓
  40, submit_noop,                                        1980681,      100.0%
  41, submit_50noop,                                      1202374,      60.7%
  42, submit_1cmdbuf,                                     6340872,      320.1%
  43, submit_50cmdbuf,                                    2482127,      125.3%
  44, submit_50cmdbuf_50submit,                           1165495,      58.8%

3000% faster for #41 and 1000% faster for #44.

Intel DG2:

  40, submit_noop,                                        101336,       100.0%
  41, submit_50noop,                                       2123,         2.1%
  42, submit_1cmdbuf,                                     35372,        34.9%
  43, submit_50cmdbuf,                                      713,          0.7%
  44, submit_50cmdbuf_50submit,                             707,          0.7%
↓
  40, submit_noop,                                        106065,       100.0%
  41, submit_50noop,                                      105992,       99.9%
  42, submit_1cmdbuf,                                     35110,        33.1%
  43, submit_50cmdbuf,                                      709,          0.7%
  44, submit_50cmdbuf_50submit,                             702,          0.7%

5000% faster for #41 and a big 🤕 for #44 because Intel.

Turnip A740:

  40, submit_noop,                                        1227546,      100.0%
  41, submit_50noop,                                      26194,        2.1%
  42, submit_1cmdbuf,                                     1186327,      96.6%
  43, submit_50cmdbuf,                                    545341,       44.4%
  44, submit_50cmdbuf_50submit,                           16531,        1.3%
↓
  40, submit_noop,                                        1313550,      100.0%
  41, submit_50noop,                                      1078383,      82.1%
  42, submit_1cmdbuf,                                     1129515,      86.0%
  43, submit_50cmdbuf,                                    329247,       25.1%
  44, submit_50cmdbuf_50submit,                           484241,       36.9%

4000% faster for #41, 3000% faster for #44.

Pretty good, and it somehow manages to still be conformant.

Code here.

September 15, 2023

But Not Mine

If you’re reading, thanks for everything.

Glamorous

I planned to blog about it a while ago, but then I didn’t and news sites have since broken the news: Zink from Mesa main can run finally xservers.

Yes, it’s true. For the first time ever, you can install Mesa (from git) and use zink (with environment variables) to run your entire system (unless you’re on Intel).

But what was so challenging about getting this to work? The answer won’t surprise you.

WSI

Fans of the blog know that I’m no fan of WSI. If I had my way, GPUs would render to output buffers that we could peruse at our leisure using whatever methods we had at our disposal. Ideally manual inspection. Alas, few others share my worldview and so we all must suffer.

The root of all evil when it comes to computers is synchronization. This is triply so for anything GPU-related, and when all this “display server” chicanery is added in, the evilness value becomes one of those numbers so large that numerologists are still researching naming possibilities. There are two types of synchronization used with WSI:

  • implicit sync - “just fucking do it”
  • explicit sync - “I’ll tell you exactly when to do it”

From a user perspective, the former has less code to manage. The downside is that on the driver side things become more complex, as implicit sync is effectively layered atop explicit sync.

Another way of looking at it is:

  • implicit sync - OpenGL
  • explicit sync - Vulkan

And, since xservers run on GL, you can see where this is going.

Implicitly Terrible

Don’t get me wrong, explicit sync sucks too, but at least it makes sense. Broadly speaking, with explicit sync you have a dmabuf image, you submit it to the GPU, and you tell the server to display it.

In the words of venerable Xorg developer, EGL maintainer, and synchronization PTSD survivor Daniel Stone, the way to handle implicit sync is “vibes”. You have a dmabuf image, you glFlush, and magically it gets displayed.

Sound nuts? It is, and that’s why Vulkan doesn’t support it.

But zink uses Vulkan, so…

Send Eyebleach

Explicit sync is based on two concepts:

  • import
  • export

A user of a dmabuf waits on an export operation before using it (i.e., a wait semaphore), then signals an import operation at the end of a cmdbuf submission (i.e., a signal semaphore). Vulkan WSI handles this under the hood for users. But there’s no way to use Vulkan WSI with imported dmabufs, which means this all has to be copy/pasted around to work elsewhere.

In zink, all that happens in an xserver scenario is apps import/export dmabufs, sample/render them, and then do queue submission. To successfully copy/paste the WSI code and translate this into explicit sync for Vulkan, it’s necessary to be a bit creative with driver mechanics. The gist of it is:

  • when doing a queue import (from FOREIGN) for a dmabuf, create and queue an export (DMA_BUF_IOCTL_EXPORT_SYNC_FILE) semaphore to be waited on before the current cmdbuf
  • when triggering a barrier on any exported dmabuf, queue an import (DMA_BUF_IOCTL_IMPORT_SYNC_FILE) semaphore to be signaled after the current cmdbuf
  • at submit time, serialize all the wait semaphores onto a separate queue submission before the main cmdbuf
  • at submit time, serialize all the signal semaphores onto a separate queue submission after the main cmdbuf
  • pray for modifiers to match up

Big thanks to Faith “ARB_shader_image_load_store” Ekstrand for god-tier rubberducking when I was in the home stretch of this undertaking.

Anyway I expect to be absolutely buried in bug reports by next week from all the people testing this, so thanks in advance.

September 08, 2023

Overdue

It’s been a busy week, and I’ve got posts I’ve been meaning to write. The problem is they’re long and I’m busy. That’s why I’m doing a shorter post today, since even just getting this one out somehow took 4+ hours while I was continually sidetracked by “real work”.

But despite this being a shorter post, don’t worry: the memes won’t be shorter.

Fighting!

I got a ticket very recently that I immediately jumped on and didn’t at all procrastinate or forget about. The ticket concerned a little game called THE KING OF FIGHTERS XIV.

Now for those of you unfamiliar, The King of Fighters is a long-running fighting game franchise which debuted in the 90s. At the arcade. Pretty sure I played it once. But at like a retro arcade or something because I’m not that old, fellow kids.

The bug in question was that when a match is won using a special move, half the frame would misrender:

kof.jpg

Heroically, the reporter posted a number of apitrace captures. Unfortunately that effort ended up being ineffectual since it did nothing but reveal yet another apitrace bug related to VAO uploads which caused replays of the traces to crash.

It was the worst kind of bug.

I was going to have to play a gametest the defect myself.

It would prove to be the greatest test of my skill yet. I would have to:

  • start the correct game
  • win a match
  • win a match using a special move
  • renderdoc capture at the exact right time during the finishing animation to capture the bug

Was I successful?

I’m not saying there’s someone out there who’s worse at the gamethe test app than a guy playingperforming exploratory tests on his keyboard under renderdoc. That’s not what I’m saying.

kof.png

Eddie Gordo Got Mad At This Combo

The debug process for this issue was, in contrast to the capture process, much simpler. I attribute this to the fact that, while I don’t own a gamepad for use with whatever test apps need to be run, I do have a code controller that I use for all my debugging:

code-controller.png

I’ve been hesitant to share such pro strats on the blog before, but SGC has been around for long enough now that even when the copycats start vlogging about my tech and showing off the frame data, everyone will recognize where it came from. All I ask is that you post clips of tournament popoffs.

Using my code controller, I was able to perform a debug -> light code -> light code -> debug -> heavy code -> compile -> block -> post meme -> reboot -> heavy code -> heavy code combo for an easy W.

To break down this advanced sequence, a small debug reveals that the issue is a render area clamped to 1024x1024 on a 1920x1080 frame. Since I have every line of the codebase memorized (zink main don’t @ me) it was instantly obvious that some poking was in order.

Vulkan has this pair of (awful) VUs:

VUID-VkRenderingInfo-pNext-06079
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the width of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.x + renderArea.extent.width

VUID-VkRenderingInfo-pNext-06080
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the height of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.y + renderArea.extent.height

which don’t match up at all to GL’s ability to throw whatever size framebuffer attachments at the GPU and have things come out fine. A long time ago I wrote this MR to clamp framebuffer size to the smallest attachment. But in this particular case, there are three framebuffer attachments:

  • 1920x1080
  • UNUSED
  • 1920x1080

The unused attachment ends up clamping the framebuffer to a smaller region to avoid violating spec, and this breaks rendering. Some light code pokes to skip clamping for NULL attachments open up the combo. Another quick debug doesn’t show the issue as being resolved, which means it’s time for some heavy code: checking for unused attachments in the fragment shader during renderpass start.

Naturally this triggers a full tree compile, which is a blocking operation that gives me enough time to execute a post meme for style points. The downside is that I’m using an AMD system, so as soon as I try to run the new code it hangs–it’s at this point that I nail a reboot to launch it into orbit.

I’m not looking for a record-setting juggle, so I finish off my combo with a heavy code -> heavy code finisher to hack in attachment write tracking for TC renderpass optimization and then plumb it through the rest of my stack so unused attachments will skip all renderpass-related operations.

Problem solved, and all without having to personally play any games.

Next Week

I’ll finally post that one post I’ve been planning to post for weeks but it’s hard and I just blew my entire meme budget for the month today so what is even going to happen who knows.

September 07, 2023

Progress

 This week started quite fruitfully, these features were added:

  • Convolutions with multiple input and output channels (input and output feature maps)
  • "Same" padding in convolutions

And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.

One more roadblock

Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps.

Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.

So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.

Want to try it out?

I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.

I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:

https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst

You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.


September 03, 2023
This is my final blog post for GSoC ‘23. For me, the past few months have been both challenging and rewarding. Although I didn’t achieve all the goals I set out to do, I learned a lot in the process. What Was Done As mentioned in the last post, we created a new project, boot2ipxe, to build netboot disk image for different platforms. With that as the base image, ipxe-boot-server could generate disk image that boots boot2container from a HTTPS server.
Long time no see! It’s time to share my progress on the GSoC project. Last time we manually built a netboot SD card image for RPi 3B+ and left some questions to be answered. Now let’s dig into those questions and show you what I’ve got. Boot2ipxe First, a new project, Boot2ipxe, is created for building and testing iPXE netboot disk image for both SBCs and x86 PC, which used to be part of ipxe-boot-server’s job.
September 01, 2023

 Sriram invited me to the oneAPI meetup, and I felt I hadn't summed up the state of compute and community development in a while. Enjoy 45 minutes of opinions!

https://www.youtube.com/watch?v=HzzLY5TdnZo




if you disagree you’re wrong

August 28, 2023
Introduction Hello! This is the final report of the work I did as a Google Summer of Code 2023 contributor to NVK. My work revolved around the implementation of YCbCr format support, which came in form of enabling three Vulkan extensions. Mesa is the open-source, default graphics driver stack on Linux, with implementations for graphics hardware from most vendors. One such implementation is NV...

gitlab is down, post low-effort blogs and touch grass until it returns

August 27, 2023

The GSoC journey is coming to a close. In just over 100 days, I gained more experience in open-source development than I could ever imagine in this period.

Prior to GSoC, I was not used to regularly submit patches to the mailing lists. Now, I’ve sent many patches and revisions. I believe my interaction with the community will only grow. I learned so much about the tools and workflow of kernel development.

After this experience, I’m more than certain that I want to make this a job, contributing to open-source is fun, so why not make this a living :)

Goals

The main goal of the project was to increase the code coverage on the DRM core helper functions by creating unit tests.

As the coverage of all helpers is a big task for the time period, I decided to create tests for the drm_format_helper.c functions.

Throughout the project, other side tasks appeared. I will list the contributions made below.

GSoC contributions

Linux Kernel - VKMS

VKMS is a software-only model of a KMS driver that is useful for testing and running X (or similar) on headless machines.

This was, unexpectedly, a big part of my GSoC. I learned a lot about color formats and how a graphics driver works. Currently, only one piece of my work was upstreamed, the rest needs more work and was postponed in favor of the primary project goal.

Patch Status
drm/vkms: Add support to 1D gamma LUT Accepted

For more information go check my blogpost about the whole process.

IGT

IGT GPU Tools is a collection of tools for the development and testing of DRM drivers. While working on VKMS I used heavily the IGT framework for testing, in one occasion a bug made a test to stop working on the VKMS, so a submitted a patch to fix that.

Patch Status
lib/igt_fb: Add check for intel device on use_enginecopy Accepted

Linux Kernel - DRM

In the DRM subsystem, I’ve done the main project goal, contributed by adding unit tests, and also helped to fix some bugs that appeared while working on the tests. With the sent patches I got 71.5% of line coverage and 85.7% of function coverage on the drm_format_helper.c.

Patch Status
drm/tests: Remove CONFIG_DRM_FBDEV_EMULATION on .kunitconfig Rejected
drm/tests: Alloc drm_device on drm_exec tests Accepted
drm/tests: Test default pitch fallback Accepted
drm/tests: Add KUnit tests for drm_fb_swab() Accepted
drm/tests: Add KUnit tests for drm_fb_clip_offset() Accepted
drm/tests: Add KUnit tests for drm_fb_build_fourcc_list() Accepted
drm/tests: Add multi-plane support to conversion_buf_size() Accepted
drm/tests: Add KUnit tests for drm_fb_memcpy() Accepted

Challenges Faced

I think the most difficult task was describing my work. Either on blog posts or in the commit messages, it takes a lot of work to write what you’ve done concisely and clearly. With time you get the way of things, but I think I can improve on this subject.

Moreover, many times I had to debug some problems. I already knew how to use GDB, but using it in the kernel is a little more cumbersome. After searching, reading the documentation, and getting tips from my mentors, I got it working.

On the VKMS, I had to create new features, this requires a lot of thought. I made a lot of diagrams in my head to understand how the color formats would be displayed in memory, and yet most of my work hasn’t seen the light of day XD.

What is left to do

I was able to do most of the proposed tasks. But the drm_xfrm_toio was left out due to the difficulty of testing it, as it uses IO memory. I tested the drm_fb_blit(), but I’m waiting for the acceptance of the patchset to send it, with that patch the line coverage will go to 89.2% and the function coverage will go to 94.3%.

Community Interaction

Besides patch submission, I reviewed some patches too. Going to the other side, I enjoyed thinking about how a piece of code could be improved.

Also, on one occasion I started a discussion about the best way to solve an issue by sending a patch. This got me a Reported-by tag on the patch that fixed the bug.

Patch
Re: [PATCH] drm/format_helper: Add Kunit tests for drm_fb_xrgb8888_to_mono()
Re: [PATCH] drm/format_helper: Add Kunit tests for drm_fb_xrgb8888_to_mono()
Re: [PATCH v2] drm/format-helper: Make conversion_buf_size() support sub-byte pixel fmts
Re: [PATCH v3 1/2] drm/format-helper: Add Kunit tests for drm_fb_xrgb8888_to_mono()
Re: [PATCH 1/5] drm/tests: Test drm_rect_intersect()
Re: [PATCH v2 1/5] drm/tests: Test drm_rect_intersect()
Re: [PATCH v2 1/7] drm/vkms: isolate pixel conversion functionality
Re: [PATCH v3 1/6] drm/vkms: isolate pixel conversion functionality
Re: [PATCH v4 3/5] drm/tests: Add test cases for drm_rect_calc_vscale()
Re: [PATCH v2 1/2] drm/vkms: allow full alpha blending on all planes
Re: [PATCH v2 1/2] drm: Add fixed-point helper to get rounded integer values
Re: [PATCH v2 2/2] drm/vkms: Fix RGB565 pixel conversion
Re: [PATCH v3 1/2] drm: Add fixed-point helper to get rounded integer values
Re: [PATCH 0/3] drm/vkms: Minor Improvements
Re: [PATCH v2] drm/vkms: Fix race-condition between the hrtimer and the atomic commit
Re: [PATCH] drm/tests: Add test case for drm_rect_clip_scaled()
Re: [PATCH v4] drm/vkms: Add support to 1D gamma LUT
Re: [PATCH] drm/tests: Remove CONFIG_DRM_FBDEV_EMULATION on .kunitconfig
Re: [PATCH] drm/tests: Remove CONFIG_DRM_FBDEV_EMULATION on .kunitconfig
Re: [PATCH] drm/tests: Alloc drm_device on drm_exec tests
Re: [PATCH] drm: Drop select FRAMEBUFFER_CONSOLE for DRM_FBDEV_EMULATION
Re: [PATCH -next 6/7] drm/format-helper: Remove unnecessary NULL values
Re: [PATCH 6/6] drm/format-helper: Add KUnit tests for drm_fb_memcpy()
Re: [PATCH v2 6/6] drm/tests: Add KUnit tests for drm_fb_memcpy()

Moreover, I use a Thunderbird addon to make the diff properly highlyted. When I was tinkering with the configuration, I noticed that the CSS of the configuration menu was wrong, so it made the user experience pretty bad.

I sent a patch fixing that to the maintainer of the addon, this patch generated a discussion that made a whole change in the CSS file due to Thunderbird updates.

Acknowledgments

I’d like to thank my mentors, André “Tony” Almeida, Maíra Canal, and Tales L. Aparecida. Their support and expertise were invaluable throughout this journey.

Moreover, I thank the X.Org Foundation for allowing me to participate in this program, and also for accepting my talk proposal on the XDC 2023.

Lastly, I thank my colleague Carlos Eduardo Gallo for exchanging knowledge during the weekly meetings.

August 26, 2023
Introduction: Not That Type of Advertisement Hello again! Last time, we talked about implementing multi-plane format support, and now after having everything fully wired up, we confirmed that it compiles and hence it’s time to ship it and move on to the next stage: YCbCr advertisement. Hm? I forgot something? It probably wasn’t important then. Probably. Maybe. :^) Put simply, advertisement is...
August 25, 2023

It's 5am and I have a headache. The perfect time for some reflection!

Not only that, but I've just had to play the part of Static Site Ungenerator, because I found out that I deleted the source of the last post and I didn't want to lose it in the upcoming publish. If your Atom feed went funky, sorry.

This document is my Final Work Submission, but is fun for all the family, including the ones who don't work at Google. Hi everyone!

&num What we wanted to happen

Going into the summer, the plan was to add functionality to wlroots so that its users (generally Wayland compositors) could more easily switch to a smarter frame schedule. I've had many goes at explaining the problem and they all sucked, so here we go again: if a compositor puts some thought into when it starts its render, desktop latency as perceived by the user can decrease. The computer will feel snappier.

wlroots started the summer with no accommodations for compositors that wanted to put thought into when they start to render. It assumed exactly no thought was to be put in, and left you on your own if you were to decide otherwise. But that has all changed!

The aim of my work could have comprised three things, but I added a fourth and then didn't have time for the third:

  1. measurement - a way to determine how long a render job took, from start (on the CPU) to finish (on the GPU).
  2. scheduling - the system that chooses when to tell a compositor that it should render. this wants a way for wlroots users to dictate when they want to start rendering a new frame, relative to when this frame is due to be displayed
  3. prediction - some clever maths that learns from the time took by previous renders and guesses how long the next one will take. this allows for moving the render start time closer to the frame deadline (good), carefully enough to avoid missing the deadline (very bad)
  4. bonus! tracing - measurement, but for humans. we're getting 60 new numbers every second, and it's going to be hard to make sense of them if they're just being printed out to the console

&num What happened

After some flailing around trying to add a delay to the existing scheduling, I started writing patches worth landing.

First came the render timer API. Now we can measure the duration of our render passes. This MR brought an abstraction for timers, and an implementation for wlroots' gles2 renderer.

Next, the scene timer API. wlr_scene does some of its own work before setting off the render pass itself, so it needed to become aware of timers and expose a way to use them.

Meanwhile, I was having another stab at configuring a frame delay. It wasn't very good, and the design of wlroots' scheduling and the complexity of the logic underneath it turned out to take a long time to get through. With this MR, though, I had a better idea of where I was trying to go. A long thought process followed, much of which lives in this issue, and further down we'll see what came of that.

Before working on a prediction algorithm, I wanted to be able to see live feedback on how render timings behaved and which frames were missed so that I could do a good (informed) job of predicting them. I took a detour into the world of tracing. libuserevents was spawned and so was the work to make use of it in wlroots. Linux's user_events tracing interface was appealing because it meant that GPUVis, an existing tool that can display a timeline of CPU and GPU events, would be able to show wlroots' events. Unfortunately Linux and I have so far struggled to get along and this work is still in progress - no submission yet because it's broken. Even more unfortunately, this meant that I wasn't able to get around to prediction.

Then I got tired of fighting that, and despite the words of discouragement...

“it's kind of impossible and I don't think we can do much better than !4214”

- a fool, a moron, a silly goose (me)

a refactor of wlroots' frame scheduling that allows us to do much better than !4214: !4307! This hasn't quite made it past the finish line, but it's close; I can feel it in my frames. It (in my opinion) neatly extracts the hairy logic that lived in wlr_output into a helper interface, allowing users to swap out which frame scheduler they use, or to forgo the helpers and roll their own without there being bits and pieces left over in the parts of wlroots that they do care about. This is the most exciting piece of the puzzle IMO; wlr_output has grown to have its fingers in many pies, and this MR reduces that and leaves wlr_output a little bit more friendly in a way that took a lot of brain cycles but turned out clean.

This new interface doesn't come with a frame delay option for free, but an implementation of the interface that has this feature is underway: !4334. It fits nicely! We hashed it out a little on IRC because the frame delay option is a surprisingly tricky constraint on the interface, but I think the conclusion is good. It was definitely a lot easier to write this with confidence after the scheduling redesign :)

To make this scheduling design possible and clean, a couple of little changes were needed in other areas, and thankfully the case for these changes was easy to make. They're helpful to me, but also make those parts of wlroots less surprising and/or broken. There was also a discussion about the fate of wlr_output.events.needs_frame, which is an extra complexity in wlroots' frame scheduling. It turned out that while removing it is possible, it wasn't necessary for the new scheduling system, so it continues in the background.

&num Loose ends

While libuserevents is usable, the wlroots integration is not ready.

There is sadly no "stock" plug-and-play prediction algorithm in wlroots.

The new scheduling infrastructure has not landed but I'm sure it will Soon™. The implementation with the frame delay option will hopefully follow shortly after. When (touch wood) it does, compositors will have to bring their own prediction algorithm, but a "good enough" algorithm can be very simple and given the current interface design can easily be swapped out for a stock one if one materialises.

And finally, the funniest one. I wrote an implementation of the timer API for wlroots' Vulkan renderer, and then put off submitting it for two months because everything else was more important. gles2 is the default renderer and supports roughly every GPU in existence. Writing the Vulkan timer was fun but landing it was less of a priority than every other task I had and nothing really depended on it, so it remains stuck on my laptop to this day. Perhaps I should get round to that.

&num Closure

The project didn't go how I expected it to - not even close. I even wrote up a schedule as part of my application that almost immediately turned out completely wrong. I'm not bothered, though, because it was fun, I made myself useful, and I met some cool people.

If you're considering doing something like I did, I can happily recommend Simon as a mentor, X.Org, and GSoC, in that order. Much love to Simon for making me feel comfortable when I really didn't know what I was doing, and for participating in my wildly off-topic free software rambles. I've only interacted with a small part of the X.Org community so far but it struck me from the start how welcoming everyone is; I have no doubts that the other X.Org project mentors are as lovely in their own ways. And of course, as a strong proponent of software that doesn't suck that's free, I have to appreciate that GSoC gave me a welcoming place to do my part in that and relieve my worldly pressures (did you know you have to pay for internet??).

Thanks everyone for putting up with me. If you would like to put up with me some more, click the links on the left - I'm not going anywhere, there's still work to do!

August 24, 2023

Progress

Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.

Since the last update I have:

  • implemented support for strided convolutions with more than one input channel, and
  • Implemented support for more than one output channel, but for now only for a single input channel.

Next steps are  to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.

As a reminder, I'm pushing all the code to this branch: https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/.

IRC channel

A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.

For those of you that may not have a IRC bouncer setup yet, you can easily join with the web chat UI, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:

https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/

Embedded recipes

I have been invited to give a talk about all this ML with mainline effort at Embedded Recipes 2023, Paris 28-29 September. Slides and a recording will be published after the conference ends.

Sponsor

Last but not least, if I am able to invest so much effort on this is because the folks at LibreComputer have been supporting me financially this last couple of months.

Thanks to Da Xue for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.