Kernel Planet
January 15, 2024
Back in my blog post about Securing the Google SIP Stack, I did say I’d look at re-enabling SIP in Android-12, so with a view to doing that I tried building and booting LineageOS 19.1, but it crashed really early in the boot sequence (after the boot splash but before the boot animation started). It turns out that information on debugging the android early boot sequence is a bit scarce, so I thought I should write a post about how I did it just in case it helps someone else who’s struggling with a similar early boot problem.
How I usually Build and Boot Android
My builds are standard LineageOS with my patches to fix SIP and not much else. However, I do replace the debug keys with my signing keys and I also have an AVB key installed in the phone’s third party keyslot with which I sign the vbmeta for boot. This actually means that my phone is effectively locked but with a user supplied key (Yellow as google puts it).
My phone is now a pixel 3 (I had to say goodbye to the old Nexus One thanks to the US 3G turn off) and I do have a slightly broken Pixel 3 I play with for experimental patches, which is where I was trying to install Android-12.
Signing Seems to be the Problem
Just to verify my phone could actually boot a stock LineageOS (it could) I had to unlock it and this lead to the discovery that once unlocked, it would also boot my custom rom as well, so whatever was failing in early boot seemed to be connected with the device being locked.
I also discovered an interesting bug in the recovery rom fastboot: If you’re booting locked with your own keys, it will still let you perform all the usually forbidden fastboot commands (the one I was using was set_active). It turns out to be because of a bug in AOSP which treats yellow devices as unlocked in fastboot. Somewhat handy for debugging, but not so hot for security …
And so to Debugging Early Boot
The big problem with Android is there’s no way to get the console messages for early boot. Even if you enable adb early, it doesn’t get started until quite far in to the boot animation (which was way after the crash I was tripping over). However, android does have a pstore (previously ramoops) driver that can give you access to the previously crashed boot’s kernel messages (early init, fortunately, mostly logs to the kernel message log).
Forcing init to crash on failure
Ordinarily an init failure prints a message and reboots (to the bootloader), which doesn’t excite pstore into saving the kernel message log. fortunately there is a boot option (androidboot.init_fatal_panic) which can be set in the boot options (or kernel command line for a pixel-3 which can only boot the 4.9 kernel). If you build your own android, it’s fairly easy to add things to the android commandline (which is in boot.img) because all you need to do is extract BOOT/cmdline from the intermediate zip file you sign add any boot options you need and place it back in the zip file (before you sign it).
Unfortunately, this expedient didn’t work (no console logs appear in pstore). I did check that init was correctly panic’ing on failure by inducing an init failure in recovery mode and observing the panic (recovery mode allows you to run adb). But this induced panic also didn’t show up in pstore, meaning there’s actually some problem with pstore and early panics.
Security is the problem (as usual)
The actual problem turned out to be security (as usual): The pixel-3 does encrypted boot panic logs. The way this seems to work (at least in my reading of the google additional pstore patches) is that the bootloader itself encrypts the pstore ram area with a key on the /data partition, which means it only becomes visible after the device is unlocked. Unfortunately, if you trigger a panic before the device is unlocked (by echoing ‘c’ to /proc/sysrq-trigger) the panic message is lost, so pstore itself is useless for debugging early boot. There seems to be some communication of the keys by the vendor proprietary ramoops binary making it very difficult to figure out how it’s being done.
Why the early panic message is lost is a bit mysterious, but unfortunately pstore on the pixel-3 has several proprietary components around the encrypted message handling that make it hard to debug. I suspect if you don’t set up the pstore encryption keys, the bootloader erases the pstore ram area instead of encrypting it, but I can’t prove that.
Although it might be possible to fix the pstore drivers to preserve the ramoops from before device unlock, the participation of the proprietary bootloader in preserving the memory doesn’t make that look like a promising avenue to explore.
Anatomy of the Pixel-3 Boot Sequence
The Pixel-3 device boots through recovery. What this means is that the initial ramdisk (from boot.img) init is what boots both the recovery and normal boot paths. The only difference is that for recovery (and fastboot), the device stays in the ramdisk and for normal boot it mounts the /system partition and pivots to it. What makes this happen or not is the boot flag androidboot.force_normal_boot=1 which is added by the bootloader. Pretty much all the binary content and init rc files in the ramdisk are for recovery and its allied menus.
Since the boot paths are pretty radically different, because the normal boot first pivots to a first stage before going on to a second, but in the manner of containers, it might be possible to boot recovery first, start a dmesg logger and then re-exec init through the normal path
Forcing Re-Exec
The idea is to signal init to re-exec itself for the normal path. Of course, there have to be a few changes to do this: An item has to be added to the recovery menu to signal init and init itself has to be modified to do the re-exec on the signal (note you can’t just kick off an init with a new command line because init must be pid 1 for booting). Once this is done, there are problems with selinux (it won’t actually allow init to re-exec) and some mount moves. The selinux problem is fixable by switching it from enforcing to permissive (boot option androidboot.selinux=permissive) and the mount moves (which are forbidden if you’re running binaries from the mount points being moved) can instead become bind mounts. The whole patch becomes 31 insertions across 7 files in android_system_core.
The signal I chose was SIGUSR1, which isn’t usually used by anything in the bootloader and the addition of a menu item to recovery to send this signal to init was also another trivial patch. So finally we have a system from which I can start adb to trace the kernel log (adb shell dmesg -w) and then signal to init to re-exec. Surprisingly this worked and produced as the last message fragment:
[ 190.966881] init: [libfs_mgr]Created logical partition system_a on device /dev/block/dm-0
[ 190.967697] init: [libfs_mgr]Created logical partition vendor_a on device /dev/block/dm-1
[ 190.968367] init: [libfs_mgr]Created logical partition product_a on device /dev/block/dm-2
[ 190.969024] init: [libfs_mgr]Created logical partition system_ext_a on device /dev/block/dm-3
[ 190.969067] init: DSU not detected, proceeding with normal boot
[ 190.982957] init: [libfs_avb]Invalid hash size:
[ 190.982967] init: [libfs_avb]Failed to verify vbmeta digest
[ 190.982972] init: [libfs_avb]vbmeta digest error isn't allowed
[ 190.982980] init: Failed to open AvbHandle: No such file or directory
[ 190.982987] init: Failed to setup verity for '/system': No such file or directory
[ 190.982993] init: Failed to mount /system: No such file or directory
[ 190.983030] init: Failed to mount required partitions early …
[ 190.983483] init: InitFatalReboot: signal 6
[ 190.984849] init: #00 pc 0000000000123b38 /system/bin/init
[ 190.984857] init: #01 pc 00000000000bc9a8 /system/bin/init
[ 190.984864] init: #02 pc 000000000001595c /system/lib64/libbase.so
[ 190.984869] init: #03 pc 0000000000014f8c /system/lib64/libbase.so
[ 190.984874] init: #04 pc 00000000000e6984 /system/bin/init
[ 190.984878] init: #05 pc 00000000000aa144 /system/bin/init
[ 190.984883] init: #06 pc 00000000000487dc /system/lib64/libc.so
[ 190.984889] init: Reboot ending, jumping to kernel
Which indicates exactly where the problem is.
Fixing the problem
Once the messages are identified, the problem turns out to be in system/core ec10d3cf6 “libfs_avb: verifying vbmeta digest early”, which is inherited from AOSP and which even says in in it’s commit message “the device will not boot if: 1. The image is signed with FLAGS_VERIFICATION_DISABLED is set 2. The device state is locked” which is basically my boot state, so thanks for that one google. Reverting this commit can be done cleanly and now the signed image boots without a problem.
I note that I could also simply add hashtree verification to my boot, but LineageOS is based on the eng target, which has FLAGS_VERIFICATION_DISABLED built into the main build Makefile. It might be possible to change it, but not easily I’m guessing … although I might try fixing it this way at some point, since it would make my phones much more secure.
Conclusion
Debugging android early boot is still a terribly hard problem. Probably someone with more patience for disassembling proprietary binaries could take apart pixel-3 vendor ramoops and figure out if it’s possible to get a pstore oops log out of early boot (which would be the easiest way to debug problems). But failing that the simple hack to re-exec init worked enough to show me where the problem was (of course, if init had continued longer it would likely have run into other issues caused by the way I hacked it).
January 15, 2024 10:05 PM
January 02, 2024
Libraries are collections of code that are intended to be usable by multiple consumers (if you're interested in the etymology, watch this video). In the old days we had what we now refer to as "static" libraries, collections of code that existed on disk but which would be copied into newly compiled binaries. We've moved beyond that, thankfully, and now make use of what we call "dynamic" or "shared" libraries - instead of the code being copied into the binary, a reference to the library function is incorporated, and at runtime the code is mapped from the on-disk copy of the shared object[1]. This allows libraries to be upgraded without needing to modify the binaries using them, and if multiple applications are using the same library at once it only requires that one copy of the code be kept in RAM.
But for this to work, two things are necessary: when we build a binary, there has to be a way to reference the relevant library functions in the binary; and when we run a binary, the library code needs to be mapped into the process.
(I'm going to somewhat simplify the explanations from here on - things like symbol versioning make this a bit more complicated but aren't strictly relevant to what I was working on here)
For the first of these, the goal is to replace a call to a function (eg, printf()) with a reference to the actual implementation. This is the job of the linker rather than the compiler (eg, if you use the -c argument to tell gcc to simply compile to an object rather than linking an executable, it's not going to care about whether or not every function called in your code actually exists or not - that'll be figured out when you link all the objects together), and the linker needs to know which symbols (which aren't just functions - libraries can export variables or structures and so on) are available in which libraries. You give the linker a list of libraries, it extracts the symbols available, and resolves the references in your code with references to the library.
But how is that information extracted? Each ELF object has a fixed-size header that contains references to various things, including a reference to a list of "section headers". Each section has a name and a type, but the ones we're interested in are .dynstr and .dynsym. .dynstr contains a list of strings, representing the name of each exported symbol. .dynsym is where things get more interesting - it's a list of structs that contain information about each symbol. This includes a bunch of fairly complicated stuff that you need to care about if you're actually writing a linker, but the relevant entries for this discussion are an index into .dynstr (which means the .dynsym entry isn't sufficient to know the name of a symbol, you need to extract that from .dynstr), along with the location of that symbol within the library. The linker can parse this information and obtain a list of symbol names and addresses, and can now replace the call to printf() with a reference to libc instead.
(Note that it's not possible to simply encode this as "Call this address in this library" - if the library is rebuilt or is a different version, the function could move to a different location)
Experimentally, .dynstr and .dynsym appear to be sufficient for linking a dynamic library at build time - there are other sections related to dynamic linking, but you can link against a library that's missing them. Runtime is where things get more complicated.
When you run a binary that makes use of dynamic libraries, the code from those libraries needs to be mapped into the resulting process. This is the job of the runtime dynamic linker, or RTLD[2]. The RTLD needs to open every library the process requires, map the relevant code into the process's address space, and then rewrite the references in the binary into calls to the library code. This requires more information than is present in .dynstr and .dynsym - at the very least, it needs to know the list of required libraries.
There's a separate section called .dynamic that contains another list of structures, and it's the data here that's used for this purpose. For example, .dynamic contains a bunch of entries of type DT_NEEDED - this is the list of libraries that an executable requires. There's also a bunch of other stuff that's required to actually make all of this work, but the only thing I'm going to touch on is DT_HASH. Doing all this re-linking at runtime involves resolving the locations of a large number of symbols, and if the only way you can do that is by reading a list from .dynsym and then looking up every name in .dynstr that's going to take some time. The DT_HASH entry points to a hash table - the RTLD hashes the symbol name it's trying to resolve, looks it up in that hash table, and gets the symbol entry directly (it still needs to resolve that against .dynstr to make sure it hasn't hit a hash collision - if it has it needs to look up the next hash entry, but this is still generally faster than walking the entire .dynsym list to find the relevant symbol). There's also DT_GNU_HASH which fulfills the same purpose as DT_HASH but uses a more complicated algorithm that performs even better. .dynamic also contains entries pointing at .dynstr and .dynsym, which seems redundant but will become relevant shortly.
So, .dynsym and .dynstr are required at build time, and both are required along with .dynamic at runtime. This seems simple enough, but obviously there's a twist and I'm sorry it's taken so long to get to this point.
I bought a Synology NAS for home backup purposes (my previous solution was a single external USB drive plugged into a small server, which had uncomfortable single point of failure properties). Obviously I decided to poke around at it, and I found something odd - all the libraries Synology ships were entirely lacking any ELF section headers. This meant no .dynstr, .dynsym or .dynamic sections, so how was any of this working? nm asserted that the libraries exported no symbols, and readelf agreed. If I wrote a small app that called a function in one of the libraries and built it, gcc complained that the function was undefined. But executables on the device were clearly resolving the symbols at runtime, and if I loaded them into ghidra the exported functions were visible. If I dlopen()ed them, dlsym() couldn't resolve the symbols - but if I hardcoded the offset into my code, I could call them directly.
Things finally made sense when I discovered that if I passed the --use-dynamic argument to readelf, I did get a list of exported symbols. It turns out that ELF is weirder than I realised. As well as the aforementioned section headers, ELF objects also include a set of program headers. One of the program header types is PT_DYNAMIC. This typically points to the same data that's present in the .dynamic section. Remember when I mentioned that .dynamic contained references to .dynsym and .dynstr? This means that simply pointing at .dynamic is sufficient, there's no need to have separate entries for them.
The same information can be reached from two different locations. The information in the section headers is used at build time, and the information in the program headers at run time[3]. I do not have an explanation for this. But if the information is present in two places, it seems obvious that it should be able to reconstruct the missing section headers in my weird libraries? So that's what this does. It extracts information from the DYNAMIC entry in the program headers and creates equivalent section headers.
There's one thing that makes this more difficult than it might seem. The section header for .dynsym has to contain the number of symbols present in the section. And that information doesn't directly exist in DYNAMIC - to figure out how many symbols exist, you're expected to walk the hash tables and keep track of the largest number you've seen. Since every symbol has to be referenced in the hash table, once you've hit every entry the largest number is the number of exported symbols. This seemed annoying to implement, so instead I cheated, added code to simply pass in the number of symbols on the command line, and then just parsed the output of readelf against the original binaries to extract that information and pass it to my tool.
Somehow, this worked. I now have a bunch of library files that I can link into my own binaries to make it easier to figure out how various things on the Synology work. Now, could someone explain (a) why this information is present in two locations, and (b) why the build-time linker and run-time linker disagree on the canonical source of truth?
[1] "Shared object" is the source of the .so filename extension used in various Unix-style operating systems
[2] You'll note that "RTLD" is not an acryonym for "runtime dynamic linker", because reasons
[3] For environments using the GNU RTLD, at least - I have no idea whether this is the case in all ELF environments
comments
January 02, 2024 07:24 PM
December 30, 2023
A while ago I mentioned I use Android-10 with the built in SIP stack and that the Google stack was pretty buggy and I had to fix it simply to get it to function without disconnecting all the time. Since then I’ve upported my fixes to Android-11 (the jejb-11 branch in the repositories) by using LineageOS-19.1. However, another major deficiency in the Google SIP stack is its complete lack of security: both the SIP signalling and the media streams are all unencrypted meaning they can be intercepted and tapped by pretty much anyone in the network path running tcpdump. Why this is so, particularly for a company that keeps touting its security credentials is anyone’s guess. I personally suspect they added SIP in Android-4 with a view to basing Google Voice on it, decided later that proprietary VoIP protocols was the way to go but got stuck with people actually using the SIP stack for other calling services so they couldn’t rip it out and instead simply neglected it hoping it would die quietly due to lack of features and updates.
This blog post is a guide to how I took the fully unsecured Google SIP stack and added security to it. It also gives a brief overview of some of the security protocols you need to understand to get secure VoIP working.
What is SIP
What I’m calling SIP (but really a VoIP system using SIP) is a protocol consisting of several pieces. SIP (Session Initiation Protocol), RFC 3261, is really only one piece: it is the “signalling” layer meaning that call initiation, response and parameters are all communicated this way. However, simple SIP isn’t enough for a complete VoIP stack; once a call moves to in progress, there must be an agreement on where the media streams are and how they’re encoded. This piece is called a SDP (Session Description Protocol) agreement and is usually negotiated in the body of the SIP INVITE and response messages and finally once agreement is reached, the actual media stream for call audio goes over a different protocol called RTP (Real-time Transport Protocol).
How Google did SIP
The trick to adding protocols fast is to take them from someone else (if you’re open source, this is encouraged) so google actually chose the NIST-SIP java stack (which later became the JAIN-SIP stack) as the basis for SIP in android. However, that only covered signalling and they had to plumb it in to the android Phone model. One essential glue piece is frameworks/opt/net/voip which supplies the SDP negotiating layer and interfaces the network codec to the phone audio. This isn’t quite enough because the telephony service and the Dialer also need to be involved to do the account setup and call routing. It always interested me that SIP was essentially special cased inside these services and apps instead of being a plug in, but that’s due to the fact that some of the classes that need extending to add phone protocols are internal only; presumably so only manufacturers can add phone features.
Securing SIP
This is pretty easy following the time honoured path of sending messages over TLS instead of in the clear simply by using a TLS wrappering technique of secure sockets and, indeed, this is how RFC 3261 says to do it. However, even this minor re-engineering proved unnecessary because the nist-sip stack was already TLS capable, it simply wasn’t allowed to be activated that way by the configuration options Google presented. A simple 10 line patch in a couple of repositories (external/nist_sip, packages/services/Telephony and frameworks/opt/net/voip) fixed this and the SIP stack messaging was secured leaving only the voice stream insecure.
SDP
As I said above, the google frameworks/opt/net/voip does all the SDP negotiation. This isn’t actually part of SIP. The SDP negotiation is conducted over SIP messages (which means it’s secured thanks to the above) but how this should be done isn’t part of the SIP RFC. Instead SDP has its own RFC 4566 which is what the class follows (mainly for codec and port negotiation). You’d think that if it’s already secured by SIP, there’s no additional problem, but, unfortunately, using SRTP as the audio stream requires the exchange of additional security parameters which added to SDP by RFC 4568. To incorporate this into the Google SIP stack, it has to be integrated into the voip class. The essential additions in this RFC are a separate media description protocol (RTP/SAVP) for the secure stream and the addition of a set of tagged a=crypto: lines for key negotiation.
As will be a common theme: not all of RFC 4568 has to be implemented to get a secure RTP stream. It goes into great detail about key lifetime and master key indexes, neither of which are used by the asterisk SIP stack (which is the one my phone communicates with) so they’re not implemented. Briefly, it is best practice in TLS to rekey the transport periodically, so part of key negotiation should be key lifetime (actually, this isn’t as important to SRTP as it is to TLS, see below, which is why asterisk ignores it) and the people writing the spec thought it would be great to have a set of keys to choose from instead of just a single one (The Master Key Identifier) but realistically that simply adds a load of complexity for not much security benefit and, again, is ignored by asterisk which uses a single key.
In the end, it was a case of adding a new class for parsing the a=crypto: lines of SDP and doing a loop in the audio protocol for RTP/SAVP if TLS were set as the transport. This ended up being a ~400 line patch.
Secure RTP
RTP itself is governed by RFC 3550 which actually contains two separate stream descriptions: the actual media over RTP and a control protocol over RTCP. RTCP is mostly used for multi-party and video calls (where you want reports on reception quality to up/downshift the video resolution) and really serves no purpose for audio, so it isn’t implemented in the Google SIP stack (and isn’t really used by asterisk for audio only either).
When it comes to securing RTP (and RTCP) you’d think the time honoured mechanism (using secure sockets) would have applied although, since RTP is transmitted over UDP, one would have to use DTLS instead of TLS. Apparently the IETF did consider this, but elected to define a new protocol instead (or actually two: SRTP and SRTCP) in RFC 3711. One of the problems with this new protocol is that it also defines a new ciphersuite (AES_CM_…) which isn’t found in any of the standard SSL implementations. Although the AES_CM ciphers are very similar in operation to the AES_GCM ciphers of TLS (Indeed AES_GCM was adopted for SRTP in a later RFC 7714) they were never incorporated into the TLS ciphersuite definition.
So now there are two problems: adding code for the new protocol and performing the new encyrption/decryption scheme. Fortunately, there already exists a library (libsrtp) that can do this and even more fortunately it’s shipped in android (external/libsrtp2) although it looks to be one of those throwaway additions where the library hasn’t really been updated since it was added (for cuttlefish gcastv2) in 2019 and so is still at a pre 2.3.0 version (I did check and there doesn’t look to be any essential bug fixes missing vs upstream, so it seems usable as is).
One of the really great things about libsrtp is that it has srtp_protect and srtp_unprotect functions which transform SRTP to RTP and vice versa, so it’s easily possible to layer this library directly into an existing RTP implementation. When doing this you have to remember that the encryption also includes authentication, so the size of the packet expands which is why the initial allocation size of the buffers has to be increased. One of the not so great things is that it implements all its own crypto primitives including AES and SHA1 (which most cryptographers think is always a bad idea) but on the plus side, it’s the same library asterisk uses so how much of a real problem could this be …
Following the simple layering approach, I constructed a patch to do the RTP<->SRTP transform in the JNI code if a key is passed in, so now everything just works and setting asterisk to SRTP only confirms the phone is able to send and receive encrypted audio streams. This ends up being a ~140 line patch.
So where does DTLS come in?
Anyone spending any time at all looking at protocols which use RTP, like webRTC, sees RTP and DTLS always mentioned in the same breath. Even asterisk has support for DTLS, so why is this? The answer is that if you use RTP outside the SIP framework, you still need a way of agreeing on the keys using SDP. That key agreement must be protected (and can’t go over RTCP because that would cause a chicken and egg problem) so implementations like webRTC use DTLS to exchange the initial SDP offer and answer negotiation. This is actually referred to as DTLS-SRTP even though it’s an initial DTLS handshake followed by SRTP (with no further DTLS in sight). However, this DTLS handshake is completely unnecessary for SIP, since the SDP handshake can be done over TLS protected SIP messaging instead (although I’ve yet to find anyone who can convincingly explain why this initial handshake has to go over DTLS and not TLS like SIP … I suspect it has something to do with wanting the protocol to be all UDP and go over the same initial port).
Conclusion
This whole exercise ended up producing less than 1000 lines in patches and taking a couple of days over Christmas to complete. That’s actually much simpler and way less time than I expected (given the complexity in the RFCs involved), which is why I didn’t look at doing this until now. I suppose the next thing I need to look at is reinserting the SIP stack into Android-12, but I’ll save that for when Android-11 falls out of support.
December 30, 2023 03:58 PM
December 19, 2023
Vulkan 1.3.274 moves the Vulkan encode work out of BETA and moves h264 and h265 into KHR extensions. radv support for the Vulkan video encode extensions has been in progress for a while.
The latest branch is at [1]. This branch has been updated for the new final headers.
Updated: It passes all of h265 CTS now, but it is failing one h264 test.
Initial ffmpeg support is [2].
[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-encode-h2645-spec-latest?ref_type=heads
[2] https://github.com/cyanreg/FFmpeg/commits/vulkan/
December 19, 2023 08:29 PM
Earlier this year, after Github accidentally committed their private RSA SSH host key to a public repository, I wrote about how better support for SSH host certificates would allow this sort of situation to be handled in a user-transparent way without any negative impact on security. I was hoping that someone would read this and be inspired to fix the problem but sadly that didn't happen so I've actually written some code myself.
The core part of this is straightforward - if a server presents you with a certificate associated with a host key, then make the trust in that host be whoever signed the certificate rather than just trusting the host key. This means that if someone needs to replace the host key for any reason (such as, for example, them having published the private half), you can replace the host key with a new key and a new certificate, and as long as the new certificate is signed by the same key that the previous certificate was, you'll trust the new key and key rotation can be carried out without any user errors. Hurrah!
So obviously I wrote that bit and then thought about the failure modes and it turns out there's an obvious one - if an attacker obtained both the private key and the certificate, what stops them from continuing to use it? The certificate isn't a secret, so we basically have to assume that anyone who possesses the private key has access to it. We may have silently transitioned to a new host key on the legitimate servers, but a hostile actor able to MITM a user can keep on presenting the old key and the old certificate until it expires.
There's two ways to deal with this - either have short-lived certificates (ie, issue a new certificate every 24 hours or so even if you haven't changed the key, and specify that the certificate is invalid after those 24 hours), or have a mechanism to revoke the certificates. The former is viable if you have a very well-engineered certificate issuing operation, but still leaves a window for an attacker to make use of the certificate before it expires. The latter is something SSH has support for, but the spec doesn't define any mechanism for distributing revocation data.
So, I've implemented a new SSH protocol extension that allows a host to send a key revocation list to a client. The idea is that the client authenticates to the server, receives a key revocation list, and will no longer trust any certificates that are contained within that list. This seems simple enough, but a naive implementation opens the client to various DoS attacks. For instance, if you simply revoke any key contained within the received KRL, a hostile server could revoke any certificates that were otherwise trusted by the client. The easy way around this is for the client to ensure that any revoked keys are associated with the same CA that signed the host certificate - that way a compromised host can only revoke certificates associated with that CA, and can't interfere with anyone else.
Unfortunately that still means that a single compromised host can still trigger revocation of certificates inside that trust domain (ie, a compromised host a.test.com could push a KRL that invalidated the certificate for b.test.com), because there's no way in the KRL format to indicate that a given revocation is associated with a specific hostname. This means we need a mechanism to verify that the KRL update is legitimate, and the easiest way to handle that is to sign it. The KRL format specifies an in-band signature but this was deprecated earlier this year - instead KRLs are supposed to be signed with the sshsig format. But we control both the server and the client, which means it's easy enough to send a detached signature as part of the extension data.
Putting this all together: you ssh to a server you've never contacted before, and it presents you with a host certificate. Instead of the host key being added to known_hosts, the CA key associated with the certificate is added. From now on, if you ssh to that host and it presents a certificate signed by that CA, it'll be trusted. Optionally, the host can also send you a KRL and a signature. If the signature is generated by the CA key that you already trust, any certificates in that KRL associated with that CA key will be incorporated into local storage. The expected flow if a key is compromised is that the owner of the host generates a new keypair, obtains a new certificate for the new key, and adds the old certificate to a KRL that is signed with the CA key. The next time the user connects to that host, they receive the new key and new certificate, trust it because it's signed by the same CA key, and also receive a KRL signed with the same CA that revokes trust in the old certificate.
Obviously this breaks down if a user is MITMed with a compromised key and certificate immediately after the host is compromised - they'll see a legitimate certificate and won't receive any revocation list, so will trust the host. But this is the same failure mode that would occur in the absence of keys, where the attacker simply presents the compromised key to the client before trust in the new key has been created. This seems no worse than the status quo, but means that most users will seamlessly transition to a new key and revoke trust in the old key with no effort on their part.
The work in progress tree for this is here - at the point of writing I've merely implemented this and made sure it builds, not verified that it actually works or anything. Cleanup should happen over the next few days, and I'll propose this to upstream if it doesn't look like there's any showstopper design issues.
comments
December 19, 2023 07:48 PM
December 11, 2023
Even if you’re a developer with legal leanings like me, you probably haven’t given much thought to the warranty disclaimer and the liability disclaimer that appears in almost every Open Source licence (see sections 14 and 15 of GPLv3). This post is designed to help you understand what they are, why they’re there and why we might need stronger defences in future thanks to a changing legal landscape.
History: Why no Warranty or Liability
It seems obvious that when considered in terms of what downstream gets from Open Source that an open ended obligation on behalf of upstream to fix your problems isn’t one of them because it wouldn’t be sustainable. Effectively the no warranty clause is notice that since you’re getting the code for free it comes with absolutely no obligations on developers: if it breaks, you get to fix it. This is why no warranty clauses have been present since the history of Open Source (and Free Software: GPLv1 included this). There’s also a historical commercial reason for this as well. Before the explosion of Open Source business models in the last decade, the Free Software Foundation (FSF) considered paid support for otherwise unsupported no warranty Open Source software to be the standard business model for making money on Open Source. Based on this, Cygnus Support (later Cygnus Solutions – Earliest web archive capture 1997) was started in 1989 with a business model of providing paid support and bespoke development for the compiler and toolchain.
Before 2000 most public opinion (when it thought about Open Source at all) was happy with this, because Open Source was seen by and large as the uncommercialized offerings of random groups of hackers. Even the largest Open Source project, the Linux kernel, was seen as the scrappy volunteer upstart challenging both Microsoft and the proprietary UNIXs for control of the Data Centre. On the back of this, distributions (Red Hat, SUSE, etc.) arose to commericallize support offerings around Linux to further its competition with UNIX and Windows and push it to win the war for the Data Centre (and later the Cloud).
The Rise of The Foundations: Public Perception Changes
The heyday explosion of volunteer Open Source happened in the first decade of the new Millennium. But volunteer Open Source also became a victim of this success: the more it penetrated industry, the greater control of the end product industry wanted. And, whenever there’s a Business Need, something always arises to fulfill it: the Foundation Model for exerting influence in exchange for cash. The model is fairly simple: interested parties form a foundation (or more likely go to a Foundation forming entity like the Linux Foundation). They get seats on the governing board, usually in proportion to their annual expenditure on the foundation and the foundation sets up a notionally independent Technical Oversight Body staffed by developers which is still somewhat beholden to the board and its financial interests. The net result is rising commercial franchise in Open Source.
The point of the above isn’t to say whether this commercial influence is good or bad, it’s to say that the rise of the Foundations have changed the public perception of Open Source. No longer is Open Source seen as the home of scrappy volunteers battling for technological innovation against entrenched commercial interests, now Open Source is seen as one more development tool of the tech industry. This change in attitude is pretty profound because now when a problem is found in Open Source, the public has no real hesitation in assuming the tech industry in general should be responsible; the perception that the no warranty clause protects innocent individual developers is supplanted by the perception that it’s simply one more tool big tech deploys to evade liability for the problems it creates. Some Open Source developers have inadvertently supported this notion by publicly demanding to be paid for working on their projects, often in the name of sustainability. Again, none of this is necessarily wrong but it furthers the public perception that Open Source developers are participating in a commercial not a volunteer enterprise.
Liability via Fiduciary Duty: The Bitcoin Case
An ongoing case in the UK courts (BL-2021-000313) between Tulip Trading and various bitcoin developers centers around the disputed ownership of about US$4bn in bitcoin. Essentially Tulip contends that it lost access to the bitcoins due to a computer hack but says that the bitcoin developers have a fiduciary duty to it to alter the blockchain code to recover its lost bitcoins. The unusual feature of this case is that Tulip sued the developers of the bitcoin code not the operators of the bitcoin network. (it’s rather like the Bank losing your money and then you trying to sue the Mint for recovery). The reason for this is that all the operators (the miners) use the same code base for the same blockchain and thus could rightly claim that it’s technologically impossible for them to recover the lost bitcoin, because that would necessitate a change to the fundamental blockchain code which only the developers control. The suit was initially lost by Tulip on the grounds of the no liability disclaimer, but reinstated by the UK appeal court which showed considerable interest in the idea that developers could pick up fiduciary liability in some cases, even though the suit may eventually get dismissed on the grounds that Tulip can’t prove it ever owned the US$4bn in bitcoins in the first place.
Why does all this matter? Well, even if this case resolves successfully, thanks to the appeal court ruling, the door is still open to others with less shady claims that they’ve suffered an injury due to some coding issue that gives developers fiduciary liability to them. The no warranty disclaimer is already judged not to be sufficient to prevent this, so the cracks are starting to appear in it as a defence against all liability claims.
The EU Cyber Resilience Act: Legally Piercing No Warranty Clauses
The EU Cyber Resilience Act (CRA) at its heart provides a fiduciary duty of care on all “digital components” incorporated into products or software offered on the EU market to adhere to prescribed cybersecurity requirements and an obligation to provide duty of care for these requirements over the whole lifecycle of such products or software. Essentially this is developer liability, notwithstanding any no warranty clauses, writ large. To be fair, there is currently a carve out for “noncommercial” Open Source but, as I pointed out above, most Open Source today is commercial and wouldn’t actually benefit from this. I’m not proposing to give a detailed analysis (many people have already done this and your favourite search engine will turn up dozens without even trying) I just want to note that this is a legislative act designed to pierce the no warranty clauses Open Source has relied on for so long.
EU CRA Politics: Why is this Popular?
Politicians don’t set out to effectively override licensing terms and contract law unless there’s a significant popularity upside and, if you actually canvas the general public, there is: People are tired of endless cybersecurity breaches compromising their private information, or even their bank accounts, and want someone to be held responsible. Making corporations pay for breaches that damage individuals is enormously popular (and not just in the EU). After all big Tech profits enormously from this, so big Tech should pay for the clean up when things go wrong.
Unfortunately, self serving arguments that this will place undue burdens on Foundations funded by starving corporations rather undermine the same arguments on behalf of individual developers. To the public at large such arguments merely serve to reinforce the idea that big Tech has been getting away with too much for too long. Trying to separate individual developer Open Source from corporate Open Source is too subtle a concept to introduce now, particularly when we, and the general public, have bought into the idea that they’re the same thing for so long.
So what should we do about this?
It’s clear that even if a massive (and expensive) lobbying effort succeeds in blunting the effect of the CRA on Open Source this time around, there will always be a next time because of the public desire for accountability for and their safety guarantees in cybersecurity practices. It is also clear that individual developer Open Source has to make common cause with commercial Open Source to solve this issue. Even though individuals hate being seen as synonymous with corporations, one of the true distinctions between Open Source and Free Software has always been the ability to make common cause over smaller goals rather than bigger philosophies and aspirations; so this is definitely a goal we can make a common cause over. This common cause means the eventual solution must apply to individual and commercial Open Source equally. And, since we’ve already lost the perception war, it will have to be something more legally based.
Indemnification: the Legal solution to Developer Liability
Indemnification means one party, in particular circumstances, agreeing to be on the hook for the legal responsibilities of another party. This is actually a well known way not of avoiding liability but transferring it to where it belongs. As such, it’s easily sellable in the court of public opinion: we’re not looking to avoid liability, merely trying to make sure it lands on those who are making all the money from the code.
The best mechanism for transmitting this is obviously the Licence and, ironically, a licence already exists with developer indemnity clauses: Apache-2 (clause 9). Unfortunately, the Apache-2 clause only attaches to an entity offering support for a fee, which doesn’t quite cover the intention of the CRA, which is for anyone offering a product in the EU market (whether free or for sale) should be responsible for its cybersecurity lifecycle, whether they offer support or not. However, it does provide a roadmap for what such a clause would look like:
If you choose to offer this work in whole or part as a component or product in a jurisdiction requiring lifecycle duty of care you agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your actions in such a jurisdiction.
Probably the wording would need some tweaking by an actual lawyer, but you get the idea.
Applying Indemnity to existing Licences
Obviously for a new project, the above clause can simply be added to the licence but for any existing project, since the clause is compatible with the standard no-warranty statements, it can be added after the fact without interfering with the existing operation of the licence or needing buy in from current copyright holders (there is an argument that this would represent an additional restriction within the meaning of GPL, but I addressed that here). This makes it very easy to add by anyone offering, for instance, a download over Github or Gitlab that could be incorporated by someone into a product in the EU.
Conclusion
Thanks to public perception, the issue of developer liability isn’t going to go away and lobbying will not forestall the issue forever, so a robust indemnity defence needs to be incorporated into Open Source licences so that Liability is seen to be accepted where it can best be served (by the people or corporation utilizing the code).
December 11, 2023 05:27 PM
December 10, 2023
Problem: I want to copy a file in git and preserve its history.
Solution: Holy guacamole!:
git checkout -b dup
git mv orig copy
git commit --author="Greg " -m "cp orig copy"
git checkout HEAD~ orig
git commit --author="Greg " -m "restore orig"
git checkout -
git merge --no-ff dup
December 10, 2023 02:21 AM
December 08, 2023
Problem: you run a Pleroma instance, and you want to restrict the composite timelines to logged-in users only, but without going to the full "public: false". This is a common ask, and the option restrict_unauthenticated is documented, but the syntax is far from obvious!
Solution:
config :pleroma, :restrict_unauthenticated,
timelines: %{local: true, federated: true}
December 08, 2023 08:32 PM
December 07, 2023
Problem: you create and VM like you always did, but RHEL 9 bombs with:
Fatal glibc error: CPU does not support x86-64-v2
Solution: as Dan Berrange explains in bug #2060839, a traditional default CPU model qemu64 is no longer sufficient. Unfortunately, there's no "qemu64-v2". Instead, you must select one of the real CPUs.
<cpu mode='host-model' match='exact' check='none'>
<model fallback='forbid'>Broadwell-v4</model>
</cpu>
December 07, 2023 05:54 AM
December 05, 2023
There's a decent number of laptops with fingerprint readers that are supported by Linux, and Gnome has some nice integration to make use of that for authentication purposes. But if you log in with a fingerprint, the moment you start any app that wants to access stored passwords you'll get a prompt asking you to type in your password, which feels like it somewhat defeats the point. Mac users don't have this problem - authenticate with TouchID and all your passwords are available after login. Why the difference?
Fingerprint detection can be done in two primary ways. The first is that a fingerprint reader is effectively just a scanner - it passes a graphical representation of the fingerprint back to the OS and the OS decides whether or not it matches an enrolled finger. The second is for the fingerprint reader to make that determination itself, either storing a set of trusted fingerprints in its own storage or supporting being passed a set of encrypted images to compare against. Fprint supports both of these, but note that in both cases all that we get at the end of the day is a statement of "The fingerprint matched" or "The fingerprint didn't match" - we can't associate anything else with that.
Apple's solution involves wiring the fingerprint reader to a secure enclave, an independently running security chip that can store encrypted secrets or keys and only release them under pre-defined circumstances. Rather than the fingerprint reader providing information directly to the OS, it provides it to the secure enclave. If the fingerprint matches, the secure enclave can then provide some otherwise secret material to the OS. Critically, if the fingerprint doesn't match, the enclave will never release this material.
And that's the difference. When you perform TouchID authentication, the secure enclave can decide to release a secret that can be used to decrypt your keyring. We can't easily do this under Linux because we don't have an interface to store those secrets. The secret material can't just be stored on disk - that would allow anyone who had access to the disk to use that material to decrypt the keyring and get access to the passwords, defeating the object. We can't use the TPM because there's no secure communications channel between the fingerprint reader and the TPM, so we can't configure the TPM to release secrets only if an associated fingerprint is provided.
So the simple answer is that fingerprint unlock doesn't unlock the keyring because there's currently no secure way to do that. It's not intransigence on the part of the developers or a conspiracy to make life more annoying. It'd be great to fix it, but I don't see an easy way to do so at the moment.
comments
December 05, 2023 06:32 AM
December 01, 2023
The Compute Express Link (CXL) microconference was held, for a second straight time, at this year's Linux Plumbers Conference. The goals for the track were to openly discuss current on-going development efforts around the core driver, as well as experimental memory management topics which lead to accommodating kernel infrastructure for new technology and use cases.
 |
| CXL session at LPC23 |
(i)
CXL Emulation in QEMU - Progress, status and most importantly what next? The cxl qemu maintainers presented the current state of the emulation, for which significant progress has been made, extending support beyond basic enablement. During this year, features such as volatile devices, CDAT, poison and injection infrastructure have been added upstream qemu, while several others are in the process, such as CCI/mailbox, Scan Media and dynamic capacity. There was also further highlighting of the latter, for which DCD support was presented along with extent management issues found in the 3.0 spec. Similarly, Fabric Management was another important topic, continuing the debate about qemu's role in FM development, which is still quite early. Concerns about the production (beyond testing) use cases for CCI kernel support were discussed, as well as semantics and interfaces that constrain qemu, such as host and switch coupling and differences with BMC behavior.
(ii)
CXL Type-2 core support. The state and purpose of existing experimental support for type 2 (accelerators) devices was presented, for both the kernel and qemu sides. The kernel support led to preliminary abstraction improvement work being upstreamed, facilitating actual accelerator integration with the cxl core driver. However, the rest is merely guess work and the floor is open for an actual hardware backed proposal. In addition, HDM-DB support would also be welcomed as a step forward. The qemu side is very basic and designed to just exercise core checks, for which it's emulation should be limited, specially in light of cxl_test.
(iii)
Plumbing challenges in Dynamic capacity device. An in-depth coverage and discussion, from a kernel side, of the state of DCD support and considerations around corner cases. Semantics of releasing DC for full partial extents (ranges) are two different beasts altogether. Releasing all the already given memory can simply require memory being offline and be done, avoiding unnecessary complexity in the kernel. Therefore the kernel can perfectly well reject the request, and FM design should keep that into consideration. Partial extents, on the other hand, are unsupported for the sake of simplicity, at least until a solid industry use case comes along. Forced DC removal of online memory semantics were also discussed, emphasizing that such DC memory is not guaranteed to ever be given back by the kernel, mapped or not. Forcing the event, the hardware does not care and the kernel has most likely crashed anyway. Support for extent tagging was another topic, establishing the need for supporting it, coupling a device to a tag domain, being a sensible use case. For now at least, the implementation can be kept to to simply enumerate tags and the necessary attributes to leave the memory matching to userspace, instead of more complex surgeries to create DAX devices on specific extents, dealing with sparse regions.
(iv)
Adding RAS Support for CXL Port Devices. Starting with a general overview of RAS, this touched on the current state for support in CXL 1.1 and 2.0. Special handling is required for RCH: due to the RCRB implementation, the RCH downstream port does not have a BDF, needed for AER error handling; this work was merged in v6.7. As for CXL Virtual Hierarchy implementation, it is left still open, potentially things could move away from the PCIe port service driver model, which is not entirely liked. There are however, clear requirements: not-CXL specific (AER is a PCIe protocol, used by CXL.io); implement driver callback logic specific to that technology or device, giving flexibility to handle that specific need; and allow enable/disable on a per-device granularity. There were discussions around the order for which a registration handler is added in the PCI port driver, noting that it made sense to go top-down from the port and searching children, instead of written from a lower level.
(v)
Shared CXL 3 memory: what will be required? Overview of the state, semantics and requirements for supporting shared fabric attached memory (FAM). A strong enablement use case is leveraging applications that already handle data sets in files. In addition appropriate workload candidates will fit the "master writer, multiple readers" read-only model for which this sort of machinery would make sense. Early results show that the benefits can out-weigh costly remote CXL memory access such as fitting larger data sets in FAM that would otherwise be possible in a single host. Similarly this avoids cache-coherency costs by simply never modifying the memory. A number of concrete data science and AI usecases were presented. Shared FAM is meant to be mmap-able, file-backed, special purpose memory, for which a FAMFS prototype is described, overcoming limitations of just using DAX device/FSDAX, such as distributing metadata in a shareable way.
(vi)
CXL Memory Tiering for heterogenous computing. Discusses the pros and cons of interleaving heterogeneous (ie: DRAM and CXL) memory through hardware and/or software for bandwidth optimization. Hardware interleaving is simple to configure through the BIOS, but limited by not allowing the OS to manage allocations, otherwise hiding the NUMA topology (single node) as well as being a static configuration. The software interleaving solves these limitations with hardware and relies on weighted nodes for allocation distribution when doing the initial mapping (vma). Several interfaces have been posted, which incrementally are converging into a NUMA node based interface. The caveat is to have a single (configurable) system-wide set of weights, or to allow more flexibility, such as hierarchically through cgroups - something which has not been particularly sold yet. Combining both hardware and software models relies on within a socket, splitting channels among respective DDR and CXL NUMA nodes for which software can explicitly (numactl) set the interleaving - it is still restrained however by being static as the BIOS is in charge of setting the number of NUMA nodes.
(vii)
A move_pages() equivalent for physical memory. Through an experimental interface, this focused on the semantics of tiering and device driven page movement. There are currently various mechanisms for access detection, such as PMU-based, fault hinting for page promotion and idle bit page monitoring; each with its set of limitations, while runtime overhead is a universal concern. Hardware mechanisms could help with the burden but the problem is that devices only know physical memory and must therefore do expensive reverse mapping lookups; nor are there any interfaces for this, and it is difficult to with out hardware standardization. A good starting point would be to keep the suggested move_phys_pages as an interface, but not have it be an actual syscall.
December 01, 2023 07:14 PM
November 26, 2023
Bro, u mad?
Zubrin lies like he breathes, not bothering even calculate in Excel, not to mention take integrals!
But they continue to believe him — because it's Zubrin!
When they discussed his "engine on salt of Uranium" [NSWR — zaitcev] I wrote that the engine will not work in principle, because a reactor can only work in case of effective deceleration of neutrons, but already at the temperature of the moderator of 3000 degrees (like in KIWI), the cross-section of fission decreases 10 times, and the critical mass increases proportionally. But nobody paid attention — who am I, who is Zubrin!
The core has to be hot, and the moderator has to be cool, this is essential.
But they continued to fantasize, is it going to be 100,000 degrees in there, or only 10,000?
No matter how much I pointed out the principal contradiction — here is the cold sub-critical solution, and here is super-critical plasma, only in a meter or two away, and therefore neutrons from this plasma fly into the solution — which will inevitably capture them, decelerate, and react, and therefore the whole concept goes down the toilet.
But they disucss this salt engine over decades, without trying to check Zubrin's claims.
All "normal" nuclear reactors work only in the region between "first" (including the delayed neutrons) and "second" (with fast neutrons) criticalities. Only in this region, control of the reactor is possible. By the way, the difference in breeding ratios is only 1.000 and 1.007 for slow neutrons and 1.002 for the fast ones (in case of Plutonium, this much is the case even for slow neutrons).
And by the way, average delay for delayed neutrons is 0.1 seconds! The solution has to remain in the active zone for 100 milliseconds, in order to capture the delayed neutrons! Not even the solid phase RD-0410 reached that much.
Therefore, Zubrin's engine must be critical at the prompt neutrons. And because the moderator underperforms because it's hot, prompt neutrons become indistinguishable from fission neutrons, and therefore the density of plasma has to be the same as density of metal in order to achieve criticality — that is to say, almost 20 g/sm^3 for Uranium.
But this persuades nobody, because Zubrin is Zubrin, and who are you?
It all began as a discussion of Mars Direct among geeks, but escalated quickly.
November 26, 2023 03:27 AM
November 25, 2023
Videos and slides from the 2023 Linux Security summits may be found here:
Linux Security Summit North America (LSS-NA), May 10-12 2023, Vancouver, Canada.
Linux Security Summit Europe (LSS-EU), September 20-21 2023, Bilbao, Spain.
Note: if you wish to follow Linux Security Summit announcements and event updates via Mastodon, see https://social.kernel.org/LinuxSecSummit. You can follow this via the Fediverse or the RSS reader of your choice.
November 25, 2023 08:32 PM
November 17, 2023
The current plan is to be co-located in Vienna with OSS-EU. We don’t have exact dates to give (still finding conference space) but it will be three days on the week of 16 September.
November 17, 2023 01:31 PM
November 12, 2023
As a reminder, The live stream of each main track of Linux Plumbers Conference will be available in real time on Youtube. The Links are now live in the timetable. To view, go to the Schedule Overview and click on the paperclip on the upper right of the track you want to watch to bring up the Live Stream URL.
Live Stream viewers may interact over chat by joining the Matrix Room of that event. To see all our Matrix rooms for Plumbers, go to the space #lpc2023:lpc.events in matrix. The room names should be pretty intuitive.
November 12, 2023 11:24 PM
November 09, 2023
The URL for the training session we did on Thursday morning is:
https://bbb2.lpc.events/playback/presentation/2.3/62e3456da3c0598910e28d204ee24b669d714c04-1699539923588?t=37m55s
Note that the URL skips to time index 37:55 which is where the training actually begins (the hackroom got started early).
November 09, 2023 08:46 PM
November 08, 2023
We’ll be holding a BBB Training session on Thursday (8 November) at:
7am PST, 10am EST, 3pm UTC, 4pm CET, 8:30pm IST, 12am Friday JST
This will be recorded so that you can watch it later.
What is BBB? It’s an open source video software, similar to Zoom and Google Meets, but is much better for interactions between remote attendees and a live audience.
There are several features that BBB provides, and this training session will go over the common ones that you will likely use during your presentation.
This session is highly recommend for those that are presenting remotely, and may also be useful for those that are only attending remotely, to get a feel for the platform. In person attendees are welcome too, but we’ll have shepherds in the conference rooms on the day to help you out.
To join, you will need to log in to: https://meet.lpc.events
After logging in, to join the meeting, click the Hackroom entry in the leftnav then select the join button of Hackroom 1.
November 08, 2023 02:06 PM
November 05, 2023
Linus has pulled the initial GSP firmware support for nouveau. This is just the first set of work to use the new GSP firmware and there are likely many challenges and improvements ahead.
To get this working you need to install the firmware which hasn't landed in linux-firmware yet.
For Fedora this copr has the firmware in the necessary places:
https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6593115/
Hopefully we can upstream that in next week or so.
If you have an ADA based GPU then it should just try and work out of the box, if you have Turing or Ampere you currently need to pass nouveau.config=NvGspRm=1 on the kernel command line to attempt to use GSP.
Going forward, I've got a few fixes and stabilization bits to land, which we will concentrate on for 6.7, then going forward we have to work out how to keep it up to date and support new hardware and how to add new features.
November 05, 2023 08:23 PM
November 01, 2023
"Why does ACPI exist" - - the greatest thread in the history of forums, locked by a moderator after 12,239 pages of heated debate, wait no let me start again.
Why does ACPI exist? In the beforetimes power management on x86 was done by jumping to an opaque BIOS entry point and hoping it would do the right thing. It frequently didn't. We called this Advanced Power Management (Advanced because before this power management involved custom drivers for every machine and everyone agreed that this was a bad idea), and it involved the firmware having to save and restore the state of every piece of hardware in the system. This meant that assumptions about hardware configuration were baked into the firmware - failed to program your graphics card exactly the way the BIOS expected? Hurrah! It's only saved and restored a subset of the state that you configured and now potential data corruption for you. The developers of ACPI made the reasonable decision that, well, maybe since the OS was the one setting state in the first place, the OS should restore it.
So far so good. But some state is fundamentally device specific, at a level that the OS generally ignores. How should this state be managed? One way to do that would be to have the OS know about the device specific details. Unfortunately that means you can't ship the computer without having OS support for it, which means having OS support for every device (exactly what we'd got away from with APM). This, uh, was not an option the PC industry seriously considered. The alternative is that you ship something that abstracts the details of the specific hardware and makes that abstraction available to the OS. This is what ACPI does, and it's also what things like Device Tree do. Both provide static information about how the platform is configured, which can then be consumed by the OS and avoid needing device-specific drivers or configuration to be built-in.
The main distinction between Device Tree and ACPI is that Device Tree is purely a description of the hardware that exists, and so still requires the OS to know what's possible - if you add a new type of power controller, for instance, you need to add a driver for that to the OS before you can express that via Device Tree. ACPI decided to include an interpreted language to allow vendors to expose functionality to the OS without the OS needing to know about the underlying hardware. So, for instance, ACPI allows you to associate a device with a function to power down that device. That function may, when executed, trigger a bunch of register accesses to a piece of hardware otherwise not exposed to the OS, and that hardware may then cut the power rail to the device to power it down entirely. And that can be done without the OS having to know anything about the control hardware.
How is this better than just calling into the firmware to do it? Because the fact that ACPI declares that it's going to access these registers means the OS can figure out that it shouldn't, because it might otherwise collide with what the firmware is doing. With APM we had no visibility into that - if the OS tried to touch the hardware at the same time APM did, boom, almost impossible to debug failures (This is why various hardware monitoring drivers refuse to load by default on Linux - the firmware declares that it's going to touch those registers itself, so Linux decides not to in order to avoid race conditions and potential hardware damage. In many cases the firmware offers a collaborative interface to obtain the same data, and a driver can be written to get that. this bug comment discusses this for a specific board)
Unfortunately ACPI doesn't entirely remove opaque firmware from the equation - ACPI methods can still trigger System Management Mode, which is basically a fancy way to say "Your computer stops running your OS, does something else for a while, and you have no idea what". This has all the same issues that APM did, in that if the hardware isn't in exactly the state the firmware expects, bad things can happen. While historically there were a bunch of ACPI-related issues because the spec didn't define every single possible scenario and also there was no conformance suite (eg, should the interpreter be multi-threaded? Not defined by spec, but influences whether a specific implementation will work or not!), these days overall compatibility is pretty solid and the vast majority of systems work just fine - but we do still have some issues that are largely associated with System Management Mode.
One example is a recent Lenovo one, where the firmware appears to try to poke the NVME drive on resume. There's some indication that this is intended to deal with transparently unlocking self-encrypting drives on resume, but it seems to do so without taking IOMMU configuration into account and so things explode. It's kind of understandable why a vendor would implement something like this, but it's also kind of understandable that doing so without OS cooperation may end badly.
This isn't something that ACPI enabled - in the absence of ACPI firmware vendors would just be doing this unilaterally with even less OS involvement and we'd probably have even more of these issues. Ideally we'd "simply" have hardware that didn't support transitioning back to opaque code, but we don't (ARM has basically the same issue with TrustZone). In the absence of the ideal world, by and large ACPI has been a net improvement in Linux compatibility on x86 systems. It certainly didn't remove the "Everything is Windows" mentality that many vendors have, but it meant we largely only needed to ensure that Linux behaved the same way as Windows in a finite number of ways (ie, the behaviour of the ACPI interpreter) rather than in every single hardware driver, and so the chances that a new machine will work out of the box are much greater than they were in the pre-ACPI period.
There's an alternative universe where we decided to teach the kernel about every piece of hardware it should run on. Fortunately (or, well, unfortunately) we've seen that in the ARM world. Most device-specific simply never reaches mainline, and most users are stuck running ancient kernels as a result. Imagine every x86 device vendor shipping their own kernel optimised for their hardware, and now imagine how well that works out given the quality of their firmware. Does that really seem better to you?
It's understandable why ACPI has a poor reputation. But it's also hard to figure out what would work better in the real world. We could have built something similar on top of Open Firmware instead but the distinction wouldn't be terribly meaningful - we'd just have Forth instead of the ACPI bytecode language. Longing for a non-ACPI world without presenting something that's better and actually stands a reasonable chance of adoption doesn't make the world a better place.
comments
November 01, 2023 06:30 AM
October 29, 2023
Suppose you want to create a pipeline with the subprocess and you want to capture the stderr. A colleague of mine upstream wrote this:
p1 = subprocess.Popen(cmd1,
stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
p2 = subprocess.Popen(cmd2, stdin=p1.stdout,
stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
p1.stdout.close()
p1_stderr = p1.communicate()
p2_stderr = p2.communicate()
return p1.returncode or p2.returncode, p1_stderr, p2_stderr
Unfortunately, the above saves the p1.stdout in memory, which may come back to bite the user once the amount piped becomes large enough.
I think the right answer is this:
with tempfile.TemporaryFile() as errfile:
p1 = subprocess.Popen(cmd1,
stdout=subprocess.PIPE, stderr=errfile, close_fds=True)
p2 = subprocess.Popen(cmd2, stdin=p1.stdout,
stdout=subprocess.PIPE, stderr=errfile, close_fds=True)
p1.stdout.close()
p2.communicate()
p1.wait()
errfile.seek(0)
px_stderr = errfile.read()
return p1.returncode or p2.returncode, px_stderr
Stackoverflow is overflowing with noise on this topic. Just ignore it.
October 29, 2023 02:02 AM
October 26, 2023
The Pixel 8 hardware (Tensor G3) supports the ARM Memory Tagging Extension (MTE), and software support is available both in Android userspace and the Linux kernel. This feature is a powerful defense against linear buffer overflows and many types of use-after-free flaws. I’m extremely happy to see this hardware finally available in the real world.
Turning it on for userspace is already wired up the Android UI: Settings / System / Developer options / Memory Tagging Extension / Enable MTE until you turn if off. Once enabled it will internally change an Android “system property” named “arm64.memtag.bootctl” by adding the option “memtag“.
Turning it on for the kernel is slightly more involved, but not difficult at all. This requires manually setting the “arm64.memtag.bootctl” property mentioned above to include “memtag-kernel” as well:
- Plug your phone into a system that can run the
adb tool
- If not already installed, install
adb. For example on Debian/Ubuntu: sudo apt install adb
- Turn on “USB Debugging” in the phone’s “Developer options” menu, and accept the debugging session confirmation that will pop up when you first run
adb
- Verify the current setting:
adb shell getprop | grep memtag.bootctl
[arm64.memtag.bootctl]: [memtag]
- Enable kernel MTE:
adb shell setprop arm64.memtag.bootctl memtag,memtag-kernel
- Check the results:
adb shell getprop | grep memtag.bootctl
[arm64.memtag.bootctl]: [memtag,memtag-kernel]
- Reboot your phone
To check that MTE is enabled for the kernel (which is implemented using Kernel Address Sanitizer’s Hardware Tagging mode), you can check the kernel command line after rebooting:
$ mkdir foo && cd foo
$ adb bugreport
...
$ mkdir unpacked && cd unpacked
$ unzip ../bugreport*.zip
...
$ grep kasan= bugreport*.txt
...: Command line: ... kasan=off ... kasan=on ...
The latter “kasan=on” overrides the earlier “kasan=off“.
Enjoy!
© 2023, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
October 26, 2023 07:19 PM
October 13, 2023
The schedule for when the miniconferences and tracks are going to occur is now posted at: https://lpc.events/event/17/timetable/#all
The Linux Plumbers Refereed track schedule is now available at: https://lpc.events/event/17/timetable/#all.detailed
The runners for the miniconferences and kernel summit will be adding more details to each of their schedules over the coming weeks, as will the leads for the networking and toolchain tracks.
For those that are registered as in person, you are free to continue to submit Birds of a Feather(BOF) sessions. They will be allocated space in the BOF rooms on a first come, first serve basis. Please note that these BOFs will not be recorded.
We’re looking forward to a great 3 days of presentations and discussions. We hope you can join us either in-person or virtually!
October 13, 2023 03:29 AM
October 12, 2023
The Free Software Foundation Europe and the Software Freedom Conservancy recently released a statement that they would no longer work with Eben Moglen, chairman of the Software Freedom Law Center. Eben was the general counsel for the Free Software Foundation for over 20 years, and was centrally involved in the development of version 3 of the GNU General Public License. He's devoted a great deal of his life to furthering free software.
But, as described in the joint statement, he's also acted abusively towards other members of the free software community. He's acted poorly towards his own staff. In a professional context, he's used graphically violent rhetoric to describe people he dislikes. He's screamed abuse at people attempting to do their job.
And, sadly, none of this comes as a surprise to me. As I wrote in 2017, after it became clear that Eben's opinions diverged sufficiently from the FSF's that he could no longer act as general counsel, he responded by threatening an FSF board member at an FSF-run event (various members of the board were willing to tolerate this, which is what led to me quitting the board). There's over a decade's evidence of Eben engaging in abusive behaviour towards members of the free software community, be they staff, colleagues, or just volunteers trying to make the world a better place.
When we build communities that tolerate abuse, we exclude anyone unwilling to tolerate being abused[1]. Nobody in the free software community should be expected to deal with being screamed at or threatened. Nobody should be afraid that they're about to have their sexuality outed by a former boss.
But of course there are some that will defend Eben based on his past contributions. There were people who were willing to defend Hans Reiser on that basis. We need to be clear that what these people are defending is not free software - it's the right for abusers to abuse. And in the long term, that's bad for free software.
[1] "Why don't people just get better at tolerating abuse?" is a terrible response to this. Why don't abusers stop abusing? There's fewer of them, and it should be easier.
comments
October 12, 2023 04:32 PM
September 23, 2023
Now that the MC selection process is finished, we’ve recovered enough passes to reopen general registration. If you still wish to register, please go to our Attend page.
Hopefully we recovered enough passes to keep registration open for a couple of weeks, if not longer, but please don’t wait …
September 23, 2023 06:57 PM
September 13, 2023
TPMs contain a set of registers ("Platform Configuration Registers", or PCRs) that are used to track what a system boots. Each time a new event is measured, a cryptographic hash representing that event is passed to the TPM. The TPM appends that hash to the existing value in the PCR, hashes that, and stores the final result in the PCR. This means that while the PCR's value depends on the precise sequence and value of the hashes presented to it, the PCR value alone doesn't tell you what those individual events were. Different PCRs are used to store different event types, but there are still more events than there are PCRs so we can't avoid this problem by simply storing each event separately.
This is solved using the event log. The event log is simply a record of each event, stored in RAM. The algorithm the TPM uses to calculate the PCR values is known, so we can reproduce that by simply taking the events from the event log and replaying the series of events that were passed to the TPM. If the final calculated value is the same as the value in the PCR, we know that the event log is accurate, which means we now know the value of each individual event and can make an appropriate judgement regarding its security.
If any value in the event log is invalid, we'll calculate a different PCR value and it won't match. This isn't terribly helpful - we know that at least one entry in the event log doesn't match what was passed to the TPM, but we don't know which entry. That means we can't trust any of the events associated with that PCR. If you're trying to make a security determination based on this, that's going to be a problem.
PCR 7 is used to track information about the secure boot policy on the system. It contains measurements of whether or not secure boot is enabled, and which keys are trusted and untrusted on the system in question. This is extremely helpful if you want to verify that a system booted with secure boot enabled before allowing it to do something security or safety critical. Unfortunately, if the device gives you an event log that doesn't replay correctly for PCR 7, you now have no idea what the security state of the system is.
We ran into that this week. Examination of the event log revealed an additional event other than the expected ones - a measurement accompanied by the string "Boot Guard Measured S-CRTM". Boot Guard is an Intel feature where the CPU verifies the firmware is signed with a trusted key before executing it, and measures information about the firmware in the process. Previously I'd only encountered this as a measurement into PCR 0, which is the PCR used to track information about the firmware itself. But it turns out that at least some versions of Boot Guard also measure information about the Boot Guard policy into PCR 7. The argument for this is that this is effectively part of the secure boot policy - having a measurement of the Boot Guard state tells you whether Boot Guard was enabled, which tells you whether or not the CPU verified a signature on your firmware before running it (as I wrote before, I think Boot Guard has user-hostile default behaviour, and that enforcing this on consumer devices is a bad idea).
But there's a problem here. The event log is created by the firmware, and the Boot Guard measurements occur before the firmware is executed. So how do we get a log that represents them? That one's fairly simple - the firmware simply re-calculates the same measurements that Boot Guard did and creates a log entry after the fact[1]. All good.
Except. What if the firmware screws up the calculation and comes up with a different answer? The entry in the event log will now not match what was sent to the TPM, and replaying will fail. And without knowing what the actual value should be, there's no way to fix this, which means there's no way to verify the contents of PCR 7 and determine whether or not secure boot was enabled.
But there's still a fundamental source of truth - the measurement that was sent to the TPM in the first place. Inspired by Henri Nurmi's work on sniffing Bitlocker encryption keys, I asked a coworker if we could sniff the TPM traffic during boot. The TPM on the board in question uses SPI, a simple bus that can have multiple devices connected to it. In this case the system flash and the TPM are on the same SPI bus, which made things easier. The board had a flash header for external reprogramming of the firmware in the event of failure, and all SPI traffic was visible through that header. Attaching a logic analyser to this header made it simple to generate a record of that. The only problem was that the chip select line on the header was attached to the firmware flash chip, not the TPM. This was worked around by simply telling the analysis software that it should invert the sense of the chip select line, ignoring all traffic that was bound for the flash and paying attention to all other traffic. This worked in this case since the only other device on the bus was the TPM, but would cause problems in the event of multiple devices on the bus all communicating.
With the aid of this analyser plugin, I was able to dump all the TPM traffic and could then search for writes that included the "0182" sequence that corresponds to the command code for a measurement event. This gave me a couple of accesses to the locality 3 registers, which was a strong indication that they were coming from the CPU rather than from the firmware. One was for PCR 0, and one was for PCR 7. This corresponded to the two Boot Guard events that we expected from the event log. The hash in the PCR 0 measurement was the same as the hash in the event log, but the hash in the PCR 7 measurement differed from the hash in the event log. Replacing the event log value with the value actually sent to the TPM resulted in the event log now replaying correctly, supporting the hypothesis that the firmware was failing to correctly reconstruct the event.
What now? The simple thing to do is for us to simply hard code this fixup, but longer term we'd like to figure out how to reconstruct the event so we can calculate the expected value ourselves. Unfortunately there doesn't seem to be any public documentation on this. Sigh.
[1] What stops firmware on a system with no Boot Guard faking those measurements? TPMs have a concept of "localities", effectively different privilege levels. When Boot Guard performs its initial measurement into PCR 0, it does so at locality 3, a locality that's only available to the CPU. This causes PCR 0 to be initialised to a different initial value, affecting the final PCR value. The firmware can't access locality 3, so can't perform an equivalent measurement, so can't fake the value.
comments
September 13, 2023 09:02 PM
September 03, 2023
Linux Plumbers is now sold out and in-person registration is closed.
This year it happened not as fast as in 2022, but the registration is still sold out long before the event.
We are setting up a waitlist for in-person registration (virtual attendee places are still available). Please fill in this form and try to be clear about your reasons for wanting to attend. This year we’re giving waitlist priority to new attendees and people expected to contribute content.
September 03, 2023 10:00 AM
The Containers and Checkpoint/Restore micro-conference focuses on both userspace and kernel related work. The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.
The microconference will be discussing recent advancements in container technologies with some of the usual candidates being:
- CGroupV2 feature parity with CGroupV1
- Emulation of various files and system calls through FUSE and/or Seccomp
- Dealing with the eBPF-ification of the world
- Making user namespaces more accessible
- VFS idmap improvements
On the checkpoint/restore front, some of the potential topics include:
- Restoring FUSE services
- Handling GPUs
- Dealing with restartable sequences
And quite likely a variety of other container and checkpoint/restore topics as things evolve between now and the event.
Past editions of this micro-conference have been the source of many developments in the Linux kernel, including:
- PIDfds
- VFS idmap
- FUSE in user namespace
- Unprivileged overlayfs
- Time namespace
- A variety of CRIU features and checkpoint/restore kernel interfaces with the latest among them being
Use LPC abstract submission page to submit your proposals and select “Containers and Checkpoint/Restart” track.
September 03, 2023 09:26 AM
September 01, 2023
Sriram invited me to the oneAPI meetup, and I felt I hadn't summed up the state of compute and community development in a while. Enjoy 45 minutes of opinions!
https://www.youtube.com/watch?v=HzzLY5TdnZo
September 01, 2023 07:12 PM
August 31, 2023
The Power Management and Thermal Control microconference focuses on power management and thermal control infrastructure, CPU and device power-management mechanisms, and thermal control methods.
In particular, we are interested in improving the thermal control infrastructure in the kernel to cover more use cases and utilizing energy-saving opportunities offered by modern hardware in new ways.
The goal is to facilitate cross-framework and cross-platform discussions that can help improve energy-awareness and thermal control in Linux.
The current list of topics proposed so far includes the following:
August 31, 2023 11:37 AM
August 29, 2023
Work involves supporting Windows (there's a lot of specialised hardware design software that's only supported under Windows, so this isn't really avoidable), but also involves git, so I've been working on extending our support for hardware-backed SSH certificates to Windows and trying to glue that into git. In theory this doesn't sound like a hard problem, but in practice oh good heavens.
Git for Windows is built on top of msys2, which in turn is built on top of Cygwin. This is an astonishing artifact that allows you to build roughly unmodified POSIXish code on top of Windows, despite the terrible impedance mismatches inherent in this. One is that until 2017, Windows had no native support for Unix sockets. That's kind of a big deal for compatibility purposes, so Cygwin worked around it. It's, uh, kind of awful. If you're not a Cygwin/msys app but you want to implement a socket they can communicate with, you need to implement this undocumented protocol yourself. This isn't impossible, but ugh.
But going to all this trouble helps you avoid another problem! The Microsoft version of OpenSSH ships an SSH agent that doesn't use Unix sockets, but uses a named pipe instead. So if you want to communicate between Cygwinish OpenSSH (as is shipped with git for Windows) and the SSH agent shipped with Windows, you need something that bridges between those. The state of the art seems to be to use npiperelay with socat, but if you're already writing something that implements the Cygwin socket protocol you can just use npipe to talk to the shipped ssh-agent and then export your own socket interface.
And, amazingly, this all works? I've managed to hack together an SSH agent (using Go's SSH agent implementation) that can satisfy hardware backed queries itself, but forward things on to the Windows agent for compatibility with other tooling. Now I just need to figure out how to plumb it through to WSL. Sigh.
comments
August 29, 2023 06:57 AM
Confidential Computing is continuing to remain a popular topic in computing industry. From memory encryption to trusted I/O, hardware has been constantly improving and broadening. In the past years, confidential computing microconferences have brought together developers working on various features in hypervisors, firmware, Linux kernel, low level userspace up to container runtimes. We have discussed a broad range of topics, ranging from, hardware enablement to generic attestation workflows.
Just in the last year, we have seen support for Intel TDX and AMD SEV-SNP guests merged into Linux. Support for unaccepted memory has also landed in mainline. We have also had support for running as a CVM under Hyper-V partially merged into the kernel. However, there is still a long way to go before a complete Confidential Computing stack with open source software and Linux as the hypervisor becomes a reality. We invite contributions to this microconference to help make progress to that goal.
Topics of interest include
Please use the LPC CfP process to submit your proposals. Submissions can be made via the LPC abstract submission page. Make sure to select “Confidential Computing MC” as the track.
August 29, 2023 04:48 AM
August 23, 2023
The IoT Microconference is a forum for developers to discuss all things IoT. Topics include tools, telemetry, device drivers, and protocols in not only the Linux kernel but also Real-Time Operating Systems such as Zephyr.
Since last year, there have been a number of new technical topics with significant updates.
- Opportunities in IoT and Edge computing with the Linux /dev/accel API
- Using the Thrift RPC framework between Linux and Zephyr
- Zephyr’s new HTTP Server (a GSoC project)
- Rust in the Zephyr RTOS: Benefits, Challenges and Missing Pieces
- BeagleConnect Freedom Updates, Greybus, and the Linux Interface
- Linux-wpan updates on 6lowpan, 802.15.4 PAN coordinators and UWB
Current Problems that require attention (stakeholders):
- IEEE 802.15.4 SubGHz improvement areas in Zephyr, Linux (Florian Grandel, Stefan Schmidt, BeagleBoard.org)
- WpanUSB upstreaming in the Linux kernel, potentially dropping Zephyr support (Andrei Emeltchenko, BeagleBoard.org)
- IEEE 802.15.4 Linux subsystem device association handling (Miquel Raynal, Alexander Aring, Stefan Schmidt)
- Zephyr potentially dropping Bluetooth IPSP?
On a slightly less technical topic.
- Reflections after Two Years of Zephyr LTSv2
We are pleased to announce that the IoT Microconference is now accepting proposals!
If you are interested in presenting an IoT-related topic involving the Linux kernel, userspace tools, firmware, Zephyr, or frameworks, please upload your submission before September 15th.
Submissions can be made via the LPC Call for Proposals, by selecting Internet of Things MC for your track.
August 23, 2023 02:53 PM
August 21, 2023
On behalf of the PCI sub-system maintainers, we would like to invite everyone to join the VFIO/IOMMU/PCI micro-conference (MC) this year.
We are hoping to bring together, both in person and online, everyone interested in the VFIO, IOMMU, and PCI space to talk about the latest developments and challenges in these areas.
The PCI interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components, incorporating more and more features such as:
These features are aimed at high-performance systems, server and desktop computing, embedded and SoC platforms, virtualisation, and ubiquitous IoT devices.
The kernel code that enables these new system features focuses on coordination between the PCI devices, the IOMMUs they are connected to, and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems.
The VFIO/IOMMU/PCI MC focuses on the kernel code that enables these new system features, often requiring coordination between the VFIO, IOMMU and PCI sub-systems.
Following the success of LPC 2017, 2019, 2020, 2021, and 2022 VFIO/IOMMU/PCI MC, the Linux Plumbers Conference 2023 VFIO/IOMMU/PCI track will focus on promoting discussions on the PCI core but also current kernel patches aimed at VFIO/IOMMU/PCI sub-systems with specific sessions targeting discussions requiring the three sub-systems coordination.
See the following video recordings from 2022: LPC 2022 – VFIO/IOMMU/PCI MC
Older recordings can be accessed through our official YouTube channel at @linux-pci and the archived LPC 2017 VFIO/IOMMU/PCI MC web page at Linux Plumbers Conference 2017, where the audio recordings from the MC track and links to presentation materials are available.
The tentative schedule will provide an update on the current state of VFIO/IOMMU/PCI kernel sub-systems, followed by a discussion of current issues in the proposed topics.
The following was a result of last year’s successful Linux Plumbers MC:
Tentative topics that are under consideration for this year include (but are not limited to):
- PCI
- VFIO
- Write-combine on non-x86 architectures
- I/O Page Fault (IOPF) for passthrough devices
- Shared Virtual Addressing (SVA) interface
- Single-root I/O Virtualization(SRIOV)/Process Address Space ID (PASID) integration
- PASID in SRIOV virtual functions
- Device assignment/sub-assignment
- IOMMU
- /dev/iommufd development
- IOMMU virtualisation
- IOMMU drivers SVA interface
- DMA-API layer interactions and the move towards generic dma-ops for IOMMU drivers
- Possible IOMMU core changes (e.g., better integration with the device-driver core, etc.)
If you are interested in participating in this MC and have topics to propose, please use the Call for Proposals (CfP) process.
Otherwise, join us to discuss helping Linux keep up with the new features added to the PCI interconnect specification. We hope to see you there!
Proposals can be submitted here here by selecting Track “VFIO/IOMMU/PCI MC“
August 21, 2023 08:41 AM
August 17, 2023
The Linux kernel has grown in complexity over the years. Complete understanding of how it works via code inspection has become virtually impossible. Today, tracing is used to follow the kernel as it performs its complex tasks. Tracing is used today for much more than simply debugging. Its framework has become the way for other parts of the Linux kernel to enhance and even make possible new features. Live kernel patching is based on the infrastructure of function tracing, as well as BPF function hooks. It is now even possible to model the behavior and correctness of the system via runtime verification which attaches to trace points. There is still much more that is happening in this space, and this microconference will be the forum to explore current and new ideas.
Results and accomplishments from the last Tracing microconference (2021):
- User events were introduced, and have finally made it into the kernel.
- The discussion around trace events to handle user faults initiated the event probe work around to the problem. That was to add probes on existing trace events to change their types. This works on synthetic events that can pass the user space file name of the entry of a system call to the exit of the system call which would have faulted in the file and make it available to the trace event.
- Dynamically creating the events directory with the eventfs patch set is queued to be accepted. This will save memory as the dentries and inodes will only be allocated when accessed.
- The discussion about function tracing with arguments has helped inspire both fprobes and function graph return value tracing.
- There’s still ongoing effort in unifying the return path tracers of function graph and kretprobes and fprobes.
Possible ideas for topics for this year’s conference:
- Use of sframes. How to get user space stack traces without requiring frame pointers.
- Updating perf and ftrace to extract user space stack frames from a schedulable context (as requested by NMI).
- Extending user events. Now that they are in the kernel, how to make them more accessible to users and applications.
- Getting more use cases with the runtime verifier. Now that the runtime verifier is in the kernel (uses tracepoints to model against), what else can it be used for.
- Wider use of ftrace_regs in fprobes and rethook from fprobes because rethook may not fill all registers in pt_regs too. How BPF handles this will also be discussed.
- Removing kretprobes from kprobes so that kprobe can focus on handling software breakpoint.
- Object tracing (following a variable throughout each function call). This has had several patches out, but has stopped due to hard issues to overcome. A live discussion could possibly come up with a proper solution.
- Hardware breakpoints and tracing memory changes. Object tracing follows a variable when it changes between function calls. But if the hardware supports it, tracing a variable when it actually changes would be more useful albeit more complex. Discussion around this may come up with a easier answer.
- MMIO tracer being used in SMP. Currently the MMIO tracer does not handle race conditions. Instead, it offlines all but one CPU when it is enabled. It would be great if this could be used in normal SMP environments. There’s nothing technically preventing that from happening. It only needs some clever thinking to come up with a design to do so.
- Getting perf counters onto the ftrace ring buffer. Ftrace is designed for fast tracing, and perf is a great profiler. Over the years it has been asked to have perf counters along side ftrace trace events. Perhaps its time to finally accomplish that. It could be that each function can show the perf cache misses of that function.
For more information, feel free to contact the MC Leads:
Please follow the suggestions from
this BLOG post when submitting a CFP for this track.
August 17, 2023 01:09 PM
August 09, 2023
Once again The Kernel Testing & Dependability Micro-conference will be taking place at LPC 2023, to discuss testing and dependability related topics.
Please submit proposals for discussion via LPC submission system.
The Linux Plumbers 2023 Kernel Testing & Dependability track focuses on advancing the current state of testing of the Linux Kernel and its related infrastructure. The main purpose is to improve software quality and dependability for applications that require predictability and trust.
The goal of this micro-conference is making connections between folks working on similar projects, and help individual projects make progress.
This track is intended to promote collaboration between all the communities and people interested in the Kernel testing & dependability. This will help move the conversation forward from where we left off at the LPC 2022 Kernel Testing & Dependability MC.
We ask that any topic discussions focus on issues/problems they are facing and possible alternatives to resolving them. The Micro-conference is open to all topics related to testing on Linux, not necessarily in the kernel space.
Suggested topics:
- KernelCI: Topics on improvements and enhancements for test coverage
- Growing KCIDB, integrating more sources
- Sanitizers
- Using Clang for better testing coverage
- How to spread KUnit throughout the kernel?
- Building and testing in-kernel Rust code
- Explore ways to improve testing framework and tests in the kernelwith a specific goal to increase traceability and code coverage
- Explore how do SBOMs figure into dependability?
List of accomplishments this past year after LPC 2022:
- Developed a new, modern API for KernelCI with Pub/Sub interface
- Added Rust coverage in KernelCI
- KCIDB is continuing to gather results from many test systems: KernelCI, Red Hat’s CKI, syzbot, ARM, Gentoo, Linaro’s TuxSuite etc. The current focus is on generating common email reports based on this data and dealing with known issues.
- KFENCE is continuing to aid in detecting Out-of-bound OOB accesses, use-after-free errors (UAF), Double free and Invalid free and so on.
- Clang: CFI, weeding out issues upstream, etc.
- Kselftest continues to add coverage for new and existing features and subsystems.
- KUnit is continuing to act as the standard for some drivers and a de facto unit testing framework in the kernel
- The Runtime Verification (RV) interface from Daniel Bristot de Oliveira was merged.
Proposals can be submitted here, by August 20th:
MC leads can be reached for question and further information::
Shuah Khan ([email protected])
Sasha Levin <[email protected]>
Guillaume Tucker <[email protected]>
August 09, 2023 03:40 PM
After a three-year hiatus, the Live Patching Microconference is back for 2023.
Accomplishments post 2019 Microconference:
- API enhancements: Livepatch pre/post (un)patch callback system state change tracking was added in v5.5. The new API enhances the safety of cumulative livepatch upgrades [v5.5]
- KLP-relocations: To facilitate module_disable_ro() removal, arch-specific livepatch .klp.arg sections were deprecated. Special arch section KLP-relocations (like x86 jump labels) are still supported for vmlinux cases, and are now applied at the same time as normal relocations. [v5.8]
- Documentation: Practical information on how to implement reliable stacktraces needed by the livepatching consistency model was added [v5.12]
- Architecture: Implemented Power32 support [v5.18]
- KLP-relocations: To support target module reloading, clear KLP-relocations in livepatch modules when their target module is unloaded. This satisfies a module loader sanity check when resolving relocations on the next target module load (x86_64 only) [v6.3]
Discussion Topics
The following topics have been proposed:
- Shadow variables are considered a livepatching power-feature that can require careful management, especially across livepatch up and downgrades. Is garbage collection or a refactoring of callbacks a better approach to manage these resources?
- klp-relocations were originally introduced to resolve livepatch / kernel and module symbol scoping issues. Recent security features like CET and IBT suggest another use case and renewed interest in having an in-tree klp-relocation build support. Is a simple conversion utility sufficient, or does said tool require greater features?
- The livepatching kselftests consist of test scripts under tools/testing/selftests and associated livepatch module code in lib/. Consolidating these under the former offers better flexibility in templating the livepatch modules as well as the benefits of building them out-of-tree. Are there any outstanding blockers to implement these changes?
- arm64 support is moving forward on several fronts: toolchain, reliable stack unwinding, user space, etc. The Toolchains MC plans to address topics like CFG in ELF and handling of noinstr functions. What issues remain in livepatching and the kernel at large to fully support arm64?
- Rust looks to be a hot topic at this year’s LPC. Its impact on kernel livepatching is relatively open ended as Rust code has only recently been merged in small parts. That said, which features, problems, patchsets should we be paying attention to as we all learn more about this newly supported kernel language?
These potential discussion topics were selected from on-going livepatching mailing list threads, but additional livepatching related topics are welcome for consideration as well. For ideas on what makes for an ideal Microconference topic, checkout this post.
August 09, 2023 03:38 PM
August 08, 2023
I dug out a computer running Fedora 28, which was released 2018-04-01 - over 5 years ago. Backing up the data and re-installing seemed tedious, but the current version of Fedora is 38, and while Fedora supports updates from N to N+2 that was still going to be 5 separate upgrades. That seemed tedious, so I figured I'd just try to do an update from 28 directly to 38. This is, obviously, extremely unsupported, but what could possibly go wrong?
Running sudo dnf system-upgrade download --releasever=38 didn't successfully resolve dependencies, but sudo dnf system-upgrade download --releasever=38 --allowerasing passed and dnf started downloading 6GB of packages. And then promptly failed, since I didn't have any of the relevant signing keys. So I downloaded the fedora-gpg-keys package from F38 by hand and tried to install it, and got a signature hdr data: BAD, no. of bytes(88084) out of range error. It turns out that rpm doesn't handle cases where the signature header is larger than a few K, and RPMs from modern versions of Fedora. The obvious fix would be to install a newer version of rpm, but that wouldn't be easy without upgrading the rest of the system as well - or, alternatively, downloading a bunch of build depends and building it. Given that I'm already doing all of this in the worst way possible, let's do something different.
The relevant code in the hdrblobRead function of rpm's lib/header.c is:
int32_t il_max = HEADER_TAGS_MAX;
int32_t dl_max = HEADER_DATA_MAX;
if (regionTag == RPMTAG_HEADERSIGNATURES) {
il_max = 32;
dl_max = 8192;
}
which indicates that if the header in question is RPMTAG_HEADERSIGNATURES, it sets more restrictive limits on the size (no, I don't know why). So I installed rpm-libs-debuginfo, ran gdb against librpm.so.8, loaded the symbol file, and then did disassemble hdrblobRead. The relevant chunk ends up being:
0x000000000001bc81 <+81>: cmp $0x3e,%ebx
0x000000000001bc84 <+84>: mov $0xfffffff,%ecx
0x000000000001bc89 <+89>: mov $0x2000,%eax
0x000000000001bc8e <+94>: mov %r12,%rdi
0x000000000001bc91 <+97>: cmovne %ecx,%eax
which is basically "If ebx is not 0x3e, set eax to 0xffffffff - otherwise, set it to 0x2000". RPMTAG_HEADERSIGNATURES is 62, which is 0x3e, so I just opened librpm.so.8 in hexedit, went to byte 0x1bc81, and replaced 0x3e with 0xfe (an arbitrary invalid value). This has the effect of skipping the if (regionTag == RPMTAG_HEADERSIGNATURES) code and so using the default limits even if the header section in question is the signatures. And with that one byte modification, rpm from F28 would suddenly install the fedora-gpg-keys package from F38. Success!
But short-lived. dnf now believed packages had valid signatures, but sadly there were still issues. A bunch of packages in F38 had files that conflicted with packages in F28. These were largely Python 3 packages that conflicted with Python 2 packages from F28 - jumping this many releases meant that a bunch of explicit replaces and the like no longer existed. The easiest way to solve this was simply to uninstall python 2 before upgrading, and avoiding the entire transition. Another issue was that some data files had moved from libxcrypt-common to libxcrypt, and removing libxcrypt-common would remove libxcrypt and a bunch of important things that depended on it (like, for instance, systemd). So I built a fake empty package that provided libxcrypt-common and removed the actual package. Surely everything would work now?
Ha no. The final obstacle was that several packages depended on rpmlib(CaretInVersions), and building another fake package that provided that didn't work. I shouted into the void and Bill Nottingham answered - rpmlib dependencies are synthesised by rpm itself, indicating that it has the ability to handle extensions that specific packages are making use of. This made things harder, since the list is hard-coded in the binary. But since I'm already committing crimes against humanity with a hex editor, why not go further? Back to editing librpm.so.8 and finding the list of rpmlib() dependencies it provides. There were a bunch, but I couldn't really extend the list. What I could do is overwrite existing entries. I tried this a few times but (unsurprisingly) broke other things since packages depended on the feature I'd overwritten. Finally, I rewrote rpmlib(ExplicitPackageProvide) to rpmlib(CaretInVersions) (adding an extra '\0' at the end of it to deal with it being shorter than the original string) and apparently nothing I wanted to install depended on rpmlib(ExplicitPackageProvide) because dnf finished its transaction checks and prompted me to reboot to perform the update. So, I did.
And about an hour later, it rebooted and gave me a whole bunch of errors due to the fact that dbus never got started. A bit of digging revealed that I had no /etc/systemd/system/dbus.service, a symlink that was presumably introduced at some point between F28 and F38 but which didn't get automatically added in my case because well who knows. That was literally the only thing I needed to fix up after the upgrade, and on the next reboot I was presented with a gdm prompt and had a fully functional F38 machine.
You should not do this. I should not do this. This was a terrible idea. Any situation where you're binary patching your package manager to get it to let you do something is obviously a bad situation. And with hindsight performing 5 independent upgrades might have been faster. But that would have just involved me typing the same thing 5 times, while this way I learned something. And what I learned is "Terrible ideas sometimes work and so you should definitely act upon them rather than doing the sensible thing", so like I said, you should not do this in case you learn the same lesson.
comments
August 08, 2023 05:54 AM
August 05, 2023
In the Linux ecosystems, there are many ways to build all the software used to put together a running system. Whether it’s building all the binary packages for a binary Linux distribution, using a source-based distribution, or building an embedded system from scratch, there are a lot of shared challenges which each system solves in its own way.
This microconference is a way to get people who work on disparate build systems to discuss common problems and possible shared solutions across the entire problem space. The kinds of topics we want to discuss are the following:
- Bootstrapping the build system
- Cross building software
- Make, autoconf, and other similar software build tools
- Package build systems, bitbake, emerge/portage, pacman, etc
- Packaging formats
- Managing software with language-specific package managers
- Patch sharing
- Building within a container
- Build systems for building containers
- License gathering and verification
- Security updates
- SBOMS
- Software chain-of-trust
- Repeatable builds
- Documentation and education
- Finding the next generation of maintainers
- Build-system visibility within the wider Plumbers attendeesThis is not a definitive list, and you are free to post abstracts for other related topics.
Build Systems micorconference would like to gather representatives (developers and maintainers) from all the various build systems and related technologies. This is not a definitive list of possible attendees.
- Android
- Arch Linux
- Buildroot
- ChromeOS
- Gentoo
- OpenEmbedded
- OpenWRT/LEDE
- Yocto Project
- Other traditional Binary Packaged distributions
For more information, feel free to contact the MC Leads:
Please follow the suggestions from
this BLOG post when submitting a CFP for this track.
August 05, 2023 09:48 AM
August 04, 2023
The initial NVK (nouveau vulkan) experimental driver has been merged into mesa master[1], and although there's lots of work to be done before it's application ready, the main reason it was merged was because the initial kernel work needed was merged into drm-misc-next[2] and will then go to drm-next for the 6.6 merge window. (This work is separate from the GSP firmware enablement required for reclocking, that is a parallel development, needed to make nvk useable). Faith at Collabora will have a blog post about the Mesa side, this is more about the kernel journey.
What was needed in the kernel?
The nouveau kernel API was written 10 years or more ago, and was designed around OpenGL at the time. There were two major restrictions in the current uAPI that made it unsuitable for Vulkan.
- buffer objects (physical memory allocations) were allocated 1:1 with virtual memory allocations for a file descriptor. This meant the kernel managed the virtual address space. For proper Vulkan support, the bo allocation and vm allocation have to be separate, and userspace should control the virtual address space.
- Command submission didn't use sync objects. The nouveau command submission wasn't wired up to the modern sync objects. These are pretty much a requirement for Vulkan fencing and semaphores to work properly.
How to implement these?
When we kicked off the nvk idea I made a first pass at implementing a new user API, to allow the above features. I took at look at how the GPU VMA management was done in current drivers and realized that there was a scope for a common component to manage the GPU VA space. I did a hacky implementation of some common code and a nouveau implementation. Luckily at the time, Danilo Krummrich had joined my team at Red Hat and needed more kernel development experience in GPU drivers. I handed my sketchy implementation to Danilo and let him run with it. He spent a lot of time learning and writing copious code. His GPU VA manager code was merged into drm-misc-next last week and his nouveau code landed today.
What is the GPU VA manager?
The idea behind the GPU VA manager is that there is no need for every driver to implement something that should essentially not be a hardware specific problem. The manager is designed to track VA allocations from userspace, and keep track of what GEM objects they are currently bound to. The implementation went through a few twists and turns and experiments.
For a long period we considered using maple tree as the core of it, but we hit a number of messy interactions between the dma-fence locking and memory allocations required to add new nodes to the maple tree. The dma-fence critical section is a hard requirement to make others deal with. In the end Danilo used an rbtree to track things. We will revisit if we can deal with maple tree again in the future.
We had a long discussion and a couple of implement it both ways and see, on whether we needed to track empty sparse VMA ranges in the manager or not, nouveau wanted these but generically we weren't sure they were helpful, but that also affected the uAPI as it needed explicit operations to create/drop these. In the end we started tracking these in the driver and left the core VA manager cleaner.
Now the code is in tree we will start to push future drivers to use it instead of spinning their own.
What changes are needed for nouveau?
Now that the VAs are being tracked, the nouveau API needed two new entrypoints. Since BO allocation will no longer create a VM, a new API is needed to bind BO allocations with VM addresses. This is called the VM_BIND API. It has two variants
- a synchronous version that immediately maps a BO to a VM and is used for the common allocation paths.
- an asynchronous version that is modeled after the Vulkan sparse API, and takes in/out sync objects, which use the drm scheduler to schedule the vm/bo binding.
The VM BIND backend then does all the page table manipulation required.
The second API added was an EXEC call. This takes in/out sync objects and a set of addresses that point to command buffers to execute. This uses the drm scheduler to deal with the synchronization and hands the firmware the command buffer address to execute.
Internally for nouveau this meant having to add support for the drm scheduler, adding new internal page table manipulation APIs, and wiring up the GPU VA.
Shoutouts:
My input was the sketchy sketch at the start, and doing the userspace changes to the nvk codebase to allow testing.
The biggest shoutout to Danilo, who took a sketchy sketch of what things should look like, created a real implementation, did all the experimental ideas I threw at him, and threw them and others back at me, negotiated with other drivers to use the common code, and built a great foundational piece of drm kernel infrastructure.
Faith at Collabora who has done the bulk of the work on nvk did a code review at the end and pointed out some missing pieces of the API and the optimisations it enables.
Karol at Red Hat on the main nvk driver and Ben at Red Hat for nouveau advice on how things worked, while he smashed away at the GSP rock.
(and anyone else who has contributed to nvk, nouveau and even NVIDIA for some bits :-)
[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24326
[2] https://cgit.freedesktop.org/drm-misc/log/
August 04, 2023 10:26 PM
August 02, 2023
August is now upon us, and the deadline for refereed track submissions is August 6, which is right around the corner. We have already received some excellent submissions, for which we gratefully thank our submitters!
For those thinking about submitting, please polish off your ideas, and point your browsers at the call-for-proposals page. Looking forward to your submissions.
Reminder: we’ve got a tight deadline to prepare the submissions for the LPC program committee to review, so, as communicated last year, we will not be extending the deadline this year, please submit by August 6th, anywhere on earth.
August 02, 2023 04:03 PM
July 30, 2023
LPC 2023 will host the second edition of the Rust MC. This microconference intends to cover talks and discussions on both Rust for Linux as well as other non-kernel Rust topics. Proposals can be submitted via LPC submission system, selecting the Rust MC track.
Rust is a systems programming language that is making great strides in becoming the next big one in the domain. Rust for Linux is the project adding support for the Rust language to the Linux kernel.
Rust has a key property that makes it very interesting as the second language in the kernel: it guarantees no undefined behavior takes place (as long as unsafe code is sound). This includes no use-after-free mistakes, no double frees, no data races, etc. It also provides other important benefits, such as improved error handling, stricter typing, sum types, pattern matching, privacy, closures, generics, etc.
Possible Rust for Linux topics:
- Rust in the kernel (e.g. status update, next steps…).
- Use cases for Rust around the kernel (e.g. subsystems, drivers,
other modules…).
- Discussions on how to abstract existing subsystems safely, on API design, on coding guidelines…
- Integration with kernel systems and other infrastructure (e.g. build system, documentation, testing and CIs, maintenance, unstable features, architecture support, stable/LTS releases, Rust versioning, third-party crates…).Updates on its subprojects (e.g. klint, pinned-init)
Possible Rust topics:
- Language and standard library (e.g. upcoming features, stabilization of the remaining features the kernel needs, memory model…).
- Compilers and codegen (e.g. rustc improvements, LLVM and Rust, rustc_codegen_gcc, Rust GCC…).
- Other tooling and new ideas (bindgen, Cargo, Miri, Clippy, Compiler Explorer, Coccinelle for Rust…).
- Educational material.
- Any other Rust topic within the Linux ecosystem.
Last year was the first edition of the Rust MC and the focus was on showing the ongoing efforts by different parties (compilers, Rust for Linux, CI, eBPF…). Shortly after the Rust MC, Rust got merged into the Linux kernel. Abstractions are getting upstreamed, with the first major drivers looking to be merged soon: Android Binder, the Asahi GPU driver and the NVMe driver (presented in that MC).
July 30, 2023 07:05 AM
July 28, 2023
EOSS in Prague was great, lots of hallway track, good talks, good food,
excellent tea at meetea - first time I had proper tea
in my life, quite an experience. And also my first talk since covid, pack room
with standing audience, apparently one of the top ten most attended talks per
LF’s conference
report.
The video recording is now
uploaded, I’ve uploaded the fixed
slides, including the missing slide that I
accidentally cut in a last-minute edit. It’s the same content as my blog posts
from last year, first talking about locking engineering
principles and then the hierarchy of
locking engineering patterns.
July 28, 2023 12:00 AM
July 27, 2023
The Android Microconference brings the upstream community and Android systems developers together to discuss issues and changes to the Android platform and their dependencies and interactions with the Linux kernel, allowing for collaboration on solutions for upstream.
Since last year’s conference, there has been quite a bit of progress, specifically around:
Currently planned discussion topics for this year include:
- 16k Pages
- RISC-V
- android-mainline on Pixel6
- Updates on Binder
- BPF usage w/ Android
- Kernel and platform integration testing
- Vendor Hook Usage
- Building Modules for Android GKI Kernels
- Resolving Priority Inversion w/ Proxy Execution
- AOSP Devboards
- And likely more…
People are encouraged to submit topics related to new Android functionality as well as issues in getting that functionality upstream.
Please consider that the goal is to discuss open problems, preferably with patch set submissions already in discussion on LKML. The slots are very short (10-15 mins), and the main portion of the time should be given to the debate – thus, the importance of having an open and relevant problem, with people in the community engaged in the solution.
The CFP for the Android Micro-conference closes on Aug 15th, so get your topics in early!
Additionally, we already have a busy tentative schedule, but please submit your topics, and should it not fit, we hope to have additional discussion space in a follow-on BoF.
July 27, 2023 05:12 AM
July 23, 2023
Here are the list of microconferences at the 2023 Linux Plumbers Conference:
Some of the above already have a blog describing them in detail, and blogs for the rest will be coming shortly. If you plan on submitting a topic to one of these microconferences, please read the blog on what an ideal microconference topic submission is. After that, submit your topic and make sure that you select the appropriate track that you are submitting for (they are all listed under LPC Microconference and end with MC).
July 23, 2023 08:15 PM
July 21, 2023
We are pleased to announce that we will have a CXL MC this year at Plumbers, and hereby invite the community in our call for participation.
Compute Express Link is a cache coherent fabric that in recent years has been gaining momentum in the industry. CXL 3.0 launched just before Plumbers 2022 (where very early discussions took place), bringing new challenges such as dynamic capacity devices and large scale fabrics, two features that bring significant challenges to Linux. There also has been controversy and confusion in the Linux kernel community about the state and future of CXL, regarding its usage and integration into, for example, the core memory management subsystem. Many concerns have been put to rest through proper clarification and setting of expectations.
The Compute Express Link microconference focuses on how to evolve the Linux CXL kernel driver and userspace components for support of the CXL 2.0 spec (and beyond). The microconference provides a pace to open the discussion, incorporate more perspectives, and grow the CXL community with a goal that the CXL Linux plumbing serves the needs of the CXL ecosystem while balancing the needs of the Linux project. Specifically, this microconference welcomes submissions detailing industry and academia use cases in order to develop usage model scenarios. Finally, it will be a good opportunity to have existing upstream CXL developers available in a forum to discuss current CXL support and to communicate areas that need additional involvement.
Suggested topics:
- Ecosystem & Architectural review
- Dynamic Capacity Devices
- Fabric Management
- QEMU support
- Security (ie: IDE/SPDM)
- Managing vendor specificity
- Type 2 accelerator support (bias flip management)
- Coherence management of type2/3 memory (back-invalidation)
- Peer2Peer (ie: Unordered IO)
- Reliability, availability and serviceability (ie: Advanced Error Reporting, Isolation, Maintenance).
- Hotplug (QoS throttling, policies, daxctl)
- Hot remove
- Documentation
- Memory tiering topics that can relate to cxl (out of scope of MM/performance MCs)
- Industry and academia use cases
Proposals can be submitted here, by September 1st:
https://lpc.events/event/17/abstracts/
For more information, feel free to contact the Compute Express Link MC Leads:
Davidlohr Bueso <[email protected]>
Jonathan Cameron <[email protected]>
Adam Manzanares <[email protected]>
Dan Williams <[email protected]>
July 21, 2023 07:33 AM
July 14, 2023
I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.
I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!
I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.
While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.
tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.
The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).
LLVM:
215.78 ms cpy, 12245.04 ms run, 120.33 ms build, 12019.45 ms realize, 105.26 ms CL, -0.12 loss, 421 tensors, 0.04 GB used, 0.94 GFLOPS
10.25 ms cpy, 221.02 ms run, 83.50 ms build, 36.25 ms realize, 101.27 ms CL, -0.01 loss, 421 tensors, 0.04 GB used, 52.11 GFLOPS
ACO:
71.10 ms cpy, 3443.04 ms run, 112.58 ms build, 3214.13 ms realize, 116.34 ms CL, -0.04 loss, 421 tensors, 0.04 GB used, 3.35 GFLOPS
10.36 ms cpy, 234.90 ms run, 84.84 ms build, 36.51 ms realize, 113.54 ms CL, 0.05 loss, 421 tensors, 0.04 GB used, 49.03 GFLOPS
So ACO is about 4 times faster to compile but produces binaries that are less optimised.
The benchmark produces 148 shaders:
LLVM:
126 Max Waves: 16
6 Max Waves: 10
5 Max Waves: 9
6 Max Waves: 8
5 Max Waves: 4
ACO:
96 Max Waves: 16
36 Max Waves: 12
2 Max Waves: 10
10 Max Waves: 8
4 Max Waves: 4
So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]
I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip
July 14, 2023 04:30 AM
July 11, 2023
The phrase "Root of Trust" turns up at various points in discussions about verified boot and measured boot, and to a first approximation nobody is able to give you a coherent explanation of what it means[1]. The Trusted Computing Group has a fairly wordy definition, but (a) it's a lot of words and (b) I don't like it, so instead I'm going to start by defining a root of trust as "A thing that has to be trustworthy for anything else on your computer to be trustworthy".
(An aside: when I say "trustworthy", it is very easy to interpret this in a cynical manner and assume that "trust" means "trusted by someone I do not necessarily trust to act in my best interest". I want to be absolutely clear that when I say "trustworthy" I mean "trusted by the owner of the computer", and that as far as I'm concerned selling devices that do not allow the owner to define what's trusted is an extremely bad thing in the general case)
Let's take an example. In verified boot, a cryptographic signature of a component is verified before it's allowed to boot. A straightforward implementation of a verified boot implementation has the firmware verify the signature on the bootloader or kernel before executing it. In this scenario, the firmware is the root of trust - it's the first thing that makes a determination about whether something should be allowed to run or not[2]. As long as the firmware behaves correctly, and as long as there aren't any vulnerabilities in our boot chain, we know that we booted an OS that was signed with a key we trust.
But what guarantees that the firmware behaves correctly? What if someone replaces our firmware with firmware that trusts different keys, or hot-patches the OS as it's booting it? We can't just ask the firmware whether it's trustworthy - trustworthy firmware will say yes, but the thing about malicious firmware is that it can just lie to us (either directly, or by modifying the OS components it boots to lie instead). This is probably not sufficiently trustworthy!
Ok, so let's have the firmware be verified before it's executed. On Intel this is "Boot Guard", on AMD this is "Platform Secure Boot", everywhere else it's just "Secure Boot". Code on the CPU (either in ROM or signed with a key controlled by the CPU vendor) verifies the firmware[3] before executing it. Now the CPU itself is the root of trust, and, well, that seems reasonable - we have to place trust in the CPU, otherwise we can't actually do computing. We can now say with a reasonable degree of confidence (again, in the absence of vulnerabilities) that we booted an OS that we trusted. Hurrah!
Except. How do we know that the CPU actually did that verification? CPUs are generally manufactured without verification being enabled - different system vendors use different signing keys, so those keys can't be installed in the CPU at CPU manufacture time, and vendors need to do code development without signing everything so you can't require that keys be installed before a CPU will work. So, out of the box, a new CPU will boot anything without doing verification[4], and development units will frequently have no verification.
As a device owner, how do you tell whether or not your CPU has this verification enabled? Well, you could ask the CPU, but if you're doing that on a device that booted a compromised OS then maybe it's just hotpatching your OS so when you do that you just get RET_TRUST_ME_BRO even if the CPU is desperately waving its arms around trying to warn you it's a trap. This is, unfortunately, a problem that's basically impossible to solve using verified boot alone - if any component in the chain fails to enforce verification, the trust you're placing in the chain is misplaced and you are going to have a bad day.
So how do we solve it? The answer is that we can't simply ask the OS, we need a mechanism to query the root of trust itself. There's a few ways to do that, but fundamentally they depend on the ability of the root of trust to provide proof of what happened. This requires that the root of trust be able to sign (or cause to be signed) an "attestation" of the system state, a cryptographically verifiable representation of the security-critical configuration and code. The most common form of this is called "measured boot" or "trusted boot", and involves generating a "measurement" of each boot component or configuration (generally a cryptographic hash of it), and storing that measurement somewhere. The important thing is that it must not be possible for the running OS (or any pre-OS component) to arbitrarily modify these measurements, since otherwise a compromised environment could simply go back and rewrite history. One frequently used solution to this is to segregate the storage of the measurements (and the attestation of them) into a separate hardware component that can't be directly manipulated by the OS, such as a Trusted Platform Module. Each part of the boot chain measures relevant security configuration and the next component before executing it and sends that measurement to the TPM, and later the TPM can provide a signed attestation of the measurements it was given. So, an SoC that implements verified boot should create a measurement telling us whether verification is enabled - and, critically, should also create a measurement if it isn't. This is important because failing to measure the disabled state leaves us with the same problem as before; someone can replace the mutable firmware code with code that creates a fake measurement asserting that verified boot was enabled, and if we trust that we're going to have a bad time.
(Of course, simply measuring the fact that verified boot was enabled isn't enough - what if someone replaces the CPU with one that has verified boot enabled, but trusts keys under their control? We also need to measure the keys that were used in order to ensure that the device trusted only the keys we expected, otherwise again we're going to have a bad time)
So, an effective root of trust needs to:
1) Create a measurement of its verified boot policy before running any mutable code
2) Include the trusted signing key in that measurement
3) Actually perform that verification before executing any mutable code
and from then on we're in the hands of the verified code actually being trustworthy, and it's probably written in C so that's almost certainly false, but let's not try to solve every problem today.
Does anything do this today? As far as I can tell, Intel's Boot Guard implementation does. Based on publicly available documentation I can't find any evidence that AMD's Platform Secure Boot does (it does the verification, but it doesn't measure the policy beforehand, so it seems spoofable), but I could be wrong there. I haven't found any general purpose non-x86 parts that do, but this is in the realm of things that SoC vendors seem to believe is some sort of value-add that can only be documented under NDAs, so please do prove me wrong. And then there are add-on solutions like Titan, where we delegate the initial measurement and validation to a separate piece of hardware that measures the firmware as the CPU reads it, rather than requiring that the CPU do it.
But, overall, the situation isn't great. On many platforms there's simply no way to prove that you booted the code you expected to boot. People have designed elaborate security implementations that can be bypassed in a number of ways.
[1] In this respect it is extremely similar to "Zero Trust"
[2] This is a bit of an oversimplification - once we get into dynamic roots of trust like Intel's TXT this story gets more complicated, but let's stick to the simple case today
[3] I'm kind of using "firmware" in an x86ish manner here, so for embedded devices just think of "firmware" as "the first code executed out of flash and signed by someone other than the SoC vendor"
[4] In the Intel case this isn't strictly true, since the keys are stored in the motherboard chipset rather than the CPU, and so taking a board with Boot Guard enabled and swapping out the CPU won't disable Boot Guard because the CPU reads the configuration from the chipset. But many mobile Intel parts have the chipset in the same package as the CPU, so in theory swapping out that entire package would disable Boot Guard. I am not good enough at soldering to demonstrate that.
comments
July 11, 2023 07:58 AM
July 01, 2023
We are pleased to announce the first ever Linux Kernel Debugging Microconference, and we are now accepting proposals and problem statements.
Kernel debugging can be done in many ways with many purpose-built tools, from printk to Crash, Drgn, KDB/KGDB, and more. These tools are built on layers of standards, formats, implicit standards, and undocumented assumptions that make everything tick. When things work well, the tools stay out of your way and help you resolve your bug. But when things don’t work so well, you’re left debugging your debugger.
The Linux Kernel Debugging Microconference aims to bring together the developers and users of these tools to discuss the shared problems we face. We hope to discuss ongoing work that will improve the state of kernel debuggers, as well as new ideas that will require coordinated development across projects. Some possible topics might include:
- Alternative sources of debuginfo beyond DWARF (kallsyms, BTF, etc)
- Problems related to core debugging tools & utilities (`/proc/vmcore`, `/proc/kcore`, kexec, kdump, makedumpfile, libkdumpfile, and many more).
- Strategies to handle the interpretation of core kernel subsystems across versions (e.g. slab & vfs).
- Core dump formats and ways they can break & be repaired
Topics outside this narrow list are welcomed: we welcome any topic that would improve the debugging experience, or merits the attention of the developers of these tools & kernel subsystems. The best submissions will describe active work or open problems, and they will welcome debate, discussion, and community consensus.
Submissions can be made via the LPC Call for Proposals, by selecting Linux Kernel Debugging MC for your track.
July 01, 2023 04:22 PM
June 26, 2023
The Linux Plumbers’ microconference is a three and a half hour session focused on one general focus area. It can be on Android, power management, tracing, real-time or any of the other many subsystems in the Linux ecosystem. These sessions are broken up into smaller topics that are highly focused work meetings with the goal of accomplishing something during the brief discussions that happen during that time. A topic session ranges from 15 to 30 minutes in length, where no more than half the time is a presentation to bring everyone in the room (or online) up to speed about the issues that need to be discussed, and the rest of the time is spent on brainstorming ideas with the audience on how to accomplish solving the problems at hand. The problem does not need to be solved in this short time, but when time is up, the audience should understand what is at stake well enough to be productive offline in mailing lists and chat rooms.
Submitting a microconference topic
A microconference topic submission should be considered a problem statement and not an abstract. The submission should explain what the issue is that the submitter is struggling with, what has currently been done to try to solve it, and sometimes that means showing multiple solutions where there are pros and cons to each solution and the submitter wants to discuss which is better with the audience. There is the possible chance that the audience may even come up with a new solution that is better than what is being presented. The topic should be focused on what is currently being worked on and not about what was already done, unless the submitter wants to talk about what new can be done with what was already done.
Presenting the topic
The topic should start off with a presentation. The goal of the session is to come up with answers to the problem at hand. If the audience does not know the details of the issue, they are highly unlikely to come up with any productive input. The more the audience understands the problem, the likelier they will be able to help out. Due to the short time of the microconference topic session, it is imperative that the presentation is extremely focused on a need to know basis. That is, only present what is critical knowledge to understand the problem at hand. The quicker the audience can come up to speed, the more time there will be to have a productive discussion with them. There is no limit to the number of slides, but the focus should be on the time spent on the presentation.
Another difference between a microconference topic session and a normal presentation, is that there is no Q and A, but only discussions. A Q and A in presentations is where the audience asks the presenter questions and the presenter answers them. In a microconference topic session, the presenter starts with asking the audience questions and then there should be a back and forth between the audience and the presenter as well as between different members of the audience.
General information topics
One exception to the above is if the general focus area requires an understanding of a specific topic that all the other topics depend on. Some examples of this include RISC-V coming out with a new specification. The first topic in the microconference may be a 30 minute presentation about what details the new specification has that will impact further development. This is required information for the rest of the microconference to know in order to have proper decision making. The Android microconference had a similar case where the presentations were required for the other topics to be discussed. The general rule of thumb is that if a presentation is needed to have productive discussions then it is allowed. Due to the short time of a microconference, it is encouraged to have few of these types of presentations and better yet to have people do their homework before attending the microconference.
Attendee preparation
The focus of a microconference is to solve problems that exist today and come up with further innovations of tomorrow. The time constraint requires that everyone involved should be well prepared for the discussions that are to take place. The topics descriptions should include links to patch discussions on mailing lists, to wiki pages that describe the general focus area, or to anything that is not common knowledge to those not directly involved in the work. Linux Plumbers is about getting other experts outside the field to give input with a different perspective. Attendees should make an effort to read through the topics of all the microconferences and if there’s a topic of interest, they should read the links and familiarize themselves with the discussions that will take place. This will allow the attendees to be more productive than if they just come in without the understanding of the general focus area.
By following these general guidelines, Linux Plumbers will remain the most productive technical conference that one can attend.
June 26, 2023 12:02 PM
June 23, 2023
We’re holding another edition of the RISC-V microconference for Plumbers at 2023. Broadly speaking anything related to both Linux and RISC-V is on topic, but discussions tend to involve the following categories:
- How to support new RISC-V ISA features in Linux, both for the standards and for vendor-specific extensions.
- Discussions related to RISC-V based SOCs, which frequently include interactions with other Linux subsystems as well as core arch/riscv code.
- Coordination with distributions and toolchains on userspace-visible behavior.
Accomplishments post 2022 Microconference
All the talks at the 2022 Plumbers microconference have made at least some progress, with many of them resulting in big chunks of merged code.
Specifically:
- The riscv_hwprobe() syscall has been merged.
- Support for ACPI has been merged.
- Kconfig.socs is in the process of being refactored.
- Preliminary patches for the RISC-V TEE have been posted.
- Some optimized routines have been merged, but there’s still a long way to go.
- Text patching is still up in the air, but we’ve been working through many of the issues pointed out during the discussions.
Likely Topics for Discussion Sections
The actual list of topics tends to be hard to pin down this early, but here’s a few topics that have been floating around the mailing lists and may be easier to resolve in real-time:
- Do we even bother with generic optimized lib routines, or just go vendor-specific?
- When can we start deprecating stuff? Likely-unused bits include: rv32, nommu, xip, old toolchains.
- Is it time to give up on profiles and just set a base ourselves?
- CI: Hosting PW-NIPA (currently hosted by Conor/Microchip), hosting “upstream kernel ci” on Github w/ sponsored runners?
- Hardware assisted control-flow integrity on RISC-V CPUs.
- Handling text patching on RISC-V systems.
- How do we deal with vendor-specific memory management?
Submissions are made via LPC submission systems, selecting Track RISC-V MC
June 23, 2023 09:34 PM
June 20, 2023
The real-time and scheduling micro-conference joins these two intrinsically connected communities to discuss the next steps together.
Over the past decade, many parts of PREEMPT_RT have been included in the official Linux codebase. Examples include real-time mutexes, high-resolution timers, lockdep, ftrace, RCU_PREEMPT, threaded interrupt handlers, and more. The number of patches that need integration has been significantly reduced, and the rest is mature enough to make their way into mainline Linux.
The scheduler is at the core of Linux performance. With different topologies and workloads, giving the user the best experience possible is challenging, from low latency to high throughput and from small power-constrained devices to HPC, where CPU isolation is critical.
The following accomplishments have been made as a result of last year’s micro-conference:
Ideas of topics to be discussed include (but are not limited to):
- Improve responsiveness for CFS tasks – e.g., latency-nice patch
- The new EEVDF scheduler proposal
- Impact of new topology on CFS including hybrid or heterogeneous system
- Taking into account task profile with IPCC or uclamp
- Improvements in CPU Isolation
- The status of PREEMPT_RT
- Locking improvements – e.g., proxy execution
- Improvements on SCHED_DEADLINE
- Tooling for debugging scheduling and real-time
It is fine if you have a new topic that is not on the list. People are encouraged to submit any topic related to real-time and scheduling.
Please consider that the goal is to discuss open problems, preferably with patch set submissions already in discussion on LKML. The presentations are very short, and the main portion of the time should be given to the debate – thus, the importance of having an open and relevant problem, with people in the community engaged in the solution.
Submissions are made via LPC submission systems, selecting Track Real-time and Scheduling MC
June 20, 2023 08:59 PM
June 16, 2023
We’re happy to announce that registration for LPC 2023 is now open. To register please go to our attend page.
To try to prevent the instant sellout we had last year we’ve updated our cancellation policy to no refunds only transfers of registrations. You will find more details during the registration process. LPC 2023 follows the Linux Foundation’s health & safety policy.
As usual we expect to sell our rather quickly so don’t delay your registration for too long!
June 16, 2023 03:00 PM
June 14, 2023
Registration for LPC 2023 will be opened soon. Past experience told us that in-person registration would be sold out very fast. If you plan to join us in Richmond, please follow our blog and social media for the announcements about the registration!
June 14, 2023 09:04 AM
June 12, 2023
The v2023.06.11a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.
This release contains a new section on thermal throttling (along with a new cartoon), improvements to the memory-ordering chapter (including intuitive subsets of the Linux-kernel memory model), fixes to the deferred-processing chapter, additional clocksource-deviation material to the "What Time Is It?" section, and numerous fixes inspired by questions and comments from readers. Discussions with Yariv Aridor were especially fruitful. Akira Yokosawa contributed some quick quizzes and other upgrades of the technical discussions, along with a great many improvements to grammar, glossaries, epigraphs, and the build system. Leonardo Bras also provided some much-appreciated build-system improvements, and also started up continuous integration for some of the code samples.
Elad Lahav, Alan Huang, Zhouyi Zhou, and especially SeongJae Park contributed numerous excellent fixes for grammatical and typographical errors. SeongJae's fixes were from his Korean translation of this book.
Elad Lahav, Alan Huang, and Patrick Pan carried out some much-needed review of the code samples and contributed greatly appreciated fixes and improvements. In some cases, they drug the code kicking and screaming into the 2020s. :-)
June 12, 2023 04:40 PM
May 22, 2023
Thanks for all the suggestions, on here, and on twitter and on mastodon, anyway who noted I could use a single fd and avoid all the pain was correct!
I hacked up an ever growing ftruncate/madvise memfd and it seemed to work fine. In order to use it for sparse I have to use it for all device memory allocations in lavapipe which means if I push forward I probably have to prove it works and scales a bit better to myself. I suspect layering some of the pb bufmgr code on top of an ever growing fd might work, or maybe just having multiple 2GB buffers might be enough.
Not sure how best to do shaderResourceResidency, userfaultfd might be somewhat useful, mapping with PROT_NONE and then using write(2) to get a -EFAULT is also promising, but I'm not sure how best to avoid segfaults for read/writes to PROT_NONE regions.
Once I got that going, though I ran headfirst into something that should have been obvious to me, but I hadn't thought through.
llvmpipe allocates all it's textures linearly, there is no tiling (even for vulkan optimal). Sparse textures are incompatible with linear implementations. For sparseImage2D you have to be able to give the sparse tile sizes from just the image format. This typically means you have to work out how large the tile that fits into a hw page is in w/h. Of course for a linear image, this would be dependent on the image stride not just the format, and you just don't have that information.
I guess it means texture tiling in llvmpipe might have to become a thing, we've thought about it over the years but I don't think there's ever been a solid positive for implementing it.
Might have to put sparse support on the back burner for a little while longer.
May 22, 2023 03:12 AM
May 17, 2023
Mike nerdsniped me into wondering how hard sparse memory support would be in lavapipe.
The answer is unfortunately extremely.
Sparse binding essentially allows creating a vulkan buffer/image of a certain size, then plugging in chunks of memory to back it in page-size multiple chunks.
This works great with GPU APIs where we've designed this, but it's actually hard to pull off on the CPU.
Currently lavapipe allocates memory with an aligned malloc. It allocates objects with no backing and non-sparse bindings connect objects to the malloced memory.
However with sparse objects, the object creation should allocate a chunk of virtual memory space, then sparse binding should bind allocated device memory into the virtual memory space. Except Linux has no interfaces for doing this without using a file descriptor.
You can't mmap a chunk of anonymous memory that you allocated with malloc to another location. So if I malloc backing memory A at 0x1234000, but the virtual memory I've used for the object is at 0x4321000, there's no nice way to get the memory from the malloc to be available at the new location (unless I missed an API).
However you can do it with file descriptors. You can mmap a PROT_NONE area for the sparse object, then allocate the backing memory into file descriptors, then mmap areas from those file descriptors into the correct places.
But there are limits on file descriptors, you get 1024 soft, or 4096 hard limits by default, which is woefully low for this. Also *all* device memory allocations would need to be fd backed, not just ones going to be used in sparse allocations.
Vulkan has a limit maxMemoryAllocationCount that could be used for this, but setting it to the fd limit is a problem because some fd's are being used by the application and just in general by normal operations, so reporting 4096 for it, is probably going to explode if you only have 3900 of them left.
Also the sparse CTS tests don't respect the maxMemoryAllocationCount anyways :-)
I shall think on this a bit more, please let me know if anyone has any good ideas!
May 17, 2023 07:28 AM
May 11, 2023
(Edit 2023-05-10: This has now launched for a subset of Twitter users. The code that existed to notify users that device identities had changed does not appear to have been enabled - as a result, in its current form, Twitter can absolutely MITM conversations and read your messages)
Elon Musk appeared on an interview with Tucker Carlson last month, with one of the topics being the fact that Twitter could be legally compelled to hand over users' direct messages to government agencies since they're held on Twitter's servers and aren't encrypted. Elon talked about how they were in the process of implementing proper encryption for DMs that would prevent this - "You could put a gun to my head and I couldn't tell you. That's how it should be."
tl;dr - in the current implementation, while Twitter could subvert the end-to-end nature of the encryption, it could not do so without users being notified. If any user involved in a conversation were to ignore that notification, all messages in that conversation (including ones sent in the past) could then be decrypted. This isn't ideal, but it still seems like an improvement over having no encryption at all. More technical discussion follows.
For context: all information about Twitter's implementation here has been derived from reverse engineering version 9.86.0 of the Android client and 9.56.1 of the iOS client (the current versions at time of writing), and the feature hasn't yet launched. While it's certainly possible that there could be major changes in the protocol between now launch, Elon has asserted that they plan to launch the feature this week so it's plausible that this reflects what'll ship.
For it to be impossible for Twitter to read DMs, they need to not only be encrypted, they need to be encrypted with a key that's not available to Twitter. This is what's referred to as "end-to-end encryption", or e2ee - it means that the only components in the communication chain that have access to the unencrypted data are the endpoints. Even if the message passes through other systems (and even if it's stored on other systems), those systems do not have access to the keys that would be needed to decrypt the data.
End-to-end encrypted messengers were initially popularised by Signal, but the Signal protocol has since been incorporated into WhatsApp and is probably much more widely used there. Millions of people per day are sending messages to each other that pass through servers controlled by third parties, but those third parties are completely unable to read the contents of those messages. This is the scenario that Elon described, where there's no degree of compulsion that could cause the people relaying messages to and from people to decrypt those messages afterwards.
But for this to be possible, both ends of the communication need to be able to encrypt messages in a way the other end can decrypt. This is usually performed using AES, a well-studied encryption algorithm with no known significant weaknesses. AES is a form of what's referred to as a symmetric encryption, one where encryption and decryption are performed with the same key. This means that both ends need access to that key, which presents us with a bootstrapping problem. Until a shared secret is obtained, there's no way to communicate securely, so how do we generate that shared secret? A common mechanism for this is something called Diffie Hellman key exchange, which makes use of asymmetric encryption. In asymmetric encryption, an encryption key can be split into two components - a public key and a private key. Both devices involved in the communication combine their private key and the other party's public key to generate a secret that can only be decoded with access to the private key. As long as you know the other party's public key, you can now securely generate a shared secret with them. Even a third party with access to all the public keys won't be able to identify this secret. Signal makes use of a variation of Diffie-Hellman called Extended Triple Diffie-Hellman that has some desirable properties, but it's not strictly necessary for the implementation of something that's end-to-end encrypted.
Although it was rumoured that Twitter would make use of the Signal protocol, and in fact there are vestiges of code in the Twitter client that still reference Signal, recent versions of the app have shipped with an entirely different approach that appears to have been written from scratch. It seems simple enough. Each device generates an asymmetric keypair using the NIST P-256 elliptic curve, along with a device identifier. The device identifier and the public half of the key are uploaded to Twitter using a new API endpoint called /1.1/keyregistry/register. When you want to send an encrypted DM to someone, the app calls /1.1/keyregistry/extract_public_keys with the IDs of the users you want to communicate with, and gets back a list of their public keys. It then looks up the conversation ID (a numeric identifier that corresponds to a given DM exchange - for a 1:1 conversation between two people it doesn't appear that this ever changes, so if you DMed an account 5 years ago and then DM them again now from the same account, the conversation ID will be the same) in a local database to retrieve a conversation key. If that key doesn't exist yet, the sender generates a random one. The message is then encrypted with the conversation key using AES in GCM mode, and the conversation key is then put through Diffie-Hellman with each of the recipients' public device keys. The encrypted message is then sent to Twitter along with the list of encrypted conversation keys. When each of the recipients' devices receives the message it checks whether it already has a copy of the conversation key, and if not performs its half of the Diffie-Hellman negotiation to decrypt the encrypted conversation key. One it has the conversation key it decrypts it and shows it to the user.
What would happen if Twitter changed the registered public key associated with a device to one where they held the private key, or added an entirely new device to a user's account? If the app were to just happily send a message with the conversation key encrypted with that new key, Twitter would be able to decrypt that and obtain the conversation key. Since the conversation key is tied to the conversation, not any given pair of devices, obtaining the conversation key means you can then decrypt every message in that conversation, including ones sent before the key was obtained.
(An aside: Signal and WhatsApp make use of a protocol called Sesame which involves additional secret material that's shared between every device a user owns, hence why you have to do that QR code dance whenever you add a new device to your account. I'm grossly over-simplifying how clever the Signal approach is here, largely because I don't understand the details of it myself. The Signal protocol uses something called the Double Ratchet Algorithm to implement the actual message encryption keys in such a way that even if someone were able to successfully impersonate a device they'd only be able to decrypt messages sent after that point even if they had encrypted copies of every previous message in the conversation)
How's this avoided? Based on the UI that exists in the iOS version of the app, in a fairly straightforward way - each user can only have a single device that supports encrypted messages. If the user (or, in our hypothetical, a malicious Twitter) replaces the device key, the client will generate a notification. If the user pays attention to that notification and verifies with the recipient through some out of band mechanism that the device has actually been replaced, then everything is fine. But, if any participant in the conversation ignores this warning, the holder of the subverted key can obtain the conversation key and decrypt the entire history of the conversation. That's strictly worse than anything based on Signal, where such impersonation would simply not work, but even in the Twitter case it's not possible for someone to silently subvert the security.
So when Elon says Twitter wouldn't be able to decrypt these messages even if someone held a gun to his head, there's a condition applied to that - it's true as long as nobody fucks up. This is clearly better than the messages just not being encrypted at all in the first place, but overall it's a weaker solution than Signal. If you're currently using Twitter DMs, should you turn on encryption? As long as the limitations aren't too limiting, definitely! Should you use this in preference to Signal or WhatsApp? Almost certainly not. This seems like a genuine incremental improvement, but it'd be easy to interpret what Elon says as providing stronger guarantees than actually exist.
comments
May 11, 2023 12:40 AM
May 08, 2023
After some hiccups with Indico we’ve finally set up a page that lists submitted microconference proposals. Along with seasoned veterans like Containers and Checkpoint/Restore and RISC-V we are glad to see Live Patching microconference returning after a long break and a brand new Linux Kernel Debugging microconference.
The Proposed microconfences page will be updated from time until the CFP for microconference proposals will be closed on June, 1.
Be sure not to miss the deadline and submit your microconference!
May 08, 2023 05:20 PM
April 28, 2023
Much angst (and discussion ink) is wasted in open source over whether pulling in code from one project with a different licence into another is allowable based on the compatibility of the two licences. I call this problem self defeating because it creates sequestered islands of incompatibly licensed but otherwise fully open source code that can never ever meet in combination. Everyone from the most permissive open source person to the most ardent free software one would agree this is a problem that should be solved, but most of the islands would only agree to it being solved on their terms. Practically, we have got around this problem by judicious use of dual licensing but that requires permission from the copyright holders, which can sometimes be hard to achieve; so dual licensing is more a band aid than a solution.
In this blog post, I’m going to walk you through the reasons behind cone the most intractable compatibility disputes in open source: Apache-2 vs GPLv2. However, before we get there, I’m first going to walk through several legal issues in general contract and licensing law and then get on to the law and politics of open source licensing.
The Law of Contracts and Licences
Contracts and Licences come from very similar branches of the law and concepts that apply to one often apply to the other. For this legal tour we’ll begin with materiality in contracts followed by licences then look at repairable and irreparable legal harms and finally the conditions necessary to take court action.
Materiality in Contracts
This is actually a well studied and taught bit of the law. The essence is that every contract has a “heart” or core set of clauses which really represent what the parties want from each other and often has a set of peripheral clauses which don’t really affect the “heart” of the contract if they’re not fulfilled. Not fulfilling the latter are said to cause non-material breaches of the contract (i.e. breaches which don’t terminate the contract if they happen, although a party may still have an additional legal claim for the breach if it caused some sort of harm). A classic illustration, often used in law schools, is a contract for electrical the electrical wiring of a house that specifies yellow insulation. The contractor can’t find yellow, so wires the house with blue insulation. The contract doesn’t suffer a material breach because the wires are in the wall (where no-one can see) and there’s no safety issue with the colour and the heart of the contract was about wiring the house not about wire colour.
Materiality in Licensing
This is actually much less often discussed, but it’s still believed that licences are subject to the same materiality constraints as contracts and for this reason, licences often contain “materiality clauses” to describe what the licensor considers to be material to it. So for the licensing example, consider a publisher wishing to publish a book written by a famous author known as the “Red Writer”. A licence to publish for per copy royalties of 25% of the purchase price of the book is agreed but the author inserts a clause specifying by exact pantone number the red that must be the predominant colour of the binding (it’s why they’re known as the “Red Writer”) and also throws in a termination of copyright licence for breaches clause. The publisher does the first batch of 10,000 copies, but only after they’ve been produced discovers that the red is actually one pantone shade lighter than that specified in the licence. Since the cost of destroying the batch and reprinting is huge, the publisher offers the copies for sale knowing they’re out of spec. Some time later the “Red Writer” comes to know of the problem, decides the licence is breached and therefore terminated, so the publisher owes statutory damages (yes, they’ve registered their copyright) per copy on 10,000 books (about $300 million maximum), would the author win?
The answer of course is that no court is going to award the author $300 million. Most courts would take the view that the heart of the contract was about money and if the author got their royalties per book, there was no material breach and the licence continues in force for the publisher. The “Red Writer” may have a separate tort claim for reputational damage if any was caused by the mis-colouring of the book, but that’s it.
Open Source Enforcement and Harm
Looking at the examples above, you can see that most commercial applications of the law eventually boil down to money: you go to court alleging a harm, the court must agree and then assess the monetary compensation for the harm which becomes damages. Long ago in community open source, we agreed that money could never compensate for a continuing licence violation because if it could we’d have set a price for buying yourself out of the terms of the licence (and some Silicon Valley Rich Companies would actually be willing to pay it, since it became the dual licence business model of companies like MySQL). The principle that mostly applies in open source enforcement actions is that the harm is to the open source ecosystem and is caused by non-compliance with the licence. Since such harm can only be repaired by compliance that’s the essence of the demand. Most enforcement cases have been about egregious breaches: lack of any source code rather than deficiencies in the offer to provide source code, so there’s actually very little in court records with regard to materiality of licence breaches.
One final thing to note about enforcement cases is there must always be an allegation of material harm to someone or something because you can’t go into court and argue on abstract legal principles (as we seem to like to do in various community mailing lists), you must show actual consequences as well. In addition to consequences, you must propose a viable remedy for the harm that a court could impose. As I said above in open source cases it’s often about harms to the open source ecosystem caused by licence breaches, which is often accepted unchallenged by the defence because the case is about something obviously harmful to open source, like failure to provide source code (and the remedy is correspondingly give us the source code). However, when considering about the examples below it’s instructive to think about how an allegation of harm around a combination of incompatible open source licences would play out. Since the source code is available, there would be much more argument over what the actual harm to the ecosystem, if any, was and even if some theoretical harm could be demonstrated, what would the remedy be?
Applying this to Apache-2 vs GPLv2
The divide between the Apache Software Foundation (ASF) and the Free Software Foundation (FSF) is old and partly rooted in politics. For proof of this notice the FSF says that the two licences (GPLv2 and Apache-2) are legally incompatible and in response the ASF says no-one should use any GPL licences anyway. The purpose of this section is to guide you through the technicalities of the incompatibility and then apply the materiality lessons from above to see if they actually matter.
Why GPLv2 is Incompatible with Apache-2
The argument is that Apache-2 contains two incompatible clauses: the patent termination clause (section 3) which says that if you launch an action against anyone alleging the licensed code infringes your patent then all your rights to patents in the code under the Apache-2 licence terminate; and the Indemnity clause (Section 9) which says that if you want to offer an a warranty you must indemnify every contributor against any liability that warranty might incur. By contrast, GPLv2 contains an implied patent licence (Section 7) and a No Warranty clause (Section 11). Licence scholars mostly agree that the patent and indemnity terms in GPLv2 are weaker than those in Apache-2.
The incompatibility now occurs because GPLv2 says in Section 2 that the entire work after the combination must be shipped under GPLv2, which is possible: Apache is mostly permissive except for the stronger patent and indemnity clauses. However, it is arguable that without keeping those stronger clauses on the Apache-2 code, you’ve violated the Apache-2 licence and the GPLv2 no additional restrictions clause (Section 6) prevents you from keeping the stronger licensing and indemnity clauses even on the Apache-2 portions of the code. Thus Apache-2 and GPLv2 are incompatible.
Materiality and Incompatibility
It should be obvious from the above that it’s hard to make a materiality argument for dropping the stronger apache2 provisions because someone, somewhere might one day get into a situation where they would have helped. However, we can look at the materiality of the no additional restrictions clause in GPLv2. The FSF has always taken the absolutist position on this, which is why they think practically every other licence is GPLv2 incompatible: when you dig at least one clause in every other open source licence can be regarded as an additional restriction. We also can’t take the view that the whole clause is not material: there are obviously some restrictions (like you must pay me for every additional distribution of the code) that would destroy the open source nature of the licence. This is the whole point of the no additional restrictions clause: to prevent the downstream addition of clauses incompatible with the free software goal of the licence.
I mentioned in the section on Materiality in Licences that some licences have materiality clauses that try to describe what’s important to the licensor. It turns out that GPLv2 actually does have a materiality clause: the preamble. We all tend to skip the preamble when analysing the licence, but there’s no denying it’s 7 paragraphs of justification for why the licence looks like it does and what its goals are.
So, to take the easiest analysis first, does the additional indemnity Apache-2 requires represent a material additional restriction. The preamble actually says “for each author’s protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors’ reputations.” Even on a plain reading an additional strengthening of that by providing an indemnity to the original authors has to be consistent with the purpose as described, so the indemnity clause can’t be regarded as a material additional restriction (a restriction which would harm the aims of the licence) when read in combination with the preamble.
Now the patent termination clause. The preamble has this to say about patents “Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone’s free use or not licensed at all.” So giving licensees the ability to terminate the patent rights for patent aggressors would appear to be an additional method of fulfilling the last sentence. And, again, the patent termination clause seems to be consistent with the licence purpose and thus must also not be a material additional restriction.
Thus the final conclusion is that while the patent and indemnity clauses of Apache-2 do represent additional restrictions, they’re not material additional restrictions according to the purpose of the licence as outlined by its materiality clause and thus the combination is permitted. This doesn’t mean the combination is free of consequences: the added code still carries the additional restrictions and you must call that out to the downstream via some mechanism like licensing tags, but it can be done.
Proving It
The only way to prove the above argument is to win in court on it. However, here lies the another good reason why combining Apache-2 and GPLv2 is allowed: there’s no real way to demonstrate harm to anything (either the copyright holder who agreed to GPLv2 or the Community) and without a theory of actual Harm, no-one would have standing to get to court to test the argument. This may look like a catch-22, but it’s another solid reason why, even in the absence of the materiality arguments, this would ultimately be allowed (if you can’t prevent it, it must be allowable, right …).
Community Problems with the Materiality Approach
The biggest worry about the loosening of the “no additional restrictions” clause of the GPL is opening the door to further abuse of the licence by unscrupulous actors. While I agree that this should be a concern, I think it is adequately addressed by rooting the materiality of the licence in the preamble or in provable harm to the open source community. There is also the flip side of this: licences are first and foremost meant to serve the needs of their development community rather than become inflexible implements for a group of enforcers, so even if there were some putative additional abuse in this approach, I suspect it would be outweighed by the licence compatibility benefit to the development communities in general.
Conclusion
The first thing to note is that Open Source incompatible licence combination isn’t as easy as simply combining the code under a single licence: You have to preserve the essential elements of both licences in the code which is combined (although not necessarily the whole project), so for an Apache-2/GPLv2 combination, you’ll need a note on the files saying they follow the stronger Apache patent termination and indemnity even if they’re otherwise GPLv2. However, as long as you’re careful the combination works for either of two reasons: because the Apache-2 restrictions aren’t material additional restrictions under the GPLv2 preamble or because no-one was actually harmed in the making of the combination (or both).
One can see from the above that similar arguments can be applied to various other supposedly incompatible licence combinations (exercise for the reader: try it with BSD-4-Clause and GPLv2). One final point that should be made is that licences and contracts are also all about what was in the minds of the parties, so for open source licences on community code, the norms and practices of the community matter in addition to what the licence actually says and what courts have made of it. In the final analysis, if the community norm of, say, a GPLv2 project is to accept Apache-2 code allowing for the stronger patent and indemnity clauses, then that will become the understood basis for interpreting the GPLv2 licence in that community.
For completeness, I should point out I’ve used the no harm no foul reasoning before when arguing that CDDL and GPLv2 are compatible.
April 28, 2023 01:27 PM
eBPF has many uses in improving computer security, but just taking eBPF observability tools as-is and using them for security monitoring would be like driving your car into the ocean and expecting it to float.
Observability tools are designed have the lowest overhead possible so that they are safe to run in production while analyzing an active performance issue. Keeping overhead low can require tradeoffs in other areas: tcpdump(8), for example, will drop packets if the system is overloaded, resulting in incomplete visibility. This creates an obvious security risk for tcpdump(8)-based security monitoring: An attacker could overwhelm the system with mostly innocent packets, hoping that a few malicious packets get dropped and are left undetected. Long ago I encountered systems which met strict security auditing requirements with the following behavior: If the kernel could not log an event, it would immediately **halt**! While this was vulnerable to DoS attacks, it met the system's security auditing non-repudiation requirements, and logs were 100% complete.
There are ways to evade detection in other tools as well, like top(1) (since it samples processes and relies on its comm field) and even ls(1) (putting escape characters in files). Rootkits do this. These techniques have been known in the industry for decades and haven't been "fixed" because they aren't "broken." They are cars, not boats. Similar methods can be used to evade detection in the eBPF bcc and bpftrace observability tools as well: overwhelming them with events, doing time-of-check-time-of-use attacks (TOCTOU), escape characters, etc.
When will the eBPF community "fix" these tools? Well, when will Tesla fix my Model 3 so I can drive it under the Oakland bridge instead of over it? (I joke, and I don't drive a Tesla.) What you actually want is a security monitoring tool that meets a different set of requirements. Trying to adapt observability tools into security tools generally increases overhead (e.g., adding extra probes) which negates the main reason I developed these using eBPF in the first place. That would be like taking the wheels off a car to help make it float. There are other issues as well, like decreasing maintainability when moving probes from stable tracepoints to unstable inner workings for TOU tracing. Had I written these as security tools to start with, I would have done them differently: I'd start with LSM hooks, use a plugin model instead of standalone CLI tools, support configurable policies for event drop behavior, optimize event logging (which we still haven't [done](https://github.com/iovisor/bcc/issues/1033)), and lots more.
None of this should be news to experienced security engineers. I'm writing this post because others see the tools and examples I've shared and believe that, with a bit of shell scripting, they could have a good security monitoring product. I get that it looks that way, but in reality there's a bunch of work to do. Ideally I'd link to an example in bcc for security monitoring (we could create a subdirectory for them) but that currently doesn't exist. In the meantime my best advice is: If you are making a security monitoring product, hire a good security engineer (e.g., someone with solid pen-testing experience).
BPF for security monitoring was first explored by myself and a Netflix security engineer, Alex Maestretti, in a [2017 BSides talk] \(some slides below). Since then I've worked with other security engineers on the topic (hi Michael, Nabil, Sargun, KP). (I also did security work many years ago, so I'm not completely new to the topic.)


BSidesSF2017 BPF security monitoring: Alex Maesretti, Brendan Gregg
There is potential for an awesome eBPF security product, and it's not just the visibility that's valuable (all those arrows) it's also the low overhead. These slides included our [overhead evaluation] showing bcc/eBPF was far more efficient than auditd or go-audit. (It was pioneering work, but unfortunately the slides are all we have: Alex, I, and others left Netflix before open sourcing it.) There are now other eBPF security products, including open source projects (e.g., [tetragon]), but I don't know enough about them all to have a recommendation.
Note that I'm talking about the observability tools here and not the eBPF kernel runtime itself, which has been designed as a secure sandbox. Nor am I talking about privilege escalation, since to run the tools you already need root access (that car has sailed!).
[2017 BSides talk]: https://www.brendangregg.com/Slides/BSidesSF2017_BPF_security_monitoring
[overhead evaluation]: https://www.brendangregg.com/Slides/BSidesSF2017_BPF_security_monitoring/#17
[tetragon]: https://github.com/cilium/tetragon
April 28, 2023 12:00 AM
Content copyright by their respective authors.