Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.
Working around Hardware Bugs
Some Linux kernel discussions can get pretty jaw-dropping. Andy Lutomirski proposed some code that would eliminate nested non-maskable interrupts (NMIs), which would ultimately allow a lot of kernel code to be removed. NMIs are signals that typically indicate some kind of hardware failure. They also typically can't be prevented by the operating system.
So far so good. However, one type of hardware error is the Machine-Check Exception (MCE), which indicates a problem with the CPU, and on Windows systems results in the blue screen of death. Andy's code had to account for various possibilities, such as an NMI followed immediately by an MCE. So, various kernel folks discussed the right way for Andy's code to handle these situations. Could the MCE interrupt the un-interruptable NMI? And so on.
At a certain point, Linus Torvalds joined the discussion, saying:
Don't worry too much about the MCEs. The hardware is f*cking broken, and nobody sane ever thought that synchronous MCEs were a good idea.
Proof: look at Itanium.
The truly nonmaskable synchronous MCEs are a fatal error. It's that simple. Anybody who thinks anything else is simply wrong, and has probably talked to too many hardware engineers that don't actually understand the bigger picture.
Sane hardware handles anything that *can* be handled in hardware, and then reports (later) to software about the errors with a regular non-critical MCE that doesn't punch through NMI or even regular interrupt disabling.
So the true "MCE punches through even NMI protection" case is relegated purely to the "hardware is broken and needs to be replaced" situation, and our only worry as kernel people is to try to be as graceful as possible about it – but that "as graceful as possible" does *not* include bending over and worrying about random possible deadlocks or other crazy situations. It's purely a "best effort" kind of thing where we try to do whatever logging etc that is easy to do.
Seriously. If an NMI is interrupted by an MCE, you might as well consider the machine dead. Don't worry about it. We may or may not recover, but it is *not* our problem.
Borislav Petkov remarked, "I certainly like this way of handling it. We can even issue a nice banner saying something like 'You're f*cked – go change hw.' :-)"
Meanwhile, the technical discussion continued, with folks attempting to figure out whether the kernel should deadlock or panic under various circumstances – for example, if the MCE arrived during user code execution versus kernel code execution.
Meanwhile, Linus elaborated, saying:
MCE hardware designers are f*cking morons that should be given extensive education about birth control and how not to procreate.
MCE is frankly misdesigned. It's a piece of shit, and any of the hardware designers that claim that what they do is for system stability are out to lunch. This is a prime example of what *NOT* to do, and how you can actually spread what was potentially a localized and recoverable error, and make it global and unrecoverable.
Can we please get these designers either fired, or re-educated? Because this shit has been going on too long. I complained about this to Tony many years ago, and nothing was ever fixed.
Synchronous MCEs are fine for synchronous errors, but then trying to turn them "synchronous" for other CPU's (where they *weren't* synchronous errors) is a major mistake. External errors punching through irq context is wrong, punching through NMI is just inexcusable.
If the OS then decides to take down the whole machine, the OS – not the hardware – can choose to do something that will punch through other CPU's NMI blocking (notably, init/reset), but the hardware doing this on its own is just broken if true.
Anyway, I repeat: I refuse to fix hardware bugs. As far as we are concerned, this is "best effort," and the hardware designers should take a long deep look at their idiotic schemes. If something punches through NMI, it's deadly. It's that simple.
Tony Luck replied, "Latest SDM (version 050 from late February this year) describes how this is going to be fixed. Recoverable machine checks are going to be thread local. But current silicon still has the broadcast behavior … silicon development pipeline is very long :-("
A few posts down the line Linus acknowledged, "It appears Intel is fixing their brain damage." Borislav Petkov replied, "Yep, we'd still need to deal with the existing systems but we don't have a choice anyway."
For me, the jaw-dropping part of this whole discussion is that Linus considers certain hardware – as designed – to be so broken that the kernel can't work around the breakage. Usually, there's at least some sort of workaround available, but maybe because this particular issue has to do with a fairly immediate system crash, the options are more limited.
Security Versus Feature-Completeness in Containers
Linux containers are subtle beasts. The goal is to cordon off a portion of a given system and make it appear to be an entirely separate system. Ideally, it would be indistinguishable from a completely independent running system, and users could do anything on it that they could have done if the system had been truly installed on its own hardware.
In practice, containers must not support various features because they would allow a hostile user to tunnel out to the host system, and this array of security requirements is not necessarily obvious.
Seth Forshee recently posted some patches to allow containers to see devices that were hot-added to the host system at run time. To do this, Seth wanted containers to have their own device namespace that could be mapped onto the available devices on the host.
Greg Kroah-Hartman put a stick into Seth's spokes, saying that he had expressed a very clear policy the previous year at the Kernel Plumbers' conference – "adding namespaces to devices isn't going to happen, sorry. Please don't continue down this path." Greg added, "Why would you even want a container to see a 'new' device? That's the whole point, your container should see a 'clean' system, not the 'this USB device was just plugged in' system. Otherwise, how are you going to even tell that container a new device showed up? Are you now going to add udev support in containers? Hah, no."
Michael H. Warfield pointed out that there was in fact a valid use case for this feature. He said, "I use a USB sharing device that controls a multiport USB serial device controlling serial consoles to 16 servers and shared between 4 controlling servers. The sharing control port (a USB HID device) should be shared between designated containers so that any designated container owner can 'request' a console to one of the other servers (yeah, I know there can be contention but that's the way the cookie crumbles – most of the time it's on the master host). Once they get the sharing device's attention, they 'lose' that HID control device (it disappears from /dev entirely) and they gain only their designated USBtty{n} device for their console. Dynamic devices at their finest."
Seth also pointed out that even more than hot-plugged physical devices, he wanted his containers to support loop devices. A loop device is a plain file that gets mounted on the filesystem, so as to appear to be a block device.
Seth said, "As things stand today, to support loop devices, lxc would need to do something like this: grab some unused loop devices, remove them from /dev, and make device nodes with appropriate ownership/permissions in the container's /dev. Otherwise, there's potential for accidental duplicate use of the devices, which besides having unexpected results could result in information leak into the container. At that point you have some loop devices that the container can use, but privileged operations such as re-reading partitions and encrypted loop aren't possible. Even if you can re-read partitions, device nodes will appear in the main /dev and not in the container."
Greg replied, "Giving the ability for a container to create a loop device at all is a horrid idea, as you have pointed out, lots of information leakage could easily happen."
Greg asked why this would ever be needed, and Michael explained, 'I need raw access to loop devices and loop-control because I'm using containers to build NST (Network Security Toolkit) distribution iso images (one container is x86_64 while the other is i686). Each requires 2 loop devices. You can't set up the loop devices in advance since the containers will be creating the images and building them. NST tinkers with the base build engine configuration, so I really DON'T want it running on a hard iron host."
But, Greg said this was not a normal use case for Linux containers. He said Michael should definitely not use containers to do any such thing.
Serge Hallyn argued that Greg's whole concept of a "normal" container use case was flawed. Normal, he said, was just whatever worked successfully. Serge said, "Not too long ago much of what we can now do with network namespaces was not a normal container use case." It seemed to him that if one could get a feature to work with containers, then that should constitute a "normal" use case.
Richard Weinberger agreed with Greg. He thought Seth and Serge were trying to shoe-horn features into containers that belonged in a more complete system emulation, such as the KVM hypervisor. Richard said, "Please don't put more complexity into containers. They are already horribly complex and error prone."
Serge, however, thought that complexity and error-proneness were things that could and should be fixed on a bug-by-bug basis. He took the position that "the only use case which is inherently not valid for containers is running a kernel."
Richard replied, "There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. Ask Andy, he found already lots of nasty things … . I agree that user namespaces are the way to go; all the papering with LSM over security issues is much worse. But we have to make sure that we don't add too [many] features too fast."
Serge said he was fine with that. It was perfectly valid to wait on adding certain features. However, he cautioned, "not exercising the new code may only mean that existing flaws stick around longer, undetected (by most)." Richard agreed with that as well.
Greg then said that Seth's patch was not the right way to address the many remaining issues and that Seth's code addressed only the specific case Seth cared about, whereas a real solution would need to address the full swath of issues surrounding namespaces, including how to handle notifications to userspace.
Greg did also say at a certain point, that loop devices would be good to support in containers. He just didn't like Seth's approach of making big changes to the driver core just to accommodate them. Greg said a better solution would be to "create a new type of thing that acts like a loop device in a container."
Seth didn't connect with that idea though. He thought that core driver changes were inevitable no matter what approach was used to support loop devices. The only way around it, he said, was to force the whole thing to be handled in userspace.
At this point, James Bottomley came into the discussion, saying, "Running hotplug inside a container is a security problem and, since containers are easily entered by the host, it's very easy to listen for the hotplug in the host and inject it into the container using nsenter."
Regarding loop devices, James added in a different post, "Giving a container the ability to do a mount is too dangerous. What we want to do is intercept the mount in the host and perform it on behalf of the guest as host root in the guest's mount namespace. If you do it that way, it doesn't really matter what device actually shows up in the guest, as long as the host knows what to do when the mount request comes along."
The discussion continued – less a debate and more an exploration of what might or might not actually work. It seems as though both sides have fairly irrefutable arguments. On the one hand, it seems inevitable that any feature available to the host system will be strongly desired within a container, and implementations of such features will abound. On the other hand, security is an absolute cornerstone, and any features built upon an insecure foundation would almost certainly not get into the kernel.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
There's a New Open Source Terminal App in Town
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.