About this video
What You'll Learn
- Why Flatcar's immutable A/B partitions make operating system updates atomic and reversible
- How Ignition and Butane provision encrypted disks without embedding long-lived secret material
- How Nebraska, kured, and Flatcar's update operator coordinate rolling Kubernetes node reboots
Flatcar maintainers Thilo, Mathieu, and Chewy explain the immutable A/B partition layout, Ignition provisioning, systemd-sysext for Kubernetes, LUKS/TPM disk encryption, and how Nebraska coordinates monthly fleet updates without SSH.
Jump to a chapter
- 0:00 Introduction
- 1:59 Guest Introductions
- 3:10 What is Flatcar?
- 12:30 Kernel modules and Flatcar
- 17:41 Getting started with Flatcar
- 21:27 Encryption with Flatcar
- 29:40 Kubernetes Upgrades with Flatcar
- 35:14 Flatcar k8s node upgrades
- 38:08 Flatcar Metrics
- 40:42 Operating at scale
- 44:25 Recap
- 45:07 Quick-fire questions
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:00 Introduction
0:01 Hey, everyone. Unfortunately, Laura couldn't join us today. Again, again, again. Seriously, what does she even do anymore? Why not just send an AI robot instead? Actually, David wrote an excuse for me using Claude apparently. So apparently Claude thinks I was busy trying to figure out why my perfectly normal Linux distro needs a package manager. Silly me. Today, we are talking to the Flat Car team about their immutable Linux distribution because apparently the solution to works on my machine is to make sure nobody can install anything on any machine ever again. David gets to chat with three maintainers who
0:42 clearly have never heard the phrase, if it ain't broke, don't fix it. Instead, they built an operating system that updates itself monthly whether you like it or not, has no Python, oh my heart, and treats SSH access like a crime. We discussed how Flat Car uses something called the SysEx Bakery. Sadly, no pano chocolates here. Why you need a PhD in System D to add a kernel module, and how they've managed to make full disk encryption even more complicated than it already was. I mean, the real highlight, David discovers you can reboot without rebooting, which broke his brain so hard, he almost
1:21 forgot to mention Rust. Almost. And my brain just broken again reading that back and remembering that conversation. Also, one of the maintainers uses 20 x terms as a window manager, which tells you everything that you need to know about the target audience for this distro and this episode. Will David adopt Flat Car for everything? And how many times will they mention that SSH ing into nodes is bad? Enjoy this rusty episode. PS. Yes. They forked from CoreOS, which was already minimal. It's like they looked at a diet and said, but what if we removed more food?
1:59 Guest Introductions
1:59 Alright. Thank you all so much for joining me today. We have an almost full house. So could we please take a minute just to say hello, introduce ourselves, and share anything that we wish to share? Hey. I'm James Laquiro, better known as Chewy, around the open source community in particular. I'm a software engineer at Microsoft. In practice, that means I maintain the FlatCar Linux distribution. But in my spare time, I also maintain the Gen two Linux distri Hey. I'm Tilo. I'm an engineering manager at Microsoft. I was engineer for roughly fifteen years before I became manager. I
2:34 have a background in embedded Linux, in Linux on infrastructure, Linux on clouds. And I've worked for a variety of large cloud providers. And now I have the honor to help maintaining FlatCar with my team and trainers outside of the group. Yeah. Hello. My name is Matthieu. I work as a FlatCar maintainer since a few years now. I'm working on various topics. For example, the provisioning with Ignition. I'm sure we are going to talk about this later in this podcast. Also, on my spare time, I'd like doing some electrical engineering, embedded software, and stuff like that. Yeah. You
3:10 What is Flatcar?
3:10 so much for those introductions. So let's cover the low hanging fruit first before we get into the meatier conversation. I will assume that majority of people listening to this podcast know what Flat Car is. But for anyone that doesn't, can we just take a minute or two to cover the bases and let them know what the project is, what its mission is, and how it helps people in this space? So I'll start, but there are multiple angles to that. So you know how you provision applications on Kubernetes? You basically have your config. Yml that specifies
3:43 a few customizations that you need for your app. You don't configure the whole app. You kind of expect it to be self contained from the container image, and then you apply that yml to your cluster and suddenly you have an instance of your application there that you wanna run. Flatcut basically does the same for virtual machines and data centers or bare metal machines and data centers. You have a configuration how you want the node to look like, have some initial services you wanna run, and then you apply that to, for instance, GCP, AWS, Azure, to your TFTP
4:18 data center for bare metal, and then the node is being created from that. Just as you never directly interact with pods in regular maintenance, you don't interact with Flatker nodes in regular maintenance. You basically it's a self contained thing. It is self updating. It is, to some degree, self healing. You apply your config and then the thing is being created and it just does its job. It doesn't get in the way. That's kind of my view of things. I don't know if Truly has a different view. To follow on from it not getting in your way, it's also very minimal. So it
4:53 has a very small attack surface, good from a security standpoint. While it looks like a regular Linux distribution, you can SSH into it and so on, it just has what you need to run Docker or Kubernetes or whatever it is and the rest we expect you to provide in your containers. Right. On you go. And yes, maybe to conclude, I would like to bring a bit of historical perspective on this. FlatCar is not brand new, it is around since a few years now and even more because it has some legacy with container operating system. CoreOS is around since ten years maybe from
5:30 now so we initially forked CoreOS to become FlatCar four years ago. So, yeah, FlatCar is not a brand new software. It's around since a few years now through its legacy. Building on what Chui said, the operating system isn't just minimal, it's immutable. And it can't be regularly extended at runtime. Like, there's no package manager. You get what you get. You get the set of binaries included with the operating system image and that's it. You can't apt get install anything and that has advantages in terms of configuration drift and version drift. The updates are full operating system images as
6:08 well. And that has another benefit in that it's not just that you can't fudge with the binaries because the operating system is immutable. It's also that you always get full version sets of all of the packages included in the Bazelets. Like you are on FlatCar version eight and you know which bash, which git, which kernel, which everything is included exactly down to the patch level. And then with the next update, you also know what's coming and you always have a fixed version set that's running and that's in terms of bill of materials and inventory, that's very beneficial. Alright. Awesome.
6:42 There's a lot to cover there. I can already see four different forks we can go down. But I wanna tackle the one that I think right now is quite topical, and then we'll come back and maybe talk about the decomposition of flat car and the components that make it up. Right? But you mentioned immutability, and this is the thing right now, especially to Linux users. Right? That there's so many different distributions that are trying to do this on the desktop and on the server side for containers. Now for anyone who understands how Linux works, everything is pretty much a fail. So when
7:10 you say that this is immutable, how far does this pushed? And I'm also curious about the update path. Right? Is this using AB style file systems? Is it built on top of system d system extensions? Maybe we can go into that in a bit more technical detail for all the Linux nerds out there. Yeah. So in general, let's wind back. The first question you had was about immutability and how that works with the Linux file concept. Most of the services need some space where they can have some temporary stuff or have their configuration files, but
7:41 they never ever actually touch user supplied configuration as they shouldn't. Right? I mean, you know, having an email service running that then fudges its own config, that would be scary. So what we do is on very first provisioning, we take that YAML config that I mentioned, then we take the defaults that are coming with the operating system and we render out the whole root partition based on these very few inputs. And that kind of compares to the Kubernetes approach of applying applications as well, because very few inputs, same defaults, and then you basically render out
8:13 the whole configuration of the container instance. And the Files and ETC, we actually keep writable. So if you absolutely have to, you can go there and mess around with stuff. We don't expect people to do because that would kind of violate the no config drift hurting that we have, but you could if you absolutely had to. And that allows for other runtime management tools like Ansible to interface with Flat Car nodes, which some users use to manage their fleets. The operating system itself, all of the OS bits are in USR. And you see that if you look at
8:46 the FlatCar root file system. So anything that might contain binaries is symlinked into USR, USR bin, USIS bin. And this is how we lay out that system. And none of the service that we ship, as Shui mentioned, it's a very limited set of service that we ship, ever tries to write anything to USR, which in turn is on a separate partition. That partition is cryptographically protected against messing around with. So if you it's DM Verity that we use. If you mess around with bits there, then the reads from these partition will simply fail. And to
9:18 answer your other question, how we handle changes in the operating system itself and updates, that is done via a separate partition. So we use a very basic AB partitioning scheme that has its drawbacks. You always need two partitions where you could have one, but it also has benefits because it is incredibly robust. Like, doesn't literally, it doesn't get any simpler than that. You just have a place to write the other stuff that doesn't interfere with your currently running operating system. You can switch atomically to the new operating system simply by means of reboot. And the big benefit here is it's not
9:53 just that you don't have any intermediate state between the two versions of the operating system. You always have a pretty seamless rollback. So there's a mechanism that you can apply a configuration that you can deploy if you have, for instance, critical services or foundational services that you absolutely need to run. Then you can hook yourself into the update process And when you first boot into a new partition, you can detect, is everything fine? Yes, continue, declare the new operating system as stable. And then you boot into that every single next reboot or detect some services didn't like the update, so
10:31 reboot again before that is declared stable and you're back in a known good state and you can investigate what's going on. If possible, I would like to add something regarding the update things. So maybe it's obvious for us, but sometimes it's not for everyone, is that the update comes from update server. So you don't have to manually update your flat car instance. Basically every month there is a new flat car release and automatically by default you can control this of course but by default every month you will get this new version of flatcar being rated on the
11:03 B partition and reboot. So by default you don't have to do anything by yourself. Everything is done automatically. And of course you can control this for initial configuration but by default you get this update from a public release update server and if you want to you can get your own version of public update server if you want to be fully air gapped the remand from the public server. So, yeah, I think it was worth to mention it. You mentioned Systemd SysEx. That is what we use as well. So Thielo mentioned earlier that you get what you get, but obviously that would
11:40 be quite restrictive. If Flat Cards didn't even ship with Kubernetes out of the box. So we do extend the system not with a package manager, but using these, again, immutable disk images effectively that get merged over the top. That's how SysEx work, again protected by this Verity feature. And out of the box, Flat Car has Docker and Container D, but you can actually remove these if you don't want them, if you want something else like Podman. And these we maintain these SIS Xs in the thing we call the SIS X bakery, where we have a whole collection of these.
12:20 They're not tied to any specific flat car release. So you can actually mix and match different Kubernetes versions with flat car versions. It gives you a lot of freedom that way. Alright. Again, we've covered a lot and I've got folks coming out of my ears. But what I love so far, we've given a pretty, I'd say, pretty comprehensive understanding of the parts that are making up this Linux distribution. Right? We know system d's on there. There's the and Verity. System extensions to deliver Kubernetes. It's something I'm just gonna leave here, but we're coming back to that.
12:30 Kernel modules and Flatcar
12:49 That's really cool. And we have got an update server. So that raised a couple of questions on my mind. Right? So it's nice as an update server and I'm pulling things. But I'm also thinking, and I think you said it is that too low. When we deploy Flat Car, we can add things to it ourselves. I'm assuming we don't expect that the general user to be writing and shipping their own system extensions or maybe we do. So how does that interfere and work with the update servers that are coming every single month and the changes that I'm
13:16 making? Maybe we can just as an example, let's just assume I wanna stick on, like, the Falco eBPF kernel module. How do I do that? How does it work on updates? I'd love to go into that deeper. The kernel modules are an interesting topic because they, by definition, need to correspond to the actual kernel that you're running. But to take a step back, we don't wipe the whole system when we do an update. We just replace the USR partition. So whatever you put an opt or basically var or any place in root where you dump your binaries, we're good with that.
13:49 We won't touch that. You can provision your own binaries and applications in a number of ways. You can just download a tarball. Like this YAML config that I mentioned, there are dedicated sections where you can list sources for files, HTTP sources, S3 sources, whatever you feel like. And Ignition at provisioning time would just download them and then put them wherever you want them and then schmot them to whatever mode you need them and maybe you add a systemd unit to start it. And that's fine. So you can, if you like, you can work on the table
14:23 basis with your applications and then care about updating them with your own mechanism. I believe SUSE X are the lowest common denominator of a mechanism that can ship your binaries safely, is very easy to generate and to maintain on the server or your own update server side and offers a built in mechanism for actually versioning and for updating. This still doesn't cover K mod question, right? For K mods, you have multiple options. You could on each flat car release have your own build pipeline that would ingest new releases, would produce a system extension with your kernel modules in it and would
15:07 basically make it ready for consumption on an update server, which is basically an HTTPS endpoint that SystemVSYSUpdate can use and then your instances could consume that. Or there's a mechanism to build kernel modules on the operating system at first boot when a new kernel is being detected. It's a little more involving than just provisioning FlatCo, obviously, but we're talking about kernel modules. Experience is kind of expected here. We did that for a long time, for instance, with NVIDIA driver support before we actually vent the whole nine yards and build pre built SUSE X system dependent SUSE X that are
15:50 not in the bakery, but that come over our official release servers that are bound to a flat car release. So you would ingest them automatically when you do the operating system update and you would have them ready when you basically start into the new operating system version. That law is an option that's not really open for people out there, lest they maintain their own update server. Because these system dependencies as we call them, need to be built in lockstep with the operating system releases. So we do that before we release the new operating system versions.
16:25 And they need to be provided via a reachable update server that the main update server that we distribute update information about knows about. And it's actually, it's not really depending on the size of your operation. It's not really that hard to rebuild all of that yourself. We actually pride ourselves to have all of this very much automated and documented so everybody could build FlatCar from sources and run their own distro with their own updates if they so desire. So that would be an option too. From my personal point of view, best option would be if that would be a desirable K
16:59 mod for users out there to have, Work with us on the distro repo and make it an optional system dependent SYSX. So we build it for you on every release and we provide it for you and then you can determine whether you wanna consume that and you would get it directly from the upstream releases. Nice. Yeah. I felt bad when I said eBPF module or Falco because I knew that was gonna go down a hard path, but then you answered it anyway. So now it's fine. I don't feel guilty anymore. Yeah. Actually, we cover most of our Falco
17:29 support in the bakery, if I'm not mistaken. We have a Falco bakery SUS X that contains all of the parts that are not strictly dependent down to the kernel version. Alright. So, again, we're covering quite a lot here of what's happening with Flat Car as a kind of distribution. So, you know, we've spoke about or at least mentioned that there's some sort of YAML payload that users can provide to tweak this in some degree. So, you know, what is the story there? How do people, you know, I'm assuming someone listened to this and goes, right. Okay. I get it. I understand what
17:41 Getting started with Flatcar
18:03 Flat Car does and the value that I'm getting from it. I wanna get started. What does that kind of first hour look like for them? Well, I guess reading the getting started documentation would be a great entry point. We have a getting started, yeah, documentation to show folks how to deploy FlatCar locally using QEMU virtualization. So that's pretty nice to get started with FlatCar locally. But of course, we also do have a lot of documentation to cover all the supported cloud providers. Because FlatCar is not bound to one cloud provider, we try to support many architectures and many cloud providers.
18:38 So we have the main ones and we also have some stuff like OpenStack or VMware. But, yeah, we do cover the principal cloud providers. So for each of them, have some documentation and basically each documentation shows you how to provision the Flat Cloud instances on those cloud providers, how to interact with the cloud providers metadata for example getting the IP, getting the old name from the instance metadata service from the cloud provider. So everything here is documented. So you can easily build your foundation to build your own workload with cloud provider. I guess some people may be familiar with
19:14 cloud provider metadata APIs, right, on magic IPs or or whatever, and they may really kinda correlate this to using cloud config on cloud in it. Is that what FlatCar is using under the hood? Yeah. So for FlatCar there is a support of CloudEdit but we only support a subset of CloudEdit configurations. So we do recommend to use Ignition to provision the instances. Ignition is initially developed by Chorus folks. So we use it on FlatCar but is also used on Fedora CoreOS, on micro OS from OpenSUSE. So yeah, it's not only FlatCar specific. And it's a bit like Cloudinet in a
19:50 way, except that it runs only on first boot. So if your instance is fully provisioned and it works, well, Ignition won't run for the next boot. And you can do really low level operation because it's triggered from the Init from FS. So you can basically, I don't format your root file system before deploying stuff on it. You can use a luke encryption for example if you want to encrypt your disk, inject some kernel arguments if you need to tweak some systemd parameters or tweak some kernel parameters. So yeah, that's pretty interesting and basically where you can use Cloudinit, you can use Ignition
20:25 because for example OpenStack has a dense metadata service, VMware has a way to provide guest info variables where you can use to inject Ignition configuration. So yeah, each cloud provider has its own way to use the user data. Ignition is a JSON file so you can easily inject it from using Terraform or the CLI using the cloud provider API. I'm sure some people are going to ask why don't you use Cloud in it? Everyone's familiar with. I think the main reason is it's written in Python and Flat Car doesn't have Python out of the box. So we need to
21:01 avoid that. Yeah. Edition has really small dependencies and pretty small binary. So, yeah, I tried to be really specific to do one thing and to just read that configuration. CloudUnit can do a lot of stuff. So as usual, it depends on your workflow in the end. But if you can cover all your needs with Ignition, then that's perfect because it's really lightweight and, yeah, easily confusing. Does it happen to be written in Go by any chance? Alright. So it gets the cloud native tech as well, whereas Python doesn't. So, you know, it's alright. Now you mentioned encrypting this, and I always
21:27 Encryption with Flatcar
21:37 think this is a a fun thing to talk about because it's easy to do and hard to do at the same time. Because when you encrypt a disk, you have to be able to unlock a disk. We're talking about a secret material and secure enclaves. And with cloud in it, and I've seen this in the past. I used to work for a company called Packet or Equinix Metal, bare metal, and Cloud, and it was the only way you could get anything onto these boxes. People used to stuff all sorts of secret material in there, and it always used to
22:04 give me the fear. So what is the know, you I wanna run a flat car. I wanna do a full disk encryption. I wanna pass in some sort of secret material. Is the best way to do this? The metadata APIs, is there something else built into flat car and ignition? Maybe we can talk about that process a little bit. No. I'm actually glad I dabbled with encrypted this last week and that that experience is still fresh. Yeah. I think one thing to mention, and that's one of the biggest kind of hidden differences between Ignition and CloudInit is
22:38 that the fact that it only renders the configuration once and that's at first provisioning. With CloudInit, if the metadata changes or if any other aspect changes, it would just rerender the config on boot and then you would have configuration drift. With Ignition and FlatCar, what you can do is you can have your butane config, the butane. Yaml or the ignition. Json config. You can have it committed to a repository. You can deploy your node from there using that config. You can deploy another node ten years from now, it'll render the exact same config and both nodes will be the same,
23:13 will be very equal because they use the same configuration. That's a huge benefit. Now, encrypting disks is an Ignition feature. It has been in there for a very long time. There's basically multiple ways you can go. If you just tell Ignition to create an encrypted device, it'll just go ahead and do it for you. If you don't supply a secret, it'll make one up. So you better go to your instance and fetch your secret and store it somewhere secret so that you have a backup to reach your disk should the instance fail eventually. But since Ignition generated that secret, it is
23:48 safely stored in the instances configuration where Lux needs it. So as long as it's not the root partition, it's basically just integrated into regular startup and you never have to interact with that. As I said, it falls back to a same default for the things you tell it to do and then just goes ahead and does it. You will have encryption at risk. The rest, you will have your keys stored in ETC Lux, where Lux needs them anyways. You just move the root partition. That's a little bit different. You can tell it the same for the root partition.
24:19 And then the provisioning will actually succeed because it just generated the secret. Right? But then you better get to the instance and set a passphrase because next time it boots, it will not be able to read the locks configuration, which is on the root partition because the root partition is encrypted. So the boot will actually stop very early in the process and ask you for the passphrase for your root system, you can then interact if you like with the serial console and hammer it in. And I know a disturbingly large number of workflows that actually use that and it's fine
24:50 if it works for them, then we support that. But you can also use a number of tools in a new Ignition version. I think it's a beta version that you would need to opt in in your configuration. You just say use that beta version instead of the released Ignition version. And then you can use TPM and System Decrypt Enroll to generate your keys, store your keys safely on the instance and then have your instance forget everything about it. Like no backups, no nothing, everything is just stored in the TPM. You could of course also pull
25:25 a backup of the Luxe configuration beforehand, but yeah. So all of these options are there. I think the key message is you don't need to stuff your secrets into the configuration because when they're not there, Ignition will make them up. If you like to provide your secrets via a safe source, HTTPS or S3 or anything, Ignition will just pull it at provisioning time and then use those secrets. Like, no secrets in user data, please. There's also another option which is clevis and you can have a tag server. I haven't actually done that, so maybe one of the others
26:00 could say more, but yeah, that's an option too. All right. Nice. Yeah. So the clevis tank thing is interesting. Obviously, a bit more work, but it's there. Assuming people are deploying this stuff on to any of the major cloud providers, then they do have access to other cloud provider service. What is it being like, you know, SSM or KMS or whatever. Right? So some way to handle secret material. I'm assuming the workload identity features of any of these major clouds would just work out of the box as well. So that's I think anyone listening who is talking about this
26:34 idea, definitely just go down that path. And, yeah, no secret material included or ignition. That's definitely the way to go. You know, what I was just thinking that this just got me thinking of a new idea, which is great to have on the podcast. Right? Because I know that Clevis added PKCS 11 support fairly recently, which we haven't enabled yet due to space constraints, but we're dealing with that. But if we were to have that, and I'm thinking if it was running in, say, Azure, you could maybe plug it into Azure Key Vault and the VM would have a managed
27:06 identity. So it could just fetch the key straight from Azure Key Vault and you wouldn't have to do a thing. Maybe that idea doesn't work for some reason, but I thought it was worth mentioning. Alright. I'll drop you an email in two weeks to see how you get home with that proof of concept and we'll have the quote linked in the description below. Alright. Nice. I mean, I think so far we've given a pretty good overview of of the project. You know? There's immutability, but it sounds like that is only on the the user partition of the desk. So people are free
27:34 to stores or I guess even run stateful workloads. Right? If you're gonna be deploying a Kubernetes cluster on this, it's gonna have some sort of local host mounted volume. That's okay as long as you put it into somewhere that you accept and know that that is writable and you handle your own backups properly. So yeah. There's Yeah. In fact, the combination of in place updates and shipping Kubernetes as a system extension, which can also be in place updated really work in favor of stateful workloads because then well, I mean, you can you can still keep your operating system and your Kubernetes versions
28:08 up to date on these nodes without much impact and still, you know, use the full power of local disk. Yeah. It's such a strong pattern. Again, I because I used to work at Equinix Metal, I was always trying to get people to do things on the metal as much as possible. And stateful workloads are really hard, especially when Equinix Metals overarching theme was use cluster API, not realizing that you can't do in place upgrades with cluster API. Like, you just lose the whole machine. It's I'm not gonna down that that path. We're not gonna run today. But, yeah, doing this
28:39 on hypervised or even running flat car on bare metal and getting all these in place things is such a powerful model for when you need to do things like this. And I imagine even more so with the amount of AI, ML, compute needs, GPUs that are running around right now as well. But again, I'm not I'm gonna try not to say that two two layer word again as the rest of this podcast. So It it is actually a good point, though, because that's the second benefit of in place updates. Right? When you have large Kubernetes fleets, updating
29:08 your fleet by deleting and recreating instances puts a massive churn on the cloud provider, on your cluster because you're recycling a lot of instances, right? And depending on how the budgeting works, maybe even on your budget. So having the opportunity to just stage something and then, you know, have a twenty, thirty second reboot and have it activated right away, that actually removes a lot of risk from the upgrade operation. And helps a lot of the churn that deleting and recreating would would impose. Yeah. Definitely. For sure. Alright. I know that we're already over where I
29:40 Kubernetes Upgrades with Flatcar
29:46 said I would be, but I did say that was gonna happen anyway, so we're gonna keep going. Now with regards to the Kubernetes story here, I imagine a lot of people that are looking to adopt Flatiron, Kubernetes is gonna be a core part of this these days because let's face it, everyone's now de facto running Kubernetes at some degree of scale. Now how do they handle these upgrades? What does that look like for them? What is the okay. I've got Kubernetes one twenty eight. I mean, I hope not. But I've got Kubernetes one twenty eight, and
30:13 I wanna get myself up to Kubernetes one thirty two. How do they go through this process? What is the cycle like? I know it's not a Flat Car specific question, but Tilo has a Tilo has a whole demo on that. I have like that. Few demos. Matthew as well actually helped me a lot in shaping those out, and I'll hand over to him because I've been talking too much already. But the key answer to your question is it's really hard to do without manual intervention because what you're talking about is major version upgrades. So Kubernetes upstream doesn't really support upgrading from
30:48 one twenty nine to one thirty unattended. They want you to be aware that this is happening, that you want this, so it doesn't happen involuntary. And I think there's a bar that you need to meet some, I don't know if it's a file interaction or something like that, that you have to supply in order to make Kubernetes aware, yeah, this is 1.29 config. I really like to upgrade to 1.3. Minor version upgrades though, that's easy. And Matthew knows all about it. Yeah, for the minor versions it's quite easy. So as Thiel mentioned, we have to remove
31:22 the bounding between the OS version and the Kubernetes version. So both of them can be upgraded independently from one to the other. So let's take the case where we run Systemd SysEx Kubernetes on the system and there is a new version of Kubernetes, so a new version of the SysEx image available. So what's going on, there is an inner mechanism developed by Systemd which is called Systemd SysUpdate that will detect that there is a new version of the SysEx image available on the server. For example the GitHub SysEx Bakery release server. And it will download it and once it's
31:57 done it will raise a flag asking for a reboot. This flag will be read by, I don't know, for example, cured to coordinate your reboot across the nodes and the instance will reboot. So now you have your new version of Kubernetes. And it's exactly the same thing for the OS update itself. So when there is a new FlatCar update, once the update is applied on the B partition, the software that takes care of running the update will raise a flag asking for a reboot and same curd will coordinate the reboot of the node. Something interesting to know,
32:31 so we don't do that for, actually we do that for FlatCar but what I'm going to say we don't do that on FlatCar is that for the SYSEX update you can really increase the reboot time of the instance by just doing a soft reboot. So you keep the kernel running, you keep the modules, everything running, but you just reboot the PID one. So you just reboot the user space stuff, systemd, and, yeah, in one second less, you have your system rebooted, and you can get your nodes being integrated again to the Kubernetes cluster. So that's the
33:02 You're just talking about doing a K exec. Is that what you're doing there? No? No. It's it's something very specific to system g. I don't know what's going on under the hood. But, yeah, this is basically a soft reboot. So you just restart the the I haven't heard of this before. This is purely user space and it's not as scary as KXX. So KXX scares me and I have a little bit of history and I dabbled with kernel development in my engineering career. And if you consider all the driver state and all the in flight IO that you have and
33:31 all the DMA requests that are ongoing, it's almost impossible to ensure from a Linux kernel perspective that all the drivers and all the mechanisms and all the subsystems actually listen to you when you, as a kernel, say, Now you freeze and you save your state because I will, you know, execute myself and then come up fresh. So on a kernel level, this is very scary. On the user space level, you have everything well obstructed by the isolation to the user space that the kernel brings you. So what systemd just does is basically just reaps all
34:03 the processes, runs all of the shutdown commands for the processes, but then doesn't reboot, but just restarts itself. And that is an interesting option. We don't support that in production yet, but it's also very interesting to look into. And I mean, lastly, if you're only talking about systemd. Sysx images, for the adventurous folks, you could just remerge the new image as soon as it's been deployed. There's no need to reboot. And that will basically replace all of the binaries under the hood with the new binaries. So you probably better shut down the Kubernetes services on that instance before you do that.
34:37 But it's actually not that hard to craft respective systemd dependencies to have that being done as soon as a new Kubernetes SUSE X is being merged. Alright. Awesome. Yeah. I mean, I know I wanted to reboot my computer, the system without rebooting, but maybe wait till we're finished recording. I'll give it another five minutes. But I had no idea that was the thing and that's really cool. And you're right. It shouldn't be a because we're not modifying or changing the kernel where you are completely unusual space and you're just trying to do that thing. So that's very cool.
35:07 And there was one of the mechanics of the Kubernetes upgrade that I don't think we got into. And I'm thinking again from the end user perspective, which is, say, I've got a cluster with a 100 nodes. Are these minor updates part of a maintenance schedule with Flat Car and I don't have to worry about it? I just go away and go to sleep and come back and my 100 nodes have went from twelve eighty one to twelve eighty two? Or am I SSH ing into machines and doing it one by one? Like, what is that? What's the operational burden of handling these
35:14 Flatcar k8s node upgrades
35:35 updates? You never SSH into Flat Car machine. You know, we have this saying, like, SSH ing into Flat Car machines and trying to fetch something is is roughly the equivalent of kubectl exec ing into your pod and then fudging something there in order to It's make your service not a good pattern. I don't know. I mean, it's it depends. Right? But, yeah, it's it's not really doesn't give you peace of mind when you just wanna go to bed. So no. This is this is entirely automated. In the case of I need be honest, in both cases, both for operating
36:09 system updates and for Kubernetes SYSEK updates. There are reboot operators for Kubernetes. Matthew mentioned QD, which we rely on in the demos. There's also a reboot operator that we maintain ourselves, the FlatKal Linux update operator. And that's basically just an operator YAML you apply to your cluster. It will watch for these flag files that both the staged Kubernetes user access as well as a staged operating system will generate. And after generating that flag file, the update logic is basically done. Now the operator takes over and then it evacuates the node. It will uncoordinate node when it rejoins the cluster so
36:48 the pods can repopulate. But it's configurable, like how many nodes you wanna have updated at the same time and what's your tolerance level and all of these things. And it will slowly churn through your cluster very, very safely. And at some point, all of your cluster will be updated. So you take a look at your cluster next morning and it runs the new version. Nice. And something I would like to add on top of this, running Kubernetes on the FlatCar is really a perfect use case for FlatCar. But you can use FlatCar without using Kubernetes
37:20 and let's suppose you have a 100 instance cluster running FlatCar but not running Kubernetes you can still have some mechanism to coordinate the reboot of the instance. It's shipped by default on flat card instances. It's called locksmiths and with this you can coordinate the number of nodes you want to keep up, the number of nodes you allow to reboot. You can even add some logic using etcd cluster or etcd instance to coordinate the reboot. So, it's not really bound to Kubernetes, but you can also use this coordinate reboot manager without using Kubernetes. Alright. Awesome. Alright. Lastly then, let's talk about the day
37:58 two side of operating Flat Car at scale. You know, we've kinda talked about a little bit so far because we've talked we've, you know, we've dived into updates and how we do this. Right? But I'm assuming, you know, operators, SREs, DevOps people, whatever their titles are, they're gonna want some dashboards and some web interfaces and all of that things. They're gonna want metrics off of the machine. Now I imagine node level metrics is just the same as any other Linux distribution. But there are flat car specific stuff, like, are those log files in place for system
38:08 Flatcar Metrics
38:26 extension updates or our base operating system updates? Like, how do we get this information out from each of the individual nodes into a Grafana dashboard or something else? Or maybe there's a FlatCar, you know, ops suite. Like, what does that day two look like for people to roll this out infrastructure? Maybe I I can I can start really quick? So I mentioned earlier the update server, which is called Nebraska. And this update server will expose some prometheus metrics on the number of instance being updated, the number of instance running a specific channel, a specific version, specific
38:59 architecture. So if you host your own public update server you can get a view of what's going on in your whole fleet of instances by requesting metrics from the public update server. This is just what I wanted to say. That's actually a very good point. And it is a lot easier than it sounds because that update server doesn't serve the real files. It just serves metadata. And it knows an operation mode where you can connect to an upstream update server. So what you can do is you can set up in a way, and that's very lightweight,
39:30 where it serves your fleet and your fleet is configured to check-in with your update server URL, but it receives all of its information like new releases, all of the release channels and all of this information from the upstream Nebraska that we provide. And you don't need to host any files yourself either because the information which files to pull is part of the metadata and your instances will basically just pull the files that are on the upstream origin server as well. It will give you full insight though on how your updates are doing, like the version spread and your
40:05 fleet and basically the operating system details like that. So this is for the update path. Like general node information, I would refer to node Node exporter. Node exporter. Thank you. Prometheus subproject. So you could basically pull this information and accumulate it in a way that you're used to. I use it a lot for these kind of insights. And I mean, taking a step back, the update process for FlatCar and particularly for system extensions is so basic and so simple. It's very, very easy to extract these from the system. Alright. Awesome. Is there anything that you just want to
40:42 Operating at scale
40:46 go into that we haven't covered yet that you think would be interesting for the listener? There's one more thing you mentioned operating at scale and day two ops. And particularly if you run larger workloads, what is interesting to know is, as we mentioned at the beginning that we really ship very, very small distro. We ship basically the things that you need to run containers. We don't even ship Kubernetes. That enables us to actually have a very thorough test suite. I think we have a way more than a 100 scenario tests, like complex tests that are like create three nodes, install
41:22 Kubernetes, install NGINX, install a CNI, see if everything works. And a comparable thing, NFS tests are in there, networking tests, all of this what you basically would expect. Because of this very limited feature set that we support in the distro, it makes it really easy to thoroughly and fully test everything we do. So our idea is that you can do minor upgrades or major version upgrades of the operating system on these nodes and your applications wouldn't even notice because it's very, very likely that we have covered the use case that you're running already in our tests.
41:59 We are very fundamentalists with our tests. They run on all of the nightlies. Like all of the tests run on all of the nightlies. They run on releases, even on the alpha releases. Like we do not release an alpha that hasn't been fully tested and every test was green. But since our test suite is fully automated and since it also supports a QEMU target, which is a very lightweight virtualization software, we run those tests on every PR that goes into the operating system repository. So we know very well the changes that are coming in, what they could cause, and
42:32 we see that very, very early in the process. So we're trying to make it very hard for issues to sneak in. And we have the stabilization process. Right? I already mentioned the alpha channel, which you can actually fully use because as far as we are concerned, alpha has been fully tested. There might be like, you know, half done features that are in alpha that aren't fully released yet. If you develop and you use an alpha image, it will always be fully functional. Like you can work on it, it won't explode because we want people that want to contribute to have a
43:04 solid foundation, right? Now, when something goes to beta, when a major release goes to beta, this is as stable as we can make it. We have at this point literally tested it hundreds of times. It has passed all test suite hundreds of times successfully. It's been through alpha, people have used it for development and beta is a very good candidate for running on canaries in larger deployments, right? So assuming you have this big cluster and you have a few nodes that are well monitored, you just put them on beta. And what that gives you as the user
43:38 is you will always see what's coming down the release pipeline, right? So your canary nodes will upgrade, they will get the beta version. And if there is any issue with fringe workloads out there, edge cases, you'll see it on your beta. You check back with us, the upstream project, and we will never propagate a new release to stable that has known issues, right? We will always fix the issue in beta and then make sure that your workloads continue to work. And that's a very nice way to hook yourself into the stabilization process and say, Yeah, I have these beta nodes
44:12 so I know what's coming. And should there be anything wrong with these beta nodes, I can tell the upstream project so my stable nodes will always be safe and there was nothing coming to those stable nodes that will endanger my workloads. Alright. I am gonna take ten seconds to recap everything that we've covered in less than ten seconds. But, you know, I think the vibe that we're getting and the knowledge that you've given to everyone today is that Flat Car is a cloud native operating system that's lightweight, immutable and all the right bets to give you safety and security
44:25 Recap
44:44 with automatic updates coming down the pipe to give them confidence that their production infrastructure is exactly how they need it to be. Their own applications, they layer on top, nothing changes. They should all be getting more sleep. They should all be happy. That's my ten second overview. Are we happy with that? Alright. Sounds good. We can definitely update the read me on the getting started and use your quote. Alright. Now before we finish, because, you know, we're a group of Linux nerds to some degree, I'm just gonna ask a couple of quick fire questions. If you don't wanna answer, could
45:07 Quick-fire questions
45:17 say Skip. If you wanna answer, please feel free. We, again, we'll go clockwise starting with you, Chewy. So what is your WindowManager or Sunrunner? I use KDE these days. Alright. A WindowManager for me is basically a background image and 20 x terms. But for what it's worth, I'm using GNOME, but I'm not using it much because I'm mainly working on the terminal. I'm using Sway, the I three Waylon. That's good. Everyone's on Waylon at least in one tile and Window Manager. So that's a that's a good roundup. Alright. IDE of choice. We'll start again with you, Troy.
45:52 Versus code, I'm afraid. I did use Emacs back in the day, but I had to give it up. Sorry. Vim. I still have it on my to do list to check out any of them, but for now I'm to skip NeoVim, go straight to Helix, but I won't say anymore. And I'm using NeoVim. But without all the sugar stuff, it's basically a Vim. Alright. Lastly, what AI tools, if any, are you using now? I'm using Copilot quite regularly. Not so much with code, but just to chat through ideas. I found it very useful there and just
46:29 confirming assumptions, that kind of thing. It's very useful. Yeah. Saves a lot of time just scraping the web trying to find answers. Sure. Yeah. Same mostly for logos and images, sometimes for strange database query languages that I don't want to dive myself into too much and yeah. Copilot. But it's obvious. Right? Because we get the subscription for free because we work. Yeah. I'm using with Copilot or Gemini, but just for one specific use case, it's to generate rejects. Because every six months, I just forgot to buy the previous rejects. So, yeah, I just use it for this. Yeah. I I think
47:11 it's safe to say that most people are bringing AI into to some degree into the workflow. I am I must say I am trusting it more and more now to write code whether that be Gemini, Cloud, Copilot, any of I I use them all because I think they all have better tasks for. But I think that allows me just to focus on the things that I enjoy, which is sometimes writing code and sometimes just prototyping stupid ideas. So, you know, each of the run. Thank you so much for taking time out of your day to sit down with me
47:38 to talk about Flat Car and just to share random things. Just because we're at 53, I don't really wanna keep you any longer, but if you want one more quick fire question, I'm happy to throw at you before we say goodbye. Yeah? Alright. One and this one, you I'll give you just a minute to think. One unknown or at least lesser known Linux tool that you use that you think the audience may not be familiar with. And I'll give bonus points if it's if it's a Rust rewrite just because I love Rust. So feel free
48:08 to take a minute. That's a tricky one. Does anyone have one they wanna start with? My mind's gone blank. Sorry. It's all good. It's I think the challenge is that you just, you know, kind of get used to your own toolset and you don't even know what's French or what's strange because you're using it every day. And then I have I more often than I would like, actually, have people giving me feedback like, wow, you use this and that tool and this and that scenario. I've I've never seen that. I learned something. And then, oh, that was apparently, that was
48:40 a, like, a French use French use case. But yeah. I I thought you were gonna say, oh my god, you're still using that thing. Why aren't you using this? That does happen too. Yeah. Alright. If anyone has one, feel free to share. Otherwise, I'll just say that it's always done for the day. I think Yeah. I think I found mine. So with Waylon, I got many issues with Sway to do some screen sharing. So I discovered this project a few years ago now. It's called WL Mirror, and it's just a simple way to do a mirror of your screen. So it's pretty
49:15 handy when you just sharing your screen for a meeting or during a demo or stuff like that. You can easily mirror what you want to do. I'm not familiar with that. I'm going to check that one out. So it's WLmural. I just thought of one. Maybe not that obscure, but I use the GuestFS tools quite a lot, like GuestMount for mounting disk images or GuestFish for poking about in the minutes of shell like interface. I find it really Yeah. I I wasn't familiar with that either. So there we go. You're two for two. The pressure's on now.
49:46 I'm gonna have to just sit here and wait for the next hour till Tilo gets one. Actually, Matthew reminded me of something that is a good conversation when you do presentations. So I've often used demos at presentations and that's basically just staring at the terminal. One of the presentations I had at Rejects basically consisted of all of demos. I didn't have any slides. And for that, I love to use cool retro term. And apparently, people don't know it. If you do not know it, check it out. It's awesome. Like, you you really you have CRT
50:19 kind of presets and, like, all of the jiggling and RGB noise and you can play with stuff. So it makes for a really great terminal for presentations. But it makes your video recordings about 10 times bigger probably. True. Be because you can have static noise in the screen background and then that makes the codecs kind of go wild. Yeah. Awesome. Three wonderful suggestions there. And and that's us. Again, thank you so much for your time. I'm gonna push the big red button now, and then I'll set you free. Thanks for joining us. If you wanna keep up with us,
50:53 consider subscribing to the podcast on your favorite podcasting app or even go to cloudnativecompass.fm. And if you want us to talk with someone specific or cover a specific topic, reach out to us on any social media platform. Until next time when exploring the cloud native landscape on 3. On 3. One, two, three. Don't forget your compass. Forget your compass.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments