About this video
What You'll Learn
- Perform Kubernetes control-plane triage by checking manifests, logs, and service health to diagnose startup failures.
- Trace scheduler behavior when apps are stuck Pending by inspecting node resources, manifests, and kubelet logs.
- Resolve etcd encryption and Postgres connectivity breakages by validating encryption providers, key types, DNS, and startup commands.
Matt Turner and Borko debug two broken Kubernetes clusters live. Issues include a missing API server, etcd resource limits, a decoy DaemonSet, DNS policy, AppArmor blocking kubectl, etcd encryption misconfig, and a malicious Postgres startup command.
Jump to a chapter
- 0:00 Holding screen
- 0:54 Introduction & Episode Overview
- 1:12 Channel Housekeeping & Discord
- 1:58 Sponsor Mention (Teleport)
- 2:42 Guest Introductions
- 4:03 Debugging Matt's Cluster: Initial Checks
- 6:11 API Server Not Running
- 6:55 Checking API Server Manifest & Logs
- 10:40 Fixing Etcd Resource Limits
- 11:40 Control Plane Stable, Checking Application Pods
- 14:15 Identifying and Deleting Decoy Daemonset
- 21:10 Investigating Clustered & Postgres Pods
- 24:51 Ephemeral Storage Warning on Node
- 33:56 Checking Worker Node & Debugging Disk Issue (Matt's Reveal)
- 36:12 Postgres Pod Pending (Scheduler Issue)
- 36:35 Bypassing Scheduler with NodeName
- 37:51 Clustered App Cannot Connect to Postgres
- 39:05 Investigating DNS Issue
- 42:39 Fixing Kubelet DNS Policy
- 44:30 Matt's Cluster Fixed & Explanation of Breaks
- 48:50 Debugging Barco's Cluster: Initial Access Issue (kubectl)
- 51:10 Investigating kubectl Execution Error (strace, permissions)
- 56:15 Identifying AppArmor Restriction
- 58:45 Control Plane Components Flapping
- 1:01:00 Investigating API Server Flapping Logs
- 1:03:00 Identifying Etcd Encryption Configuration
- 1:05:00 Debugging Etcd Access & "Unable to transform key"
- 1:16:30 Identifying Unencrypted Data in Etcd
- 1:22:30 Fixing Etcd Encryption Config (Adding Identity)
- 1:25:04 API Server Becomes Stable
- 1:25:25 Debugging Postgres Authentication Error
- 1:28:50 Identifying Malicious Postgres Startup Command
- 1:29:15 Fixing Postgres Startup Command
- 1:32:28 Barco's Cluster Fixed
- 1:32:40 Wrap Up & Explanations of Breaks
- 1:34:34 Final Words & Thanks
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:54 Introduction & Episode Overview
0:54 Hello, and welcome to Rawkode Live on the Rawkode Academy. Today is clustered. We have a couple of clustered and a couple of broken clusters and a couple of great guests joining us today as we do some live debugging to get these clusters back online and working. We have a clustered application written in Rust, has a small database that we'll do our best to fix. Now a little bit housekeeping before we get started. Please subscribe to the channel, like, comment, and share the videos as well. This helps other people find them. It keeps the YouTube albums happy
1:12 Channel Housekeeping & Discord
1:29 and just means that more people get to enjoy and learn together. We also have some membership packages available on the academy that allow you to support the channel, so please feel free to check them out. And if you have any questions, then you can jump into the Discord and ask them. I'm more than happy to tell you what we are doing. The Discord is available at Rawkode.chat. There's nearly 600 of us in there now talking all things cloud native, Kubernetes, EBBF, and everything in between. So come and say hello, and I look forward to meeting you.
1:58 Sponsor Mention (Teleport)
1:58 All right, last thing, our sponsor. Teleport have been sponsoring us now for the last two months, and I got to thank them a whole lot. It's the easiest decision I ever made when it came to clustered because we use Teleport every single week. We've been using Teleport since the first episode, allows us to secure access to the clusters as well as share and pair in a same session, which you're going to see us doing a whole lot today. It also allows you to expose those applications securely through the proxy, which you'll also see in action. So if you want to support
2:30 Clustard and Rockwood Live, check out Rawkode.live/teleport. It's a UTM link, but I would really appreciate it. And, of course, go install Teleport on all your clusters. You'll thank me. Alright. Let's jump over and say hello to today's guests. I'm joined by Rawkode Matt. Hey, guys. How are you? Good. Thanks. Subly nervous yet? No. Of course not. These are pros. These are pros. Alright. No. No. It's on both sides, I think. Alright. Could you both do us a favor and introduce yourselves? We'll start with Barco there and then move over to Matt. Hi. My name is Barco.
2:42 Guest Introductions
3:09 I'm a big fan of the clustered series here. I think it's a great resource for people learning Kubernetes or preparing for CKA. So I'm really humbled to to participate. I know I've learned a lot through this series. I hope others find value here as well. Thank you very much. Matt? Yeah. Hey. I'm Matt Turner. Some kind of thing. Done a fair amount of Kubernetes, I guess. I got my CKA, which was quite a quite a stressful thing. I I think this is gonna be worse. I haven't watched any of these because I didn't want to
3:46 cheat. I mean, I didn't wanna end up fucking somebody else's. It's a different time difficult time time for me. I didn't wanna end up just, you know, copying anybody else's breakages by accident. It's hard to get them out of your head. So, yeah, I'm I'm completely new to this, but we'll we'll see. Alright. Don't don't worry about it. It's all in good fun. So I see Russell has a beer. Enjoy, mate. Pop has joined us. Rawkode, he's putting a lot on your shoulders. He wants you to make me cry. I'm just I'm just promised he promised Tim Horton's copies, so I'm
4:03 Debugging Matt's Cluster: Initial Checks
4:18 gonna try not to disappoint. Alright. Perfect. Well, I have Matt's cluster in front of me. So we are gonna jump over to our screen share. I am going to open a session onto the control plane node, and, Barco, I'm gonna ask you to join and give us an echo hello to let us know that you're here. We're in the same terminal, and then we will do our best to get this application running as quickly as possible. Let me start our timer. Feel free to set up the KubeConfig, and we will check if we have a control
4:53 plan, which is usually a pretty good first step. So my I don't know how much I'm I should troll you, although I should just shut up. My worry here, to be honest, I I haven't broken it hard enough for for two pros like you. I wasn't sure Sure. How close it would go. This might this might be very quick, but then we get to move on to Barco's super hard, super broken cluster, and it's all good. The the quicker is the I mean, the sooner I get my dinner. Usually, I meet dinner at 8PM on a Thursday. So,
5:17 you know, if I can get up there, I'm not gonna complain. Alright. Because I I put a list of what I'll go through at the end of, like, four or five ways I tried to break it that didn't work because I sort of, you know, came up with them from first principles, but Kubernetes is too resilient these days, and they actually didn't have the effect I expected. So that was quite interesting. Alright. Well, it looks like we do not have a control plan. That's the correct IP, I guess. It definitely looks correct. I mean, I can't
5:50 say for sure, but I would I would argue that, yes, that is probably the correct IP. And the port seems okay. Would you like me to confirm somehow? Okay. Well, we don't have a server running. Yeah. Or okay. So well, I just Nice reminder in the chat from Kevin that there are only three rules. Don't play teleport, don't play with editing, no eBPF. Although, the latter two are unofficial. I was gonna say. Through rage induced by me on previous episodes. Okay. So I guess cube controller seems to have been played around with, But So we have the API server running, and
6:55 Checking API Server Manifest & Logs
6:57 you've jumped straight into the static manifest. Right? Yeah. I'm just going to see if anything seems off here. Same thing, Jamfi, too here. No. That looks like a perfectly normal static manifest for the API server. I'm pretty happy with that. Okay. I guess we can check system d. Yeah. Let's confirm if we have a Qiplit. Russell's also saying there's a rule. Yeah. That is a real rule. We have not invited to guy back after episode two, nor will they come back. I see an API server there. Maybe it's just flapping? Yeah. Okay. That's Might wanna check out the the
8:17 logs perhaps. Interesting. Okay. So it seems to be restarting. Okay. Pause. Haven't known Matt for a wee while now. I would not be surprised if there's just a doing a kill minus nine every three seconds. Yeah. That will you you know, no no UTF eight, non blue cap. No no product, and, yeah, I mean, the heavy metal approach is is just something containers Big cap. Would be your best bet. I know you. It's more back here. Simple but effective. Wait. We don't have Varlog? I have Varlog single. Yes. Typing it wrong. Okay. HCT. Okay. There's always a panic when I don't see
9:53 Etity running. Yeah. It it definitely shouldn't have lost its data. Like, I haven't done anything like that to you. You should when you get it back, you should be able to upgrade in place. Yeah. I'm gonna have to find a way to disable that, but there's always the remembered Vim cursor every time we open a file and it starts halfway through. I I killed my bash history, but I I've forgotten about that. Damn. Every week, we see it. You're like, oh. Hacks. Yeah. I didn't think of that. I'm obviously I'm not a hacker, am I? I I
10:35 cleaned up my bash history. All good. What do we have? I mean, this doesn't look right. Yeah. That one m is is actually, he is never gonna be able to start with that. That's for sure. I can see a smirk on your face, Matt. It wasn't it wasn't actually crashing when I left it. I wanted to just squeeze it. I should have squeezed squeezed it to just the point where it was timing I mean, I would just remove the the limits, to be honest, but we don't need them. We never follow that advice in production if
10:40 Fixing Etcd Resource Limits
11:21 you're watching. But, you know, for for the sake of clustered, am happy for us to delete a little bit. Alright. We may just wanna give it thirty seconds, make sure it's still running, make sure API server is up. It's always fun to see which flags people use with PS. Mhmm. Everybody's got their own little unique take into it. Oh, come on, Russell. Six minutes thirty seconds in and there's already a Rawkode smash of just deleting stuff from manifest. Yeah. They like to delete stuff. But SSD dropped again. Right? Well, API server is not starting up.
11:40 Control Plane Stable, Checking Application Pods
12:33 But that's oh, did did you Was that CC still running? Well, I assumed it was. Oh, it's okay. Seems to be running. Okay. So let's check the logs again. Yeah? Yeah. Yeah. Let's do it. Let's see. There is multiple ones. That's from five minutes ago, but I think that's the other log. K. And Cubelet's running. HCD is running. Oh, there we go. Server. There we go. That was there we go. Okay. So we have one worker that's now working. That's good. But we have control plane that seems to be working and one worker that is okay.
14:14 Let's check few things. Already checking for malicious policies. I mean I mean, I would maybe start with the get pods. I'm just throwing that out there. Maybe it's running. Maybe maybe Matt forgot to break it. I I guess I'm just You're just assuming the worst. Right? Well, just kinda check what's next. It's kinda difficult to know what to do because I could've just r m this r f the disk. Right? But it didn't seem to be in the spirit of the rules. So It's against the rules. What's against the rules? Sorry. Deleting the disk because that would kill Teleport.
14:15 Identifying and Deleting Decoy Daemonset
15:22 Oh, yeah. Actually, you know, I could have done something fairly dramatic, I guess. Man, I'm not doing very well here. He's trying to delete all the pods. I don't need to send I know. That was just a complete mistake. I'm, like, thinking I'm I'm thinking I have to smash everything. Okay. So So we could ignore ambassador mostly. Okay. We can see that our clustered pod has completed, which is not a good sign. And our postgres are stuck in container creating. And we got some silly and flaps too. So Oh, there's our controller managers aren't running
16:07 either. Okay. Okay. Let's controller manager. Yeah. I think the controller manager is quite important. We we need that running for everything else to reconcile. Are you spot it? I hope I'd hope you'd get to the point of running the update, changing the deployment, and then being like, where are my pods? Where are my where are my pods? Yes. Okay. Bear with I'm I'm 54 clusters in at this point. You think I'm gonna miss that? No. I'm sure I'm not the most creative person. So that looks okay. And when we ran the get pods, it was actually in a
17:00 status of pending, and pending really only means one thing, especially for a static pod which doesn't have access to a lot of the Kubernetes primitives. Did we have a scheduler running? Yeah. Yeah. We may want to describe that controller manager then and see. Oh, wait. It's running now. What's Eleven seconds. That wasn't me. What are you doing? There's, like, a lot of things are restarting. So I wonder if there is something that's causing them all to restart. Yeah. Old forgotten limit range. Only known by open open shift operators. Mhmm. Yes. Can you run get pods all again? I'm
18:35 I'm just curious if we have more things restarting. Okay. That controller manager does appear to be alright. Was it just maybe at CD? Like, I don't know. The fact that it was done. So, yeah, it could be HCD related. So let's assume it's okay just now, which means we probably wanna focus on our clustered and post code applications. So, I mean, our clustered and complete, and we have one pending. What's this web six x? I think that's a decoy. What do you mean by deco? Like I I think Matt has just deployed them to confuse us.
19:40 Russell asks, how can the clustered pod be complete? So it generally means that the process actually is zero. So it's either not our image. It's been swapped out. Yeah. Hard to tell right now. Was that the web one you described there? Yeah. Yeah. Okay. So that's just an engine export, I think. I would suggest passing it. Just doing the big delete button on it. Nothing good ever came from random workloads in my cluster. I mean, it's okay. I'm gonna delete it, but I'm wondering what responded. Well, the pod name was interesting. Right? It wasn't a pod tuple.
20:45 It's more of a daemon set tuple. I think he's just deployed a daemon set called web. Yeah. There we go. Okay. Let's delete that. Yeah. Okay. Matt, you look sad about that. That was my Tumblr. Okay. I mean, we don't see any more restarts, I don't think. So that's yeah. You're right. So let's look at let's look maybe the image has been changed. I think we're I think we're thinking that too. Okay. So I guess I did s t s. What was it? Okay. Oracle's flying through this. I could just go have my dinner and come back, you'll have
21:10 Investigating Clustered & Postgres Pods
22:17 it fixed. I don't know. I'm I'm scared of that that you see it's coming before. Alright. Well, we're just about halfway through our allotted time. So you've got just over twenty minutes. You're doing well. You're doing very well. So I don't am I missing something? I am not seeing here request for limits. Did they just pass through them? You are correct. There's nothing in that manifest that's setting the limits. So what else could be modifying them? Another common thing, sensor to admission webhooks or something. Yeah. Check the mutation webhook configuration. Me? They're a breaker favorite on clustered.
23:18 Okay. They are. Got the. Oh, yeah. EBPF for the control plane one. We've had one EBPF breaks in the bar that was brutal, absolutely brutal. Yeah. Know. Okay. So there's not a mutating webhook. So I'm not a terrible person. So what are you looking for now, Marco? Can't you specify web static webhooks like here? Oh, yes. You can. So but there's only the enabled admission plugins. So it's only node restriction, which is which is okay. Yeah. I don't see anything. So I'm just thinking now what are let's take a look at that here again. What was the error?
24:30 And you checked for quotas and limit ranges, didn't you? And there was none. Yeah. Okay. So So can we see the request and let me scroll up. I actually the node was low on resource of thermal storage. This may be something on one of the worker notes instead. Yeah. Okay. So let's get let's see something here. Me to open a new session, or do you think you can do it from the control plan? One sec. So Okay. So it's scheduled on worker one. Worker one is ready. What if we just, like, delete it and see if it gets rescheduled to work
24:51 Ephemeral Storage Warning on Node
25:43 or two? Is that gonna work? Yeah. Go for it. I mean, you could always edit oh, well, we can come to that later if we need to. Let's do it this way first. Yeah. Kevin says one of the workers was not ready ready. Maybe it got scheduled there. It appears Matt's worker has fixed itself. I'm not sure how. No. I'm not sure how. They're surprisingly resilient things. Oh, is it gonna what state was it in? Oh, okay. Okay. Navin's pointing out the and are under resource issue, which we're looking into now. I was really tempted to break one.
26:31 I was really tempted to break the finalizers just to just to really annoy you when you're trying to delete things, but I didn't get around to it. Sorry. I've got a nice alias for that now called nuke from orbit. So that patches out all the finalizers. Let's edit the deployment, Barco, and just add node name to the spec. And Well well, it's running now. Oh, is it? Yeah. I'm not sure why, but it seems to be okay. So Alright. Let's just go to cluster. The other ones and see what happens. Yeah. Go for it. Why not?
27:14 See, delete and fix these everything. I know my chat are laughing at this, but it does work. I mean, we're gonna have to find out what the breaks were because I feel like we're not really fixing some of these issues. Like, I don't understand what fixed the Yeah. We'll we'll see what happens. Running. Okay. Right. So when I upgrade it to version two. Yep. Okay. I think I'm gonna have to describe that. Well, let's just I'm guessing the one that's running is the old one. Unscheduled with no events. I'm a little confused. How's the bag? I'm just I'm I'm trying
29:25 to understand what's happened here. Okay. So we have one running. Is our scheduler flapping again? Yeah. We'll we'll take a look. Just so this is region one. We do have a schedule there. Something's happening with the resources, I think, and they're I wonder if I just do this. It's Dean, the only one that works. That's brave. It came back as well. It did. It did. I assume this should be v two, I hope. Oh, no. No. Because it's the replica set for the old one is still active. So we have an unhealthy new replica set with v two,
31:02 which can't be scheduled. So the old replica set is continuing to schedule pods. Now our new replica set and the pods created from is currently pending. Now, that usually means that the scheduler hasn't assigned a node or a volume can't be mounted or there are resource constraints. Normally, we would see something in the events for that, and we are not. My intuition is telling me interesting. Maybe look at the scheduler manifest. I don't know if we did that yet. It started at the top, so I'm gonna trust that he's not modified it. Maybe that first one was a red heron.
32:02 Maybe I am an elite hacker. No. Our mission is, I mean, we don't really need to fax anything that Matt's done. Like, we can bypass the scheduler, which I'm tempted for us to do just to see if it works, and then we can always try and work out what the issue is. So as as you recall, Marco. I'm just trying to think. It seems like something's happening with the resources. I'm trying to think what what are the ways you can manipulate resources, either available to Kubernetes from a node or but, like, we checked we checked limit.
32:46 Did you describe worker one? Like, describe the node for the API server? I don't know if we did that. Let's let's check this and let's look at some of the logs. And then if we don't find something quickly there, we'll we'll try to, I guess, your way of just pass bypassing the scheduler. Yeah. Look at that. We may have a disk being filled somewhere. And if I remember correctly at the start, worker one was unready was not ready. So sorry. And and, like, class, what did you where are you looking at here? So we've got a warning about the disk space,
33:56 Checking Worker Node & Debugging Disk Issue (Matt's Reveal)
33:59 although it is a couple of hours old. Oh, okay. But it does seem to be okay as of nineteen minutes ago when we like, because that was a not ready node, and now it is ready, which is particularly strange. But that was the worker, not the control plane. Maybe we should jump on to worker one and Oh, right. Walk around? Yeah. Let's let's do that. Okay. I've opened the session. Feel free just to jump in. We've got ten minutes. But I see we have we we snoop around for five more, and then we bypass the scheduler
34:40 Just because it'll kill me if I see Matt wins. Should've brought a wee dram to just it's almost 6¢. So if you're on a a DF, does the desk look okay? It does indeed. Do wanna add a dash I just to check the I nodes? Yes. Signed. Okay. I'm assuming you had something running on that node at some point, Matt, that was doing something nefarious with the disk. Right? Yeah. Yeah. I'll, yeah, I'll give it away. I was that's why I think access to node two is is lost. So I I I tried to break both nodes. I logged
35:34 out of two and couldn't get back in. I realized I'd broken it a bit too hard. I still had a shell on node one, so I undid the thing that was in the. Okay. Yeah. So that actually is a red herring, I actually can't explain why the thing's still having issues. Well, I mean, I did another couple I did another couple of things on the notes, but it's not I mean, it's it's not that. Okay. So I'm confused. Are you saying that you're expecting this replica set issues that we're having that that's a different problem?
36:09 I can't remember exactly why. What would we add up Yeah. I think you are expecting yeah. I think you are expecting what you're seeing. Yeah. Alright. Will we bypass the scheduler and then work it out? Or you you really wanna work it out? I can tell. Okay. No. I think you can you take it over? I'm I'm kind of, like, stuck. So I'll, like Well, let's edit the cluster deployment. Right? And we'll jump down to the pods back. Straight out of the CKA, manual scheduling. We're just gonna do a mod name. One two one six five. Because it's a
36:35 Bypassing Scheduler with NodeName
36:48 So you wanna schedule it to a specific, like, different node? Yeah. I'm just gonna hard code it so we don't need the scheduler. Like, I'm assuming that's why it's pending. One, though. Sorry? But wasn't it already in a work room one? No. It's not scheduled to a node yet. The the new one isn't Okay. I think No. See, now it's container created. Is completely wrong. There was something the scheduler wasn't assigning a node to it, and I don't know why. But, like, we can just bypass it and see if we get it working. And then,
37:20 like, yeah, now we I I yes. Now I I for some reason, I was thinking it wasn't like, there was something wrong with that node, but I I yeah. I I missed missed it wasn't even being assigned. Yeah. Of course. I mean, I have no idea why the scheduler wasn't doing that. But it was for the old rep, like I said. It's really, like yeah. I'm I'm not entirely sure what happened. Shall we test our clustered application? Yep. Can you hear me say? Aye. Alright. So we cannot speak to Postgres. So we're not done yet, and we have
37:51 Clustered App Cannot Connect to Postgres
37:59 seven minutes to go. Okay. I wonder which one of my hacks did that because this was all done in, I'm not gonna lie, quite a compressed time frame. I I I threw a couple of things at the wall. Like, I'm glad one of them worked. Sorry. The the connection string is hard coded here. Yeah. It just tries to speak to the service Postgres. So you wanna make sure that we have that service that has endpoints, etcetera, just, you know, basic and and see. So we have postgres We do have an endpoint. Yeah. And then So that's
38:52 that looks okay. Right? Yep. You've just been paged. It's 3AM. We're losing a million dollars a minute. Go. Go. So okay. So so the error message is it can't resolve the host name. Okay. It may be DNS. In fact, this chat, it's everybody's saying DNS. Russell, it's always DNS. Naveen, it's almost DNS. Kevin, even if it isn't DNS, it's always DNS. Why is it always DNS, though? Because it sucks, and you should all use Istio. I've tried it. I can't I can't get it installed. Yeah. I had to I had to get it in there. Yeah. You can be driving
39:05 Investigating DNS Issue
39:51 up RAM. Is running? Now if you've given me a cluster of this tier, and we could have broken some things. I'm, fatty fatty pride at this one. That's correct. Right? Yes. So we see ready. We see Kubernetes cluster .local. I believe that looks alright. Yeah. Good call. So we do have two Kubernetes pods running. Suggestions? I don't know. Could be something on the node. So my suggestion would be, let's get inside cluster, get some debugging tools and see what the hell we can work out. Bash. Yeah. And then I have to every week, I need to try and remember the name of
41:15 the package. I think it's DNS, you tells no space. Oh, for dig. Yeah. I always forget. Yeah. I wanna I dig. Yeah. It's different than every distro. I always forget. There's commandnotfound.com, which is really useful. You just put in the name of the CLI tool you want. And for every distro, it gives you the package name and, like, BSD and Darwin as well. It's super useful. Okay. So what deck is telling us here oh, we're not using cluster DNS. This is the Equinix mail. DNS was over. What's the IP address? We have it up here. Right?
42:02 I'm gonna as we see, 1096010. Yeah. Can remember. But we can try and force the Okay. So it did answer us and that does work, but I don't think a resolve.com is using it. So he's I'm super impressed you found that. Buried in the fourth line of fourth line from the bomb. So Very, very good. Oh, it's the kubelet configuration. Right? You want me to keep driving this barcode? Do you wanna No. I'm I'm a little not following actually what happened here. Yes. So I'm gonna go on to the control the worker node. Right? And we're gonna
42:39 Fixing Kubelet DNS Policy
43:15 do a cat of kubelet, and there's a bunch of configuration files we can use. And one of the things you can do with the kubelet is change the DNS configuration. If I can remember how here. I don't know if that's right or not. And cluster DNS is here. What what I thought he changed here hadn't changed, but I feel like I'm close. Ever since then So I guess mean, we can check VarLib kubect config. Maybe there? I mean, he can't have just done that really shitly. Excuse my language. Yeah. There we go. There's DNS policy of default. I don't know
44:01 if that's correct, but I feel like it should be cluster first. And this is Have I made that up? Thirty seconds. We might still restart it, don't we? Super impressed. This integration won't get taken effect. Super, super impressed. Yeah. The the pod had rescheduled. The old one was still terminated, but it was enough for the service to run. So you you changed the cluster policy from cluster first to default. Right? Yeah. Because the default because this bit me once, and I lost, like, a day to it. The default is non default. I I thought that would pass the eyeball test. I
44:30 Matt's Cluster Fixed & Explanation of Breaks
44:50 thought you'd read that. I think the default was what it was meant to be set to. Did you think default would be the default? But it's not. It's cluster first. Yeah. Super impressive. The reason one of the reasons they weren't scheduling is because the scheduler is called default dash scheduler, and I changed I overrode the scheduler to default. So it would have been trying to be using a nonexistent scheduler. That's not the only reason either. But There we go. Did you do anything with the resources, or that was, like, completely off there then? Yeah. That was maybe a bit of a
45:21 red herring. So I tried to fill the well, I did fill the disc on both workers to to get that disc pressure. This cluster doesn't have auto tank turned on, so I thought it would fill the disk. It would get the disk pressure condition, which it did, and then take the nodes and evict everything. That didn't work because must not do that by default. But I did it by dumping a load of stuff into a file. And if you on the working nodes, if you run t TMux attach, you'll find that hidden in the background, there's just like a
45:51 a Python interpreter holding the file open. So I made a massive file, opened it, and then deleted the file. So if you use d u, you will not find the file that's filling the disk. The problem was because I filled it right up to the brim, you couldn't fork a new shell, so I broke SSH access. And like I said, I managed to undo it on I thought that was unfair, so I managed to undo it on one of them but not the other because I'd already logged out. Okay. Well, it's great. Yeah. If you there's also an anti affinity
46:20 between the clustered pods, which I think is another reason it wasn't scheduling based on a like, a global one. So based on it, you can use any label as the the topology key, and you meant to use regional zone. I just used OS. So because they're all Linux, right, it tries to bring the new one up before it turns the old one off, but it can't do it because there's there's no applicable nodes to post it on because they've all they're all running the same OS. Alright. Well well played, Matt. But we tried to get it fixed. Solid team effort from
46:48 Rawkode and So we're gonna swap places now. So I have let me jump back over to our screen share. I have Rawkode cluster here. Matt, please join the session I'm about to open. There we go. Just an echo hello. Let me know that you're there. Barco, feel free to sit back and relax and enjoy. Awesome. Thanks. It's fun. My invite code's expired. David, just want to get another one. You're supposed to register before the episode, Matt. Come on. Wait. This is fine for four hours. I only got the link two hours ago. Oh, yeah. It should be good then. Yeah.
47:34 The the the email I oh, never mind. Doesn't matter. Oh, did I have to register for bucket list as well? Okay. I thought I'd registered once, and I could just connect, so I didn't yeah. I didn't know anything. Well, I guess this is a nice opportunity to show everyone how easy it is to add new users to Teleport. Alright? Users add rules admin logins route empty. Let's hope you get to this before anyone in the audience does. Otherwise, they're pairing with me and you get to go home. I I stuck the link in our private
48:04 chat just to pop that open. I'll just clear that. There we go. The chat on An e cam. Yeah. Oh, that's the one. Yeah. I should have started the timer when you told me the link was broken, shouldn't I? Yeah. You can see it. I think it's good. Let's see what we've got in the chat. Russell with mission impossible level timing. Nice work. Yeah. We did come far too close. I've been saying the same. Super impressed. Kevin, have a celebratory beer. I I wish I don't actually have any. I will not make that mistake again.
48:43 No. I should have brought one. I would have taken the edge off. Clustered Rawkode. Not quite consider myself lucky. Alright. Managed to register yet? Yeah. I thought I was in. Did you join my active session, or did you open your own? Correct. I'm not very gonna be very good at this, am I? You'll be fine. Let's just get in the same session. Check out activity, then active sessions. There we are. Cool. I thought I could show the audience that. So you join sessions on Teleport by clicking activity, active. And now I can see that Matt opened
48:50 Debugging Barco's Cluster: Initial Access Issue (kubectl)
49:26 one. Okay. Yes. Feel free to join me. Hi, Drew. Awesome. Alright. Thank you. So you're gonna have to export a cube config, check for the control plane, and I am gonna start our timer. Okay. Use whatever command you wish. I do appreciate the key alias, though. Uh-huh. Okay. Interesting. We're route. It's allegedly readable by its owner, and there's no dot saying there's an SE Linux context or an extended attributes on it. Well, that's interesting, isn't it? I mean, I would try LS after just in case. But I thought you would get a little thing on
50:29 yeah. But okay. It's a fair point. I haven't Well, yeah, I I'm gonna look it up. I don't remember what the the e is. Yeah. Do you mind? I haven't heard of them. I haven't used the subsystem for a while. Oh. But we we can count it in QC. Okay. Interesting. Can we install packages onto this thing? Oh, tell me what the letters mean. You can always rely on the arch lit arch Linux wiki. That's what exactly what we need. Best thing. E is extend format. So I think that's actually safe. So I I don't think we need to worry about that.
51:10 Investigating kubectl Execution Error (strace, permissions)
51:30 Are you running an s trace? Yeah. Why don't we see this? What he's doing? He's obviously doing some low level Linux nonsense. Resource temporarily unavailable. Yeah. What is You wanna check the mount points? Oh, maybe. Yeah. Could have given us, like, a weird mount. Read only file system over the top. That's odd. Can we see the call to read the environment? No. I can't see the access to the environment variable or any any stat or any open call on the file. It doesn't So let let's let's do the basics first before we go into the deep dive stuff.
52:31 Like, is our kubectl really kubectl? Yeah. I think it's I no. I think it's forking. Think it's forking. Elf binary. It's a go it's a go, but elf binary. It does look alright. Okay. It was put there at a plausible time. It's a plausible size. Yep. What have we how could you broken it using this? Like, put a read only file system over the top? That was one of my thoughts too. Yeah. Yeah. It's meant remount r o. But that's errors. Right? So that's if there's a file system error, it shouldn't Alright. It shouldn't freeze it. It should remount it read
53:24 only, so I think that's okay. I can see Barco laughing at us already. I'm just laughing at the comments. This is that is a full Google interview stuff. Alright. I've got confidence in this. So It's not a SIM link. It's only got one hard link. Let's try and change the permissions on it to seven four four and see if it complains those. Right? But the thing the thing is we can cat it. So something's stopping. KubeCTL has, like, an SE Linux. Okay. I mean, if there's I'm gonna be angry. Like, it's like it's like KubeCTL has got an SE Linux text or
54:07 something. I don't know too much about on the right track. It's nothing with the files. It's something with the kubectl command. Correct. Are you using eBPF? I am not I don't think Oh, okay. I I cannot do anything with eBPF now. You're looking for LD preloads. Yeah. Yeah. I'm seeing you there. Okay. So if there's something to do with a cube control command, we already validated this. Right? Let's check, like so let's check if it's an alias. Run can type alias. Adjust our alias. Okay. So could be a function. I don't know because that would show up when we
55:08 did this. Right? And I did that to find the file. The file's okay. I'm gonna check anyway. I mean, there's many places it could be, but still Yeah. I mean, that we could always just do. That's true. Okay. So it's something getting in the way of kubectl, which looks okay. Mildly preloaded. I'm thinking it's gonna be, like, a, yeah, Systrace, like, of the security mechanisms. It's it's gonna be gVisor or something, you something that blocks syscalls. Although I never saw when I s traced it, I never saw the access to admin.com. So I wonder what I mean, does does
56:02 it fork naturally? I can make this thing follow forks f. There we go. Okay. Forked. I assume that's natural. Yeah. Access permission denied. Well, they got assume. Barco's done some of the maybe Barco's just opening the file at that path. So why don't we copy it to somewhere else and see if it works? That's a good point. Yeah. There's a new one there, but I can't see there's there's several pits, but I can't see the fork. Oh, because it of it's called clone these days. I just wonder what on earth is so where's it exec ing?
56:15 Identifying AppArmor Restriction
56:59 Because something we've seen on a previous episode from Noel was the BPF bypass that would restrict access to open a certain files on the disk. So I would suggest we just try and copy it somewhere else. And if that doesn't work, we need to start exploring the processes on this machine for something malicious. Yeah. I'll tell you what we'll do. We'll do that so it gets a new INED number as well. Poo.com. That was the best name you came up with. I I hold on. What do wanna call it? Right. But remember, it it snapped with that
57:41 file. Now I'm curious if we copied kubectl to a different location. That would work. Okay. So we can't read anything. That's interesting. Yeah. Right. So, yeah, Kube CTL. I wonder whether there's an SE Linux policy that's matching the the loaded file. Well, Kevin said in the chat that you can't get SE Linux on Ubuntu. But someone told me last week, you can now get SE Linux on Ubuntu, and that put the fear into me. Hello. Yeah. Oh, that works. Interesting. That was a good suggestion on my part. App Armor. There is an AppArmor profile for
58:26 KubeCTL that doesn't allow you there's nothing in it, really. It wasn't generated. It's just preventing you from running the command. Sneaky. Nice. Okay. That's AppArmor made the ban list. Alright. Let's continue. Yeah. AppArmor and SD Linux are those things that I really should learn better, and I've never invested any time whatsoever. No. Normally I mean, when I say SELinux and people said it literally doesn't exist anymore, that shows how much I know. But, you know, one of those Linux security modules. Oh, look. Okay then. Back to where we were. It seems to be kind of running, but
58:45 Control Plane Components Flapping
59:08 they both my cluster was older than that, so he's been in here since you did something. This all looks good. Oh, I don't have any of my illnesses. This all looks suspiciously quiet except the network's up and down all over the place. And, wow, so it's all of the control plane. Okay. So we don't have an APIs. The API server's flapping because, obviously, it's been up at some point because we can we can do this. Yeah. What's the restart? Should we just try it and see what happens? Yeah. So what's the fix for the AppArmor thing
59:51 then, Borko? How would we fix it? You can you can just basically, there's a command, a a log prop. Sorry. Gen prop or something like that or log prop that will generate the the profile AppArmor profile for kubectl based on, like, the last time it was executed, like, all the syscalls it needs, and then it will allow you to do it. Or you can just disable the profile itself, and then it should work. Nice. Boot into sync mode. We haven't seen our before. Let's say, clustered first. Yeah. I've tried to kind of because I've seen a few episodes.
1:00:32 I've tried to put in things that I haven't seen before just to help people, I guess, audience, like, get exposure to some things. I don't know. Because I I found this, like, really useful to learn from, like, really enlightening to see how people think and different things they can think of breaking. So I try to put things that they haven't seen before. Alright. Awesome. Matt, we've got Angry Man, and our authentication for Postgres failed. Yeah. But you'll notice it's it's v one Angry Man. There really has no. There only has one Angry Man. This is the v two image,
1:01:00 Investigating API Server Flapping Logs
1:01:11 but we're getting an error connected to the database. Oh, is it? I'm sure it's a v one. The the title of our web page is v one. Oh, yes. Good. I I I think that's cached. Oh, okay. Great. There is a caching problem, and sometimes I have to reload it 54 times before I get the new video. I don't I I don't tend to open v one anymore. No. Fair enough. Okay. So we've upgraded with the and it can't auth to Postgres. That's interesting, isn't it? And then and the whole control plane's flapping. That's all very odd. Okay. How do you
1:01:44 wrote this app. How does the auth work? Does it get its password from a config man? No. It's it's hard coded. Okay. Postgres and Postgres one, two, three, if I remember correctly. But you can you should be able to take a look at the stateful set, and it's just environment variables passed into the Postgres image. Well, now our now our API server. There we go. We need to fix that flapping. Yeah. Degraded. It's a static manifest, so you won't find out those? We will still see it as a as a child of containerd.service. Right? Except it's not there at the moment. But
1:02:26 okay. Yeah. Let's do a quick quick visual inspection of the Why do people never look at where the cursor is when they open the file? Oh, no. I forgot. I was just just spot where it was. I see some sneaky stuff. Okay. And there's all kinds of things you could subtly break in here. Yeah. It looks like Oracle has added an encryption provider to our API server. So You might wanna take a look in temp e c e c. Yeah. Embarrassing question. What does that even do? That's the encryption at rest in etcd, right, for, like, the raw encryption.
1:03:00 Identifying Etcd Encryption Configuration
1:03:23 I believe so. Yes. Yeah. Now what we have to be careful with here is if we remove this, we can't can't read out an SCD anymore or he could have added it and not actually enabled the encryption on SCD, in which case this is the reason it's flapping. So what we're gonna have to do is check the API server logs, I think, and see if we can get some information before we go all Hulk smash on this. What if we Hulk smash it and we just lose all of that CD? We can just put it
1:03:59 back. Right? Because we don't need API server access to do it. Yeah. Just don't delete the ec.yaml. Oh, no. No. Mustard. You know, 123456. Nice. I wonder if there's gonna be something spicy in there. I I tried to put it the the the thing is you have to limit to exactly, like, 16 or a certain number of characters so you don't have much much room for creativity. Oh, is it okay. I was gonna get the logs through through QCTL through Containerboard. You'll need to go to Firelog. Yeah. I'll I'll tell you the restarting is not passed by the encryption.
1:04:42 Okay. That's good to know. Okay. Containers. Yes. Where it is. Then, yeah, cube dash API. So where do we go? No match. Failed to list config maps. Enable transfer. This looks like some Nets CD. Registry config maps default. Keywords in. Machine. No machine prefix found. This does look like it's it's looking for a hard coded config map in NCD, and the NCD tree has been mutilated. Mm-mm. Go back up. Is there any precursors? That's just a reboot, isn't it? Shutting down. Stop. Shutting down. Shutting Shutting down. Shutting down. Okay. We wanna check. I mean, the STD is running because that's
1:05:00 Debugging Etcd Access & "Unable to transform key"
1:05:46 why we'd see an error that it couldn't actually. STD is actually not flapping, is it? Oh, I don't know. The only thing is PS. API server's going by. Yeah. Good. Yeah. One time it's restarted, I guess. We should be able to see the time of the process started. Seventy two minutes ago. Right? Okay. Yeah. Right. So ACD is okay. Does look like it's been okay. So it looks like an important config map is gone. We can't use the API server to inspect it. We could use ACD CTL. Well, we do get an API server for
1:06:24 a while, so we could always wait for it to restart, and then we're on a clock. But maybe there's a nicer way to do it. Oh, it's back. Yeah. Get config maps all. Right. Where did it does stuff for config maps and not pods? Why is it trying to get it to CA for one and not the other? I don't understand that. Little confused that k get c content apps all did not work. I would have expected that to work. Okay. Yeah. I love it when have to ask them. We Wait. What was the error when you ran that?
1:07:33 Something about I'll show you. Something about I love how you're joining to fix this as well. Okay. None of those. Try to get Centimeters again. Unable to transform key. Right? Okay. So, yeah, that does make sense, I think, actually. Why? Yeah. Because there's a few especially config maps. Right? Out of the box, there's a few of them that are, like, completely intrinsic. And I think, you know, because you could there's always a service called Kubernetes, and there's always a config map, I think, with the cluster c a in, there's always just something else. And it looks
1:08:08 like he's maybe managed to delete it. And, like, the error handling, it's just always expected to be there. So the error handling is just it just doesn't really wrap it nicely, and it's saying, well, I'm trying to get this because this is this is the raw key in NCD. Right? It keeps everything under slash registry in NCD for some reason. So this is the raw etcd path, and it's just saying it can't can't find this or can't find some, you know, some prefix of it. Yeah. So I always use this to configure etcdctl. Let's copy this.
1:08:37 Yes. Cool. I can never remember how to do that. Okay. What was it? Registry. Yeah. I don't know if I should be seeing something or not. I'm not an entity. I think we have to go registry slash the namespace slash config maps or config maps slash namespace one or the other? I forgot. It's it's config maps slash namespace by the looks of that error. There is a flag, something like list resources or something like that to show you the tree, if I remember correctly. Yeah. There there was some kind of, like, list commands in there.
1:09:22 This is my least least favorite part of the CKA. Yeah. This is a bit intractable, this NCD. I don't remember NCD. No. You might have to go do want me to Google some stuff? Well, we ran get registered services, and that worked. Right? We actually got this value back. I have an SCD cheat sheet. Right. So because this one doesn't work. Right? So why can't we do config maps? Where did specs come from? It was just part of that thing. I don't know how to show the tree, so let's try that. Add to the CTL,
1:10:12 print tree. So I I should I start copying and pasting from this cheat sheet? Oh, yeah. Go for it. Fuck it. Yeah. I don't I don't know what it is. Yeah. There we go. Alright. So Oh, yeah. There we go. What's under config map? Flash registry slash config map. So we do have default kubrits here. Yeah. But what's the error message again? I'm gonna copy this. But It's basically saying that doesn't exist. Is that the same? Did you copy a registry? Nice. He wouldn't do that to us, would he? Is it u is it a UTF eight
1:11:05 hack? They're not allowed? No. That actually looks alright. Okay. So that's interesting. I'm able to transform key. I assume we're reading this right. I mean, to me, that just says it doesn't exist in the database, but maybe that's maybe there's more nuance to that error. Can use the x c d x c t l to check what it what it the value is in there. Right? Okay. Yeah. I was assuming that meant the key didn't exist. But, yeah, let's we could try to get its value again. You tried that, though, didn't you, David? Yeah. But we weren't doing it right. So
1:11:45 I mean, I wasn't doing it right. Looks like you said to me. I mean, it might be garbage, but I I like, I'm not reading the I'm not reading the error message. The error message to me says, yeah, no prefix found. It means that that path in SCD doesn't exist. But Yeah. It is. Gonna be something subtle, isn't it? So the path exists. Yeah. That other message is really shit. I'm gonna Google it. Yeah. Do it. Because I I SCD is honestly my least favorite part in this. Can you you try something else? Can you get
1:12:33 config maps from a different namespace? Like Interesting. Specific namespace? Just cube c No. No. No. I mean cube c t l. Dash dash keys only. Oh, okay. Yeah. Oh, we can. It's just the default namespace. Okay. It's just black. Alright. Well, it was I had fun, but I'll see you later. I'll go home. Wow. Okay. What sits between the API server and entity? Like, nothing that I'm aware of. And this isn't our like, if RBAC wasn't allowing us, then it would be a much more, you know, friendly Kubernetes level error. Right? The fact this is saying internal error, the
1:13:35 fact this is making it into the API server's logs means this isn't, you know, this this is an exception. This this isn't, like, an RBAC or anything. I'm gonna take a look at this. Which namespace is that meant to be in? Well, that is the cube system one. Right? But that what I'm worried Is it meant to be in both? I think it's in every namespace. Oh, is it? Okay. Oh, right. So that you can always get it no matter what your RBAC is. Okay. Are those the same? Yes. Maybe you should be I mean, I I
1:14:28 did a very simple test of the first five and last five characters, but I'm gonna assume, that they are Oh, it's it's hash yeah. It's hash algorithms go. That's like NSA proof. So Yeah. And it's I mean, they yeah. It's there's a few because they're on the serialized protos. Right? So there's you can see the text. I think was winning a gift card at this rate. Yeah. This is this is awful. Whatever it is. Cannot transform me. Did you find anything on the error message? Because what what's the transform part about? It is encryption related. You told us it
1:15:15 wasn't, Marco. Yeah. Yeah. Restarts aren't encryption related. No? Alright. Okay. The restarts are due to this. Issues with encryption. Yeah. The the thing spams this into its logs and then quit. Like, the restarts are due to this. Surely. Surely. Okay. I'm assuming you've enabled encryption, but not you've you've not on this key or you've put the old version of this key back into. I'm gonna assume He's encrypted one namespace. Right? He's encrypted one or the other because it works for one, and it doesn't work. It's gonna let my turn on. Oh, wait. It's done the scrolly
1:15:50 bug, hasn't it? How on earth are we gonna undo that in fifteen minutes? So I'm assuming he's maybe been nice to us and left us Yeah. I guess a script or I wonder what he's done to Cilium with Cilium's deployment being in temp. I wonder if we can just do a get from the cube system namespace. Right? And store it into the default space. Yeah. That's encrypt it. The other one is plain text. So we're gonna have to in fact, let's take a look at the default one one more time. Right? So I don't think there's enough namespace
1:16:30 Identifying Unencrypted Data in Etcd
1:16:38 in it. Right? It's just a certificate. It's just a metadata. So we can just write the cube system one to the default key. So Yeah. Right. Yeah. Okay. So encryption's on, so it's trying to decipher everything, and the one in default is not enciphered. So it's it's deciphering the garbage or it's not checks something or something. Okay. So this is the o This is our encrypted secret. But has he only encrypted the one everything else works. I know you're right. Encryption's on, and we can read the pods, and we can read the deployments and stuff. Right?
1:17:10 So we did yeah. Okay. I think everything but one thing is encrypted. So let's put And that's a smart way to do it. Right? Because we get the error on the one that's unencrypted, and we inspect it with etcd CTL, and it looks okay. That's that's a very smart way to send us down the wrong path. So how do we we can't interesting workaround. I mean, the the solution is actually much easier than this. But, like, oh, but now I'm curious if this will work. Let me get my cheat sheet up. Well, let's try to read the default one
1:17:48 again. Yeah. There should just be a put. You should be able put. I've put it. Yeah. Okay. Alright. This is great. Awesome. Scroll bug. Oh, we've currently got it. Okay. So we still need to fix the flapping, but I think we might be able to get all the config maps so that API server comes back. So let's let's deal with this flapping because that's getting made too fucking annoying. Yeah. Okay. So this so that encryption error was was filling up the logs. So, yeah, let's try to find the flapping. Grep dash v is your friend, I think.
1:18:25 Okay. We'll just need to wait for that to come back. Maybe that is why it's flapping. Because there'll be watchers on that config map, and it's just shutting down the API server. There will, but he said it wasn't, which means he's done something else to the API server. Alright. Okay. I I missed that. Okay. So grab dash v less than watch. We might be getting a new log. Maybe. Not yet. It takes it takes a while. Does it make useful comment also? You can take a look at in the chat. The one from Bala? Yeah.
1:19:17 So Bala says, please check the identity tag and the encryption config. Okay. And this? Identities. Or Maybe Google, like, identity and that CD encryption config or something. So you're telling me we've not fixed it. Is that what you're saying? Well, you did fix it for that one one config map. Alright. So I thought you were just being curling the way to to to one thing. Right? This this will be you're right. This will be the key ID. I'm actually making it pick up the key or something. Where's our API server? Hold on. Yeah. Does it do crashly back off on
1:20:16 static ports? Because this is taking a long time. No. The kiblet can just be a bit temperamental at times. It can take up to, like, four minutes. You could always restart the kiblet. Yeah. There we go. So k c m. No. Alright. We're gonna have to fix this properly then. Or or hack doesn't work. Okay. So, yeah, let's Google guys that can fix it. Interesting. I mean, it's encrypted. It's very encrypted. Okay. So we need to do that. If you can you recreate that secret by, like, using through KubeCTL? Yes. So you what your suggestion is we
1:21:21 could do Yeah. Let's try and do that first. System. Yep. System. Root c a c r t o yaml yaml. And we can change this. Yeah. So I never changed the matte as which probably break it. And apply. What? Alright. We need to learn how this encryption Oh, apply. Yeah. Do a do just do a delete. Just do a delete and a create because if apply is trying to read it to do a three way merge, and it can't read it. So smash it. Hulk smash it. It didn't work unless, I guess, we do an STD delete.
1:22:25 Well, you were trying to delete the file you were trying to apply. Right? So that will not work. Right. Hey. Okay. Great. There we go. Oh, now the post. Okay. So we need to encrypt everything or do this identity hack to make it not really Yeah. Let's let's do that identity thing that we've talked about. Alright. So we can set encryption field type to identity. Oh, that and, like, that kind of identity. Okay. Do you know what that means? Do you wanna go for it? Oh, just as in it means, like like, the identity function. Right? As in it means
1:22:30 Fixing Etcd Encryption Config (Adding Identity)
1:23:00 just pass through and don't like, encryption type identity just means don't do anything, I guess. But that doesn't make sense because a lot of what's in XED is enciphered. Right. Except some of it wasn't. Except some of it. But if we default name space config maps weren't encrypted. Right? Okay. You're saying we can get default working again if we do this. We might break some other stuff. Right. So, like, if you Google, like, HCD encryption config or something. Oh, we could do it per namespace. Okay. I love if you Google for etcd. The first hit is Kubernetes even if you don't
1:23:43 specify it. Okay. So there is an identity right there. Do I have that there? No. Right. So, basically, I encrypted all namespaces except default. So it wasn't but because there was no identity there, it wasn't able to read those because they were not encrypted. Oh, right. We needed an a configuration and and resistant. The reason I did that is because if you would decide to just turn off encryption, that would break the rest of the cluster. Because then it would be able to read that that namespace, but then all of the control plane contain maps will now work because they are now
1:24:28 encrypted and would break the entire cluster. Okay. Found that half encrypted half wasn't. That's that's clever. So there's a much easier way than me trying to retrofit the encryption then. Like, so Yeah. Which is clever because you basically just so one one thing is if you recreate the secret or the config map or whatever resource, then then it will encrypt it using the EC config and replace it. So that's why when you use for cube root from the other namespace, it worked. Or you can just add that one line identity in the config, and then now it's
1:25:02 able to read unencrypted stuff by default. Got it. Nice. I had I had no idea that was a thing. So that's It refuses to do no encryption. That's great, isn't it? It refuses to do Well, I think it's optional. Right? So stuff can be encrypted or unencrypted. I mean, maybe maybe they're not back. Yeah. So if you have identity in there, then it's both. Okay. So we have four minutes, and we still need to fix this Postgres. So Yeah. So I'm gonna cheat a little bit. He was definitely messing with Cydium because there's some Cydium
1:25:25 Debugging Postgres Authentication Error
1:25:35 config in in temp. So Oh, that's me. That's part of the customer bootstrap process, I believe. Okay. Yeah. I I did not mess with Cyllium. Because all of that Cyllium stuff flapped. Like, the fact that the password is wrong. Okay. So the password is correct here. I think Postgres one two three is correct. I believe the user is Postgres, and the database is clustered. In fact, our health check shows the user Postgres. So And it's hard coded into the into the Rust app, is it? It's not read for the environment or anything you could fit into.
1:26:18 I should know, but I I I will admit that I wrote this a while back, and I've abandoned that ever since. So let's check. Because my immediate thought again was was eBPF. Right? I mean, what could make the password be wrong, like an eBPF filter that's changing it on the network? No. The hardest thing was SCB encryption. Everything else should be relatively easier to fix. I'm sure it's telling us some facts about that. Yeah. And figure out. I mean, it's weird because this is hard coded. Assuming it's our image, which appears to be password authentication failed for user postgres,
1:27:08 which means probably modify the stateful set. What's the generation on this? Generation nine. Yeah. I would I would have applied this once. So Yeah. I hate computers. Yeah. Well, I hate people too, so it's tricky, isn't it? I mean, that looks okay. PostgresQL one to three. It's not anything silly. Like, it's not Postgres past w d or anything silly, is it? I don't think so. I mean so this is a relatively stateless thing. Right? So he could have modified it in the pod. I'm just gonna delete it and have the database recreate. Yeah. I I might be on our date
1:28:13 docs, but it looks like the environment variable is called p g password, all one word. Are you sure? That's what I thought. I found some docs that said that for the latest version. But have you recreated it? Have you got the original YAML sitting around? Can you just recreate it? It didn't work. Oh, yeah. I do have the original YAML. Because a lot of what I did was with kubectl edit. If you just thrown the original YAML over the top of all the you'd have fixed, like, six of my hacks all at once. Oh, don't do that.
1:28:50 Identifying Malicious Postgres Startup Command
1:28:51 Can we take a look at the YAML again in the cluster? Oh, it is. Postgre password. Okay. That's what those docs are all about. Okay. Let's work this out. Okay. Stable set. No choice default. Keep scrolling down. Oh, you son of a Oh, what? Oh, man. In the start up That's great. Nice. Very nice. You made me laugh out loud. Like it. Our APIs never flapped. Oh, yeah. We still haven't got to the bottom of that. So what causes pods to restart? Health checks? Interesting. Yeah. So, like, Kube API server and Kube scheduler and those, they have health checks. Right?
1:29:15 Fixing Postgres Startup Command
1:29:55 Right. I mean, lots of things. Right? I mean, you could've been you could've been making a note already. You could've caused the thing to exit one because it doesn't like its config file. But, yeah, health checks also. Is it possible that the health check is not healthy for some reason? It's doing HTTPS. Is this I could have believed it was because it needed the CA bundle to be able to to verify the service certificate of the API server, but that's back now. Yeah. I don't think lives use the real endpoint, is it? Is that what you changed?
1:30:31 I actually think it No? It I think it is. Yeah. I don't look enough at this. Every time I see it, I'm like, that can't be right. I mean, like, yeah, it's probably gonna be, like, hard to there's just a small little change. I mean, it go up. So there is where is it? None of us access. False. Right. Yeah. Okay. So yeah. So it's making the the health check fail. That's nice actually because that's honestly, like, an error a user could make. Because if you're if you're installing your own cluster, you see something like that, and, of course, you turn
1:31:08 it off. Like, it's it's actually a very Yeah. You're like, oh, I don't possible thing to, yeah, to get wrong. Awesome. I think that should be it. Should that be all of it? We're We never actually go to edit our Postgres. Oh, you gotta okay. And you gotta restart failed. Yeah. Oh, I don't think you took enough. The API server died before I could save the alter. I don't think you're taking enough lines out of that. Have a look at the indentation again. Let's try again. Oh, yeah. You're right. My whole idea was that you would fix this first
1:31:46 and then try to recreate the pod, but because this pod uses config map in this namespace, it would fail. And then you'll have to go figure out what's going on with HCD. That was a hard one, I gotta say. We still haven't fixed it. Oh, Postgres. HTTP. Yeah. Why does it not when I changed the spec? Oh, you need to, yeah, you need to kill it. Right? Yeah. I shouldn't have. Not when I changed the stateful set. Oh, yeah. I did the stateful set. I should have rolled it. Well, unless someone changed the semantics. There we
1:32:28 Barco's Cluster Fixed
1:32:28 go. There we go. We're two minutes late, but we did get the dense. You were. Not not bad. Alright. Those were two really tough clusters, I gotta say. HCD is always I've had the AppArmor HCD, never things I like to see. So that was a tough one, Rawkode. And that was that was very good. I thought I'd squeeze the SCD. That's why I squeezed the CPU. Right? Because I'm like, SCD is just when it goes wrong, you never know how to fix it. So I thought if I could just make it sort of misbehave, sort of be really slow and annoying,
1:32:40 Wrap Up & Explanations of Breaks
1:33:05 everybody's just gonna be like, oh, SCD. I don't understand it. Because it's always the thing that gives you cold sweats when you see the spam. Yeah. Because how often do you have to use STD control? Like, never. Right? Really? Yeah. I did need the backup of it. There's a hints folder. Was there a Hints folder? Damn it. Yeah. There was a backup of XCD completely unencrypted, and there was a backup of there was the the YAML file for the encryption configuration with the identity in there. So just in case we got lost, we could do fix those. In case some
1:33:42 idiot starts writing random bytes to EdCT, yeah, you really could better keep it back up. Yeah. Well, I did it for me because I was doing stupid things with EdCT. Alright. Well, thank you both for for joining us today. Those were great. Those are really good. You know what? It's it's the best episodes are the ones where I walk away confused but learning so much, and I got that from both of your customers. So I really appreciate the effort you've put in there and for joining me during the debugging as well. Thank you to the audience for watching us,
1:34:14 and all your comments. They're always very helpful. These are always one step ahead of us, and, thank you to Teleport for sponsoring clustered. We will be back next week with more broken clusters and more pain for me, Pop. If you're still watching, you definitely all Barco that that gift card there. You definitely destroyed that. Alright. Any final words before I say goodbye to you both for today? It's fine. Thanks a lot. I'm honored to be here. Yeah. No. Same. And thanks thanks for the breakage. I lots of stuff as well. Alright. Have a wonderful day. Enjoy your weekend. Thanks
1:34:34 Final Words & Thanks
1:34:49 a lot. Bye all.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments