About this video
What You'll Learn
- Trace Kubernetes control-plane failures from fake kubelet binaries to node readiness recovery.
- Diagnose kubelet startup failures from bad systemd flags, then patch unit files and restart dependencies.
- Recover restricted shell and PATH corruption on remote hosts to restore admin access and worker rejoin.
Rawkode teams up with Marek Houns (The Null Channel) to debug two community-submitted broken clusters featuring a fake kubelet binary, a restricted-bash shell, a custom Wordle game, blocked PATH tricks, etcd cert trust failures, and a kernel namespace limit.
Jump to a chapter
- 0:00 Holding screen
- 2:03 Introduction
- 2:23 Housekeeping & Sponsors
- 4:00 Guest Introduction: Marek Houns (The Null Channel)
- 6:04 Tackling Kevin's Cluster (The First Break)
- 6:38 Initial Cluster Check & Node Status
- 12:02 Debugging the Control Plane Node (NotReady)
- 13:03 Investigating the Kubelet Issue
- 17:18 Identifying the "Mooplit" (Fake Kubelet Binary)
- 18:46 Systemd/Systemctl Issues
- 19:06 Reinstalling Systemd
- 21:01 Kubelet Logging & Configuration Errors
- 22:04 Finding the "Mid Bloop" & RUN_ONCE
- 24:55 Kubelet Crash Loop: Invalid Flag
- 25:46 Fixing Kubelet Unit File (RUN_ONCE, Bad Flag)
- 26:24 Kevin's Cluster: Control Plane Ready
- 27:30 Tackling Russell's Cluster (The Second Challenge)
- 27:58 Restricted Shell & Speak-and-Spell Game
- 31:48 Attempting to Fix User Shell via /etc/passwd
- 32:20 Reconnection Issues & File Persistence Problem
- 34:11 Identifying Restricted Bash (Rbash)
- 36:05 Escaping Rbash via Sudo
- 36:43 Replacing the Custom Shell Binary
- 38:17 Stage 2: Auto-Logout & .profile
- 41:03 Fixing Auto-Logout (.profile Cleanup)
- 42:10 Gaining Full Root Shell Access
- 42:55 Russell's Cluster: Control Plane Missing
- 43:11 Missing Binaries (curl, apt-get) & PATH Issues
- 47:48 Identifying the '9cat' Output Interference
- 52:36 Fixing PATH (Resolving 9cat)
- 53:16 Kubelet Certificate Errors
- 1:00:21 Containerd & OCI Runtime Issues
- 1:08:02 Debugging OCI Runtime Error ("Unknown Container ID")
- 1:12:17 Discovering Containers in Kubernetes Namespace
- 1:26:20 Identifying the Kernel Namespace Limit (sysctl)
- 1:27:39 Fixing Kernel Namespace Limit
- 1:30:05 Russell's Cluster: Control Plane Ready
- 1:30:32 Worker Node Not Ready & API Server/Etcd Cert Issue
- 1:32:12 Regenerating Kubernetes Certificates (kubeadm)
- 1:35:08 Continued Certificate Issues & Etcd Logs
- 1:40:19 Debugging Etcd Trust Error ("Unknown Authority")
- 1:46:08 Fixing API Server Manifest (Adding etcd-cafile)
- 1:50:58 Control Plane Ready, Worker Still NotReady
- 1:51:10 Regenerating admin.conf
- 1:54:00 Tackling the Worker Node
- 1:54:15 The Worker Node Wordle Game
- 1:55:25 Playing the Custom Wordle
- 1:58:23 Escaping Wordle
- 2:04:01 Worker Debugging: Connection Refused
- 2:04:19 Re-joining Worker to Cluster
- 2:07:01 Fixing Worker Route Issue (Blackhole)
- 2:07:47 Debugging Worker Join Token/Config Map
- 2:17:17 Manual Pod Scheduling (Temporary Workaround)
- 2:22:25 Worker Node Becomes Ready (Root Issue Resolved)
- 2:23:33 Redeploying Application Pod
- 2:24:10 Wrap-up & Conclusion
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
2:03 Introduction
2:03 Hello, and welcome back to the Rawkode Academy. This is Clustered. Today, we have another Rawkode versus the community. Only today, I'm not fighting alone. I will introduce my guest in just a moment, but before I do that, a little bit of housekeeping. If you wanna win a t shirt, which of course you do, you can go to Rawkode.live/win. Go there, enter the competition. We will draw the t shirts at the end of the episode. Next up, I wanna thank our sponsors. We have been using Teleport since the second I would say first, but it's actually the
2:23 Housekeeping & Sponsors
2:48 second episode of Cluster, way back in January 2020. So I I wanna thank Teleport for their support. You should go to Rawkode.live/Teleport to learn more. We use it every single week. We pay our it gives us access via GitHub authentication to SSH, to databases, to applications. It's truly a wonderful tool that should be in everyone's production and maybe even non production environments. So go check it out and watch us use it today. It's a whole lot of fun. Alright. I see the chat's having fun too. I also want to thank Equinix Metal. You can find out more about Equinix Medal at
3:23 Rawkode.live/metal. Going to these links helps the sponsors and helps me. So go learn more about Bare Metal Cloud. If you don't wanna pay for it, because why should you, to check out Kicka Tires, use the code Rawkode. This will get you $200. Now we use Equinix metal every single week as well since the first episode. Bare metal clusters are the best. We have tons of cores, tons of RAM. It is a whole lot of fun. Now I could use small instances for a cluster, but it's a lot more fun to use big chunky ones, and maybe we'll get to see a
3:53 look at the cores and RAM later. But that is awesome. Alright. Thank you, sponsors. Now I said I'm not fighting today's episode alone. I'm gonna introduce our wonderful I don't know if I wanna call you a guest or a victim, but either way, welcome, Marek. Hey, man. How's it going? It's going well. How are you doing? Believe victim is correct terminology, by the way. Is it? Yeah. Alright. Well, for I mean I'm sure people have seen you before on this, but why don't you give us the introduction to Marek? Yeah. My name is Marek Houns. I'm super
4:00 Guest Introduction: Marek Houns (The Null Channel)
4:30 humbled to get to do this with you. I learn something new every single time I watch one of these. I think it's fantastic and I appreciate you doing this. I don't know if none of you know much about me. I'm a fellow YouTuber. I also try to help people but I don't ever run a live show quite like this one. This one is very special. Yeah. And I just entered to win a free t shirt. So that's what I was doing during that introduction. Awesome. Well, depending on how badly Russell and Kevin scatters today, I may have to send you
5:04 one as a sorry anyway. So we'll see. I do love that they're, you know, they're all talking at the chat about Russell's sort of build up psycho tweets over the last three days. Scheduled tweets are the worst thing ever. I have come to determine that. That's horrifying. I I got to wake up since I'm like behind time zones. Like, that's was my morning tweet feed was like Russell's scheduled tweets. Well, here's the the latest one. Not a tweet, but a comment in the chat. There are 23 hints for today's cluster for Russell. I hope Russell's better at giving hints than
5:42 I am. Last time I got I got flamed for my tweets. Or not my tweets, but my hints. So hopefully he's better than I am. Yeah. Awesome. Russell also stated Barack's content is also awesome. Yeah. It really is. We'll get a link to that in the comments. Feel free to drop that in there, Matic. Share it with the with everybody. Alright. Well, there's no point in delaying the inevitable. We should take a look at our first cluster. So let me pull my screen share up. There we go. We have Are we hitting Kevin's first? We
6:04 Tackling Kevin's Cluster (The First Break)
6:15 are gonna have Kevin's first. Yes. So I've done a t s h l s. I'm listing the notes that belong to Kevin. I am gonna join Kevin control plane one. That session will be started in just a second. Please join that when you are ready and give me some sort of echo or l s or whatever to let me know you're there. And already we have Kelsey. Nice touch, Kevin. Thank you very much. Really quick question. How do I list sessions in the command line? I'm gonna try the command line instead of the web. Oh, you
6:38 Initial Cluster Check & Node Status
6:47 can't. You have to get those session ID from the web. There's a feature request to do the session listing, but it hasn't been done yet. And there's a good reason that I wanna dive you know, divert off course here. But what you'll see is when you go to the active sessions, it will list sessions for all clusters, even ones you don't really have access to. So you'll actually see the there's some stuff I'm doing with e b p f d for at KubeCon. Uh-uh. And they're fixing the the scoping of this API before they release it to the CLI. So
7:16 but you can grab the session ID from here. I know that's a a long answer to what you want to do here, but there we go. And I'll just join it for both in case we open them. So did you use the the web or did you use the CLI? I used the web. I'm gonna flip over to the CLI and I was gonna try to do that in the background while yeah. Alright. I may oh, well, let's let's start talking around this first cluster. So this is broken by Kevin, very active member in the Discord community
7:46 and also in the YouTube comments as the fire flash. So let's see. I will export our kube config. Alias k to kube control. Just out of curiosity, looks like looks like that's maybe alright. Should we cat it to see if it's a binary? That's how I determine if something's a binary. Well, I mean, I don't I I don't trust these two. Like, these two have seen a lot of clustered episodes. They know a lot of tricks, and I'm gonna be checking all the basics today on everything. But at least, like, kubectl as binary potentially in the right location,
8:36 We have a version that looks like what I would expect. We got a server version, which means that our control plan may be alright. I'll give you the owners if you wanna pick your favorite command and hit our control plan. Alright. Not too bad. Not too bad. I'm actually kind of worried that that worked. Like, this is a bad thing on my part, but I I'm kind of worried, like, this is, like, some default engine x reply or something. No. No. No. No. I don't I don't think so. Let's try get pods service, and endpoints.
9:21 The holy trinity. Right? Pods look good. Service looks good. Employees look good. Is that are we being too confident here? That's what I'm saying. I'm feeling worried. Okay. So maybe we should I'm gonna open ProPlayne is not ready. So I would actually check out cubes the cubelet. See what why the cubelet is not Well, we do have the v one image. So that that that's a good start. Let's go back here. So shall we just well, we just try updating the v two and see what happens? Okay. Yeah. But the control plane isn't healthy. Well, the node isn't healthy, but the control
10:21 plane is responding to request. I guess. K. That's it. Edit play. Clustered. There's that weird thing. Told you that would come up. Do we have a working cluster? No. We have a. Kevin says, loosely, I forgot to check one thing. Damn. Alright. Tell you what, Kev, feel free to jump back in. Check what you need to check. I will edit my deployment back. Come here to right. We've rolled that back. You go make whatever change you need to make, and we're gonna move on. You want me to check the kiblet? Alright. Let's do it. Let's our mission is
12:02 Debugging the Control Plane Node (NotReady)
12:02 not to update v two on this one. Let's get that control plane happy. So we have a not ready on the control plane nodes. Marek, what you wanna do? Alright. So if we have not ready on here, then what we really probably need I would check the cubelet logs to or maybe just systemctl, the status there. Will we describe the node first? Sure. Yeah. I guess since it's responding, there's no reason not to describe it. Let's see. So there's a bunch of unknown. So this would say that the cubelet isn't talking here, the node status.
12:58 Yeah. Which I think you already kind of thought anyway. So why don't you start digging into the cubelet? Alright. So Oh, there's that that's not right. That's fantastic, by the way, but not right. Does our kubelet just curse at Yes. So we have a deprecated warning, but that's not really my biggest concern. My biggest concern here is I don't think that this is a real cubelet. But you ran a system control command, right? Oh, yeah. But this is the cubelet logs. Right? Right. This is just going to be its logs out here. We could actually just
13:03 Investigating the Kubelet Issue
13:48 check the logs. Just do like a journal. Yeah. And this actually prints it out a little bit better. Yeah. That's a problem. So there's two possibilities that my mind jumps to. First, that this is not the right cubelet. And we could look at one second. I'm looking through this printed out a bunch of garbage. There is the we try to figure this out and he's injecting log messages. So it's actually a type of attack is injecting log messages where you don't sanitize user input into log messages. I don't know that he found one of those in the cubelet. If he did, you
14:45 should actually be reporting that. So my hunch is that he actually just compiled his own. Compiled his own cubelet. Okay. So how can we Cool. Convert that. I mean, we could search on a cubelet version, right? Yeah. Yeah. Cubelet version. That was like a cubelet. Don't think it is a cubelet. No. I don't either. Where's that beta ring? I don't know. We could check when it was last messed with. It should be the same age as the machine. Uh-huh. Let's do that. X-ray. Yeah. Alright. So everything oh. This is interesting. Things have been messed with,
16:15 I think. I shouldn't have printed out the entire directory. That was kind of dumb. Well, looks like CAF was working on this just forty minutes ago. Nothing like a last minute break. So 07/2006 or seventeen o six, April twentieth. I'm curious of the strings command. Well, I mean, I expected to put it as cozy, but I wanna see it. Yeah. It's not a binary. I should have cat it. Like, I I'm telling you, this is what I did last time. I was like, it's not a binary. I'm just gonna cat it. And then I cat it a binary.
17:18 Identifying the "Mooplit" (Fake Kubelet Binary)
17:18 It's a mooplit. He's he's given us a mooplit. Very nice. So can we just go pull? Let's just do an app to just reinstall. Well. Is that not work? No. Don't don't cat it. Don't Do your string. Do string. Oh, yeah. String's always gonna do. Yeah. That's better. So why did the restart Maybe it's just a hangover. Nope. It's not a hangover. You Wait. Do we need to do like a daemon reload on system CTL? Only when you modify the the unit fail. Right. I would have thought at least. But all our system control commands are. He's
18:31 not just modified Kiplit. That's the problem here. Did he actually modify the system c system d? Or did he do the system files? Oh, look. System control was compiled today at five past five. I've never reinstalled this DNLive system before. Surely that's gonna blow up. Right? No? There we go. Do you wanna run? Get notes. You think it's happy? Or are we gonna have to list every binary modified in the last ever? It's not we might have to list every binary that's been modified. It still says not ready, but it might take it a second
19:06 Reinstalling Systemd
19:31 sometimes. Let's define user ben dash m three six o t fail. I just did a modified time. I thought it was just seconds. I don't know, actually. Let me go Google it. I'll Google it. It's dash m time space. M time. And that's just type f. Okay. Oh, and then that's a forward predicate, so maybe. Yeah. Like negative. Yeah. I don't think he's compared to all those. Alright. He's in the chat. We found all the binaries. Alright. So our kubelets should be posted in a status at least there. Should be. Let's see. It still says kubelet stop posting
20:49 node status. We have a kubelet running. No. We don't. So now we should probably just get its logs from it now. I I would do you mind if I type real quick? No. Go for it. Alright. Let's see. Yeah. It looks like it's starting. I think the daemon reload mod is only needed when you modify the unit fails. Oh, actually I I had a misspelling of cubelet when I did the logs anyways. So I do believe that there's some this isn't your typical thing that you get exported out in logs here. I think possibly the system
21:01 Kubelet Logging & Configuration Errors
21:58 file has been updated with how it runs. Let's see. Let's see. Its exec start here is the kubelet, kubect config args. That almost seems right. Wait. Wait. Insecure values. Maybe we should check its configuration file. Well, mean, the environment fail is pointing to the mid blitz though. Oh. Oh, sorry. I hadn't scrolled back down to where you were. Let's take a look at the mid bloop. Run once. Oh, that I knew that was gonna make another appearance after the last day. Wonderful. I have hurt myself. Past Merrick hurt current Merrick. Well, we still have the Etsy kubelet files.
22:04 Finding the "Mid Bloop" & RUN_ONCE
23:10 So I guess if we go to systemd system kuplit, where was the fail? Here. No. This is configured here. Yeah. That is the right one. And No more mooblets in here. Right? We should have just started a find on the file system for mooblets. Alright. Still not QIPLET. That was unexpected. Okay. So it's on this roll and restart thing. It does have a restart ten seconds in the config. So it's not getting enough. Oh, sorry? It's not getting enough config to start by the looks of it. Or it's getting a dodgy flag which we're probably skipping
24:41 because of all scroll. Yeah. I was looking through this. I I actually absolutely love the cubelet service failed with the result exit code. Perfect. That that was so helpful. Yeah. I can't think of the start and all this mess. Let's I was scrolling through it. Let's create some blank blank space because blank space is good. There's a new restart. Where's my black space? Oh, there we go. Okay. So restart counter is it's restarting 8,000 times. There's some deprecated flags. I don't wait. Failed to parse a kubelet flag. Invalid argument. File check frequency. Okay. Yeah. You're looking at the same thing. Yeah.
24:55 Kubelet Crash Loop: Invalid Flag
25:37 Alright. Let's find that failed check frequency. Assuming it's not in the kubelet. It is. Oh, in fact, even our Kuplet fail has a run once. You sneaky devil. The mid blot was just to throw his off as set. So he definitely watched last one. I don't need to do a daemon reload, but I'm gonna do one because now I've got this here. And that restart took a bit longer. So hey. Alright. Let's give that thirty seconds. See if we get a ready status. Alright. Here we go. Thank you, Kevin. That was fun. That was good.
26:24 Kevin's Cluster: Control Plane Ready
26:40 Alright. I feel vindicated that if I had been typing, I would have cat ed the cubelet. So thank you, Kevin. I feel better about myself as yeah. Alright. Well, let me close that session. And, let's see. Kevin's saying, kinda sad that one break didn't stick. It should have been a system resolved the break. Well, that just means you've got a break for next time. Right? So it's all good. And Kevin has already posted a link to his GitLab repository with the breaks, so and the Discord server. So if you wanna know more, go check it out.
27:27 Alright. Let's jump over. I I don't know if Russell's taunting us or not. These guys are obviously pros, but we're gonna tackle Russell's cluster now. He's been hyping this up. So I'm expecting great, great things. I've woken up each morning to a series of tweets taunting us. So I have his tweets up just so that we can go through them as reference material for, you know Yeah. Alright. We've landed into a speak and spell. Wait. Do we have root? I'm concerned we don't have root. Well, right now, I think our shell has been replaced by
27:58 Restricted Shell & Speak-and-Spell Game
28:12 something else. We're probably gonna have to try and get out of it if we wanna get access. Alright. I'm joining. I'm in. Oh, I just killed it. I control c to to see if it would just give me a proper shell. Assuming maybe he just modified the bash profile to run another command immediately. Can we just type exit? Oh. What noise did Hoot. Oh. Us. Us. I don't trust that we're still in the I was gonna say, I don't actually trust that we're in in Bash right now. Yeah. We're not think we're still I think we're still in the
29:26 game. I think we're still in the game too. We run Bash? Okay. So Rust said you'll both love it. It's written in Rust. So we all know that Rust is the I got it. Superior language. You're just running bash inside of whatever this is? It would appear so. I'm not contained. We have access to the host. Quick. What's happening? Oh, yeah. Let me reset the clock. Good call. So whenever I do shift a key, something breaks. Export. Oh, that's interesting. Our bash. So I don't know that you're in bash. Russell bash. Right. I I think that you're in a
30:42 special. You're in a special. Okay. Where? So I love that there's a user code victim, which I'm assuming we may have ended up oh, there we go. There's error. So let's set the crap out of this. Right? So end line And I see it's just had we have ah, open. Oh, I need to go to the terminal here because that's annoying. I can't open Vim. Right? So Can you escape Vim? He's made it so that oh, I'm assuming he's taken over a read line and when I do a single exclamation mark in Vim, it was done too, but that could have
31:40 been a side effect. So we can open Vim, which means we can edit password d. And this is probably just Ben Bash. Right? I would have thought so. It's an interesting place for Bash. Well, aren't we still in? So we've modified the password for now. I'm assuming we can get out of his game with a new session. And reconnect with a new session. Alright. So that was that didn't get me where I wanted. I could your code, Russell. Yeah. Oh, okay. I will gaggle. No. Rip it. Weak. Alright. Now we're back in this dodgy r
32:20 Reconnection Issues & File Persistence Problem
33:06 bash. If we look at passwords It's back. It's back. Right. Okay. So Did we validate that we actually saved it? Just as a like is what we did correct? Or did he revert it? I think that's two different problems. I feel, knowing Russell, that he reverted it. But Yeah. It's not saved. Okay. Let's see what our fail system looks like. What permissions do we have? Do we have I mean, says that we're root, but I actually don't believe that we're root. I'm kind of wondering if we're like, our bashes root. No. I don't know. I mean, we can
34:11 Identifying Restricted Bash (Rbash)
34:20 do cat, proc, self, c grip. And what else do we want? I mean, the fact that I'm even in a profile system, I think I do believe we are root. But is this the system's profile system or is this some container that you're in? This this is not a container. This is his shell on the root of the host. And our bashes are part of the real thing. He didn't make it. So but I'm not gonna look that up right now. I'm gonna focus on destroying it with fire. So Oh, it's the restricted bash shell.
35:13 Okay. So our bash is a real thing. I didn't realize this was it. Okay. So one second. If we're in the restricted Well, can we just open a normal bash? No. Okay. So now specify slash and command names, but can I move to the directory? No. Okay. No. Can set plus r maybe. I think it's plus r if you do here, I'll type really quickly. Do we have access to sudo? Wait. Wait. Okay. We're done. That's it. Okay. I can see the and I can open binaries with a slash. So we currently have real root.
36:05 Escaping Rbash via Sudo
36:21 Okay. Because we had access because sudo is a set uid binary. So well, at least I'm assuming so right now. I I don't really know what Russell's got in store for us, but I really wanna focus on this password fail because I want my normal shell back. So what I'm gonna do is grab for that speak shit. I'm gonna stick bash over the top of it. That makes sense? Let's stick this to temp. It should just be bin bash. Right? Yeah. We we just want bin bash. We don't want his dodgy bash. Oh. There's some weird input things going on
36:43 Replacing the Custom Shell Binary
37:07 right now. Yes. It really doesn't like me entering a key. So copy binbash. I'm gonna try and copy and paste this then. So annoying. Okay. We need to find a way to do it. That's the key. Although, let me move it out of the way, didn't it? I moved it to oh, so if I auto complete it. So temp BeaconShell and then come back to user bin and copy bin bash to that location. Alright. So let's restart our shell and see if we get presented with that annoying fucking quiz. Oh, we're back to them now.
38:17 Stage 2: Auto-Logout & .profile
38:32 Stage two. Rawkode success. Oh, auto logout. That's interesting. I bet you were back in the Speakeasy. Alright. So I think we just have to take commands every fourteen seconds or whatever stupid release got for us. Oh, no. It it just there we go. Russell, that's that's annoying. You need to log in and hit the hints really quickly. Well, no. We don't want the hints. We're we're just heading level two. Come on. Alright. So one second. You're in home victim. Okay. So my guess is that there's something in the like bash sh for the profile. Yeah. I kinda wish I had to speak
39:36 easy. The speak and spell back for a minute. Let's install a new shell. Oh, good. I'll just keep typing. Alright. We don't have at. Yeah. Russell up the time. Come on. Okay. So you think it's in the bash RC? Don't know that it is. I think you as I think you might be right. I mean, with that aggressive of a timeline, it tells me that it's something he thinks we can get to quickly. Well, that's the infective's bash r c, so that's useless. Okay. Cat profile. It's either in the bash RC or bash profile where I've seen the time out. Yeah.
41:03 Fixing Auto-Logout (.profile Cleanup)
41:08 There's here we go. So we've got he's used the profile to fuck with us. Okay. So in theory, we can just remove that profile. Or delete that really quickly. You got twenty seconds. Go. Okay. So now, hopefully. No. We're still a victim. Sure. We didn't but hopefully we just don't get logged out quickly. I think we will. The fact because that temp that temp profile is what I thought was making us the victim. Wait. Can we just switch users to root? Yeah. But I'm pretty sure our session is still gonna get cut off. Let's see.
42:10 Gaining Full Root Shell Access
42:25 But if you switch to root and you don't keep the current session Yeah. That seems alright. Good call. Okay. And I can type k now. Would you look at that? Oh, well. Alright. Whatever. Level three, maybe? We don't have a control plane. We don't have a control plane. This this I can deal with better than the previous. Well, I'm gonna install my my my tool chain now. No. I think I've actually got a real shell. I prepared my Rawkode. Live. Oh, fuck. Script. No curl. Okay. But but check this out. Look at where that was. That was user local sbin
43:11 Missing Binaries (curl, apt-get) & PATH Issues
43:28 bash. Did he uninstall apt? Alright. One sec. Could you check what are what we're in? Are we actually in bash? That shell at the top makes me hurt. So Ben Bash looks legit. Is it? Let's try it. Is Bash not just another Speakeasy shell? I I don't think so because the time oh, I mean, could've changed the timestamp, but it depends on his his sneaky thing. Let's see if that saves this time. Yeah. That's not saving. It almost feels like we're in an immutable file system. Oh, yeah. Russell says we copied bash to his beacon shell. So he's right. We did
44:47 do that. Which means the only way that he could be modifying our environment now is through the profile stuff here, which we removed. Or by a rogue process. And he may just have fucked with the path. Oh, yeah. There's no apt. We've been to. We do have d p q g, and he didn't remove apt get. Alright. I'm gonna install Carl this way. Can you install apt that way? I don't know. I've never tried. Well, I think it's well, actually, curl is already installed. So he's hidden them. Interestingly though, where apt get is looking, it knows that it's already installed.
46:06 Or he Yeah. So there's a varlib cache. Var cache. And then we could go into at and then we have the archives. We have a whole bunch of devs here, but the curl one is not there, unfortunately, because that's not been as that was part of the base image. No aliases. I mean, we could try unset curl. Yeah. It's not installed. I'm curious, Russell. Did did you just delete the binaries? Because it feels like you've just deleted the binaries, which means But then how did Okay. He's moved it. That's what he he's moved it. Okay.
47:07 We don't need Carl. We we'll we'll make do with that. I can get my tools. So But apt get What I want is my these. OS query. Need to add the repository for that. That may be a pain in the ass. May as well show people my script. Since he's made it, so that I can't use it. And I can I can still run these ones? Was it still doing something? Although you told me that we don't think Russell's used after though. Right? So maybe I don't even need all these tools. Well, I don't know. Russell's also the type
47:48 Identifying the '9cat' Output Interference
48:09 to make you think that he didn't use something, so you don't check it. Although I haven't seen anything that makes me think that. Although I guess a lot of times I think eBPF is used for networking, but that's not entirely the case. You could do something else with it. Sure. I did the last query, right? Can you ping Google? Yeah. Well, I mean, I'm heading to Bintas repositories and stuff. Okay. Yeah. Yeah. Never mind. Alright. At least I've got some of my tools. You have been What just happened? Why have I got nine, Kat? It actually shows up better on my screen,
49:23 by the way. It actually shows up properly. Yeah. That's annoying. Alright. Let's just get rid of that. Wherever the nightcap was, it was ignoring the control c signal, but I managed to control z to suspend it. We seem to be okay. Which I'm assuming I can bring it back whenever I want. So just for the laughs. Okay. Did I get my OS query? No. I did. Oh, you did? Oh, never mind. Sorry. It's really funky on this screen, but we do have it if we need it. Okay. WestQuery is a great tool. You can query
50:20 everything with SQL syntax. Pretty sweet. Alright. Where were we? No control plane. Okay. Let's fix that. Is that the IP address? It's the right port number. Well well, one second. It's the default port number. I don't know if it's the right port number port number. It looks alright. We're pointing it to admin.com. What is is it configured to bind to? So like, we could check the API server config. Yeah, good idea. Let's use less or shared picture. Alright. So he has it bound to 1072147 here as the advertise. And then okay. I don't wait. Wait. Wait.
51:34 Wait. Wait. The etcd server. Yeah. That's okay. Yeah. Can we actually just look at the container run time, see what run is running? Just because we don't act can't access it doesn't mean it's not running. Well, let's see if the kubelet's running first. Thanks. Thanks, Crystal. Kubelet is loaded, but not active and running. Can we just enable it? Well, it has an exit code from six seconds ago, so I'm assuming it's crashing. Oh, but it doesn't have any logs. Perfect. I'm gonna change the path ordering to get rid of that nonsense. So let's reexport our path
52:36 Fixing PATH (Resolving 9cat)
52:48 with equals Ben, user Ben, user local Ben. Now anybody that's just watching, I want to assure you this is not your normal breaks that you see. So that should allow us to do yeah. Okay. So I failed to construct a cubelet, unable to load the CA. Alright. So we're messing some certs. Oh, you messed with the certs. I thought that that's worse than messing with our shell. Don't mess with our certs. Can you check the permissions on it maybe? I don't know. Sorry. Kupla is saying it cannot open at the Kubernetes p kica.c r t.
53:16 Kubelet Certificate Errors
53:55 Appears that that fail is fine. He set a temporary fail system to read only on the slash etsy directory. Sneaky. I do believe somebody said like twenty minutes ago, it felt like we were on a read only file system. That looks to be the only change. It's nice to be able to run commands without getting nine c metrics. Just I'm just throwing that out there. It feels nice. I don't know. I kind of wish my terminal just randomly every once in a while threw it out there while it was working. You know? Actually, okay. So that's a new project. We
54:50 need a a Rawkode project is like every hour, your terminal the next command will give you a nine cat just to like get you up and moving. I think it's a feature. Didn't seem to fix it. Oh, it's still there. Oh, you get the tool. Shit. It came back. We're gonna go back to it. It feels like we're on a read only file system. One second. I have a command. Yeah. So it's not making that modification. This is what we had with the password file. Can you do a d h dash capital t f? T f t, sorry. T.
56:12 Oh. Oh. D h isn't found. Sorry. Never mind. That won't that won't change anything. Our file system looks alright. And in fact, I can rate files. So either of them is compromised. Did did he just recompile them? I have something running out there. Oh, my OS query seems to have landed a second state. Alright. I need to admit the lock fail. Good thing I know Damian alright. Although, maybe not well enough. No. I know it's telling me it wants the path to have that. I don't want the path to have that. I'm gonna assume he's he
57:32 says he's not recompiled them. Oh, okay. And this is recompile? Yeah. I think our is okay. Alright. So then how would he keep us from changing the file system? If the file system isn't a read only file system, and Vim is okay. Do I still have? And Vim complains if you don't have the permissions to change it. It doesn't just silently fail, generally. Generally speaking. Well, let's try editing it with another command and see if that works. So set and lane and we want to remove what was the It was the run once. Tempest Oh, no.
58:27 It's the file system. Sorry. The read only. Okay. So we wanna do s search. You have a capital y. Star or dot star slash remove is it I e I? Maybe we don't need to search. Used to be okay with this third nonsense. Don't make me bust it peril. Third. Yeah. Let's just look it up. I don't need to. Set replace. That delete lane. There's not I'm sure there's a nice easy way to do that. So we can just tell it the lane number. That's nice. I didn't know that. So kubelet said, And it's Lane 12345678910,
59:43 D, Kiblet. Nice. And now we need it to be in lane. Take that, Russell. Nice. Yeah. The file attributes. He could have put the immutable file attribute on it. That was I was trying to go through the ideas of Yeah. The attributes look alright. Okay. Alright. Do you think we have a cubelet yet? No. Oh, I I stand corrected. Maybe. Do we have an API server? No. Do we have anything, QB? The kubelet. Can you check if we have etcd? We don't seem to have any static manifest yet. So we can check the kubelet config to
1:00:21 Containerd & OCI Runtime Issues
1:00:56 see if he screwed with the path that it looks for the static manifests. Do you wanna check? Pardon? Do you wanna check? Sure. Let's see. We're going to run where is the config file for etcd? Barleb kubelet config dot yaml. Where did you say? Sorry. I was not listening. Barlib kubelet config dot yaml. Alright. You're gonna use Vim? Brave. True. I'm just gonna add it because it's probably a binary. If he does that to you, I'm I'm not gonna stop laughing. But I'm sure he wouldn't. And James said maybe Vem is just a read only quote. I don't know how you
1:01:40 would do that, but maybe. I have no idea. I don't know what's going on with them, to be honest. Do we not have a viral kubelet? Maybe there. Cool. So fig.yaml here. This looks acceptable, but let's look through that. Wait. Wait. Wait. Wait. Where is the okay. Here's Etsy Kubernetes manifest. So this is the static pod path, the thing we were looking for, is correct. Yeah. This looks about right. I'm happy with that. Do we do we have do we have any Kube stuff yet? The search file here is fine. I think this looks fine to me.
1:02:45 Maybe we're just impatient. No. We weren't. Is the v one beta one actually the is it still not in v one, the Kubelet config? No. We do have a kubelet, although that's been running for a couple of days. So maybe we need to look at the maybe these manifests aren't real. The kubelet should did we check the logs? Like, a journal CTL of the log? Because the kubelet should complain if there's a parsing error. Okay. So it says if not present. I don't know. Maybe we wanna image pool policy of always. But the image image looks correct here.
1:03:35 Russell says he made two changes, and we only caught one. So let's cut that unit fail again. We've obviously missed something here. We have eight minutes. Let's check the flags. Oh, look at that. Oh. Oh. Okay. So we just need to delete that without Vim. How would you cat with lane numbers again? Does it dash n? Yeah. And then this said I delete 70 crap. Did you screw up Vim that way? If you didn't he said he didn't recompile Vim. I need to know. Well, let's check out the the RC. So he's over he's changed.
1:04:37 Oh. Oh. Hey, that's gone. Let's restart the computer. We should have looked there. Yeah, we should have. But we assumed the worst. We assumed he had compiled his own Vim. So Okay. So anybody that doesn't know, I went against Russell one time and he basically recompiled everything. So whenever I think of Russell, I think that he's recompiled stuff, which he probably used against me. Alright. It's not starting. Our static manifest still. Error getting node generally is just that etcd hasn't connected yet. Right? Yeah. Well, it's failing to start the scheduler. Well, there's no API servers. Maybe that's all
1:05:55 maybe it kind of expected. Right. Because the API server is the only thing that's gonna touch etcd, which is where it's gonna find its node. And we've got an event here. Connect connection refused. Unable to register node with API server. I'm not seeing anything. Okay. So we we we can do this. It's not starting our API server. We don't even know if it's trying to start our API server. Passing glance on that manifest tells me that it's it's alright. Run to default. The yeah. It looks I I mean, yeah, I don't know the whole manifest file, but it looks fine to
1:07:12 me. I feel like we're missing something in the kubelet start up, so I'm gonna restart it. There must be an error message. Kubelet has to have an error message on why it's not starting it. Okay. So here's this. I'll stop scrolling. Should have passed through a pager. One second. One second. Failed to create container d task. Failed to create shim. OCI runtime caused unknown container ID. So it's failing to talk to the OCI runtime. I can't I can't see it. I'm just gonna restart. It still starts being on me again. I hope that's alright. That's fine. Go for it. Alright. Control c.
1:08:02 Debugging OCI Runtime Error ("Unknown Container ID")
1:08:18 Let's see. So you see something about container d in here? Yeah. Yeah. I've never actually seen this error before, but it's saying the let's see. The kubelet is playing start container from runtime service failed, failed to create container tasks, failed to create shim, OCI runtime create failed contain starting container process caused getting the final child's PID from pipe caused end of file unknown, container ID, UID. Do we have any containers running in the container runtime? Can we like I mean, is container d running? Wait. Wait. Wait. Okay. There is a read only true here in volume mount. Oh, no.
1:09:11 This is fine. Never mind. Alright. Let's try. That that error almost seems to be the controller manager, though, not the I see that for as well. So so we do have some two days ago. C t r c l s. No containers whatsoever on container d. We only have one image. Let's check the container DLOG. Something. Failed to create shim. Why would it fail to create the shim? OTSCI runtime create failed. So getting the final child pid child's pid from pipe caused end of file unknown. Well, container d doesn't have any configuration. So this is a standard container d.
1:11:03 Can you search the files in the manifest for the word unknown? Okay. Just a thought. I'm gonna restart container d. Could we actually, we could check for a container d configuration file. There shouldn't be one. Right? There isn't one. Okay. So EOS unknown typically happen The quote at the end of unknown is causing me pain because the the line there, it it's doesn't have a starting quote. It it it does, but it's not on the it's on the field. It's a structured log. So it's here on the error. Oh, okay. Okay. I get it. Okay. So if we run the status on
1:12:09 this well, we are at a time, but we're gonna we're gonna fix this sucker. So we do have a bunch of shims with the Kubernetes namespace. I never actually provided the namespace when I did the CLI. Let's try that. Oh, we do have containers. Okay. Wait a second. We have a cube controller manager, Scheduler. Is this all running in like is there two run two namespaces? Like is it running in a different one than what cube CTL is hitting? Let's try and grab some logs. I I don't know. That's my bad, sir. There's no log
1:12:17 Discovering Containers in Kubernetes Namespace
1:13:08 back around. Let's try and inspect one of those. And I guess the API server is the one we care about, which okay. That one definitely isn't there. The controller manager is. The controller manager, the a p or the scheduler I saw, but the API server I didn't see. Oh, and full night inspect. Wait. Is it me or does it just But we don't have a controller manager running, do we? This is why I feel like it's a shadow namespace or something. But you're right that PS should see that. You could have broke run c. Which container
1:15:32 Let's put dash l on that. Have any of these been modified? No. Did I get my BPF tools earlier? Oh, we should, I guess, check kernel modules loaded. I want my tools now. He's annoyed me. Oh, wait. That's didn't work because of It's trying to run c metrics again. So the path has been modified. Why is the app get using c metrics system d? We removed the profile directory. I don't know of any other way to set the path at the the global level besides that. Right? So we're already have modified the path. Can we just
1:17:20 delete the things out of s spin that have been modified recently and put s spin back in the path? We'll find a few of them here. Environment. Is there a local I mean, I guess I can also just remove system d and journal and stuff from that, but then I don't know how many Right. These modified. But I need I want the app to get to work. I was gonna say if we can install the BPF tool, we can scan for Well, I wanna modify I wanna see what files are being opened by processes. What what's your idea?
1:18:17 I was just gonna look for kernel modules loaded, like eBPF programs. Yeah. We can run BPF trace too. This is one of the most annoying breaks we've seen. Right? There is there are? There there are. There are hints in the home directory, but I'm not sure we need it quite yet, but just throwing that out there. I know that Not giving him the satisfaction. I I was gonna say, I feel like for Russell, we can't use his hints because Why is that calling that again? When I delete that fail. It I've had enough of it now.
1:19:17 This one. The user s been? Yeah. Good. I'm not as good with I I am super glad you're dealing with that. I I did it. Okay. No. It doesn't. It doesn't install it. It keeps pulling up that really where was that again? It's been does it come back? No. So how when I run an app, is it getting that c metrics? Maybe from journal control. So I just need to list everything and user has been grep for April 20. April '20 '1 maybe? Internal control seems alright. They all have the same date at this point. Right?
1:21:02 Yeah. He's not He's reset the dates, which is also annoying. Just talking now. Like, he's sneaky. Okay. But they're not installed. Oh, because I've removed the s pen from my path now. Correctly. This is the one I wanted. Alright. So special is any reader rate on the fail system. Not sure about that. Log extra fields on the service. Really, because I don't trust the Kiplet service yet. Now Kiplet is reading the static manifest. That seems normal. Alright. What was your idea? Effectively, this. This looks fine. I I was going to run there's a BPF tool,
1:22:55 and then you could give it the prog, which will list all the eBPF programs. Yep. We can do that with bpf trace. That was my only I don't know. That was my thought as far as seeing if there's any rustle Russell code in the kernel. Well, there's definitely a lot of e b p f going on, but we also have a cilium on this cluster. The new cilium does use e b p f, so that should be like that should be fine. Yep. I mean, we could look for things that are suspicious. He he also told us that his hack
1:23:48 doesn't work. So right now he's I'm pretty convinced he's broken container d. Alright. So how would you break container d? There's run c, which is how container d runs applications on the host. It kind of takes care of the the stuff. How would you break run c? There's namespaces and c groups and the things that it does with the kernel calls. I've never actually touched configuration of run c. Can you configure run c? Let me let me Google that. I don't think so when you're using the containerd shim, but please check because we're into uncharted territory for me as well.
1:24:31 Yeah. I'm gonna apt get install reinstall container d. And we checked that there was no container d configuration file. Right? There Yeah. I did that container d config dump and we got nothing. And Russell was basically till I reinstalled Container d before telling me he hasn't touched Container d. So, you know, thanks. Let's read this error message properly. Run pod sandbox for kubectl server. This is the one we care about. Failed to create a container d task, failed to create a shim, the OCI runtime create failed, getting the final child's PED from the pipe. Okay. So
1:25:51 I am gonna use a little s squid in there. Oh, it's gonna be not working because of the screen size. Alright. I'll just do a PSN. K. So there's actually a there's actually a GitHub issue for this, the exact error. It was a configuration okay. So system CTL had a max user namespaces. So as I was saying, possibly messing with namespaces, Like, restricting the number of namespaces? Russell says bingo. I don't know. So we're restricting the namespaces via system d? Here. Can I type Yep? Please. Go for it. Parameter. You're right. It's a kernel parameter. It's like
1:26:20 Identifying the Kernel Namespace Limit (sysctl)
1:27:33 because namespaces are part of first before you go right into it. Okay. So this system control grep have type grep for namespaces. Because I'm curious of what he's tampered with. That's not on our path anymore, so you'll need to Okay. Okay. I got it. Sorry. Russell. Okay. I just I'm gonna go to the user as guess we're going to the dash Yeah. And grip. Oh. Oh, yeah. I'm a dumb dumb. Oh, yeah. Max code name SpaCy7. That's a very specific number. Alright. Let's pull it that way. So let's actually just update it to the rest of these. Oh, shoot. I did
1:27:39 Fixing Kernel Namespace Limit
1:28:44 I opened debugging tools. One second. Oh, okay. You got it. Alright. Probably have to restart container d. Although you could we can run a PS, I guess, and see if we start to see things showing up. Like, grab the cube and see if we've got any control plane component. Controller manager. That's it. It's coming. Oh, Russell. I that was good. Messing with the the system settings. Actually Restart the cue ball to speed us up. Educational break. I I really like it because it's it's talking about namespaces, c groups, how how containers are actually run. That's that's brilliant.
1:29:54 Good one. I appreciate that one. Do we have an API server yet? Just coming. Okay. It was there. Let's go check the log. No. I have containers. Oh, okay. Apparently, we're halfway through. Oh, no. But now we're hopefully talking Kubernetes land, so it should be easy breaks. Alright. Varlog containers. Sorry. Oh, Varlog. That's that's in the wrong place. And then cube dash API. Failed to find any PEM data on certificate input. So this looks like he messed with the certificate data. Alright. Let's go check the steps. So this needs pretty much everything annoyingly. Did it say which one it unlocks?
1:30:32 Worker Node Not Ready & API Server/Etcd Cert Issue
1:32:04 So this says the t l cert file, the configuration is the API server dot cert in the PKI. Well, we just refresh all the certs? With Kubadian? Yeah. I think that's the easiest way to do this is just ask Kubadian to refresh it. I don't remember the command off the top of my head. No. I never do. There. The configuration file looked fine. Serps. Reader. We could just do all, I guess. Two changes per broken cert. What does that even mean, Russell? What does two changes per broken cert? How many broken certs are there? How
1:32:12 Regenerating Kubernetes Certificates (kubeadm)
1:33:24 Alright. So we're gonna have to delete. Cool. Now we just check the logs again. Well, we need to restart everything. So temp Man, man b star. And move all back. Do you need to restart the cubelet first? I mean, we'll do that anyways just to speed it up. Where'd that dot EMVRC come from? Nice catch. Nice catch. Now okay. I think we just sledgehammered his fine break. I think he actually base 64 encoded data in them. So if we had actually, base 64 decoded his certs. It was taking us so long. Guess we still got better.
1:35:08 Continued Certificate Issues & Etcd Logs
1:35:08 Oh, it's not regenerated to API server. To forget for a survey. Alright. We'll regenerate it. But that's about to have people with the RM. Let's see if we've got what we need here. Oh, good. You did that on that too. Maybe we can't renew. Can't do we need to generate the keys and certificate for signing request and then renew? Did you delete the signing stuff? Yeah. I think I deleted too much. Been there. Read your Certificate for serving the Kubernetes API. You're annoying me now. So I'm gonna have to do and it I remember like there's
1:37:25 An hour ago when I used to have a friend called Russell. It's alright. Let's just regenerate everything. And then we'll have every PKI. Yeah. It's better. But now, because we've modified even the Rawkode to connect, but I'm alright with that. Let's we need to move all the manifest back again. Right. We'll have to restart everything now. Alright. And we have an API server. Maybe. Can't be that easy. Can't be that you just do get nodes. No APIs ever flipped. Okay. Can't speak to k. Oh, did we restart etcd because we gave it new certs. Right? Yeah.
1:38:56 I need to be a bit more soft touched with these components I think and getting some bad habits now. Well, I do believe that he base 64 encoded words into those. I don't know if we would have decoded them. Can we see the logs for etcd? Yeah. It doesn't like the certificate that it's getting. Old or new? Old or new? It's like the hypocrite signed by unknown authority which would cause me to believe that the root cert, the what which we changed, It's either hasn't gotten the new one or the Kube API server hasn't gotten the new
1:40:19 Debugging Etcd Trust Error ("Unknown Authority")
1:40:20 Yeah. I'm gonna stop everything. This is gonna end up in trouble, isn't it? Nah. It's fine. It's fine. Why is that not working? So it doesn't think that you provided it a container. Oh. Oh, XRGs. Yeah. XRGs. That's what I was gonna say. You have to pass in there you go. Alright. How do we delete a stop dictator? Or do I need to just stop it first? Or we could just run stop first. So I don't use CTR very often. Yeah. I I assumed RM would just have a Yes. Really delete me flag. Can you just use
1:41:57 Why don't we try the pods? I don't wanna use cry control for that. Pods out there. Op print one. At n one. No. Tail n one. No. How'd you skip the row again? That doesn't matter. Right. Control. Runtime endpoint. Really should have just set the environment variable. Stock pods r m four. And egg. Oh, no. That's weird. Gonna have to tape all that again. Alright. Alright. Maybe we're ever thinking this. Maybe we just screwed up the search. I don't know. We could we I don't think we did. Why is that not deleting them? Pods only list pods. Yeah. So you have
1:44:14 to do RMP force. Alright. Now we got very little running. Star. Kiplet. You wanna reset you wanna reset the stairs one more time or we are we happy that they're maybe okay? I think we're fine. I think I think we did like the full you ran the phase for kubeadm, which should regenerate everything. That's annoying. Alright. So it still says certificate signed by unknown authority. The only thing I can think of is we need to delete stop the cubelet, delete the pods, then redo the certs. Because maybe one was holding one open and didn't let it get re reformat
1:45:24 like, rewritten. I didn't or we didn't notice that it complained that it didn't rewrite the the cert. That's like the only thing I can think of minus minus him recompiling the binaries to just complain about that, which I'm hoping he didn't do. Alright. Well, we have now regen all the steps. The kubelet is currently stopped. There's no pods running. Now we restart. Encryption isn't enabled on FTD to the best of my knowledge. That would be harsh. Start. Oh, Russell telling us to use the hints, but he said the search are fine. There's a line message from the API server YAML.
1:46:08 Fixing API Server Manifest (Adding etcd-cafile)
1:46:13 Oh. A line. You pick one line out of a hundred lines. One that's missing? Okay. Yeah. Need to, like, pull up. Oh. Nice call. I think. There should be a a certain key for each one. Change it to yeah. The ending. Alright. Yeah. C a dot key. And then let's just confirm with we won't have a Kube API server, will we? Well, let's see what the API server says. That's not what he removed. Alright. So I just assumed whenever there's a CA cert, there's a CA key, but I guess not. So it's the insert file, it's the key file.
1:47:39 That looks okay. Right? Well, if he's going off of removed a line, it's not that he edited a line. And we have the kupla client there. Let me kupla client key. And we're just assuming that it's one of the commands. Let me see if I can't pull up I'm gonna log in to the other cluster and pull up this file. One second. Yeah. Not? I mean, we could always end at phase manifest, but Yeah. I'm I'm gonna cheat. I'll be right back. Alright. So It's an etcd c a file according to the docs. Do we not we have a cert file,
1:48:52 a key file, a server. Oh, okay. The file. No. It's looking for the certificate authority. And what's the path on that? On your Was that not just p j I c a dot cert? I don't know. Is there an entity c a at all in your Shouldn't it be the authority that signed all of them? I think it should be the same authority. That would be I guess, etcd could have its own authority. Oh, it does. Yeah. Okay. Etcd sees there. And then we wait. Nothing. But maybe I'm being impatient. Definitely old. Oh, one second. Can you go back to
1:50:31 that file? Did you go did you add the one that was Well, it's not trying to connect in four minutes. So I'm gonna restart the keyboard. What? Which one? Sorry. No. No. No. You I just you probably did. There's the you did the c a cert that's in the Etsy folder, EtsyD folder. Yeah. Alright. There we go. We got a new log. It's all good? Okay. So that just means Alright. Admin.com is using the old. Right. Alright. Let me do remove admin.conf. Q bdm. There's renew admin. Okay. Can't renew if it doesn't exist. We have
1:51:10 Regenerating admin.conf
1:51:45 to do init phase. So I thought we did this. I thought that the init phase was going to rewrite the admin admin comp. Yeah. It's it's own phase. I just don't remember. So there's the control play one which would have done the static manifest. Could config. And we'll probably need them all actually. Right? Yeah. I I think so because it's going to build ones for each of them. They're actually going to continue to have problems talking to each other without them. Alright. Look at that. Two well, not two hours. Almost two hours. And we've got a we've
1:52:49 only just run our first script control command. And we're assuming that the control plane is actually working. Like, we haven't checked on, like, the things that are actually running on it. Just kubelet's happy. Yeah. I'm just going straight for the I have to launch BI. Oh, because our path and everything is screwed up. Error equals which then edit v one. Feels too easy. Well, the worker is unhealthy, so it's probably not gonna be able to schedule it. Do we really need to go fix this worker? I'm tired. I was gonna say, I don't know what
1:54:00 Tackling the Worker Node
1:54:01 shell we're going to drop into on the worker because when we logged into this one, we spent the first forty five minutes getting into bash. Logging into the worker. Let's do it. Let's do it. Command. Wordle in the command line. This is fantastic. Enter your guess. James, we could untaint the master and that's usually my favorite trick or manually scheduling the pod, but we did say to Russell we would fix the worker. So and we go. This was a command. It's not gonna overwrite the shell set in the password fail. I'm sure there's a way to tell it
1:54:15 The Worker Node Wordle Game
1:55:17 to disregard the login shell. I don't really want to play Wirtle. It's five letters because I guess Russell. It wasn't Russell. But is it today's actual Wirtle, or is it his own Wirtle? Oh, I I don't I've actually never played Wordle before. Oh. So I also tried I hate you. Custom mortal. Alright. Okay. Alright. Let's try. Words. Not cubes. Hello. Alright. We have an l and an o. Is that an l and an o in the word or in those positions? A green would be in the positions. Okay. Okay. How is not in the dictionary?
1:55:25 Playing the Custom Wordle
1:56:29 Why can I not think of any words with l and o in it? Limbo, I guess. I Oh, nice. Okay. It starts with an l. L o. Okay. There's all my l words. Logging. Wait wait. No. That's too long. Loom. That has an o and an m. Right? Loom. I freaking I I hate you, Russell. Wordle. Shelf. Must be an a. Lamps? How is lamps not in the dictionary? Well, you're gonna get a different word. That's what he's saying, I think. Oh. Just random pair and location. Oh, that's even okay. So I'm I'm not I'm
1:58:06 gonna stop guessing, and I'm just gonna look at yours. It has to have an m and an o. Lomi, that is not a word. Yeah. I would've never gotten that. One second. Let me look it up. I'm calling BS on that one. Okay. Denoting or relating to a fertile soil of clay and sand containing humus. Really? I was never good at word, though. The end might be s e. That's a a common ending. Since we have an e, I think that's a common ending. I'm not really good at English being that it's like my third or fourth
1:58:23 Escaping Wordle
1:59:21 language. It's not definitely not my best language. What's your first language? English. It's my only language. I'm just not very good at it. Yes. So that's an all we have to par is s and e. I'm just trying to get more layers out of it, and it's it's gonna go I think it's just gonna give us a shit word. Alright. So can you try these? Obese. Oh, obese. Told you, s e. Oh, nice job. This is the last one. And then, Russell, you're gonna have to help us with this. Can you actually I'm going while you're doing that, I'm gonna
2:00:59 try to escape the shell. I wish I played Wordle. Okay. Nothing I'm doing is letting me escape the shell either. Okay. R and a. Pratt. That seems fitting. Alright. Maybe not p r. What goes before r? No. Because there is in the middle, Russell. Could be cramp. No. Pram. Prada. Dry. Drums. Grain. Oh, we're almost there. I rate. Hey, Russell. Do your magic. We won then. Alright. That has c d this thing? Yeah. So you've not broken that at least. Hey. I actually beat Wordle. Mine was early. Alrighty. Make sure we don't need to do that
2:03:42 again. Alright. That must have fucked with. That's good. Now, what is the problem that we have? We have an unhealthy yes. Well, all these search are gone though. Right? Yeah. Let me join your session. Give me just a second. Oh, we're gonna have to regenerate the certs here. Yep. So if we do a j q temp custom data, this is all my setup stuff. Join token is here. If we go to our loop cloud instance, script, grep join. We can run these two commands to rejoin the cluster. DNS name isn't there. I think this is
2:04:19 Re-joining Worker to Cluster
2:05:39 yeah. Okay. QBDM reset. Russell, I just wanna say nice job on this. This was really well done. What'd you say? We got from the join that give us header. Oh, check the check the namespaces. We haven't got our worker joined to the cluster yet. Oh, okay. Sorry. Sorry. For some reason, whenever I join your session, it's just not joining your session giving me Russell on the worker. Russell worker one. No. I won't. Yeah. So it's gonna 145. It should be going through our default gateway. Default through here. Not to overwrite it. Oh, we have an unreachable.
2:07:01 Fixing Worker Route Issue (Blackhole)
2:07:13 Yeah. Black hole does. Yep. Yep. Rawkode. For you, Russell. That looks better. Oh, did that just segfault? Oh, DNS lookup failed. And then X509 unknown certificate. Sorry. I wonder if the join token was reset somehow. So what we're gonna do here is regen that, so we can actually ask it for a new join token. Create oh, it's a bootstrap. How do we do the join token again? I don't remember. Maybe the generate join. Believe it's run on the control plane. Yeah. It's just a Pretty straight. Yeah. Okay. Cool. Could be an AM talking. Great. There we go.
2:07:47 Debugging Worker Join Token/Config Map
2:08:22 X four join token. That makes sense. I think it actually expires after twenty four hours anyway, so it probably was bad anyhow. Yeah. Yeah. Probably right. We have to recreate the conflict map. That's the one that expires after twenty four hours. Oh, okay. Okay. We have to tell it to I knew something expired after twenty four hours. I I didn't remember which one. And it is and this should be an upload. Yep. Config. Upload service and config. Let's try config and see if that works. Well, let's take off the v nine. I'm a bit more confident now.
2:09:26 Oh, does our black hole come back? Alright. Let's bring back my v nine. Okay. So we do need to upload this. Maybe this there. So what was the missing? Please see uploads there. We are not pending the certificate key. So, alright, this isn't working. The cluster info config map does not yet contain a DWS signature for token ID. My join token wrong? I thought you just generated a token. Did you have to do it after you did the upload all? Don't think so. UDM and it pays out. You did the backs of the Bootstrap token
2:11:12 for the join node to the cluster. Okay. I don't I don't know. I haven't done this in a really long time. But there's the bat the bootstrap token that generates a bootstrap tokens to join the cluster join a node from a cluster. Yeah. We need to rerun that one. It doesn't hurt. Right? So and then we'll print the join command. Well, this is essentially what we were running. Well, that's obviously still complaining with the same message. Okay. What does this mean? It does not contain a judgment. Okay. So wait. Wait. It's so weird because the control plane is that's
2:12:29 the token ID of the control plane that it's printing out. So it looks right. Which one are we using? 70 yet. The 70 yetle? Yeah. This should be working unless the problem is an existing config map. Well, that's our although it should have been updated. I'll see MRM. What's wrong with that? Keep control. No. Keep control doesn't have a remove RM. It's a delete. Thank you, Zari. I mean, some people, I think you could, like, alias it if you like, I guess. Alright. So let's upload all the config maps again. Alright. Russell says we fixed the root issue.
2:13:49 Regenerate the search may have made us a bit more difficult. I really shouldn't have though. That should be able to join just fine. My token is fine. Buying Google. Yeah. I don't know. Is there any webhooks on our cluster which is causing it to fail to join? Yeah. Actually, I haven't really looked at the cluster yet. No. Alright. I think I'm just gonna bypass this. I don't I'm not sure what we have wrong here. Seems Because it's speaking to the cluster, and it's saying the cluster info config map doesn't have something. I feel like this is just my lack
2:15:07 of running Cubadian to do this specific thing. Like Possibly. Possibly there is a command or something that we're I thought I thought the upload was the one to do. Alright. What's in this config map? It's been a bit weird. Wait. Wait. Look at that. The JWS cube config up at the top of that config map. It it's not the same unless that's base 64 encoded. Oh, no. I'm just gonna delete this. That should there should be an alias for delete on that. Right? Okay. Here we are. And if it's gonna do Just the upload cert
2:16:09 or upload config. Upload config. Sorry, I said certs. I meant or wait. Which one would it be? Upload config or upload certs? Yes. Both. Alright. So how do we regenerate the cluster info config map? Alright. Russell is saying try removing the mount namespace limit. I don't know that the namespace limit's currently a problem. I thought that may have regenerated it. Do we need to upload the config or upload the I'm just bypassing Oh, never mind. Okay. I got it. Sorry. I have to do my kids bedtime soon. So we'll go with the old hack, bypass
2:17:17 Manual Pod Scheduling (Temporary Workaround)
2:17:18 the scheduler. Schedule on the control plane node? Yep. Node name, Russell control plan one. I did edit the deployer. Right? Let's delete the parts because I probably put it in a very weird position. I'm just deleting the world now. Oh, we don't have a controller manager for this stuff, it looks like. Oh, that could be in the configuration manifest files, I guess. That looks alright. No. No. I meant, like, did we do we not have one at all? Like, he has the manifest up and running. I don't know these manifests that well. Shoot. Looks alright.
2:19:22 That's not been restarted in a long time, so it's probably using the the wrong set. What else is probably using the the wrong cert. That should all be alright. I mean, the schedule over bypassing anyway. We have a replica set. We have a pod. It's in pending. I've got a node. What does oh, add a node Russell control plan one. I shouldn't have to remove the tenth because I manually scheduled it. No. Because you directly scheduled it. That shouldn't matter. No. That's just pending for another reason. Wait. Did you do describe? You did do describe.
2:20:53 No event. No events. Why wouldn't it schedule I mean, we fixed the namespace issue. The only thing I could think of is like when we change the system is to restart but I've I find it very dangerous to restart one of these systems. I have broken them that way. We don't need to look at the controller manager because that's done its job now. Alright. Keep it. Maybe it needs restarted. Yeah. It's saying that control plane's not found. That almost looks like it's not talking to the API server still. I see some unauthorized. Did he touch our back?
2:22:05 I really do need to go in for my daughter. So let's check out these hents. Oh, dear. Oh, dear. Oh, we fixed it all now. Okay. Alright. Whatever's broken, I've done it. I think the problem is Tripla isn't speaking to the API server. I think something happened with our certs when we updated them. No. Because the Kiplet certs were updated too. Yeah. Everything in there is is brand new. Let's stop the Kiplet. There are no So yeah. We actually untainted the the control play node plus we directly scheduled it on it. I don't think that's a problem
2:23:33 Redeploying Application Pod
2:23:40 as far as the taints and tolerations go. Unauthorized. Yeah. I don't know. Right now it does feel like something with the certs that we did. Well, like Like, we We're two and a half hours in. I have no idea what our last thing is. We're fixing a of Russell's stuff, so that's good. My mind is a little mooch. You even beat Orbital, so that was fun. But let's call it there. How was that? You have fun? Oh, it was fantastic. Thank you, Russell. That was that was good. Yeah. I think there was there was a lot of learning
2:24:10 Wrap-up & Conclusion
2:24:24 there, although we were a bit heavy handed with the cluster at times. Lots of lots of fun stuff, so thank you for that. But I'm gonna go put my kids to bed and have my dinner and go to sleep crying like a little baby, I think. So Alright. Thank you, Kevin. Thank you, Russell, for breaking two great clusters. Thank you, Marek, for spending two and a half hours with me working through this absolute car crash of a cluster. Thank you to the audience that stuck with us. I know that was an absolute long slog, but lots of great fun. We will catch
2:24:55 you all next time. Thank you to our sponsors, Equinix Merrill and Teleport. See you soon. Thanks, Mark. Bye. Thank you.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments