About this video
What You'll Learn
- Trace kubectl access failures by checking PATH, binary trust, and kubelet and API server connectivity from logs.
- Find and repair Kubernetes control-plane faults by auditing manifest state, etcd read-only flags, and controller-manager settings.
- Detect rogue nodeside misconfigurations such as iptables rules and rogue containerd tasks that block workloads.
Control Plane and Learnk8s go head-to-head debugging each other's broken Kubernetes clusters. Issues include a rogue iptables rule, an etcd read-only manifest, a missing PATH, a tampered kubectl, jsPolicy webhooks, and a stuck containerd task.
Jump to a chapter
- 0:00 Introduction
- 0:46 Introduction
- 1:04 Housekeeping & Sponsor Thanks (Teleport)
- 2:14 Introducing the Teams (Control Plane & Learnk8s)
- 2:22 Meet the Teams
- 4:09 Control Plane Team: Learnk8s Cluster Challenge Begins
- 5:01 Control Plane Team Debugging Starts
- 5:17 Preparing Debugging Tools (Nix)
- 5:59 Debugging Kubectl Access Issue
- 9:38 Modding Cubectl
- 10:08 API Server Connection Refused
- 10:21 Checking Kubelet Logs
- 10:39 Kbler
- 10:58 Examining Static Manifests (/etc/kubernetes/manifests)
- 11:49 Time Stamps
- 12:52 Klustered diff
- 13:15 Using Backups to Find File Changes
- 13:46 Controllers
- 15:00 Manifests
- 15:16 Identifying etcd ReadOnly Issue in Manifest
- 16:38 Cube VIP
- 17:26 Identifying Disabled Controller Manager (ReplicaSet)
- 18:54 File Folder
- 19:11 Investigating Rogue IP Tables Rule
- 19:27 IP Tables
- 20:06 Dash Capital L
- 20:56 Dash Capital N
- 21:27 psyllium
- 21:51 Locating and Identifying the IP Table Rule
- 22:21 Tnut
- 23:29 Pods
- 23:31 Removing the IP Table Rule
- 24:10 Kubectl Access Restored
- 24:43 Verifying Service Connectivity
- 25:56 Updating Application Image Version
- 27:00 First Cluster Fixed (Control Plane Team Success)
- 27:01 Moment of Truth
- 28:02 Interlude: YouTube Latency Test
- 28:57 Control Plane Cluster
- 28:58 Learnk8s Team: Control Plane Cluster Challenge Begins
- 32:19 How does the shell find binaries
- 33:10 How would we remove the path
- 35:04 I wouldnt trust the binary
- 37:16 js policy
- 38:32 webhooks
- 42:11 fail restart
- 43:29 update endpoint slices
- 44:44 process list
- 47:04 angry geese
- 48:14 finding containers
- 51:49 removing a container
- 53:54 string commands
- 55:44 tls
- 1:30:55 Learnk8s Team Debugging Starts
- 1:31:25 Debugging Shell Environment (Missing PATH)
- 1:32:40 Fixing the PATH Environment Variable
- 1:34:44 Debugging Pod Status Issues ("many", "terminating")
- 1:35:07 Identifying Untrusted Kubectl Binary
- 1:35:49 Replacing Kubectl Binary
- 1:37:15 Encountering Admission Controller (JS Policy)
- 1:38:58 Investigating Webhook Configurations
- 1:40:07 Locating Validating & Mutating Webhooks
- 1:40:23 Deleting Webhook Configurations
- 1:41:19 Testing Pod Creation (Webhooks Removed)
- 1:41:57 Troubleshooting Persistent "honk" Pod Status
- 1:44:45 Searching Processes for Anomalies ("hunk")
- 1:45:19 Found Rogue Process/Container
- 1:46:14 Tracing Parent Process & Container Namespace ("angry geese")
- 1:49:01 Attempting to Restart Pods (Delete All)
- 1:50:03 Using ctr to List Containers
- 1:50:35 Found Rogue Container in "angry geese" Namespace
- 1:51:13 Attempting to Remove Rogue Container (ctr rm fails)
- 1:51:55 Learnings about Containerd Tasks and Containers
- 1:52:55 Struggle with Containerd Task/Container Removal
- 1:58:59 Recognizing "always restart" Feature (Nerdctl)
- 1:59:14 Killing the Task via Nerdctl
- 1:59:43 Rogue Container Removed & Pod Status Improves
- 2:01:10 Webhooks Recreated & Final Cleanup
- 2:02:21 Second Cluster Fixed (Learnk8s Team Success)
- 2:02:44 Post-Challenge Discussion: Issues Encountered
- 2:05:59 Reflection on Challenges & Learnings
- 2:07:08 Conclusion & Farewell
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:46 Introduction
0:46 Hello, and welcome to today's episode of Clustered at the Rawkode Academy. I am your host, Rawkode. Today, we have two great teams picking up the clustered mantle and fighting each other to win. This one's a bit weird because all the teams know each other, and I think it's gonna be a whole lot of fun. Alright. Now before we get started, this is a little bit of housekeeping. If you're not already subscribed to the channel, I would encourage you to do so now. Also, if you'd be so kind, thumb up the video, like it, comment, share, all those other things. It's
1:04 Housekeeping & Sponsor Thanks (Teleport)
1:17 just gonna help YouTube share this content with more people. We also have membership options available at the academy, so you can check them out and join for as little as 99p per month or cents, and this just allows you to support it, and I would appreciate that very much. And if you want to continue conversation or you're not watching live, we do have a Discord server available at Rawkode.chat. There's only 600 cloud native technologists in there talking about all things cloud native, Kubernetes, EBPF, and everything in between. So come and say hello, and I look
1:47 forward to meeting you. I also want to thank Teleport for sponsoring Clustered. This was the easiest decision I've ever had to make. We've been using Teleport since the very beginning. It is a fantastic tool. You're going to see us using it today to pair and share terminal sessions on these rather broken clusters. So check it out, get it running in your cluster, and let us know how get on. You just use rockhode. Liveteleport. It's a UTM link, but it keeps the sponsor happy, so that keeps me happy. Alright. So we had a little bit of
2:14 Introducing the Teams (Control Plane & Learnk8s)
2:16 technical troubles, but we are back in action, and I am gonna introduce our two teams today. So let's pop over here. We have a slightly different layout. We've got two teams of two, which means we can have everybody on the screen at the same time. It's gonna be a whole lot of fun. Let's start at the top left. Jack, do you wanna say hello, introduce yourself, and then we'll just move around the windows, please. Sure. So I'm Jack from Control Plane. I'm an infrastructure engineer. Usually, I'm I'm doing more throwing stuff at clusters. But, yeah, looking forward to to fixing some as
2:22 Meet the Teams
2:51 well. Awesome. Thank you. Hi, everyone. I'm James. I'm also from Control Play. I'm also used to breaking things and pointing out broken things a bit more, so looking forward to having to fix some stuff up today. Awesome. Thanks for sharing. Hello. I'm Chris. I'm I guess, stumbled into some bits of DevOps. I spend most of my time kinda helping digital teams become kinda more efficient and more empowered and enabled and stuff. I'm a developer at heart. So but, yeah, ended up kinda looking at DevOps and other buzzword nonsense, which made me stumble into Kubernetes. And I'm currently a teacher at Learn k
3:43 eights. Awesome. Thank you. Hey. My name is Simonik Bao. And just like Chris, I am also an instructor for Kubernetes at Learn k eights, and I also work as an MLOps engineer for Appia. I usually break stuff and don't bother fixing them. That's what I do. So but today is gonna be a real test for me to try and fix the broken stuff. Alright. Thank you all for introducing yourselves. So we're gonna kick things off with our first cluster. I have the learn Kate's cluster in front of me. So that means team control plane, you are up first.
4:09 Control Plane Team: Learnk8s Cluster Challenge Begins
4:19 This is teleport. I am going to click here to open a session on the control plane node. Please do not use that button yourself. There is the activity panel here and active sessions. So make sure you join my session. I'm gonna ask you to type echo hello or echo your name, anything like that to let me know that you're there. And then I will start the timer and we'll get the session underway. Okay. Give you just a moment. So remember active sessions and then join. Got one of you. Is the newest one. Alright. I guess that's it. We've got both
4:57 of you in the chat. In the chat in the session. Alright. I'm gonna start the timer. Feel free to configure your cube config and best of luck. Okey dokey. Yeah. I guess you could you could use teleport and a session as the as a chat. I'm gonna get started by bringing some of my own tools just so that I can debug nice and quick. Doesn't like that. Okay. Never mind. Right. Let's let's have a look what we're working with. What kind of tools were we done with inject? It was going to be, like bringing my own
5:17 Preparing Debugging Tools (Nix)
5:40 kubectl just in case I can't trust it. Oh, come on now. It's the first step. You still Oh, no. On. That's the first step. You've definitely watched us before if you don't trust kubectl. I'm just saying. Yeah. Okay. Let's get that going. I mean, let's just have a look at pods. Okey dokey. Oh, look at that. Maybe you shouldn't have trusted it. Oh. Oh, yeah. Damn. Fixed it. Has it already fixed If I Let's see what that error was. I haven't used it. You can't do it as root. You'd have to add a user. Okay.
5:59 Debugging Kubectl Access Issue
6:17 I have a solution for this already. Jack came prepared. This stream is not going to last too long now. Okay. Let's grab this instead. Copy. Okay. Cool. I have next portable. Just make sure it works. And then let me just grab my tools. So I'm using next to grab a bunch of tools. I could just write a script with wget or whatever. Oh, you're using next flakes as well, not this next shell? Yes. Nice. So I'll I'll go over the tools that are in there in a bit. But hopefully, once this downloads a bunch of stuff,
7:19 we'll be able to see it a bit clearer. But in theory, if I didn't well, in this I've got kubectl. I've got, like, BusyBox, you know, Rarkess. I've got BAT so that I can get some nice colors and a couple other bits and bobs. I do appreciate some colors. You come prepared. Yeah. You Exactly. If you're typically debugging your own cluster, you probably don't need treat it as such an adversary. But yeah, have seen some pretty nasty stuff during clustered, so I thought bring your own tools might help. I like it. And yeah, I mean, hopefully that will fix
8:02 this problem. We can also look into kind of what happened to there. I think messing with kubectl might be a bit of a theme here. But we'll see. I could have also slimmed this down a bit. I did pull in stuff like Docker as well, which is a bit bit too extra. I think you did the right thing. You'll need that. You'll need that later on. So if I do which kubectl, it's now in the next store. That doesn't actually exist on the host. It's because I'm using next portable. It's in a different location. But we can skip over that. So if
8:41 I do that flake next, we can see the tools. I'm pulling in kubectl, dig Nice. Proc s, nodectl, yeah, bunch of other stuff. And I thought busy box because why even trust p s when you can just bring your own? So if we look at user bin kubectl, I guess yeah. So it's not executable. I guess we could make it executable. The other thing we can do is We have a control plane control plane employee going rogue in this location. We see Lewis popping his head up. Lewis, what are you doing, man? So we can see that they're they're roughly
9:31 the same size. There might have been some meddling going on. But they are going to be built slightly differently anyway. Are you going to try chip modding your left Yeah. Let's try it. If it yeah. Let's just see what it does. I mean, hopefully, it doesn't, like, exit me out of this session or something. Okay. It looks it looks alright. Okay. No API server. I like that. Let me just use my one, just in case. That's fine. Right. What should we do now? Should we should we check Kubla quickly? Oh. In a true. K. Let's let's pop out of this.
10:21 Checking Kubelet Logs
10:38 K. Kubler. What's it doing? Anything popping out here for you, James? So we've got some some pods that are broken. We're dialing 6443, but I'm not sure that's particularly that's maybe expected since API server isn't up. Yes. Okay. We could look at PTC cube manifests. Let's see if we can spot anything amiss in the API server, YAML, maybe. Yeah. That sounds like a a good good thing to try out. Let me just grab my VIM again. David, we're supposed to say hot or cold if they're getting closer or or we're not? No. If if they wanna hand, they can
10:58 Examining Static Manifests (/etc/kubernetes/manifests)
11:30 give you a note. Alright. Just feel free to taunt and laugh if you wish. Alright. Draw stuff to loose if you want. That's exactly where I'd That's exactly where I'd I'll maybe save the taunts in the last for the last five minutes if we get anywhere near that. Sounds good. Sounds good. Perhaps, Jack, we could have a little look at the time stamps on the files in that folder to see if I am. Oh, man. A lot of people Willy keeps Willy's saying hot. Let's have a look. Okay. Sixteen. So, yeah, manifest has changed. Let's see if it manifests.
11:49 Time Stamps
12:09 We're looking at the cube controller, cube scheduler. And as you do. As well. I'm not looking forward to that. Yeah. I love it when people have time stamps from, like, thirty minutes before the show. Yeah. Oh, shit. Okay. I mean There's some stuff modified a few minutes before the show as well. Full closure. Sure. Yeah. Start with XD. Yeah. Can do. The other thing I was thinking is I have a I have a little file here that I can just pop in here. Okay. Now you came prepared. So let's see what the the diff is
12:52 Klustered diff
12:57 for the Kubernetes backups. Kubernetes manifests, and then ETC. That's Kubernetes Let's pop this into that so we can have a look. So let's see. Is there anything odd that just looks pretty normal stuff? Let's have a scroll through. That that read only, I think, looks pretty sus. Yeah. The read only is interesting for sure. So which files that that's the etcd manifest. Okay. Very nice. Because yeah, I noticed in a lot of these in a lot of these clustered exercises, it's just a case of what has changed and what has not. Okay. Control has been changed.
13:46 Controllers
13:59 Are you seeing and learn, Kate, seeing this being deconstructed right in front of you? Good. It's learning experience for everyone. That's what we want. Learn Yeah. People to learn about Kubernetes. Now the breaks I'm not familiar with this format. So disable replica sets. Yeah. So the controllers, can do dash replica set, dash replica, replication controller. And it will disable that individual controller and the controller manager. Yeah. That's very cool. I guess I guess if you know that you don't need something, you can just remove it Yeah. To reduce your attack surface. Anything else of interest?
14:40 I could see that the replica set controller might be quite useful, though. Yeah. It probably is useful. We might want that, I guess. Right. So let's have a go. So I think the main things there was the read only and the rep cassettes. Fact, one of the files was touched, but I didn't notice any difference. I wasn't paying a massive amount of attention. It's quite hard to notice these differences. Manifests. SPEAKER So I think there was one in etcd, wasn't there? There was a read only. Two: Yeah. That was the read only you spotted. Okay.
15:16 Identifying etcd ReadOnly Issue in Manifest
15:23 Also, did it open on read only? That's the other thing. Does does Vim open where where the change happened? It was controller. Yeah. So so Vim has opened at where the issue is. It's always funny. Nice and helpful. Yeah. I I was looking into that earlier, and the file of interest is is it Vim info? It's Vim info. Yeah. VimInfo. So if we go in VimInfo, we can find some other interesting things that I have edited. We got it. We got it, Jack. Keys to the castle. There we go. Yeah. Gonna a very me, trying to think about
16:13 The big question is whether Jack and James have cleared their Vim info before shutting down their cluster. I I don't think I'd be showing it off. We'd like Okay. So I'm not sure why this file has been opened either. I guess we'll find out. A lot of laughter in the comments and people enjoying the forensic approach for this. Very good. Very good job. Right. I think we there's also a cube fit. It looked like it had been might have been edited that we didn't spot earlier. So I would to you to touch that. Yeah. It turns out
16:38 Cube VIP
16:48 we broke cube fit without trying to break it. Caused some issues. So Great tip from Russell there. Start editing all files was said. Mhmm. Does the kubla just restart itself? The Kubla won't restart itself. No. But it'll restart any static manifests or modified when it notices the change. Got it. Okay. Or you can force it with the restart. So, I mean, it might be a minute or two for the changes. I guess the other thing we could look into is that other file that changed. So prep. There's a vet. Yeah. Or should we look at the VIP
17:26 Identifying Disabled Controller Manager (ReplicaSet)
17:43 first, actually? Kubernetes manifests. Kubit. So okay. And that one started at the top, which is interesting, I guess. Backups, Kubernetes, manifests, kube, vip, and then you see kube manifests, kube, vip. And then I'll just pipe that into BAM for some Oh, I don't have that. You're not in the shell anymore. Escaped the shell. What have we got here? It's just custom made So, like, sensible differences, yeah. I mean, in in theory, it could have been changed, but we'll find out, I guess. Also, if KuBip was screwed up, I guess we wouldn't be in Teleport right now.
18:43 So the other thing was this file. So this is a file that I'm not familiar with. And, also, I do not have backup of, so I can't just cheat. I don't know if you're familiar with this one either, James. But, yeah. Would never touched this folder. Okay. That looks very fun. I'm not sure we even need this file. So I'm going to can you comment? Maybe you can comment in these files. I, from the name, would hazard a guess that the if up dot d would run on interface being brought up. So potentially, we might need to undo that in IP
19:27 IP Tables
19:40 tables. Yes. I'm also not very familiar with IP tables, so I'm not looking forward to this. Right. Okay. So so My yeah. At some stuff. Don't know whether the state would have been saved in the save state file, whether we can restore from that or whether we'll just have to target it manually. So you can run I b tables dash capital l to list all the different rules. And what dash t not properly to target the NAT table since that's where it's at it? Would that just be the tile? Let me just grab something with a t
20:06 Dash Capital L
20:38 a t. James is right with the dash t net. He don't need to do the the graph. Oh. The dash is capital l, base dash t net. Can I do them in any order? Yeah. Yes. And you might want dash n as well, which will give us the rule number. Sounds very low. Thanks, Anne. Oh, maybe not actually. Moment ends. Maybe you don't need the l of the dash t. No. You don't. Okay. This seems a bit noisy. But Yes. Let me just do a clear right. While we're with him. So I think about middle of the screen,
21:27 psyllium
21:29 there's the offending rule in the output chain. Mhmm. Oh, yeah. My my highlight isn't showing. Yeah. Below the second Cilium comment. Yeah. So I guess we have two options. We can either delete that rule or change the whole cluster so that it's not on that port anymore. Just a quick bit of Google food to try and remind myself exactly the IP tables incantation. Sounds good. I'm gonna The comments are spot on if you need them. Yep. I think dash dash line numbers is the The tack is flag I was looking for earlier. All of it.
21:51 Locating and Identifying the IP Table Rule
22:09 There we go. Yes. That's the Okay. Roger. We have a number, okay. I'm not familiar with how I would target that. I guess it's on output. Right? Dash I think you need the t not, and then it's is it dash capital d for drop, and then the line number? If my very rusty memory serves me correctly. So t now. Or you could always take the sledgehammer approach, which is the the Rawkode approved way. Wash all the rules. Can I just wipe everything and then wait for it to be rebuilt? You can. I will also point out this is
22:21 Tnut
22:48 a Cilium cluster with no cube proxy using e b p f and x t p for traffic routing. So, I have tables. Well, I seem there's a couple of Cilium things in here. So is that just for compatibility? So Yeah. Just not work together or I'm pretty confident you can make it and this this machine will be fine. Valid suggesting this anyway. Flush it. Yes. Yes. Could do. Okay. That would be dash capital f to flush everything. Okay. So should we just wing it? If you I have the I think I have the command if you just wanted to delete the
23:31 Removing the IP Table Rule
23:32 single one. Yeah. Let's let's go with the slightly safer option. I don't want the cluster to implode. Then Dash capital d. Definitely run out of time. Do I need the the NAT thing? Yes. Was it t NAT or n NAT? Yeah. T NAT. Cap d, output in caps, and then the rule number, which was two, was it? Yeah. Yes. Okay. Moment of truth. Is it gone? It is gone. Nice. Okay. Kubectl get pods. Tada. We are in more business than we were earlier. Guess, well, while we're here SPEAKER This is a nice tool for anyone
24:10 Kubectl Access Restored
24:29 doing their CCAD. You can look at all the short names if you want to do the the shorthand for, like, pods and deployments and whatnot. Cool. So we've got the pods there. I guess I should look in teleport to see if the service is running. In fact, let's just get can I just curl this? You are halfway through. Twenty minutes left. Okey dokey. Okay. So it seems to be running fine. Oh, another thing that's really good is remember the completion stuff. Cool. Do we want to preemptively check for any admission controllers potentially? Or should we just try and
24:43 Verifying Service Connectivity
25:38 and go ahead and edit the I think, yeah, going going ahead with an edit sounds fine. I can't I can't see it causing any issues that are non recoverable. Kubectl edit, deploy, and then cluster. Cluster. Did I misspell it? There you go. And then v one, v two, please. Let me just have a scroll through this, see if there's anything that What? You don't trust them? I can't remember if it was health or health z. That's something to Yeah. I can never remember if it's health or health z. It's cluster first, which was the nasty one
25:56 Updating Application Image Version
26:36 from last time, rather than it was on, what was it, default? I would never have caught that. Yeah. Matt tried to stick that one past me, but I wasn't letting him have that. Okay. Let's get pods. Alright. You want me to test it? Looks like it restarting. I think so. Yeah. If you could. That would be handy. Moment of truth. Salmon looks worried. My my worry was over two minutes and as soon as you downloaded the tools, was like, this is done. This is done. You got advanced. Well Good job. Done. Good job, John. Done.
27:01 Moment of Truth
27:22 So I can change this. Good job. That's a master master class in forensic debugging. That was very good. Thanks. Yeah. I I would have been completely stumped there on on IP tables. I avoid that like the plague. No. No. You did well. I like the approach. You know, you took it slow. You just identified what you wanted to you wanted to check for history. You wanted to check for them info. You had you came prepared with backups. Like, you know, that's a great way to do it. Nicely done. Very well done. Alright. Certainly scratching the back of my memory on some of
27:58 those commands. Okay. So this is our first stream of ultra ultra low latency on YouTube and the chat I want this to run a test to see how low that latency is. So put your fingers on the one key and then I'm gonna say go and you're all gonna press 1. And I'm gonna get that cluster ready just now while I do that. All right, so control plan. Cool. All right, go. That's some more comments. I'm not seeing anyone yet, so it's still not all lit and see enough. I I can hear you on the stream.
28:02 Interlude: YouTube Latency Test
28:44 They've got on the TV. It's it's not that far. Yeah. The late there's not much latency to be honest. Yeah. It's not too bad. Alright. Thank you all. Alright. Let's jump back over to our screen share. So I have the control plane cluster here now. So if the learn Kate's team, Norman and Chris, if you can jump in and join this session, remember to use active sessions and join. Type echo hello, echo your name, anything you wish. Let's make sure they're all on the same page and then we'll get this thing started. Thank you to everyone who typed a I
28:58 Learnk8s Team: Control Plane Cluster Challenge Begins
29:19 think it is a lot better, definitely. Of course, Lewis comes in twelve minutes later. Nice one, mate. Am I in the right one? I'm not sure if I see nope. Just me and Chris. Sessions. I'm in active sessions and the only active session I see is learn cades. You're in the wrong cluster. Join the control plane cluster. Teleport control plane dot cluster dot Yeah. Oh, right. There you go. We're good. Give a second. I'm gonna start your timer now, Simon, for being so unprepared. That's what's gonna happen. It's it's not gonna make any difference. I'll just
30:02 throw it away. I was just like, Jack, you got this. I gotta do authenticator now real quick. We're now about to demo how to absolutely not prepare for one of these and and fail miserably. No. You're just gonna do great. Like, I host the Kubernetes office hours and we share your learn case debugging image a lot for people. So we we got a lot of confidence in you and we're looking forward to you fixing this cluster. Uh-huh. Because they prepared to be disappointed. I'll I'll put that warning out when that Yeah. For some reason, I can't log in
30:39 controlplanecluster.live. Different login. It's your the other one. Did you register for both? No. I didn't register for both. Anyway. Chris is doing the taping. You can do the navigating. I'm gonna start your timer now. So set up your cube config and let's check for our control plane and best of luck. Cool. Where is it? Yeah. Just no LS. Cool. Okay. So is that no LS or no Well, that's weird because LS should be a built in into the shell. So that's that's a nice Yeah. But I've got no built in things. You have echo? I have echo. I have a a thing,
31:53 and I can see that I'm maybe in bash. No. Come on bash. Docker. Interesting. They know. You know, they're not gonna get cube cut. You're not gonna leave me that, are you? So it looks like you're missing a lot of things. So how how does the shell find binaries? By path. Yes. There we go. Ah, fine. Okay. Fine. So, yeah, my path messed up. Thanks. Thank Cool. Thank you. Yeah. Interesting. Okay. So You could echo path and fix it. Yeah. What would it normally be? Maybe there's something like just in case there's anything in there.
32:19 How does the shell find binaries
32:52 You trust what's in there? No. I think we're not going. Oh, have you done something? Yeah. I trust. There you go. Trust pretty much anything in the path. There's only one thing. No. The absence of a path. I can trust that. Cool. You're talking about. Yeah. There you go. Guess the other thing you could think about is, like, how would we have removed the path? Oh, okay. Fine. Of course, fine. Ben. It's bash, isn't it? Try. It's the easiest way to restart the session, where I can source this. Yeah. You can source It will work, because
33:10 How would we remove the path
33:54 we we put the original path at the top of the file as well. Oh, okay. Cool. You should add it back in. Great. Fine. Cool. Thank you. Okay. Fine. First stumbling block. Okay. So I can now do Have I still got my Kube config? Yep. Fine. So more, etcetera. So that looks reasonably unsuspect. There's there's fixing piece, Lewis. Come on. So pod's currently in a state where it's got many Many. Readiness. Well, it got elite readiness, I guess. Yes. And a terminating state. So this is the binary I wouldn't trust. Gonna give you a minute, Jack. CTL.
35:04 I wouldnt trust the binary
35:11 I'd say this is the binary I wouldn't trust. Keep CTL binary. Interesting. Okay. So we had all sorts of aspirations of a bad keeping a bad KubeCole and never got around to doing it. So kudos for actually doing the thing that we planned to do. To be fair, we did want to do something like changing the code, but it was way too difficult. So it did sound very, very crude. Yeah. I had to look at changing the code. Was like, well, not for the not for now. Later on. Should we just get a new should we just get should we just download
35:49 a new one and QTTR. Stick there. Then judging by what, like, just the clue that we just got, I'm wondering whether that's yes. Only smalls is probably a shell script that's aliasing to it being somewhere else, which is what I was gonna do for us, and then we didn't even get to do that. There you go. There you go. I'm not so I can Just get rid of that. Right. Now what does it look? So that's a real kubectl this time. Interesting. Cool. Okay. And it's got zero red. Yeah. Should we just delete this deployment and
36:51 create a run first? We could just do, like, a run. I don't know. It's a busy box. Yeah. Search Oh, it's dash dash image. Uh-oh. Yeah. This this isn't meant to happen. This is me not knowing how JS policy works very well. But it it would be it would be blocking. Yeah. I would have an admit there's there's an admission control knocking about. It's called JS Poly, so you do something. So Files look so if you've changed them, you've moved the dates back. So what was it? JS policy is somewhere doing something. There is no docker on these machines.
37:16 js policy
38:19 That's oh, yeah. Brain fart. Right. Okay. So JS policy was the thing that you tried to put in, but that's gonna crash back a little bit. Yeah. Sometimes it works. Sometimes it doesn't. We're not sure why. Haven't used it enough to to judge. The fix should be should be the same, though. Yeah. Yeah. The the fix in in both scenarios is the same. Can also Brain file. I have to look at the mutating webhooks. You can use API resources to get a list of the resources on the cluster. I'll do the same source. The oh, what's that? KeyCastle
38:32 webhooks
39:34 completion bash, was that? That's what it was. It might be a caret instead of the dollar. Yeah. Less than symbol instead of dollar. Oh, okay. Yeah. Oh, yeah. No. Caret. You still got the Right? No space. There's not space at all. Yeah. No space. Okay. It's fine. Kiev gets it's probably a validating. Right? So validating web book. Oh, you're gonna go for a fire hammer approach? Yeah. Delete it. Yeah. I mean, why not? What's the worst that can happen? Right? Go to the mall. Yeah. Give it a go. Okay. And I didn't get recreated by some
40:38 magic. I'm surprised that worked, actually. If it was I'll I'll take that better luck then. That's absolutely fine. What was the other one? And it was in webtighting webhook. Yeah. If I can type mutation, which there isn't any they're not names based on that. Okay. So now what have we got? Status don't hold. We've now got a Postgres that's just come to life. What was my desktop? So what's the honk? Because I'm guessing it's a status we're getting honked. So is there something in writing an HCD? The status is hunk. We we haven't bothered with it. We haven't
41:46 bothered with it. Okay. Good. I'm much too scared to go in there. API server. Perhaps there's some modify the code in the API server to return hunk. It's back fail restarting thing. You can have a look at the processes to to see if there's anything suggestive. But it is but it is actually right? So Yeah. That's true. Alright. Let's just optimistically change the thing and then get rid of the readiness probe just to make it so that it's ready. Once you do this, then you wanna should we have a look at the API server config?
42:11 fail restart
42:56 Yeah. Sure. I'm just gonna I get I'm just gonna push in that the thing does work. So I'm gonna get rid of the Yeah. Readiness and the liveliness because I'll assume that we've done an excellent job and that that that's that's container's absolutely fine, and it never needs to be checked. Okay. So we've got another one now. But what's responsible for writing the status on a pod? Might have a bit better luck if you watch what's happening to the status. Updating endpoint sizes. Node. Not. What's that? Yeah. James set this one up, and it doesn't show up in
43:29 update endpoint slices
44:26 events either, which is quite brutal. What he's looking for in the process list? Anything? Are there any Anything that looks keywords you can think of that might be worth grabbing for? Grab for a hunk. Yeah. Lower lowercase might work. Hold on. It's very good. Okay. Very good. That's that's almost a a hacker's reference. Right? What have you done? And where is that? Oh, it's in a container somewhere, presumably. What container is that gonna be in? Presumably the API server. Right? Yeah. Probably. Or the cubelet, maybe. The cubelet on the nodes might be sending that back.
44:44 process list
45:59 You can also see the the parent of the of the process. How did you do that? I'm trying to It's a Try e f. E f. And then it's the what was the thing? It's that one. Oh, there's a funny namespace somewhere. So that one, and there was a container's in an angry geese. Container d, but it's not running actually in there. So what are you thinking just now? I'll try to get rid of that JS policy thing that's not helping us anyway. But the there's a container namespace of angry geese, which seems suspect. I did that p s
47:04 angry geese
47:23 dash e f, but that's giving me, like, a load of things, not just the tree. I think I did p s dash e f, didn't I? Yeah. I think you're doing it with a what was the number? That was the number of the process that we cared about. Right? That opts honk the planet. So I wanted to see the parent of that process. It was this one. Right? Can you just do PSEF on its own? You should get a tree. I could be mistaken. Oh, you need a So I think you've got two avenues right
48:14 finding containers
48:16 now. You're suspecting that this the file does not exist on the host. You're you're pretty much looking for a container that's running it. So I mean, you can run a QControl get pods all and look for random container accounts in a pod, or you're gonna have to use cry control to list containers that are running outside of Kubernetes. But if you're happy that that looks normal, then you're gonna wanna start looking for containers that are not in the Kubernetes namespace. Yeah. Thanks, Noah. It was p s f a u x for the tree. Yeah. F a u x is really good.
48:56 So it might not help me at all, but I'm just gonna delete all the pods and let everything come back in case it wasn't set up as something that was persistent. That sounds like a scary thing to do. Why not? No. I I don't know if you delete everything. Don't delete anything from cube system. Why not? It's just a pod. It will come back. Yeah. Yeah. I mean, no one's telling me not to, that it'll be a really bad idea. What's the what's what's the container runtime on this? Container d. So you could just crack Should
49:31 we should we check what containers are running? We can indeed. There's also Cryo control. I'm guessing they're just running. I'm guessing you you started that container using the the runtime. Yes. So let me add the flag for you just because I do know it off the Yeah. I've never used a cryo control. Oops. Sorry. Oops. Sorry. Sorry. Run. Right. Look at that. Ah, okay. Cool. Okay. So can we check what containers are running? You can also use CTR and do a CTRPS as well. So it depends whether you want look at it from Kubernetes context or from
50:19 a container d context. You could do. It's l s. Is it? PC l s. So this container PC l s. Okay. Yeah. L s. Right. So But it's in a in that angry geese namespace. Right? So Yes. So you can add a dash n to that. Yeah. What was it? Angry geese, which we saw. Yeah. You need to put the n before Before the c, I think. Yeah. Okay. Yeah. It's a bit of a pain. There we go. Cool. Okay. That'll do it. So that's the thing I wanna get rid of. So what was the that's l s. So what's
51:11 the You can do a CRM. CRM. Okay. Start by that. Then It's a spice. C space r m and then the ID. Okay. You must have been the task first. Dash dash f. Or dash f. Sorry. Or dash dash force. No? Maybe you might have to We're all showing our container d ignorance here. You might have to stop the task first. So if you do TLS, t space l s. Okay. So say again. Say t space l s. C c t r Yeah. T for task. Like that? Space space l s. Yeah. Yeah. And then Okay.
51:49 removing a container
52:14 Now we can remove it. So you need to RM both the task on the container. Okay. Okay. Fine. And you might even need to stop the task first sometimes. I have to shut down the machine. I was just trying to Why Nerd CTL is much nicer if you have it, but it's Yeah. Nerd CTL is good. Alright. So So what was Stop the task. Right. See, task syntax for that is, like, kill that task. And then r m that task. I think the killer should hopefully have done that. You'll now need to do a container
53:07 l s to see if the container still exists, and if so, you'll need to remove that too. Yeah. But my task is still running even though I've killed it. Yeah. So now you should be able to remove the container. Oh, wait. It is still running. I mean, the task is still running. You sometimes have to do kill dash s sig kill. Then yeah. Alright. Okay. Stop. It is. It's back running again. Fucker. Yeah. Can you I would try a t r m dash f. Must be a force flag on that task or task delete. Well, we saw it stop momentarily. So can
53:54 string commands
54:04 I just, like, string the commands along each other? Is there some delay between them? There isn't an extra thing that we made you do here, by the way. You should just be able to kill RM and then RM the container. It's a work Always. Restarting it. You've done some other accidental magic. We'll check the parameters on t r m. There must be a a way to force remove this thing. Yeah. Fine. Would you like me to type for a moment? Because I'm really curious. Yeah. I was gonna do that or just I was gonna exec into it and just stop
54:53 it doing whatever it's doing. I just want to do an RN dash f and that's gone now. What was doing? I'm sure that's what I did at some point. You need to check for the container. So it's c l s, and then I want to see our Stop on that one, I think. Because who needs consistent contacts? No? Yeah. And You can't do CRM, I think. Oh, yeah. We were doing CRM earlier. I did r m. That's how I'm thinking because it wasn't stopped. Is there any parameters to r m? You might need to do a dash dash
55:44 tls
55:54 help. Yeah. It doesn't look like that. No, let's just keep snapshots there. Yeah. You're gonna have to do a Why is the container still there? Is the task still there? Let's do a TLS because I'm curious if it's magically come back. Yes. But Right. Okay. If I just exec onto the running thing, and then I can do some other magic and just stop it doing the thing that it's doing, maybe. You want to use Nerd CTL. It's it's way more comfortable to use. It's more more Docker style. Also not available on the machine right now,
56:40 though. Okay. Unless we downloaded it and put it somewhere sneaky on the machine. You think we're gonna trust your tools? Trust them on the mind. Yeah. You you type dot n c t l? Let's just attach onto the thing and quickly if I can see if can do that quickly, and then what was I gonna say? MS is task exec, like, onto that. Presumably, I'm gonna guess what the syntax is. You need a parameter called dash dash exec dash pid. I sorry. Exec ID for reasons I don't understand. And who says continue is not easy to
57:26 work with. Alright. I'm gonna throw it. Why don't you try a task pause? Like, we don't actually need to get rid of it. We just want to stop it doing its naughty things. Right? Correct. Let's try a t pause. No. That's a good shot. Oh, that board. And then confirm with a yeah. I don't know. Still running. That's that's not fair. Is there is there a process that's starting to or That that failing, yeah, you can see if you can identify a process that you might be able to kill. Yeah. There's a process that's probably running this. Let me
58:19 check. Any other processes? Grep for maybe angry or geese or hunk. That's the one, isn't it? Container edition one c. Mainstream. That doesn't help. Yeah. The shim was always gonna be recreated. Just in the interest of time, how did you just create this task? Because I'm really surprised task RN didn't get rid of it. You got a system deep process running or a rogue script? Node CTL with always restart. Okay. So if you type dot dot n CTL, you'll get Node CTL. And it might be a few fix the problem. Problem. You should be able to kill
59:14 it from that a bit more easily. Yeah. It looks like the always kind of thing isn't an official container d feature. So I think it does some magic to keep itself alive. Sneaky sneaky. Good to know. I don't think we even expected it to stay alive this long. Stay around for a bit. Alright. Is it gone? Is it actually gone? Chris is doing everything. Stop. Kill our end. Just to make sure. Great. Bye. Bye. Okay. Not a terminate. Oh, you've been terminated for a while, aren't you? So I'm assuming your anger, Geese, was constantly updating the status
1:00:06 of those pods. Mhmm. So run a world through loop. You have that's not good considering you deleted that namespace. JS policy. Yeah. You're gonna have to check for those still coming. Like, mission controllers. Yeah. The original intention with the policy engine was to have the admission controller get in the way, but also protect itself from removal. But but that didn't quite work. Alright. Ten minutes that you'd have to remove the admission Yep. At all. I think delete them both webhooks. Maybe the dash dash all didn't actually work in the back end. It did appear to work.
1:01:32 Yeah. Alright. I'll just do your busy bots again. And just make sure that JS policy is gone because it it might be recreating this. Oh, yeah. Definitely. It will will be. But that that doesn't matter. Alright. So now I've got a thing that looks like it's gonna start. Yeah. I think the creation is because your you know, tag change. Yes. The clustered is up. Postgres, is it coming up? Yeah. That's good. No. Let's check it. I think it's checked. It was up and running, both of them, and just to get bugs. K. Alright. Yeah. Cool.
1:02:35 I mean, I could optimistically ask you to check it, but I'm That's still version one. Right? That do you wanna No. We changed it. I think, don't we? We changed it already? Yeah. Well, let's check. Let's check. Let's check. Yeah. Go on then. Go for it. It's definitely not that. That's gonna be something else. Oh, failed to connect the database. Authentication failed. Okay. But it that's it's got that far. So the thing works, so I just now can't connect to the database. Well, the database credentials are all hard coded and configured in the postgres stateful set.
1:03:13 Okay. But it was an old failure. That's what you said. Right? Was an old failure. Yeah. Connectivity looked good. So just the password then? K. You've got all edits a stateful set. Yep. The password has been changed. Ah, okay. Where is the password? I believe it's postgres one two three, but I will confirm for you now. I really should know this. Is there anywhere that I would have been able to find it? It's in that text. On the workload. Yeah. So the password should be Postgres PostgresQL one two three. Yep. Okay. So I just had an extra few
1:04:09 characters thrown on the end. Oh, your machine controllers are back. Back. Fine. Machine controller. I'm not gonna care about trying to fix them and change the things that kicks. Obviously, it lets me get away with that quickly before it respawns. You're gonna have to delete GS policy pod add deployment, sorry, and the GS policy namespace. Oh, yeah. Yeah. If I could delete that namespace, didn't I? Yeah. But there's probably a finalizer on the GS policy part. I just delete the deployment and hope for the best rather than namespace. No. I'm gonna bother looking at what it
1:05:05 is. Alright. And then one more thing. Three, two, one. Nice. Okay. Give me a sec. Not healthy yet. Alright. There we go. Let's try refresh. Yeah. Yeah. You got the dash. Oh, it's right. Nice work. Work. Good work. Five minutes to spare. Loads of time. Well done. Well done. Yeah. Nice clusters there. Well done, everyone. Yeah. Good job, guys. Good job. Yeah. Was really good. So that's a really I I think the takeaway there for me, I had no idea that nets nerds control offered that always restart. Like, that's that's not behavior I expected to
1:06:07 see from continuity there. So, like, does NerdControl have a long running process that was instigating that? I'm not sure. So we were were Go on, Jack. Sorry. I was gonna say, yeah, when you when you add the dash dash always, I'm not sure how exactly it works. But we noticed, like, it injects itself in to the container to like capture logs and stuff. So it must do do some injection where it like adds itself in so it can re add containers or something. Yeah. A sneaky one. Then Magic. Basically. Magic. That's what I was describing.
1:06:50 And I also love that the you never really got the GS policy thing doing exactly what you wanted, but it was still causing absolute havoc through the entire thing as well. Yeah. I'm not sure if it can cause kinda more hassle that way or not. But yeah. Alright. Well, yeah. Those those are great. Well done, both teams. Good job breaking. Good job fixing. I think we've seen a nice mixture of different stuff there, which is always fun to see. So thank you for for taking time out of your week. You know, it's it's not easy
1:07:21 to break these clusters even if you leave it to the last minute, Solo. Just saying. Just saying. Always. Always. Best by the time that's high aspirations of things we do. But it was a pleasure. It was good watching you's work. We learned a lot today, so thank you for taking time out of your day to join us. Thank you to everyone that watched, and thank you to Teleport for sponsoring. Remember to check out Teleport at Rawkodelive/Teleport. Alright. Any last words from anyone before I let you get back to your evening? No. Thank you very much for having us.
1:07:51 Long time watcher, first time streamer, I guess. You've got you've got Lewis hovering above this. So thank you all. Thanks for having us. Alright. Thanks, everyone. Have a great day. I'll see you all soon. Adios. See you. Bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments