About this video
What You'll Learn
- Troubleshoot Kubernetes node scheduling by tracking taints, memory pressure, and pod readiness when services are failing.
- Pinpoint kubelet cgroup driver mismatch between systemd and cgroupfs and recover networking by restarting control-plane services.
- Investigate rogue self-replicating workloads, isolate the triggering process, and remove spawned namespaces plus replicas safely.
Sid Palas and Arian van Putten each break a cluster for the other to fix. Cluster 31 hides a kubelet cgroup driver mismatch starving Postgres and Cilium, while Cluster 32 spawns replicating "not-a-virus" pods across rogue namespaces.
Jump to a chapter
- 0:00 Holding screen
- 0:55 Introductions
- 1:12 Housekeeping & Call to Action
- 1:38 Guest Introductions
- 3:52 Challenge 1: Sid's Cluster (Cluster 31)
- 3:55 Cluster 31 by Arian van Putten (@ProgrammerDude)
- 5:33 Initial Cluster State: Pods Unhealthy
- 5:46 Diagnosing Postgres Scheduling Issue
- 7:04 Examining Worker Node Status
- 7:27 Worker Node: Memory and Kubelet Config
- 11:41 Attempting to Remove Node Taint
- 13:41 Taint Reappears
- 14:50 Discovering NGINX Pod Evictions
- 15:34 Deleting Rogue NGINX Deployment
- 16:40 Taint Persistence & Webhook Check
- 23:51 Kubelet Cgroup Driver Issue Identified
- 32:37 Attempting to Fix Cgroup Driver Issue
- 39:16 Checking Networking & Port Forward
- 40:20 Restarting Cilium Pods
- 42:02 Restarting Control Plane Kubelet
- 45:01 Verifying Kubelet Fixes & Status
- 46:50 Confirming Service and Networking
- 48:07 Upgrading Clustered App to V2
- 48:48 Verifying V2 Upgrade (Success!)
- 49:50 Explanation of Cluster 1 Issues
- 53:00 Cluster 32 by Sid Palas (@sidpalas)
- 53:05 Transition to Challenge 2: Adrian's Cluster
- 53:25 Initial Cluster State: Clustered Pod Missing
- 53:52 Diagnosing and Fixing Replica Count
- 55:05 "Not a Virus" Pods Appear & Replicate
- 56:47 Investigating Source of Replication
- 1:00:03 Examining Rogue Pod Details & API Access
- 1:01:40 Analyzing Rogue Namespaces by Age
- 1:05:32 Re-examining Clustered Pod as Trigger
- 1:11:51 Searching for Rogue Process on Nodes
- 1:26:04 Accessing the Rogue Container
- 1:30:40 Examining Rogue Container Files (README)
- 1:32:01 Stopping the Rogue Container (Source of Replication)
- 1:32:24 Cleaning Up Rogue Namespaces
- 1:34:16 Accelerating Cleanup and Redeploying Workloads
- 1:38:50 Adding Tolerations & Redeploying Clustered/Postgres
- 1:41:03 Upgrading Clustered App to V2 (Success)
- 1:41:43 Explanation of Cluster 2 Brokenness & Conclusion
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:55 Introductions
0:59 Hello and welcome to today's episode of clustered. This is episode 14, and I have two fantastic guests and two very broken clusters for us to fix today. Before we get started, there's just a little bit of housekeeping. First and foremost, if you haven't pressed that subscribe button, you can do that now. Also remember to thumbs up slash like the video. This will help other people find it. It's always a good thing. And if you wish, you can now join the Rawkode Academy at a sandbox level for a very tiny tiny nominal fee, which will get you some cool emojis and private live
1:12 Housekeeping & Call to Action
1:35 streams coming up soon. Today, we have Adrian and Sid joining us. We have both prepared one one broken cluster each and will join me as we attempt to pair and fix the others. Hi there, Adrian. Hi there, Sid. How are you both today? Doing well. Yeah. Doing just fine. It's a bit warm, but It was not a problem I have in Scotland today. It's been raining and windy all day. One second. I have to take a phone call. Please don't put up. Sorry. Yeah. Alright. Well, Aaron takes the phone call. Sid, do you wanna introduce yourself and tell us
1:38 Guest Introductions
2:16 a little bit about you, please? Sure. My name is Sid Palace. I am a DevOps consultant. I split time between helping startups improve their DevOps and cloud infrastructure setups and also making content for my YouTube channel and my blog. So I have a YouTube channel called DevOps Directive where I cover a variety of topics from Docker to Kubernetes and all sorts of things. So go check that out if you have time. Most of my Kubernetes experience is with managed clusters. So depending where Arion breaks things, that could be useful or it could be completely worthless.
2:50 So we will find out once once we start poking around. Nice. Yeah. I definitely encourage people to check out the dev ops directive. It's a fantastic YouTube channel. Lots of great content there. Hello, Adi. Would you like to give us a quick introduction if you're alright sir? And I have we'll get started. Yeah. Sure. So I'm Arian. I in my spare time, I work on NextOS, which is a declarative operating system, which has a lot of cutting points with Kubernetes because you all like declarative and immutable things as well. And for work, I work at Wire.
3:28 We do end to end encrypted chat and video communication, and I maintain Kubernetes clusters here and do operations and break things very often as well. Occasionally, also write some code, but lately, not so much. Yeah. That's a quick introduction of myself, I would say. Yeah. Awesome. Thank you very much both of you for for sharing. So we are gonna start today with cluster 31. Sid, that means you and I are up first. I have enabled the head We can see teleport, which is our pairing tool of choice. I will open a session here. Sid, if
3:55 Cluster 31 by Arian van Putten (@ProgrammerDude)
4:08 you can do me a favor. Once you're in, just type some sort of echo hello to let me know that you're there. We should also see a nice little to show up here. Adrian, you can just sit and watch and prepare the laughter. If you got a beer, you actually have a beer. Yeah. I told you I would get a beer. I am betterly unprepared and jealous, but enjoy it, mate. Okay. So I will do a little bit of setup first. These are QVDM clusters. That means our Okay. Configuration lives here. I will alias k to keep control, otherwise
4:43 you will just be watching me type k and that failing many many times. And Seth, I will give you the honors of running on the first command to see if we have any sort of API server available to us. Hey. We did. Look at that. Thank you for help. Didn't break everything. So I'm proud of myself. Okay. The API server is good. So let me clarify the the objective then when we have an API server. Right? It's sometimes it can be a little bit fussy. Our objective now can be a little bit confusing. Our objective now
5:18 is to make sure that our clustered and postgres squeal pods are running and happy, and also to be able to upgrade our cluster pod from v one to v two. That is our objective. Alright. So you have run a get pods and we can already see Postgres is definitely not happy And And and clustered has restarted a few times. It has. Yes. Alright. Let's check out what's going on with Postgres. Okay. Okay. So we have a taint on it that it won't tolerate. Okay. So yeah, it's unable to be scheduled right now. It looks like our worker node
5:46 Diagnosing Postgres Scheduling Issue
6:10 has memory pressure and obviously we don't tolerate running on the control blend mode. Yep. Yep. What are you thinking? What do we need to look at here? Yeah. I was just gonna scroll through this config and see see what we've got. Well, I'm curious about this memory pressure label on the node. Yeah. Should we describe the node and see what that looks like? Yeah. Good show. I think that's a good idea. What I've got in my head here is that Adrian can just be cruel and just add either label to mess with us or he's
6:55 maybe actually screwed with the memory available to the Yeah. To the kubelet. I see that smirk in your face there. I have a very bad poker face. I wish I turned off the webcam. There we go. Look at that. Okay. Eviction threshold met. Attempting to reclaim memory. I think we're just gonna have to jump on to that worker notes. Yep. In the memory here, nice big number, which is what I wanna see. So I will open a session on the worker. Just feel free to jump into it, let me know when you're there. Okay. And then we'll start to look around
7:27 Worker Node: Memory and Kubelet Config
7:43 and see what's happening. What kind of beer you got? We got wait. No. A roadhouse, which is a pilsner. I like it a lot. From the the black woods in Germany. Ah, nice. Nice German beer. Okay. I'm getting a oh, hold on. I had a a background window open with the YouTube videos. I started getting a delayed loop, but now I'm good. I've done I've done that so many times. Alright. Let's I'm gonna run my my favorite memory command. So this looks good to me, which means so my thought process here is that alien has
8:45 modify seems I don't know. I think he's modified the system d unit file for the cubelet and his memory constraint. Is that a lane where you if you got any other ideas you wanna throw it? That seems like a reasonable first first place to look. Okay. Well, there's nothing there. Or should we cap the unit fail? Get all the actual what do you call them in system? Do drop ins or is it something else? Yeah, drop ins. Drop ins. Alright. So system control cat kubelet. This should I believe shows us the entire config and then we
9:40 can look for files that could potentially be have other stuff in it. Let me know if get any ideas. Otherwise, I'm just gonna be looking aimlessly. Poking around. There is a way, right, when you can you see the niceness level of a process of Linux? I think there is. Like, let's see. That's to run a command with a nice value. Could be Alright. Let's just start looking at the files listed in this unit. So we've got this one. That's the one it's showing us. So we've got won't be the bit strap. Could be fire lib kubelet config and fire
10:47 lib kubelet flags. Those are the two I think we we're gonna wanna look into. Are they clean to you? Yeah. Config. And there we go to keep the configuration. Yeah. Don't think there'll be anything in there. Hey. Your tongue's dead. I'll sit back. I'll watch. Did we just did we describe the node on the control plane? We did describe the worker node. Not on the control plane. We only did it for the the worker node. So we could describe that one as well. Yeah. I wonder, am I over I mean, he could just have added the
11:41 Attempting to Remove Node Taint
11:58 label. He might not have actually done anything to the memory on the machine. Right? True. That's worth checking. So should we hop back to that other that other one? Yeah. Do you wanna Or is this one also set up? Oh, no. Just the just the Okay. So if you why don't we just do a Yep. Do wanna just edit the node and remove the label, see what happens? Sure. It's a good start. Come on. Editing the service files is way too easy. Yeah. What what was even easier would just be adding a label. I keep looking at him to see if
12:38 he's gonna give anything away but I've lost you there. I'm practicing my poker face. So Alright. So we do not have memory constraint label there. Yeah. Maybe on the worker. Oh, I'm hinting things again. Oh, yeah. But what did you what was your edit command? Sorry. I missed it. Oh, no. We're stuck in Vim. Okay. What? What? Oh, let me out. Let me out. Oh, it's there. You'll need to do q exclamation mark exclamation mark. Yeah. Otherwise, you'll run it. Otherwise. Oh, we looked at the worker. Did we not? Oh, no. You edited the control. Yeah. Okay.
13:31 That'll be why we couldn't see the label. Not there? No. Here we go. Oh, there's a tint. There we go. Yeah. Let's We just remove that entirely? Yeah. Let's do it. You can just remove that. Yeah. Okay. Just bend it. You might wanna just create postgres again. Yeah. When did that nine seconds So let's see if that taint magically reappeared. So how is that getting added? So what if I I would sort of stuff out all the time. If you got I don't yeah. That's a good shit. So let's also check all namespaces, I guess,
14:50 Discovering NGINX Pod Evictions
14:57 in case he hit it somewhere. Oh, okay. Have a namespace. Alright. We've got a bunch of NGINX pods that are getting evicted. Stop laughing. Okay. Sorry. No. I'm allowed to laugh. Right? That's whole point. Yeah. That is the whole point. Let's see what what these are doing. We have this this service. Let me just try deleting that service and see what happens. Alright. The deployment, not the service. It was kinda nice that the load balancer service type just works on the was cool. Yeah. These clusters are currently running natural l b, and I'm in the process of migrating
15:34 Deleting Rogue NGINX Deployment
15:55 over to. But yeah. But I did give you a broken CSI implementation. So There's also I heard that Cilium now has part of the meta l b code base integrated. Yeah. They did some work with DGP. I saw that announcement last week. So let's just actually, why don't I just delete that whole namespace? We we probably don't need it. Yeah. Yeah. I I don't trust that namespace. I'm I'm happy for you to blow it away. That's it. Arian. Okay. Takes a little while. Now let's go back and try to edit that taint off the node
16:40 Taint Persistence & Webhook Check
16:47 in case that pod was adding it back. It may still be there. It may still be there. That's true. It is still there. Okay. So we've we've cleaned up something, but we don't know what that something was actually doing. Do you mind if I type for a moment? I'm just Yeah. Go for it. Curious. Is that if okay. Yeah. That does look like Yeah. It it did indeed get get cleaned up. Is there something else that could intercept our node config before it gets used or after it gets used, I suppose? Some kind of Yeah. Well Mutating
17:46 webhook or Yeah. Mutating webhooks are always oh. I see. Oh, no. I thought we were getting a broken API server there. Damn. Lots of issues with Rawk. Yeah. I'm gonna filter that out. Look how slow it is when I get pods across all namespaces. It's like Right. There's probably a lot of stuff somewhere. Yep. I was kinda surprised, like, once it is evicted, that it doesn't clean up that port reference. Like, it just keeps accumulating. Let's see. And what is what is Hubble UI? Is that just an interface for Cilium? Yeah. It's a really cool UI that shows you network
18:40 traffic across your nodes and services and such. It's pretty sweet. I I just wanna see it running. Come on. Anything. Show me anything that's running in my cluster. I don't care about pending an infected write down. Okay. So, we got speakers, we got some Rook stuff, we got our control plane. I just wanna know, yeah, there's nothing to Nothing that looks nefarious here of course you could have swapped out any of these images, but that would tell me, I think I like your idea instead. Let's check for mutating webhooks. Let's go down that path first before we
19:16 start checking the nodes for rogue processes. Got a tip from Nolan in the chat as well that there could be a fiction hard set on the Qubelet. How do we actually view these? I'm using webhook configurations. No resources found. Only yeah. Cool. Still no resources found. Okay. Should we jump back across to the worker? Sure. Start looking first. Is there anything else? No. No. I think I think that's that's all I can think of at at this side of things. Okay. Because we're our our edit to the node is going through successfully. So it's not
20:14 like a validating webhook could be the case because otherwise, we'd get an error back if it was just rejecting it. Nothing no static pods. So what happened there? And that weird reload debug. I'm just scrolling. Alright. I thought I was gonna drop out our config. What do you want to do? The view the entire config at once or I did want to view the entire config at once. Yes. Alright. I think let let me tell you where I'm at. Right now, I'm at the stumps, to be honest. I think we go through this config file for the kubelet and we
21:28 see if there's anything restricting it here. I'm also curious if he's just been really cheeky and scaled up a whole bunch of those deployments to a stupid number causing the evictions and because those have all got requests on them. Right? I could just be adding up. All those, like, system and Celeum and Rawk deployments we We could check the requests and on the notes, see if they are larger than the capacity, for example. Right? Yeah. Yeah. That looks okay. If you want hints, you can always you can have two in total, probably. Don't know. Go back to you, Pierre. Go away.
22:20 I mean, we we could take the the thing you just gave us and and actually check those requests on the nodes and see if if there's more requests than Yeah. Available. I mean, the node isn't breaking us wet. Right? Yeah. So, does feel like this is all happening on the I didn't see anything in the Kubla config there. So, yeah. Let's jump back to the control plane. There has to be something on the API. I need to speak with someone at teleport, that bugs annoying. So, where do you wanna check for request? Of course. In fact, did
22:59 we describe them? Feeling so scarpered today. Don't know what that is. We did describe the node. Right? We did. And there was no artificial request or limits on that. I don't think so. Yes. We I don't know if we described the control plan node, but we we described the working on. Yeah. I think the working on this one. So k. So we got the infection threshold limit here. It has sufficient memory, and that's happening a lot. Trying to reclaim memory from the eviction here. We can see that we've actually only got 1% of requests and 1% of limit.
23:41 Yeah. Well, the vection threshold is has to be that has to be Kubelay configuration. Right? There there is a hint in the Kubelay config. There is. Maybe. Yeah. Yes. There there is something. Because we're not memory constrained at all. Which which means that it just feels like it's kubect config. Yeah. And it's not the it's not getting evicted based on too much memory use. It's just never getting scheduled. So it's not something in the Postgres stateful set config. Could be something else. But There are also evicted bots. Right? There's a pending bot, but some bots are also getting actively evicted.
23:51 Kubelet Cgroup Driver Issue Identified
24:33 True. So I guess we're going back to the worker node then. Okay. Do you wanna pop open the Kubelet config again? Maybe. Didn't mean to do that. He told us there was there might be a hint in here. Right? There's only 38 lines. Would there be an easy way to figure out what KubeADM usually spits out? Yeah. Could you grab a a healthy config and then we could diff them maybe or at least take a look? I think we can just run to generate one. I'm looking at all those zero seconds, and I'm thinking, are those a thing or am I making
26:07 too much of that at the end? No. That that that's too much. Those are there by default. I only I appended a line to the config. It's at the it's it's at the end. Let's we'll give this a second. C group driver line? Yes. Alright. So we just ask to generate one and we'll just do a death in there. We'll see what happens. Sure. I I don't know the command. I should not verify now. Let's have a look. But I'm pretty sure we can do a That would be there's some way to run the individual faces
27:04 or something. Yeah. There is. I think it's under in it, and then you can pass in. Cube ADM diff. Is that a thing? Somebody says the Cube ADM diff is a thing. That would be cool. Is it? Everything else is alphabetical except for c group driver. This is This is a There is a thing. I just don't remember off the top of my head. So But this is not a control plane node. Right? So we need don't we need to join? But And it there we go. Kube finalize. Yeah. I think Kube let's start puts in
27:53 the bootstrap config, and Kube let's finalize does the Did I spell that wrong? Yeah. Alright. Don't know if that's gonna run it. Do you want me to tell you the diff instead? Yeah. I mean, for the sake of time, it seems like Yep. And you've already hinted that it's that last line. We just Yep. So yeah. There there there are two things I changed in this config. One is I I am a system d. I love system d, so I wanted try out the system d c group driver. And I also and also changed the resolve conf
28:53 option. But so those two I touched. The only two I touched. Apart from that, there's a completely vanilla config. Alright. Yeah. But I think the last one is probably the first one to look at given that that has more to do with the issues we're seeing, probably. Alright. I am going to look up Qubelet driver and see what the options are here because I have no idea. Apparently, it defaults to system d. Yep. Oh, there's our Kubernetes minute command. Yeah. So Continue. Yeah. So if that is the default, then you haven't actually changed it. Right? Unless cluster API
29:57 doesn't use the system d c group driver by default. Or maybe this is one two one docs. Is this just a new change? So the the c group driver there was already a c group driver line that I added It had something else before. So Mhmm. Yeah. So, yeah, maybe check the older version. Yeah. Yep. Oh, thanks. Oh, no. And of course the search goes back to the latest version. Alright. Let's copy the bit after the thing and then come here. The page doesn't exist. Darn it. Kubernetes component. K this is frustrating. Why can't I find that page on that
31:01 version of the doc? Okay. Tasks, administer cluster q b d m. So where's that menu bar now? Right. Task. Administer a cluster. Just step three. And configure secret. Oh, it's not in the docs. No. It's new docs. Sorry. Maybe there only used to be one option, so they didn't say anything about it. Well, that at least explains why we're unable to look this up. Yeah. Well, let's see if we could just take a look at this. This wonderful brand new one twenty one dot, which default to system deep, apparently prior to that was not the case.
32:03 So c group f s, is that? Yeah. Okay. So the c group f s driver. Yeah. I'm it says here to continue using that. Good catch that. Well, we just Yeah. Found some random Stack Overflow page that mentioned c c group f s. So we've got two sources that state that as a potential candidate. Or should we migrate to the system d driver following the correct instructions correctly, which which Adrian did not do. So Let's revert to to cgroupfs, I think. Yeah. I think that's a good call. Let's just get it working, and then you can
32:37 Attempting to Fix Cgroup Driver Issue
32:45 it in to share what exactly is happening here. So I'll I'll do the control plane. Do wanna update the worker node? Yep. I just updated on the the worker node. Alright. Nice. Yeah. I I have a high level of confidence this might fix it, but I'm not sure. Yep. So we need to restart on our Kiplet. Alright. Nice. Alright. I will give you the honors of running a get pods on our control player. Let's see if that has helped at all. I'll pop over to the chat and I'm sure everyone is yelling at us. So no confirmed as zero sort of default.
33:27 You know, I fall for these zeros in the Kubernetes config all the time in default versus something nefarious. You mentioned in dev, it doesn't exist. That's a great idea for a sub command though. Let's make that happen. Ben noticed that everything was in alphabetical order except the secret driver. Yeah. I never thought of checking that. Very clever. It looks like we have a new error, though, at least. Yeah. We're in unknown territory now, though. I don't know how to fix it from here on. So this is Great. He he did mention changing that one other
34:04 config. Okay. What's the resolve conf? Yeah. But it's not related to the issues we're seeing now. Okay. So I think that's just because we've changed it. Right? And we have running processes probably using the system d driver. Yeah. So now there's like two views of the world. Right? So I don't know. Maybe we should restart container d or something on the on the the that are running. What we would have to do is probably We can cordon and uncordon. We Can we check system details status just to see what processes are running and stuff on
34:48 the on the worker? Well, the control plane is degraded anyway, and they are the same there. Gotcha. And if we scroll down a bit, just see what what stuff is running. There's a lot of output here. It should show. So So here is all the containers running. So it still seems to be using the system DC group driver. That's why it shows up nicely here even though we changed it. So Nice. That is a horrible way to break a cluster. Yeah. Okay. Let's think. I say we make the system d secret driver work. How hard can that be? With the the
35:42 upgrade path on the on the docs? Oh, it's actually telling us to drain the node and then restart the kubelet. So yeah. We can do that. Alright. Throw random system d at the cluster and it breaks. Alright. Let's I like this comment. Let's stick with the worker node right now. So let's update it. Let's set it back to system d. Yep. D. Oop. That means Oh, no. That one doesn't exist yet. So And then we have to drain and or or we need to Do we need to restart, or you're do that after? Do we need to restart the cube load?
36:30 Or wanna do that after we drain it? I guess it yeah. I think with how fucked that is right now, I don't I probably it doesn't make I think it would just restart it for now. It's fine. Yeah. But after I get notes, and I'm just gonna drain Yeah. Our worker. Okay. It's now being cordoned, workloads will start to drain off of it. What we can do now is that we actually stop the kubelet. We've had the kubelets only thing running because there's nothing else here. The daemon sets will be on it. We can restart those manually.
37:06 Did we restart the kubelet already or not? We have one. Yeah. But I think when the I did it before he cordoned it though, I think. Yeah. Yeah. She wanted to Hey. Look at this. There is a container creating now for that. Oh. But I think now because we restarted it again. Oh, it's it's something is running. It's good. It's running. We have zero of one ready, though. Yeah. Do you wanna uncord it or note? Or did you do that already? I did not. But how is it scheduled? Interesting. Scheduled onto the control plane? Oh, wait. I think it did it didn't
37:45 actually it complained that it couldn't drain. Look. Unable to drain. Oh. Oh, but then it should still be cordoned, right, even if it didn't drain it? Or I think if it's We can't delete the pods with the the storage. Right? So we could have just sold it to delete the empty chairs because they're not required. But I think we've done enough to get it back online. Right? That that pod is coming. That part is coming. Oh. Yeah. It's it's back. Alright. Let's unload and Yeah. Let's let's get a port forward running, see if we can have our application and try
38:25 the update. Like, all we need to do is get our app working and then Yeah. If we get the updates mission accomplished. Right? Exactly. Yeah. You can just do a cube control on card and then the node name. I don't know the node name. Yeah. I don't know the node name. The worker. I still find it weird that those pods got scheduled and run on it. It was cordoned. We're done. Okay. I'll jump over to my local terminal for the port forward. Okay. I'm here to set it up on teleport so that we can just expose it
39:09 from the the UI, but I haven't done it yet. So, get pods. Or forward. Uh-huh. No. Oh, no. There's there's Oh, well, actually, the daemon sets will need restarted with our cilium pods. Right? And so if you wanna do a cue control get pods on our cilium namespace, we probably have no networking at all in our cluster. Oh, what? No. I will just throw out my idea here, which is the one I try not to recommend anyone try at home, but it's to do delete pods all on the Selium namespace. I'm totally okay with that if you
39:16 Checking Networking & Port Forward
40:14 wanna if you wanna But they're ready. Too current, too ready. Too up to date, too available. That's Let's let's try it. Never hurts. Oh. Pods dash dash all. Oh, that's a very interesting. Oh, no. No. It's no. No. It was just in Cillium. No. It was just in Cillium. It was just a lot. Just okay. It was just a lot of I only bail if they got the fear. It's okay. Alright. What does it get pods look like by then? That hasn't returned yet. But Yeah. We can Yeah. Just control c. It just usually waits for
40:20 Restarting Cilium Pods
41:07 the thing to finish. We don't have to. Okay. That's Celeb m one's coming up. Container creating. Okay. Yeah. I see that's to look quite healthy. Yep. Give it a few moment a few more seconds. Alright. We got some more things. Coming. Alright. So can you do a dash o wide? What what node are those first two on? Control play. I don't think did we restart the kubelet there after updating the config? On the control plane? Yeah. Not not sure. Yeah. I don't think I don't think I I think I updated the config, but I don't think I restarted the kubelet on the
42:02 Restarting Control Plane Kubelet
42:26 control plane. What do I do? So I'll see system CPR. Noel in the chat has said that the docs say this field should not be updated without fully rebooting the node. Well I don't read I don't read docs. That's otherwise, the clusters don't break. So the sandbox for those first two pods will already be created under the the maybe the wrong yeah. I think we may have to delete those two pods and try and encourage it to restart again. I'm not sure. Are they running on the control plane? Yeah. They are. Yeah. They are. Let's give it a bit of encouragement.
43:16 Yeah. Let's see what I get pods shows us there. Same. I guess let's describe and see what Let's describe another Do have a preference between the operator and the the other one? Let's do the second one just because it's in a in a container stage. So it's actually beyond container create, and I think don't know. Do you think can get some k. Oh, yeah. It's not happy. Expected c groups path to be a format. Blah. So, yeah, it's still looking for this. Can we just restart the cubelets one more time on on that node? Why not?
44:35 I actually did follow the docs for updating to the the the system DC group driver. So we might have gotten ourselves in a worse scenario than we started. Maybe. I don't know. It's broken. Oh, yeah. Should we just validate that we correctly updated those Kubelet configs and didn't fool ourselves, like, have not saved correctly or something? Let's do it. Maybe we can can we return this to system d and restart it? Just That's what we tried. I thought we did. I've obviously been a bit We forgot to hit save. Yeah. Because the control play note, it was
45:01 Verifying Kubelet Fixes & Status
45:23 was not broken when I left it. This is something we we have introduced ourselves accidentally, which is fine. It's more fun. You said give me one job and I I broke it. We may maybe have to encourage those pods again. I'm not sure. Yeah. If if it's still screwing up, I would maybe try to reboot the control plane. No. That's not a horrible idea, actually. Bare metal does not reboot quickly. Oh, yeah. Well, we can maybe k exec into itself. That's a nice hack too. Sure. Yep. But I think this should probably get get it back up and running. Let's see. Okay.
46:13 There we go. There we go. Good. Okay. So I'm gonna read the report forward. Mhmm. Do we think the Click. The test work will help? Hey. You you fixed it. Good. So now do we need to try to upgrade it, David? Is that I'm a bit worried that I have reload, and it's not quite broken again. Yeah. It's finished reloading. Right? Okay. It's just timing out very slowly. I have to do the two second time out to that, so that's annoying. Okay. Let's check our service for Postgres. And we just wanna make sure it's got
46:50 Confirming Service and Networking
47:02 some end points I guess. It does. It does. What was the error again? Was that filter lookup address? Okay. Do we as core DNS running? But we wanna probably run a get pods in the cube system namespace. Is. Looks like it's running. Maybe we should try to port forward again. Oh, it also restarted once more. That restart number is climbing. Oh, it's working. And I fixed the video encoding. Alright. Sid, go for it. Let's update this application. Alright. This deployment is just named clustered? It is. Yeah. I'm very confused why it's working. That's funny. I added another another little booby trap that
48:07 Upgrading Clustered App to V2
48:25 I was expecting to be triggered, but it didn't. It's fun. So we just want v two on the image tag? Yeah. Please go for it. Then maybe it will break now because we update this. Let's see. Very curious. Who knows? Alright. We'll spend a few more minutes on this, and then we'll switch. Container creating. That's good. Terminating. Alright. Let's run get pods. Reconfigure our port forward. This looks good. I heard that better tone in your voice there. It working? Why is it working? Why is it still v one? Still v one. Let's just edit again.
48:48 Verifying V2 Upgrade (Success!)
49:18 Did not take it says v two there. Yeah. I wonder if it Did it it bring up the new the new version already? Yeah. In pot? I mean, of course, when I was updating the Is it maybe caching in the browser? Wait. Is that v two? Yes. Yay. That's good. That's good. Yeah. Alright. Good job. I didn't like that one. Well, thanks for dragging me along there, David. So you wanna fill us in on that? On what what I broke? Yeah. Basically? Yeah. So I I am a vivid system d user, and I was like, why isn't this running
49:50 Explanation of Cluster 1 Issues
50:10 the system d c group driver? So I changed it. And then I which just means that system d is responsible for managing the c group tree. And every node in the c group tree is managed by a system d unit, and you can set limits on each of the c groups created using system d unit files. So I limited q pots dot slice to only use to allow 100 megabytes of memory, and this doesn't get reflected by Kubernetes. And it's like, oh, I have all the memory I need, and it keeps scheduling things. And then system d starts
50:48 throwing OMs at the thing, and then Kubelet says, oh, this bot is in a very this node is in a very bad state. It's under memory pressure. We shouldn't schedule more things. So Well, we've got just makes sense. Yeah. Place, though. Right? Yeah. But I think we kind of because we moved back to c group f s, it kind of fixed itself somehow. Yeah. So I I just made one change on the worker node, and that's at this drop in file for QPods dot slice in somewhere in Etsy system d. There's a Oh, yeah. QPods dot slice dot
51:21 d. And this is the only thing I touched, and then everything broke. So so yeah. Yeah. Yeah. And I broke one more thing, but it didn't it stopped breaking is that I made core DNS points to to the Etsyresolvecom from the hosts. And because I'm a system d advocate, I changed Etsyresolvecom to point to system dresolved, which happens to have the same IP address as core DNS, which would cause core DNS to end up in a loop forever and crash. But that didn't happen, which was interesting. It did happen yesterday. So but yeah. So that that breakage didn't
51:59 happen, but but the other one was I was like, this much should have as much memory. The stuff in the area namespace, was that just a red herring? Just a bunch of Yeah. It was just me trying to deploy, and I didn't clean up after myself just to see if things failed to deploy. Yeah. I did make sure to delete the bash history on the nodes so that it wouldn't be obvious what I changed, but I forgot the the the Arian namespace. Yeah. Alright. So I mean, there's there's a whole bunch of stuff there. I had no idea. But, like,
52:30 I've never modified the secret driver because why would anyone in the right mind do that? Yeah. So it should it should default to the system d one. It's just that's I noticed, for some reason, cluster API doesn't use it yet. There was this issue. I was like, maybe we can wreck havoc with this. This is interesting. So Alright. Well, thank you very much. I hate to eat every second of that. The server lies about its amount of memory, and there's nothing you can do about it. Yeah. But now it's but now it's time for revenge. Right?
53:05 Transition to Challenge 2: Adrian's Cluster
53:05 Yes. Definitely. Now I have to suffer. Alright. You wanna join our session? We are now dealing with SID cluster. Yeah. I see an active session. I click options, join session. I will I see you typing. Configure the bits and pieces that we need. And I will give you the honors of checking for an API server. And the echo hello. Yeah. Fair play. It's alive, but we're missing a pod. This is right. That is great. Do we still even have a deployment? We do. Interesting. Do we want to describe it to see Mhmm. Let's do that. What's going on?
53:52 Diagnosing and Fixing Replica Count
54:04 Reload. It says let's see. I mean, the rep Oh, yeah. Is set to zero. Oh, yeah. I'm also gonna open the chat at the same time so I can steal hints from other people. Just let me move you down. I think you need to scale this one up, and I always love waiting to see how people decide to scale that. Some people edit requirements, some people do a proper scale up. Let me I'm gonna edit this, I think. Yeah. Because you said that word first and then it's like, oh, yeah. That makes sense. I could edit it.
54:45 Yeah. Somebody messed with this. Generation six. So there were a lot of edits to this thing. So I have a feeling there will be more more issues when I change this. But how many replicas do we want? Just one? Yeah. We only need one. Okay. That's scary. That was that was what we needed to happen. I wanna check other namespaces now. There's there's a trap. Were you worried that that wasn't gonna happen there for a second? That was just that was my I I had deployed something, and and there was waiting for that to happen. So Oh, look at that. We
55:05 "Not a Virus" Pods Appear & Replicate
55:26 have Oh, I have the the scroll bar issue, I think. Yeah. You can reload. It fixes it. Yep. Okay. Let me try to leave this page. Yeah. That's much better. Okay. No. No. No. For me. Why is this? Oh, I I already regret signing up for this. I'm just curious Alright. Okay. So let's maybe look oh, no. It happened again. Oh, that's because I reloaded. Okay. So and will that now happen to you when I reload or not? Yeah. I'm gonna try Chrome. I've never actually tried a different browser to see if that helps. So let's do that. But
56:10 I think it's just trying to sync the two sizes because you're zoomed in more than I am. I think this is what's causing the Oh, really? Okay. Oh, no. And they're all oh, they're all these namespaces were created, like, ten seconds ago. No. What did you do? Yeah. I think it's just syncing the size of our two windows. That's that's the problem. That's Actually, resizing my window fixed it. I'm gonna leave it like that. Wait. So is UUID is what you often use for your namespace nameset? It it makes sure I don't have a collision I guess.
56:47 Investigating Source of Replication
56:59 Oh, not a virus. I don't mind not a virus pods as long as there's not a crypto minor pod too. Oh, it's why is everything so oh, no. I have okay. David, how are we gonna get this out of our cluster? Get what out of our cluster? The the everything. Oh, man. Do want me to delete all the UUID namespaces? Yeah. We could write. Maybe I I don't even want to know what's in there. I just want to get rid of it. But I'm kind of scared because some of them are, like, two seconds old. So
57:37 I have a feeling there's something nefarious. Yeah. You're right. Yeah. Okay. Yeah. So I wonder if it's gonna help. But we can try it. That's Well, let's let's think about this. Right? So we actually did modify our deployment. It went through and then an event triggered it. So I don't think this was a mutating web configuration. Of course, we could double check. But the fact that the edit went through and we got a new event suggest that there's potential controller running on our cluster, but it's also watching the resource. But why why wouldn't it be a mute
58:11 like, this is the first thing I would think of like, maybe it would be something like Because then because if we have as a museum web configuration, we wouldn't have got the scale down event. I see. Because it would never have changed in the first place. I see. I see. So it's not something blocking it. It's just something else watching it. Yes. I think. I see. Let's let's check. Right? Yeah. There's no mistake in my configuration. So this is another controller probably running on one of these names. Yeah. Do we maybe want to Sorry. Need check.
58:40 Maybe want to check what service accounts there are or like Well, my thinking is, right, with the time on these namespaces is that the not a virus workload will not be running in one of those namespaces because they're too ephemeral. So I think we should just stick to looking at maybe the five namespaces that we know and trust and see what's running in here. Mhmm. Yep. Makes sense. So let's do a why don't we do a get pods all correct. Running or something. Yeah. That sounds good. Okay. We do have a couple of We have
59:24 not a virus which is running. A lot of them, actually. Oh, no. Do we want to maybe describe one of those pods? Yeah. Go for it. Scrape one. Oh, but I need to know what namespace it is. Right? Oh, no. Okay. Wait. I'm gonna open just scrape all namespaces pods and then just that name. Yep. Yeah. There should be one of those pods in each UUID namespace. Paste. So if you I can do it. Like, will this work? No. I cannot do that. Kevin is looking for home describe all the pods in one namespace. There's only one
1:00:03 Examining Rogue Pod Details & API Access
1:00:14 pod in each of those namespace. I see. I see. That makes things that. So let's do again this. Pick one that is running. We pick our favorite namespace. Let me paste it. K. So there is a container called slash clustered. Indeed. Do we want to know what's in there? It's using the downward API, though. What what is downward API? I have no clue. A way to I think something needs from the cluster into the pods so you can consume it. Usually, like, the IP address of the control plan or or something like that or the
1:01:09 pod name itself. You can push down a whole bunch of stuff. Interesting. So I So this seems like this way, it's hooking into the the the Kubernetes API somehow. Yeah. And we've got a conflict map called the Kubernetes root CA. So it's actually using our I would imagine Kubernetes admin cert. Now depending on whether we want to I mean, we could delete that config map and then those pods can't modify anything on the API server anymore. Yeah. I'm con so you want to delete the the cube root c a dot c r t. Well, let's
1:01:40 Analyzing Rogue Namespaces by Age
1:01:47 take a look at it first. But, yeah, I'm I I think if we if we get rid of it, it should stop the the radius. Just want to know what's namespace that config map is. Right? Like, maybe it's duplicated across all the let's just look at the okay. So we have a volume amount for this service account. Blah blah blah. K. So here is the the weird stuff. Yeah. So it's using the download API to know what namespace it's in. Yeah. You'll have to help me here. I've never seen this downward API volume type. This is only Let's see how many
1:02:35 of these things we've got. Right? Look at Oh, no. Yes. Great. So I want to describe one of them first. Mhmm. Okay. So It's just It's it's there. Yeah. Right. Yeah. So And I suppose this is the the server certificate of the API server just for some One of these namespaces is gonna be longer lived than the other sort of ephemeral ones. There has to be something running outside of these UUIDs or one UUID that isn't as malicious. I'm looking for a timestamp that's let's sort it. Yeah. Let's sort sort dash n on the Three.
1:03:24 Three. Right? K three. K three fourth column. Right? And then dash n for yeah. This this works. Yeah. Good. Oh, let's how do we reverse it? Do you remember? No. I don't know. Dash r. Dash r? Yeah. Oh, but wait. Yep. I think you have to put it on the Oh, it's the key. Right. Exactly. But it's alphanumeric sorting. There we go. This helps. Okay. Let's just scroll up. We can do it. Oh, no. It's not sorted by yeah. You're right. But still, we can get the longest running thing from here. It's probably three seven. I see things that have been
1:04:21 running for seven minutes. Said we're no longer friends. Yeah. But it all triggers because we we scaled up this thing. Right? Before that That was that was what I was waiting on. Because there's nothing that's older than that. Like, can we look at things okay. How about we look at all the the different things and just look at things that are younger than when you created the cluster, but not from the past seven minutes. Right? Just to kind of know oh, there's nothing there. So this all hasn't been touched as far as I see. Right?
1:05:00 All two days, twenty two So this looks fine. So maybe the thing that triggered it is actually not in the cluster. Yeah. I'm I'm starting to think about myself. Yeah. Maybe we're we're putting we're getting put on a there are some red herrings. Lots of them. So some something is missing from the outside, I think. Maybe it's the clustered part that does this, somebody says in the comments. Yeah. We should describe that, actually, and take a look at the image. Yep. That would be a good way to hide something. I'll give you the honors. I'm gonna comment much. We can just
1:05:32 Re-examining Clustered Pod as Trigger
1:05:54 so what is that container that we're using? Okay. That's Postgres. That's fine. And the other one where are you? Oh, wait. Oh, yeah. There was never anything running. So it would be the Postgres one that's nefarious, but it looks like it is. Well, it's been updated to take our weird config map with the kubectl. This looks like it's just a standard Postgres image. Right? Unless what's the image pull policy? I I haven't touched the post Postgres deployment. Yep. K. So why has it got the cube roots there? Yeah. Exactly. Why does it also have this volume?
1:06:34 That's interesting. That makes me suspicious. Is that just always there, and I'm just overthinking? I think it might be. Yeah. I don't know. No. There's never a cube API. This is A volume that contains injected data from multiple sources. Type projected. Maybe there's just some global config that just gives it yeah. Maybe we are on a Yeah. But I think maybe I've checked myself or checked those here. Okay. Let's let's let's focus. Something has API. Okay. Let's try something. Okay? Mhmm. The only way to access this cluster from is from a vendor cluster or from admin.com.
1:07:24 Oh, then we can't access it. Well, we can't. Protect it. Oh,.com.Dotcom. Yeah. Yeah. Well, there's also the is that true, though, that you could also use the Kubelet certificate or whatever or the There could be static manifest. Yeah. But those don't look like they've been modified. But, I mean, there's also all these certificates. Right? And I don't know where it is. Etsy Kubernetes p k I, which also all give access to the cluster. So Alright. Let's do v coop. I see You could look at the logs of one of the not a virus pods to see if it's doing anything, maybe.
1:08:17 Yeah. We could try that. Do you want to try that? I have a feeling this is a hint. I don't know why. So you don't want me to run this? Oh, no. You can run that. You do. Yeah. Okay. Namespaces, print the name, crap, so that we keep cube root by default. Let's not delete the default name. Oh, you can't anyway, but still. But I don't think you can delete any of those namespaces except for the root. Yeah. Let's just nuke everything. I have a feeling as we're deleting the new ones, but it's like Hydra.
1:08:55 Okay. So let's follow your trade of thought right now then, which was you want to what? Sorry? My train of thought. Well, I got a hint that I should have looked at the logs of one of these not a virus pods. Alright. Do wanna So this fire log containers and start poking around then. Can we do that? Can we have multiple windows hidden here or how does this work? Well, I'll I've opened something on the worker if you wanna jump onto that session and poke around the logs there. And we'll just leave that delete. Okay.
1:09:29 I'm I'm just assuming not a various schedules on the worker, but I I could be incorrect. Okay. So so the nice thing about I think doesn't container do you also send things to the journal, or do we have to go here? You need to go to parallel containers. K. And then we no. I don't not the container you're looking for. It might be. So we can just do we want to all of them, I guess? We could. Right? Okay. It's running cube CTO applies. It's doing stuff. Yeah. Looks nefarious. Oh, it's it's a job. Oh, there is
1:10:12 a okay. That's I think there might be some cron job that's messing around. I think this is gonna go on forever. Let's just do we cancel it? Get the yeah. Look at this. And there's probably also a cron job thing that's just spawning things all the time. No. Okay. Okay. Interesting. But let do we want to try to nuke I don't think that's gonna work either to to nuke all the jobs. It's probably do we want to look at what this container is doing? Maybe Well, we got some we did get some debug output. Right? So it's it's creating a
1:10:52 job. It's doing this it's scaling down our deployment. So this was reactive. Right? So this happened because we tried to scale up the deployment. Mhmm. It did. Yeah. So there's something that's watching the cluster deployment. Yeah. I don't know if if that statement is true. It it is not running on something other than the cluster or the two nodes. Okay. Doing Like, I'm not sitting here on my computer running something. Just hitting enter every time we do something. I'll get them this time. Alright. There must be a Rawkode player somewhere. So Do we want to check
1:11:51 Searching for Rogue Process on Nodes
1:11:52 if there's maybe a process outside of Kubernetes running on the control plane? We're just talking to the API server. Yeah. That's what I tried to do. I was trying to filter out the kernel processes, but, apparently, I just thought that was a rejects. Why does it think that was a reject? Do you think There we go. You. Yeah. The cube API server. Let's see. I don't see a lot of nefarious things here. Do we? It is not running on the control plane node either. K. But it might be running on the worker node. Who knows?
1:12:50 I don't know. Do we want to check just to be sure? There's a lot more running on the worker node for sure, but that makes sense. Oh, look at the sleeps. Yeah. I think this is k. I don't think there's anything running outside of Kubernetes that's of interest from this. Well, where are those loops coming from? What was the root cause of this? I think it's container d. So Yeah. So it's probably part of the Oh, look. There's a setup and run dot s h. There is. Yeah. I think it might be worth looking. How can we
1:13:37 dive into that container? Can we maybe attach to one of those containers and kind of look what's inside? Is that a good idea? Like, attach to one of We can just cube CTL. What's it? A text to run. ID. So let's grab the I was gonna say you could launch an a new version and override the the entry point if you wanted. Oh, yeah. We could Process isn't running anymore. But Yeah. We I can try doing it from from using Well, we can just try control. I was thinking of just attaching to an existing one with kubectl,
1:14:26 and then but we can also do it through this. Yep. It's fine. Bar run. It should be correct. Oh, without the container. Just runtime endpoint, apparently. K. Is it info? Yeah. This looks all fine. Can we attach? Yeah. I was looking for the IDs of the namespaces. Unless you don't have to attach by another means. We we can do it through cube cuddle. Right? You mean the cube cuddle? To it? Yeah. Alright. So you wanted to get pods dash a and just pick one. Exactly. Yeah. Thanks, Noel. That's what I was looking for. Crack control exec. Okay. Exec.
1:15:58 IT. Oh. IT. Pending pending. Oh, and they're all jogged. Oh, no. They run for about thirty seconds each. Okay. That's okay. That's good enough. Right? So we just filter we we we we can get this done in thirty seconds. We just put a sleep. Like, it will not if we are executing into it, it will not kill until we kill the thing. Right? But these are things that run-in the cluster, and I keep coming back to that first process that ran after we scaled the deployment. Yeah. But that They are they are pretty much the same the same workload.
1:16:50 I'll give that hint. But was it running in the cluster? It was not. Yeah. See. So So the original thing that triggered it was not running in the cluster. Does it Yeah. And I'm assuming that's still running. Yeah. And so you're so even if we find a way to block not virus by looking into it, it will just spawn more not a virus after it. Right? Well, yeah, I'd like to get to that first one first because the because it's gonna be the same code, the same binary, the same script. As soon as we get to that, we'll
1:17:21 understand it more and then we can thing. Now on the worker So so what what was the script called that we that we saw? Set up and run. So maybe just find through the entire disk and scrap for that. Just do okay. Let's just What is u d? U power d is power management thing. I think it's to be what is root nerds coddle? That looks NerdTTL is the new That looks container DCLI. Ah, okay. But it shouldn't be in the root directory. Exactly. That makes me and it shouldn't be and this it shouldn't be installed on this
1:18:06 kind of node by default. Right? So I won't look at the bash. It's it's empty anyway. So root nerd cuddle is running. It's interesting. But that might also be part of that con wait. So can can I try something? Please. Let me just By all means. Let's see. Yeah. That makes sense, but I want to know if it finds it in other places. What makes sense? Sorry. I'm scrolling my head right now. No. It's only inside containers, but interesting. But there might be a container running, right, that's was spawned on the worker node directly outside of Kubernetes.
1:18:53 It's listening to everything that that starts the whole the whole shebang. So that's not how can we find out all the container d processes that are not pods spawned by the kubel? All of the pods are Kubernetes namespace for container. Oh, I have a way. We can stop the Kuplets, then we can stop all the containers in container d. Yeah. What's card on that? I don't trust this node anymore. It's dead to me. But does it make sense if we stop the cubelets, then at least we oh, yeah. Stop recording makes sense as well then. Yeah. I don't know why we took
1:19:40 so long to do that when we have a road process running. But sometimes you just don't text. Yeah. But you didn't drain it. Right? You only cordoned it. Right? Do you also want to The cordoned does the drain. Oh, no. I meant it to drain which does a cordon. Ugh. Sorry. So that should just leave the demon set. So, like Yeah. No worries. So now let's see if there's still something running. Ordering by dates probably also helps in this view. And then not the container you're looking for. Still running thirty seconds So that's cleaning up there. So
1:20:19 I say we run cry control p s again. Is that liveness Prometheus? Is that out of place? There's no Prometheus on the cluster. Yeah. That looks interesting. Right? Let's do an info on that liveness container ID and see what it is. Sure. But it was also created two days ago, which may gives me the feeling it might not be the thing we're looking for. But Docker image. It's part of Cilium, it seems like. Is it? Is it or not? I don't know. Oh, no. That's just the CNI I was getting. Alright. Let's see. N full command
1:21:27 isn't ideal. What else do we have? I'm not that familiar with cry control. Inspect. Mhmm. Why did I give us a machine with so many disk? Or is that that's TTY. Yeah. These are just all the devices that are oh, this is part of SEF. It's part of this. So okay. We put. Okay. So the the there doesn't seem to be anything nefarious running anymore on the worker nodes right now. So maybe it's something nefarious running on that. I think so. This all looks fine to me, except that Cilium operator was created a minute ago. That's a kind of suspect.
1:22:26 Yeah. There's, like Okay. Let's do one more thing. K. Let's see what energies we've got on this machine. We did get the hints that it was the same workloads. Right? So Yeah. So it's the SIP plus cluster. Exactly. A few versions of it. So That was just me developing it. Exactly. Yeah. Okay. So o o three is the only one that's running. Okay. You know what? Let's Let's do something weird. Okay. Let's block docker let's remove the image and block docker dot I o in DNS. I was gonna fill it and actually take a look at the script and say that
1:23:23 it might look Oh, yeah. That we can also do. Yep. That's if that doesn't work, we're gonna leave all the other images intact, nuke this one, and then make sure we cannot contact Docker Hub anymore. Yeah. We we can add a we could we could definitely block access to that file. I hope this doesn't frame my machines that oh, start Docker. There we go. It it should not. That's Yeah. I I I don't plan. I mean, I'll run it, but I'm gonna override the entry point. So we should be okay. That's the main you've been kind enough to
1:24:00 leave a shell in the image. Oh, yeah. Well, there was a shell script running. Right? So that should should be fine. Yeah. Go on, Tucker. Oh, it it's a private image, though. No. Oh, but we can get the the credentials out of Kubernetes then. Right? So Yeah. That. So do you want me to So you were very close when you saw the nerd CTL. I couldn't get CryControl to work. That's why nerd CTL even was in the logs at all. Mhmm. So we were we were close with Nerd CTL. Interesting. But why is it not on the or
1:24:38 why is it not anywhere? Oh, yeah. It's alright. Need go ahead and Let's just do How is that update and locate? There's lots of things. For our lit nerds CTL. K. This has helped me. There's lots of processes named NorthCTL running. Okay. So I guess interesting. So maybe those containers are running NorthCuddle. No. It's still running because we stopped all of those. Right? So Yeah. Interesting. So we have a lot of there's a network namespace named nerd CTL. And how about no. Damn it. Do we How is this okay. So this this is our nasty process.
1:26:02 Okay. So we've got a process here. Right? Yeah. We can do a process ID. Alright. Let's go in here first. There's a way to n s. And this is our namespaces. So why don't we do an n s? Do we have n s center? And then Yep. We do. We can tell it to enter the. So we just give it the ID. Space this. I'll redo this again. This is where we wait for somebody in the comments to tell us. Just to drive it. Maybe it's f, maybe it's not t. I think it might just target
1:26:04 Accessing the Rogue Container
1:26:44 pet to alright. Okay. Did I just go the wrong way around? So is it the NS enter mount Alright. Oh no. We're not doing pet because we know. Oh we could do pet. Okay. How did I tell the namespace then? Target. Or dash m for the mountain namespace. Mhmm. Okay. So n s enter mount prok. No? I can't remember how to enter. I'm feeling Me neither. So we'll figure it out together. No. That's fine. Did we get a hint? Thank you, Noel. I must have been I I gave a hint since I think cleaning up will be more interesting than continuing to
1:27:47 hunt. Yeah. Good. Thanks. K. So let's did that not work? Okay. No. That's our normal host. I don't think we're in the Interesting. What was your hand? CTRC list. Right. Okay. Let's do that. And then exec into the container itself. Yeah. Let's do this. Did that not work? Because why not? I think maybe I just got the wrong way around. Maybe. Yeah. I think the IT comes here. No. Like I said, I couldn't get cry control to work, so I installed Nerd CTLs instead. K. So we But then I deleted it. I can I can reinstall it real quick?
1:29:03 I feel like the andesenter should have worked, though. Yeah. That was also a bit surprising to me. Just enter a p. I've not got a pet anymore. P f three x grip. Okay. Yeah. Null's always on the ball with this stuff, isn't he? I forgot the runtime endpoint. I think you got it right. No. We got an error. Oh, so my container Failed to find container. Oh, that's that's okay. I can fix that. So Yeah. But it's not yeah. But the funny thing is it doesn't show up here. Right? Maybe there are two instances of container d running.
1:30:07 No. The problem is the cry control, I think, defaults to the namespace of kit. It's only shows that the Kubernetes pods. I see. So I I just plopped a a nerd CTL binary into the root directory if you wanna use that. Awesome. Slash root. There we go. Alright. Exec, this, and bash, and IT. Hey. We're in. K. That So what is what okay. So let's look at that script. Right? Oh, there's lots of stuff here. Set up and run. Let's look. Oh, there's a read me. How about we look at that first? How how how kind?
1:30:40 Examining Rogue Container Files (README)
1:31:04 That's just my my notes. But it there's probably good info in there. Yeah. Do you want to look at the there's no FIM here. Right? No. K. Okay. We'll just scroll up. It's not that long. K. Let's see what it does. This is a very nice hacker to leave us the notes on what what he did. So there's to dos that have been done. Create Docker secret before port, protect against failed Kube Cuddle calls, Create a port which kills some cluster test workloads, replicates itself and times. Create a private docker registry image. Copy to worker. Run container.
1:31:54 K. It's on NerdCutl because you couldn't get the thing to work. Okay. What I think we can do, though, is just we can throw away this stop this container from running and then nuke everything from the Kubernetes cluster. Right? And then that should Yep. Help us. But do we want to know why it is stopping things from scaling, or do we just first get rid of the problem? I think we should run. It's it's not doing anything too intelligent. K. Yeah. Okay. Let's just get rid of it. K. Good. K. Let's now go back to the control plane
1:32:24 Cleaning Up Rogue Namespaces
1:32:34 node, I suppose. Right. And then we can do your delete namespaces command again that we did before. This one. And, hopefully, it should terminate this time. Or does it? Depends on many namespaces we have. Yeah. Everything is cordoned as well. Right? So we shouldn't have new things starting either. The work very same. Cordoned. Yeah. K. You're getting praised for your make file. Yeah. If a make file, read me. It's good good work. This is taking a little hoping it was just the only thing you had done was scale our deployment to zero. Yeah. Yeah. I was like, oh, yes. We
1:33:30 got it. It's easy. We got this. This tough to test without having things go completely haywire. So I was I was I was very nervous at that moment whether or not it would actually work. Mhmm. This is a lot of namespaces. It's oh, you you cancel it. Turn it. Yeah. It's the number going down. That's the question. Yeah. How many do we have? Yeah. Yeah. Wanted to check. K. 3,086. And if you run it, it's going down. It's just gonna take a while. Well yeah. We have to get rid of all of them first, I think. Yeah.
1:34:14 If you do dash dash weights equals false, it might parallelize it. Because now every time it's waiting for the namespace to be cleaned up before it executes the next command, so maybe we can Fire and forget. Exactly. No. That didn't make it much more quicker. Darn it. We can maybe XARX do it in parallel. Is there a flag for XARX to do things in parallel? Can you use? I have no no no idea. You want me to issue the same command from my laptop and see if we No. I think it's going right. Yeah. Yeah.
1:34:16 Accelerating Cleanup and Redeploying Workloads
1:34:59 Slowly. Is there a faster Yeah. I mean, because that node is now cordoned, none of those workloads should be able to run and continue to self replicate. So I I suspect that once we get them cleaned up, everything will just work fine. But do you modify the maximum pod limit on the node? Or no? I guess they're just queued up while they wait on each other exiting. Alright? The jobs. Yeah. I think Kubernetes make sure that it doesn't spawn all the jobs at the same time. It queues them up. Doesn't it? I don't know, actually. Yeah. Not not familiar
1:35:44 with how that handles it either. But I just need to wait another twenty, thirty minutes for this to delete. Yeah. But we we we we can do it in parallel though, I think. Yeah. Well, let's hope it or not. Let's leave that deleting. Right? We're we're pretty confident we're not gonna get that back. I've opened a new control plane one. So let's try and get our our deployment updated. But we we can't encode on that node probably until all those namespaces are away. And we could edit we'll add a toleration to our clustered thing and just schedule it on the control plane.
1:36:24 Is that is that cheating? No. No. I guess not. Yeah. Let's just try and wrap this up in the next few minutes. I know we're a little bit over, So let's see if we can get through this. Did you join? Wait. One second. I did not join. Wait. Which one is it? The other one that, of course, created a minute ago. Yep. I'm in. Alright. My connection dropped for a minute. Hopefully hopefully, you can hear me now. Oh, yeah. Yep. We hear you. Great. Do you want do we want to move it to the control plane?
1:37:24 The we can remove the no schedule thing from the control plane, or we can add a toleration to our workloads. Either works, I guess. I don't remember how to type a toleration at the top of my head, do you? Yes. No. So what I'm gonna do instead is edit the note, the because then those jobs make a schedule on it. Oh, good point. So we cannot do that. Yeah. Exactly. So I I do. Let's let's key key k edit the control play note. We can do it. We can do it. Where are the things? I think the syntax comment on is we
1:38:15 could have done x I x with a dash p for parallel. I'll try and speed that up just now. And do p 16 or something. It's a it's a very Yeah. We still got a few to go. 2,000 or so. Yeah. Mean Yeah. Well, this machine has, like, 48 course. Right? So we can do 48. Is this going faster? I don't know. In the meantime, how do I select this text? Usually, I have a global buffer in Vim. Now I don't know how to like, it's actually selecting the text in in Vim now instead of I I copied it
1:38:50 Adding Tolerations & Redeploying Clustered/Postgres
1:39:01 if that helps. Okay. So now we're just gonna edit the deployment, and then we're just gonna paste it in the spec here. And then we change the words taints to tolerations, and then Is that it? Yeah. I think so. Oh, sorry. Off you go. I went K. Right. Okay. I think the Seven 1,700 to go. I think this should help. Yeah. Here we go. It's great. We'll have to do the same for Postgres. Right? Yes. Correct. Edit. Is that a stateful, Seth? It is. It's, again, your turn to do your magic. Oh, yeah. My magic. I'm the pester. That's where we
1:40:11 are. Right? Well, you can do the rename. That's your skill. Exactly. Do you need me to do anything here? No. The tabs. The You've done enough, Sid. K. There you go. We might have to Yeah. But I think we'll need to do a little delete there. Gotta reschedule. Because it needs needs it tries to be friendly against stateful service. We we've made that more difficult than we had to. We could just have added node name to both of those. What is the control plan mode rather than going through the scheduler. But it's Okay. It's up, I guess. Well, it's
1:40:59 not ready yet, but yeah. Now it is. Okay. We have it. Alright. Do you wanna do the do the image upgrade and then we'll do the port forward? I don't think we need to port forward to p one just Yeah. I didn't really touch anything system wise, so I suspect it'll work just fine. Okay. Let's look at this version. I'm just gonna for a little laugh, take a look at our namespace. 1,200 to go. I don't think that's gonna finish by the time we wrap up. Rule out deployment status. Our status deployment. Successfully rolled out. There we go. Well, let's
1:41:03 Upgrading Clustered App to V2 (Success)
1:41:41 give it a shot. Let's give it a shot. That's the dance. Nice work. Damn it. Both of you. That was tough. Yep. Yeah. I like I like the the cordon strategy. My I I had the same idea as Irian to fix it in terms of block Docker Hub and and then clean things up. Yeah. I I, you know, I don't know if it's just, you know, this sure or what, but it's it's it's always so obvious that, yeah, we just have to coordinate to stop this replication happening. Like, we knew it was replicating, but still we
1:41:43 Explanation of Cluster 2 Brokenness & Conclusion
1:42:35 were focused on trying to understand the why rather than actually stop First, stop stop the cause and then investigate. Right? Instead of the other way around. Now we've got ourselves in a situation where we need to run delete for half an hour. I I think that's the context that's important. Right? We we know this is not a real production service, so we don't actually think about all of those other things first. And I think I need to try and flip that mindset in my head. Yeah. The the card and help, but that was a pretty nefarious script.
1:43:08 That was So what did it do? We actually didn't read it. And I guess the whole API server thing was a we we were just confused about some output that didn't have anything to do with it. Yeah. Or did it? Yeah. I mean, the the it's just based on a KubeCTL container. I copied in my credentials. It pulls to see if the cluster deployment has any pods. If it has pods, then it basically flips a trigger and and from then on, self replicates. A worm. You gave us a worm. Yeah. And so I've I I put it in
1:43:40 a container on the the worker because I figured if I put it directly on the worker, it might be easier to find. But if it's just if you're just seeing, oh, container d is running some stuff, that seemed seemed like it would look normal ish. Yeah. It was I thought yeah. I think we knew there was a process, but trying to find it was was pretty challenging. Nice work, both of you. Yep. Very well played. Unfortunately, two really tough containers. I'm gonna have to make sure whoever's on next week makes it a lot easier for
1:44:12 me because I We can then edit the deployment and scale it up to one and then we can celebrate. Like Yes. That that was it. I need someone just to come on and just do something really simple because these are tough. Okay. They seem to be just getting harder and harder. But hopefully, our exploration is still providing viable knowledge to to everyone else. And hopefully, not giving too many people bad ideas for me to break clusters. That would be unintended. And and I do think that that will be a side effect maybe. I don't know. Yeah.
1:44:43 I wonder how many people are gonna go write their own little Warming container that does self replication and bring it down the machine. That's Well, so I I got the idea for it because I I do some work with Kubeflow, and I wrote a recursive Kubeflow pipeline and it got out of hand on me one time. And so the way that I stopped that was to delete the image that it was pulling and then go clean stuff up. But Yeah. I think our container d alias would had a very similar effect there just to stop it doing that.
1:45:11 I know we went over, but thank you both for sticking with us. Thank you for sharing your knowledge and horrible ideas with us. And I hope you both had fun and I'll speak to you all again later. Yeah. It was fun. Thanks for having us. Alright. Have a great day. I'll speak to you soon. Bye. See you. Ciao. Ciao. Take care.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments