About this video
What You'll Learn
- Trace API server restart behavior and stuck Pods by reading cluster logs and runtime diagnostics step by step.
- Reproduce Cilium networking failures, compare plugin and shim binaries across nodes, and verify containerd sandbox behavior with crictl.
- Detect and remove a malicious systemd-collector-d binary, then fix remaining Rawkode service settings to restore API server access.
Noel Georgi returns to debug Kluster 015, where a corrupted Cilium CNI binary and a malicious systemd-collector-d binary hijacking the containerd shim leave pods stuck. crictl, shim comparisons, and Rawkode service fixes restore it.
Jump to a chapter
- 0:00 Viewers Comments
- 0:20 Introductions
- 0:22 Introduction and Previous Recap
- 1:24 API Server Fix (Resource Limits) Recap
- 1:30 Kluster 015
- 3:28 Initial Pod State (Completed/Init) Investigation
- 7:12 Investigating Cilium (CNI) Pods
- 10:01 Debugging Init Containers and Sandbox Errors
- 12:26 Using crictl to Inspect Containerd
- 16:52 Corrupted Cilium CNI Binary Found
- 21:51 Replacing the CNI Binary on Control Plane
- 27:00 Persistent Sandbox Issues & Containerd Debugging
- 45:20 Comparing Containerd Across Nodes (Shim/Cgroup Differences)
- 52:27 Identifying Malicious `systemd-collector-d` Binary
- 1:10:18 Removing Malicious Binaries & Restarting Services
- 1:11:35 Verifying Containerd Fix & CNI Status
- 1:18:42 New Issue: API Server Communication Problems
- 1:23:56 Checking ETCD & API Server Logs
- 1:28:16 Debugging Rawkode Application Pods
- 1:32:36 Fixing Rawkode Service Configuration
- 1:34:36 Cluster Fixed & Conclusion
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:22 Introduction and Previous Recap
0:22 Hello and welcome to today's episode of Rawkode live. I'm your host Rawkode. This is Clustered part seven part two. Work that one out. Last night, we had an episode of Clustered, and I unfortunately had a power cut which put an abrupt end to our debugging. So I'm gonna be joined again today by Noel. We're gonna dig back into cluster 15 from Tim Hawken and see if we can work out what went wrong and why we couldn't work it out yesterday. As always, please remember to subscribe to the channel and click the bell. This means you
0:55 get notifications for all the awesome content as well as helping other people discover it. We have a very active Discord where we're always talking about cloud native, Kubernetes, and a mix of other technology related subjects. So come in, join us, have some fun. Of course my employer lets me do this on their time. Thank you very much Equinix Medal. If you wanna check it out, you can get $50 coupon by using Rawkode live. This will get you around one hundred hours of compute. Alright. Let's get back in the scene. Hey Noel. How are you today? Hey.
1:24 API Server Fix (Resource Limits) Recap
1:26 Good. Are you ready for some revenge on this cluster? Yep. Okay. We'll just we're just gonna dive straight into this I think. So let's get the screen share up. I've got the control plane note here. If you can just join type echo and then we'll get ourselves hopefully fixing a broken cluster. Well, let me try and recap while you do that. What kind of happened yesterday? Oh, you're already in magic. Yep. I'm in. It's a after my power cut, I think you decided, oh, I'm I'm gonna know what was wrong with this. And you dove back in and
1:30 Kluster 015
2:06 you you uncovered a couple of things. So do you wanna just explain what happened after my my power dropped? Yep. So I went in and started looking at the locks and it's unlike the API server was basically restarting kind but we kind of missed that because we were just kind of restarting it here. It was restarting at the trees. The like, the stack trees was, like, so huge. Like, I had to scroll down a lot. Then I kind of dipped the cube APS of files between the working cluster and this one and found, like, the limits was, like, two,
2:37 though. So what happened basically was the APS of it was getting throttled. So it never properly completely started up. So I think our well, I think yeah. Our hypothesis last night was that there was something ruining communication between entity and API server. The the problem was it was just never we're fully starting. And then silly thing to miss, but I believe you have removed the dodgy resource limits and we have a working API server. Is is that correct? Yes. We should have. I'll let someone run back and Yeah. It looks like it's been running for
3:13 Yep. What is that? Thirty minutes? Yep. And I mean, that's interesting but I would expect it to be up for like twelve hours. Yeah. Bit longer? Alright. Okay. So does that mean let me reset up our aliases. So we got kubectl, kubectl. Yeah. We have an APIs server that works. Okay. So we yeah. Did you poke around anything else? Or I saw those bots are completed, but then they said, like, you're going to reshare it. So, like, okay. I stopped there. I I need to refresh. So kind of almost all the ports are kind of booked.
3:28 Initial Pod State (Completed/Init) Investigation
4:06 All the ports they enter are completed. That is just wonderful. Alright. What? Completed? That's weird, isn't it? Yep. So when you describe it for what it actually says so if I just do okay. Let me just add the completion too. I'll just tackle a couple of comments just now. Rawkode says, oh, the API server was restarted. Yeah. I don't know how we missed that. I mean, we were restarting ourselves. Like, I even when I watched it back, I realized I said let's skip the resource limits and mounts because it didn't know the API server was restarting. I just assumed it
4:53 wasn't an issue. Very naive of me on this show. I should always check everything. And will you just ask where where was the resource limit set? I believe it was the API server. Yeah. Yep. It was a manifest. Nice. Okay. It could be something this is something interesting. Oh, shit. Sandbox. Sandbox changed. It's I wonder if it's something to do with the pause board. Yeah. It's making me think that either the root secret is no. Maybe container d has been messed with? I checked that container d system d service. It seemed fine. So I kind of stopped over the yesterday. So
5:48 let's check it again. Yeah. Cool. Container d. It looked fine. I couldn't really find anything wrong with it. Or there must be config file for container d. I don't we should find a mistake there. Yeah. Well, yes. So normally, just got to see the environment file. I I don't see anything. Let's jump into the Etsy directory and see if we can if there's a container d dot yaml toml. Yeah. No. Alright. I think it's fine because it's Yeah. Yeah. Can do that. I'm not gonna assume anything is fine ever again after yesterday. Yeah. There's no config, it just seems to
6:40 run as container d. Do you mind if I what happens if I shut down container d? Bad things happen. Right? I was gonna run it manually on the command line just so we could actually see the log output but I guess the the more sensible Yeah. The more sensible approach would be just to use journal d. That seems fine, I think. I I don't see errors. Alright. Let me throw an idea through you then. So even before we fix the clustered port, right, if we don't get the networking up, none of the other ports would come up
7:12 Investigating Cilium (CNI) Pods
7:19 right. That's true. Do you think we should focus our efforts on maybe the serial namespace first? Yep. Let's just try the ceiling port and see if it's the same error. Okay. Cool. It could Yeah. I think why do we have more? Oh, wonderful. Yeah. Why not? Alright. We got a question in the so I'm gonna I'll I'll tackle that. Navin Kumar is asking what is a POS container? That is a great question. Yep. The POS container is the what holds the namespaces and c groups for all the other processes and the networking as well. So it's created to bootstrap all of the
8:07 c groups namespaces and networking stack and then the other processes are launched into those namespaces so that they can work. And I think as far as as far as I know, it just literally runs a sleep infinity, and that's that's all it does. It's just there to be the parent of boilerplate, that substrate for the containers. Is is that right, Noel? Yeah. So that I yep. That's what I also believe. Like, it's there just to set up all those names, pages, and c groups, then you can just start on the other port. Like, a port is a
8:34 collection of containers. Right? So and since the con the containers seem to sorry. Share the same network names, please. It's there to, set up all those things. Yep. Because if our main container both hosted it and it died, then we'd lose every all the other processes and stuff like that. So and sleep should never crash, I guess. Maybe that's why it uses that. I'm not sure. Okay. So that's all was fine. So I'm just going to check the locks. Yes. Go for it. The fact that it says completely makes me think that we got an x zero from
9:11 it which I think is a little bit strange. Oh, we got because this was a oh, so restarting. I'm gonna throw something out there like if we do a Cilium get pods. Like, we've got a whole bunch of stuff here. I'm kinda curious what happens if we just delete one of the pods like if it comes back and completes again or if it runs like I'm curious if this is affecting all new pods. Yep. Let's delete the one that's not running. Yeah. Let's not delete the one thing that is working on this cluster at the moment.
9:51 That's a great idea. But I guess it's just keep keep on restarting. Oh, yeah. I'm hoping it goes running. If it if it completes again, I'd be a little confused. And, like, my first assumption was it is something with container d was tweaked. Like, is it some the thing that's, like, c group or, like, the sand I think the sandbox is related to the post container. Like, I could be wrong, but that's what I believe. Also, the other one is completed. Right? But it didn't oh, it didn't get deleted. It's still terminating, so I'm just gonna try and see.
10:01 Debugging Init Containers and Sandbox Errors
10:30 So We've not got a new one. Oh, no. We do have a new one. It's running in it. Okay. Yep. Okay. Let's wait for a bit. Nice coffee. I don't think that's doing anything right. Yep. That's probably if I describe, I think we would probably get the same thing. Yeah. Shall we check the event log just in case something happened in there? Yeah. Let's check the events. Oh, so it got assigned, but it's, like, waiting on it. So it's probably the same thing. So the edit could just do you wanna okay. We've got a describe. What's the name of the edit container?
11:26 I think we have auto complete. We should be able to log. Clean Cilium state. Should we try and get the logs from that? You'll need to load autocomplete. I don't think I loaded it. No. It's there. Oh, it's there. I loaded it. Cilium, what is the port name? It is F 25. Yep. And it was clean Cilium state. Yeah. Let's try to check the previous one. Tested keys. No. Not here. There is nothing. Oh, so it's just the board is not initializing. Oh, is it something to do with the scheduler? No. The scheduler has applied the label to
12:17 it though. Oh, yeah. It has. Okay. Pod initializing would tell me that it can't set up pause container. Do we have cry control? Pod. Yeah. We should probably create that file. Right? So Yeah. Let let's create that config. Yeah. Yeah. It's somewhere in this file. One of these files actually has the configuration for it. I just saw it yesterday. Well, it said no. I don't know if that's for us or not. But, well, he'd feel free to drop a little bit extra details there. What what did I say that was wrong? Let's see Kubernetes. Just tell us to create the container deconfig.
12:26 Using crictl to Inspect Containerd
13:17 No. Not this one. Which has do you know the continuity config? I I have to, like, figure it out. It's, like, a container runtime dash pocket, I guess. Yeah. Hold on. I'll find that. I I just keep using my lead message from Discord many, many minutes ago. Right. This is a file. Right? C r s I k l dot v m l? I've got it here. Or c r I dot v l? That's it. Okay. C r f t l is working. So we should be So the Sileum on this node is fine. So we're gonna need to configure container d
13:59 on the node that that other Sileum was scheduled on and then see what container details this is running on that. Right? Yep. Okay. So, let's jump over. I mean, it's not running on any other one. So, let's jump onto g j. And I will create a container d config. What was it? Cry control dot yaml. Yep. Weird bug. Right. Okay. Then cry to that save. Yeah. Okay. So there is Selium. That's not the new one, though. Can we just But when I read when we list the containers, you should see the post container. I had never
14:56 like, when it's, bad, I saw it, like, with Docker, but I don't know, like, if it it shows, like there should be post. There should be a way to show the container. Or it's just like, oh, we have the CTR. CTR. Test. Maybe it's not I think we had to put in the namespace, cates .i0 dash There's also a c list. Do we need the namespace? Does that work? No. No. I think it's just like the namespace is case.io. I think it's what is it? Get, man. C p r. Oops. I don't know. Let's try that.
15:43 Okay. Must be for c. Are you runtime config? Where is, like, would it pass in the Well, CTR would list every every container on the host. I don't think anything is running on this container d. Well, it is also suggested we try try cry control p s, which may or may not display what we need. Let's move those comments up there. Yep. So nothing is running. That's kind of makes sense. So let's try it. Get container. That seems fine. Let's try. Do you wanna should we get the, yeah, get the logs? Let's see if this container
16:29 is completing of the editing. Why is it not auto completing? Oh, did you see that? Failed the same plugin. Yep. Celeb c n I. Okay. I'm gonna stop this so we can try and maybe process some of this. So these logs are coming from the container d runtime or from the containers that it's running? Like, do we have an op c and I? Or we do it? Okay. Yep. We do have it. So it's probably, like, the c n the c m port field to start. Right? But the problem is none of the port containers are running.
16:52 Corrupted Cilium CNI Binary Found
17:24 Like, if I do a c r I c l p s on the other node, I can see the ports. And here, there's, like, nothing. Why? Okay. Let me go back to the other one and get the notes and see if it's actually ready. Okay. The notes are ready and the polls are scheduled. Okay. So there's an error message in the container d logs that said failed to reserve sandbox name. Okay. Yeah. What what do say? Like, prep post. That is actually good. It's not running. And I suppose Should we restart the CLT? Trying to do
18:26 is give me a minute. So I was trying on the other terminal. It's like in the PCM. Right? I just want to see if both which nodes they were scheduled on. So the one that is running is on I guess, it's a master and the control plane, and the other ones that are, like, not completed must be on the other nodes. Yeah. Something is something has to be wrong with container d. It's unable to start any any container whatsoever. So, like, should we try the other approach? Like, just run So there's a warning, not blocking dev mapper.
19:16 Let's see. That's it. Oh, so there is a conflict. Why is it not killing? Oops. Just gonna kill it. Oh, sledgehammer. Alright. I approve. Always must be the system d one that's starting it up. Yeah. We need to stop it and then run it manually, I think. Okay. And let's make sure we run it with debug if possible. Debug. We'll do a log level debug. Yeah. Dash l debug. Okay. Dash dash l level. Oh, that was oh, what is this? Failed to destroy network for sandbox. Oh, p p c n I bin, and the folder was present.
20:40 Yeah. The folder was present, but we didn't have that binary. See if we can find something that's missing. Well, there's something interesting on the control plane node. Do you see that? So I thought I'll jump on the control plane and see if we've got the still m c n I there. And we do, but it's also been replaced. Yep. And that's probably the reason why it's crashing. But it's it's working on the control plane. Right? No. It's actually the port is kind of getting, like, restarting. So, like, if you look at the restarts, right, it was, like, probably,
21:25 like, 360 restarts. Okay. So we divert the binary? Well, I yep. Let's yeah. Just save it. Not that I don't trust you, Tim, but I am gonna move Cilium c and I to Cilium c and I maybe broken. And we'll bring back that old Oh, yeah. C and I. And then we should how can I restart? Let me just force this. I'm gonna kill her only running pod. Yep. Whatever happened. Hopefully that works. Is it possible for them if the control plane node does not have as working CNI that the other nodes are be unable to
21:51 Replacing the CNI Binary on Control Plane
22:25 start or does that an unrelated issue? That I'm not sure. Okay. So Cilium is running. Let's see if it gets healthy. Could be yep. It's running. Right? Yeah. And I wonder if the other one is stuck in in it because it's waiting on the control plane node to come online. Oh, no. We should start system continue on other machine. Yeah. That's it. Okay. That still seems happy at least now. I wonder what that old binary was. I think it would not be able to start the other note on the other notes because the binary was kind of corrupt. So
23:11 then, like, one if it's not, then it won't be able to, like, set up networking for the other ones. So I'll I started the con community, and I'm gonna start the unit. It seems hell fair. Yep. Yep. Okay. That must be like oh, it still had errors. And let me yeah. Let's check that. Why is it still in and it we should probably delete that one thing, or is it not not healthy? Well, do we need to get the binary in that directory to still in pull that itself? See, the unit container is supposed to install
24:09 the CNI. Or is it not? I don't know. I know. At least for Flanagan or anything. Other than I think the unit container is supposed to install the CNI. So I guess it's the same thing for f. Oh, let's check if the so I think the edit container for Selim and see what it actually is. So I'm just gonna pull up a serial manifest. It's, like, serial quick start. Edit containers. So it's supposed to run the init container dot asset script. And the name of it is okay. It's called clean seal in state, so that seems fine.
25:17 Hold on. So you went to the edit containers. It's got an edit container dot s h and that's this is pulling the binaries? Yep. Yep. Okay. So that being unable to start, it must be due to the sandbox problem. So we need to fix this. Although that's because it can't find the plugin. Where's the first error? Let's make sure we're not skipping anything. In fact, I'm gonna stop container d. Again? And just run it manually. Oops. So I think so it's first says fail to destroy. So whatever was there previously, it was not destroyed. And then it's probably so let's delete that
26:20 port that is already running, like, which is, like, stuck in the unit state. Deleting stuff, I can do. Alright. Let's No. Let's keep the continuity running. So I'll just keep continuity running. Force, great, period. Alright. Those ciliums are being nipped from orbit. I see the same error. No. Oh, we should check if the let's open up the cilium manifest and see if the volume mounts are right. If it doesn't have the host part mode, it won't be able to, like, find that part right. Okay. So edit the Siviom. Now Yep. That was quite a lot. Let
27:00 Persistent Sandbox Issues & Containerd Debugging
27:21 me jump down. Let's switch to host path. That's easier. Okay. We've got FireRunStillium. Looks I mean, it's in Okay. We have of DNI bin. Yep. Yep. And where is it referenced? It should be referenced in volume modes, which should and the mode part should be a host of the n I bin. Yep. That seems right. Alright. So we might be missing something related to actually bots and bug grab and probably we'll see them not coming up. Magnum was asking, are we allowed to give tips? The answer is 100% yes, please. If you've anything you've got, send it to us.
28:18 So we got sandbox errors, container d on the worker nodes. What are we what are we missing? It feels to me that container d has something stored, some state stored that we need to be able to flush somehow. We could like delete the wall, log continuous. Does that have a cache? No. Does it have a No. Lib? Oh, yeah. There's stuff out here. I'm half tempted to say maybe. Yep. But you should first stop on data d just to yeah. Oh, it's not running. It's it's not running. Yeah. We've we've been running it manually. Okay. Yeah.
29:10 Why don't we Yep. We quit. Maybe Why don't we move it? Okay. So now there is nothing in there. Yep. There it just doesn't even exist. Where are the other locations? I think it should create it. So is there gonna be no. There shouldn't be anything else. Right? It yeah. I think if you start again, it shouldn't just recreate it. Yeah. So we got a whole bunch ah, look at that. Cleaned up orphan ID. Yep. It's cleaned up. It's like garbage collected. So yeah. There's loads of stuff that run container d. And oh, it's running now?
30:02 No. Oh, it's the other one. I mean, we're not getting any sandbox errors. That's good. Yep. So let's just start the system d service like the normal one and I just restart your bled. Good call. Okay. That seems fine. Oh, it's doing something. It's it's doing stuff. Yep. Okay. It's good. Same here. Darn it. What if I I I keep doing this and I don't think it's ever been right yet. But what if Tam pulls a dodgy image? I'm gonna do my favorite change and change the pull policy. Why is that not? But it may seem right. It's getting it
31:12 from okay. Okay. The local ones. Docker IO, Selium. So I don't know why it's pulling from Docker Hub, like, with new manifest actually pulls from query, but I guess it's fine. So roll out what was the command again? Oh, there was two images. Right? What was there? There should be two. Right? One for container and one for the normal one. Good call. Yeah. And we should be at rollout or, like, delete the pods. I mean, I'm gonna what is it? Roll out, restart, d x Restart. Okay. And then let's run a wee get pods watch on it. Okay.
32:10 Terminating, pending, pending, init, init. Come on. Yeah. It's not gonna move on from init, is it? Yeah. So the chat people are saying that you could QDM fix this. Yeah, of course. Like we could have QDM fix the entire cluster but we wanna work out. I think the mission is not more to work out what was done to the cluster and how it was affected rather than fixing the fixing the bonus. It says it can't find anyway, like, if you look at the errors, it says it can't find the plug in c m c n I and the
32:49 control plane has the fly binary. So should we manually copy it over? Well yeah. We we could manually copy it over and see if it supposed to be automatically installed, but there must be something we are missing. So we could copy it over, Mandy. And if I let's see sorry. Opt. See if I can. Oops. What just happened? Oh, let me log in again. I just logged out logged out. Oh, I know. I might. Alright. Let's get that binary. So the CNA then hopefully, there's a hopefully, I can just get in fact, we could just copy the script.
33:49 Right? What was this script? What was this c n I install? Here we go. Yep. I mean, we can just run this on the host. But I guess the binary is actually inside the container, so we actually need to get the binary outside of that image. Well, I was hoping there was gonna be yeah. Actually, there's a better way. So it register machine. So if I do, I think, If the if all the image, I should be able to find the binary. You wanna find that from the the actual image cache on the host? Yep. It's VarLib. Right?
34:36 VarLib container. Yeah. Right. F dash name. Oops. C m dash c n I? Yeah. Was that it? Yep. So let's let's see what we've got here. It must be in one of the yep. For tires. I thought it was pulling the images. Snapshot. S yeah. Is there let me try to see. I see here. It should list the images. Right? Does it? Yep. So It's not it's not pill damage. I guess, is this image kind of correct? Let's check the let's run this in the host and see if the image ID matches Yep. On the control plan.
35:48 Oh, there's a version 3.2, and here we have, like, three three point one. Yeah. It has the right ID. Oh, wait. So there must be in one of the manifest, the post container that might be specified, and in that, it might be specified as three point o and, like, we're just trying to, like, pull 3.1. So if you look at the cube, like, configuration, so for that Kubelet. Right? Okay. Let's actually go there. Was it updated? This looks like Where does it specify? Oops. What did I do? Let me create it. I just where does it specify which POS and HDUs?
36:53 That must be some configuration. I thought it was in the binary itself. Can I? No. It must be in one of the configuration because, like, in offline environments, like, we'll have to pre pull this image. So it must be one of the configuration I have seen it. But let's check if we have actually access to the Internet. Okay. That's working. Can we use CTR to pull the image? Oh, sorry. Yep. I think we are in the same page there. CTR. I think it's, yep, we can just do pull what is it? Kates dot I o slash
37:34 g c r. Oh, what is that? Kates.gcr.i0/pause. Kates G c r I 0 / pause and then 3.2. And I would suggest we just pull the cilium one too. 185. Let me just so we have updated one. Let me pull the Cilium one over here. Okay. So I never seen any error messages. Oh, wait. Hold on a second. When we ran container d, it told us that all the images were there. Let's check some for some weird configuration. Maybe it's missing some plug in or, like, the configuration doesn't or the doesn't know it. Like I think I broke it.
38:57 Alright. If Start start a new session. Just call that process. Okay. I'm here. Okay. So you said let's check out the container deconfig. I didn't think there was one. Yeah. And we already blew the starter away. I don't think there's anywhere. I don't think there is a container deconfig. Let's check-in at c. Let's see. No. There's nothing in the h c for continuity. Just try my this is out. No. And why let's do that? Why is it possible to specify? Alright. I wanna take a look at the container d logs again. I'm not I'm not happy with something, but I don't
40:07 know what yet. Yeah. Something with oh, okay. Finally, actually got full screen. I don't know what happened, but the teleport thing. Right? Like, it's actually easy to read. Mine's just Container. I just see something must be running. Okay. So I can't read that. Sorry. Stop. D. Container d. I don't get it. So it's something. Container d dash The filter. Okay. Yep. I wish it we're running a debug log in. I'm about to say, I wish it wasn't so for both, but I don't think the debug logs are helping. I could just run it without debug logs.
41:03 Yeah. I think you're right. Let's let's just do container d. Wow. Oh, it's nice. Oh, are the container deep binary on-site? I never really seen anything that I wanted to see there. Oh, okay. I got yeah. I don't know why that's great. Like, I'll do a control c and, like, kill the process. There we go. Just stop it and, like yeah. P kill. Point probably. I should probably do my s nine. And let's look at the container deep binaries. User bin container. Container d only. That is Yeah. It's the same binary. Yep. Okay. Container to the binary hasn't been touched,
42:41 but it's not working. And there is a container to shim. Right? Well, yeah, there's a whole bunch. Why is the container edition run c v one has, like, a different time stamp? Yeah. But that's the same on the other machine, so I'm okay with it. Okay. Okay. Although, can it be can it oh, no. Because we do it. Okay. That's why there's a different amount of them. Yeah. Those look fine. So that okay. The other machine also has, the right one. Right? Okay. Well, it suggested we run the state is to see if it left the config file.
43:35 It's in a mod probe over there. Supposed to use a c group driver instead of a six fifty driver? Well, they're trying to stay this on a control plane now to see if that Yep. If anything's different. No. This is a working node. Right? Okay. I have, like, so many terminals open. I don't know which one. Okay. Got it. B p. Okay. Am I in two worker nodes? Yeah. So I'm on the same machine trying to compare the binaries, I'm such an idiot. Too many times. See, that's different. So if we look at the control plane, it's actually using the continuity
44:19 shim run c driver. But on the other one, it's actually trying to use the system to see through driver. So it must be somewhere in the continuity config. Yeah. Good catch. Okay. So let's look at the also, let's look at the binaries on the control plane since we are here. So there is something called continuity. So something was modified. And if we do So I bet the sham has been swapped out. That's what it is. Yep. Yep. Oh, not here. Flash user opinion. Alright. We wanna go Yep. I think the shim was probably chopped, sapped out. Probably just
45:12 reinstalled container d. Yeah. Yeah. That's like Hertz. Right? Oh, I mean, just uninstall it. Now let's do the app reinstall. It should really fix it. I forgot how your computer. There we go. Okay. But we should also check the system, the config file because it was it must be there and specify which c group I want to use. Well, I don't think there is a config, which means it's just using the shims that exist. Ah, okay. Yeah. Let's try to start again. Yeah. Let's do a status and see if it looks a bit healthier. No? Okay. No. It's using something different.
45:20 Comparing Containerd Across Nodes (Shim/Cgroup Differences)
46:03 So it must be let container d service. Yep. It was I don't think this file should have been when we reinstall, this should be the right one, but let's confirm. If it exists, I think it would keep it the same. No. Let's look at the right font from the control pane also. System. Let's see. Let's get continuity. So there must be config file that says this used the wrong one. Is there any, like There must be something hidden. Right? Yep. There's a specific configuration files. We could say, like, which driver to use. You could, like, change it, but, like, it
47:09 must be in this continuity config. I don't know which yep. Let's see if it actually can list the config. Yep. Dumb. Yeah. And it it says the shim is container d shim. False. And let's try that container d config dump. Get driver. Well, need to just check-in our system d system. We don't have a system d c group equal to false. And what is the other one? System d c group false. Yep. It's saying why is it using the wrong one? Oh, there it okay. This is a Alright. What does the control plane have for
48:14 plug in or disabled plug in, required plug in? Is there anything in there? Okay. Let's minus c five plug in. So You can just less it. I'm just I'm gonna go through it from the top. So why don't we try that today? Sure. So I've got version two. Version two. Root. Yep. On you go. I'll let you know that if it's different. Okay. So there's a root. There's a state run container d, plug in directly, disabled, home score, required plug in. GRPC, that seems yeah. I'm done. Yep. C group part, t t r c. What's the time out section? So I'm up
49:03 to the time out sections. Debug is fine. Time out. Okay. Plug ins. Should do the report. Oh, so this is where the cut image was specified, the sandbox image. Yeah. We've got 3.1 over here. It's, like, too much con should we copy it over and do a dev? Yeah. Let me copy the worker then. Okay. It's much easier than, like yeah. Them worker conf. Wait. We this is the same working on itself. So there's no point in fixing it here. We should copy it from the control plane. Oh, this is the plane. Sorry. Okay. Sorry.
50:00 We're comparing this to a container. Oh, yeah. Dumb. Can I just, like, dumb, save the file? See, the death worker. There's Ow. So why is the worker not using the system the oh, hold on. I think it's well, this could done by an environment variable. That would be sneaky. That is right. So okay. This is on the control plane. I'm looking at that. Container d. No. I'll just be crap. I guess I can't understand. Right? So we need this telling us to remember the issue is CNI related. I think we both believe that container d is unable to
51:01 start the CNI because of the system d's scheduler. Yep. It's using some wrong, like, shim. Let's see system d. Find dot dash type f crap. Oh, that's gonna continue the configuration here. Alright. Let's try and use the container detail. Like, it's gotta be able to not the Yep. What was that, though? Yeah. Like, there has to be something. I just find it so strange. Oh, so it sees c group as system. Okay. Yeah. It's using the c group. And if I check it on the control plane, It's using the systemslicecontainer.t search service and but, yeah, it's still been
52:17 then it's starting the container d process, which starts with the container d share run c v one. And here, it's using running it inside I assume Qubelet has nothing to do with this. Right? Well, yeah, container d would be running independently on the host and speaking over gRPC. So No. I want to, like because that namespace. Right? Like, if you look at the status, it says, like, the namespace case dot I o. So this is must be something very silly configuration which we have Okay. Let's think about where can we where can we provide system deal overrides?
52:27 Identifying Malicious `systemd-collector-d` Binary
53:10 Did we really reinstall on this machine or the Yeah. On this machine. Correct dash r dash I r. Okay. So how about our lab Lab. System. Okay. Let's take a look. Oh, let's just go into the target. Multi user. I wonder if this dash minus is that disabling the overlay? We don't want that. Right? I think dash is something. It's just loading the overlay plugin. Okay. But, like, it says my okay. Let's try that, though. But I guess that's fine. I think, like, I have seen dash. Yeah. It's a dash thing. It's, normal in system deconfig. I don't
54:19 know what it really does, but that read the man p. Yeah. Okay. Status. Alright. That's, like, file that while it's set, like there's, like, h c system, the system dot conf dot d. Okay. Where do you wanna go? Etsy system d and system dot conf. Oh, this does system dot conf. Right? We don't have it. Let's try an Etsy default. That's where, like like, sneaky things sometimes go. Let's see. Default. Okay. Oh, we do have a default. Grab. It's nothing in there. Me see how it is getting started. Let's system d y. What if this is what if it's a
55:49 separate service? Right? What's the Oh, wait. Wait. Wait. I guess I so if I get the file oh, no. It's in the same thing. Right? So I was wondering, like, if it goes into Labs system d, it might just use it for system d thing. Okay. So control plane. Something's weird. Right? We've restarted container d a whole bunch of times. This system d collect stuff has been running since April 6. I think it's running independently and it's just maybe I think we can just shut that down. If we if we think it's causing a problem, think we could just shut it down.
56:24 Yep. Probably. Why is system d collector d then? Yeah. I I'm killing it. Right? So Alright. D kill system d collect or Collector d. Yep. D. No. That's it. Okay. I think it's just come straight back. Yeah. Okay. The system has a command that tells you which unit file started the process. Right? But I can't remember what it is. If you do a cat or status, it shows, like, the file. Right? Let's do a tree. Like, if I get it There's maybe maybe it's not tree. Let's just do a status completely. Right. It's saying it's registered under that slice.
57:42 Oh, so if I check the same thing on it was just system CPU status. Right? Yeah. Yeah. Please don't tell me it's the same. No. It's different. Okay. Looks same. The one there is degraded on the control side. It's degraded. Oh, look. There's a system decollector on all the services. Why is this? Why is I don't know what this is? So the control plane doesn't have anything related to system d collector. There must be some config file we are missing. Just because Ah, let's see. C dash. I'm gonna check the other cluster. I need to know if this process is a red
58:51 heading and we're chasing something that just is completely irrelevant. Oh, yeah. I hope I don't see it here. So this is the control plane. Yep. It's no. It's not using collected data. It's just the Yeah. Control plane. This is the So the g grid is, like, fine. Yeah. We need to get rid of that collector plane. Wherever that is, that needs to go. That's kinda where my head's at right now. Do you agree? And can you log in to a control plane also from the other cluster just to make sure if you're using the same thing?
59:31 Okay. It's the control plane, like like a working node. Oh, the working node. Yeah. Of course. Yeah. Okay. So we need to get rid of this. This system deep thing is is doing something bad to our container d, and we need to we need to get rid of it. Let me try to find something like Oh, there's a container editor cons in the Etsy modules lot on p. Let's check that. Yeah. Just this. Already I'll just confirm it's the same thing. I guess it's the same. It should be up. All the Lithia NetFilter. Those are, it just needed all those
1:00:32 modules. That's fine. Trap hunting. Okay. There's there's, like, one file. So we could do did we have a prep for a system decontroller? I wonder if this is the system deconfig. Like, if we go to I wonder if it's some sort of process that's supposed to do nice things and it's currently being abused here. No. Yep. Boy, I I think it's hard to use LSO, or we could even, like, run continuity with trace and see which conflict file is loading. But it's it's let's see. Let me if we run the status, right, and and filter for collector,
1:01:30 it runs on every unit. Like Every unit has this thing now or almost. The continuity ones. Right? Teleport and continuity. Teleport. Why? Oh, that is interesting. So it must be, like, a CPL or something. Should we just Google out for whatever that system d control d is? Just Google about it. Yeah. I did a quick Google. I didn't find anything that had me worried. I got just the stable. The stable. Oh, s as in Munich dinner? It's I just want to see which one is in it. Presses, which says started as been in it export.
1:03:12 I wonder if system d running in a container. No. It must not be. That would be an impressive trick. Yep. I feel like I don't know. I am so confused. It's strange that teleport is also doing the same. Does teleport use container d under the hood? No. Teleport is installed on the nodes because I don't trust people not to break it if I run it in a container. Okay. Those are I will bind these. I was just looking if there was like sampling or something. And it's definitely using user Ben. Right? Let's check. Yep. Yep. User Ben continuity. User Ben continuity shim.
1:05:16 System Yeah. You could go ahead and it seems like the most innocent process in the world. But yeah, I feel like it's the cause of all our problems right now. And I can't work out how system d is like, it's running as a sub process of container d and teleport, but at the same time, hence the Should we just stop the like, if there's, like, a a system d collector d, system c t l, data, system d collectivity. Right? No. There is nothing. Yeah. Last year on entry. Crap. Yes. It's fine. I'm tempted to remove the binary and restart
1:06:21 container d. Is that too aggressive? Fine. We could, like, stop container d, like, reinstall it again just to make sure. Like, if we just say user lab system d, system d. In fact what's the timestamp on that fail as well? That's innocent enough. But if we what else is in that directory? I don't trust anything. The it's it's it's tricky. The timestamp is strangely unique there. Oh, okay. The binary. And on the control plane? Yeah. Let's check the worker the control plane. You're right. Yeah. It's under user lib system b. Oh, there's nothing called He just left everything in that directory.
1:07:43 Yep. Yeah. Our time data has been moved has got a different timestamp too, and there's no collector. Collectivity? Is it, like, we could just delete the collectivity binary and could just pick up the shim? Yeah. I think we can delete that file, and I'm gonna Move it to a yeah. Let's just move it to, like, a old dot old or something and, like, yeah. I'm gonna try something. Hold on. System d. What was it? User lib. The y libs. Okay. Okay. Post slash system d. Yep. What are you? GRPCR address back to so it's something related to
1:08:34 container d. Yeah. Let's let's go in time just now. Right? And then I'm not happy with the system the timestamp. Timed dated. What are you? Yeah. You're look nefarious. Let's get you that way. System daemon. I don't think we need a daemon reload, but restart Yeah. Container daemon. Container Kiblet. Yeah. That's, like, the status of container d. I checked the status of every oh, why is that oh, we have to call that process. What? Can we just remove it? Yes. Let's do that. Yes. You do. Yeah. Let's check there. Grab collected the print to Sorry. Go
1:09:48 away. Yeah. Go away. Oh. It did. I have no idea what that fail was, but it scares me now. Let's run get pods. Yep. Look at that. It's working. Ah. So, like, you just needed, like, kill all those. Correct. Remove the binaries and do it on other nodes. Yeah. I think we need to go into the other do you wanna grab x c, and I'll grab k d? Sure. That's user lib system d. Move system d collector to temp. Move system d time dated to temp and then restart of computer d. Time to. Yep. It's using the shim now.
1:10:18 Removing Malicious Binaries & Restarting Services
1:11:21 So it's it's getting past the NA container. Should we, like, do the files and reinstall? Because the Shim time stamp was kind of, like, different. Right? Got it. Should we just, like, do a reinstall on other nodes also, the container to reinstall? I let's check the nodes. Let me check the nodes and see if it actually. Well, we're getting we're still not getting our silly in pods. Yep. I'll just, like, reinstall it. Yeah. Don't have the note. Yeah. I'll do it on the k d. It's weird. It's like I I I don't know if we've messed the system deconfiguration
1:11:35 Verifying Containerd Fix & CNI Status
1:12:28 option or if the system deconfiguration deconfigs up the fact that that binary is there and runs it somehow. It could be that container depicts it up because if we'll when it did a help, right, it said one of the arguments that, like, something related to container d. So it must be something like, if it's present, I think it'd be, like, just pick it up. Come on, five. Container d is Let me see. Not removing. Oh, maybe I need I guess I should have stopped it first. Yep. We should have killed it. Yeah. I'm gonna do that there.
1:13:09 Update another terminal now. Mine doesn't stop. I see, like, a lot of sleep. So on the other node, I actually see a lot of sleep in the when I do a system CTL status in the container d thing. Yeah. A lot of sleep sixties. Oh, I killed it. Yep. Part of sleep sixties. Okay. Entering. Good. There we go. I'm gonna make sure that's definitely gone. Oh, okay. It's back up. Let me check the note pods. Yeah. My container do seems healthy. Why does it say, like, inspector? Wow. I know it's needed. So what yeah. We deleted the Varlib container d, didn't
1:14:49 we, as well? Maybe we need to do that on the other machine. Yep. Oops. I should have Mhmm. X c. Oh, it's okay. I'm just gonna stop Qublet also just in case and just start it again. R f y. Think think the star. Star. Tim can only explain what the system be connected here. Like, Google itself doesn't know what it is. Okay. So it's continuity is exiting because Yeah. If we do that, if we get the logs again, you'll see there's a exit. Compare the new template. I run Cilium. At the binaries also of c n I.
1:16:21 Exiting because of a signal. Oh, but if you look at the c the c l m c n night timestamp is also kind of and then heading in this node that I am in, like, x c node. It was just modified today. Not Oh, no. We we did that. Right? We we renamed maybe broken and replaced the other one back. Yep. And the control play was fine. I meant the working node, the one that has, like, x c, the one I'm in. Oh, yeah. What was the directory called? Op c n I bin. Op c n I bin.
1:17:11 Oh, yeah. That's because it's downloaded it when the first image when the image finally started. Alright. Okay. Yep. Let's find that. Yeah. I think that's okay. I think there's there must be something running on these devices sending a signal. To Civium. Let's check all the processes. Wonder what this is. Why? Let's see. Java script. Why is it Java? Hubbell? Potentially. Let's grab that process ID. P d prop. Not there. Well, it's gone now. Is it like a cron? That would be sneaky, wouldn't it? Yep. I could like do a cron tab dash l. I I do normally tell the breakers
1:18:42 New Issue: API Server Communication Problems
1:18:47 I won't look in cron, but I think given how much time was spent on this, like, screw it. I'm gonna cron tab dash. Oh. That's bad. That's bad. Alright. Let's check the other notes. Yep. Okay. I'll fix it on mine, The one that's in x c. Wow. Every minute. Did we get them all? Yep. Let me check. I think so. Okay. I guess so. I thankfully, this was not on the control plane. So that explains the signal and why C and I was shut down. Yep. And why is it, like, taking too long to respond?
1:19:46 Oh, we probably just opened up a new can of worms, haven't we? Oh, let's check. Okay. I would probably, like let's check any cron type exist here also. Maybe, like, slowing it down or something. And while he was talking about the s two container d thing, but that was never there. Right? So it's yeah. We've just broken the control. Should we check the manifest to see if it bothers, like let's do the log verbose logs. Yeah. Go for it. Please feel free. Dash v. Ah, so it if you went ahead and, like so it's hitting the server. It's not
1:20:40 like any SSL thing this time. Actually, that's, like, two servers. Probably, this is enough. Two doesn't really go with anything. No. I don't want to, like, go over, like but it went up to the API server. Right? Like, it says, like, we can see the request went up to, like, eleven five hundred. That's fine, I guess. Okay. Let's look at the server logs. Yeah. Yep. Oh, we forgot to check the contact. Right? I did. It's all good. Yes. Nano, not them. So but there was nothing in there. Oh, okay. Yeah. Let's check the APIs over the last
1:21:28 container l s s r less cube. Context It's CDP timeout. That is not a meta dot v one status. Just gonna scroll up a little. I guess even if we had fixed the limits issue yesterday, we would have been still going down a big habitat. This is one tricky damn cluster. Is that today or yesterday? That's today. Right? Yeah. Looking at creating man what is this? Mandatory flow control. I just saw something. What is it? Yep. Do you think we should go take a look at the manifest again? Yep. Testing keys. It's not restarting, though, is it?
1:22:40 Like Yeah. Let's not restart. Oh, it's still running. Oh, so there must be one of the issues to API server. Let's search for limits first. Oh, sorry. Now this I actually removed all the limits, but I guess it's fine. Oh, no. It was there. Okay. Oh, it's running now. I just typed it wrong. Okay. So Should we try get again? Just to Try a different namespace just for no real rhyme or reason other than I'm running out of ideas again. Yeah. Let's just check the default, which has, like, the least. Okay. Let's do dash dash v. Oh, dash
1:23:35 v equal to nine. And we're less on it. Oh, so it's kinda stuck here. So it's APS server taking time to respond. I Is RTD running? Oops. I give me minute. I accidentally closed it. Are you happy? Yeah. You're happy. Should we check the No. It's not. That's only been running for two seconds. What? Date. Date. Yeah. Okay. It's been up for, like Yeah. The time on the server is wrong. What? One 01:24PM. Oh, no. No. Okay. Yeah. That's Yeah. That's right. So SCD is only running a few minutes. So probably it's like AP service that kind
1:23:56 Checking ETCD & API Server Logs
1:25:00 of timing out, which is like since SCD is not really up. Wow. It got restarted again. Right? Yep. We need the SCD logs. Yep. What log? Like, what wrong folder. Yeah. Thank you. Bar log. There we go. Yeah. Let's find the last one. Alright. I think I find yours. That looks okay. Oh, we should take the limit for that CD. Probably the limits for it CD is also set to or that it probably, like, takes a long while too. And since we are now hitting it with the more load yep. Done. This better be the last fix. Come
1:26:13 on. Okay. So it's restarting just now because we don't have entity. I'm gonna do a quick search for the limits in this directory. Alright. Okay. Okay. Yeah. Good. Good. It's taking its time. Taking a long is there, like, limits set for Kubelet too, like, from system t? I'm gonna restart the Kubelet. Yeah? No? Sure. Am I just being Let's wait for one more minute, Cap. I'm just being impatient, Arna. That's a long time. Should we just, like, kill it so that it starts it? There we go. Okay. Yep. It's fine. K. I think you will have to get
1:27:23 healthy. So Yep. We can always Well, let's check AP server. Yep. Health. Health. Okay. No. I feel like we're close. Let's get it. That's errors. Top and tech okay. Yeah. That's errors. Okay. Yeah. Pod security policy. No. How could it, like, break one sister fixed Centimeters? Object to That doesn't appear to be too healthy. Oh. Oh, it's back. Oh. I hate that show. Okay. Set is not up. Let's start very important. Yeah. Are we using any more volume for that? Yeah. Funny, Valid. Sorry. I know what what you're saying there. So Seth is not running, but I was
1:28:16 Debugging Rawkode Application Pods
1:29:04 asking, like, are there any ports that use, like, a PVC? Yeah. The postgres one. Ah, okay. I think it might just take a bit. Let's check if the serial ports are all up. I I think we should just oh. Oh. So we have two running. So there's only one that's in crack now. They're like so how many nodes do we have for? The worker node and one control plane and and three workers. Three. Right? We fixed the limb. Right? That was the cron job kill. Unable to reach API server. Okay. So we did restart some stuff. So I'm going to
1:29:54 restart Cilium. That's better. Right? Yep. Sometimes, see the eventual consistency. Sometimes, like, at times, it takes, like, forever to, like, reconcile. That's good. There we go. Things are happy. Yep. Oh, great. And Don't worry, Miles. We are all over it. We are restarting everything within our power. Rawkode gets is it DS? Oh, no. It's Rawkode. Yeah. Get deploy. There's only You should copy. Those coming back, right, I think? Yep. I guess it's yeah. Let's take the pods. Maybe you should just, like yeah. Don't know. Do the pod. Okay. I I don't wanna suspend too much longer
1:31:01 on this. I think we've it's been good fun, but Yep. We've fixed a lot. And I'm hoping if we just keep an eye on this Rawkode namespace for a moment, we'll start to see that come out here. There was a lot in this story. Yeah. It's our name. So what Rawkode has so many components. Yeah. It's also because of the machine I'm I'm installing these on. So, like, these are quite beefy machines, and it creates an OSD for every disk that is available to be consumed by the CSI driver. And, you know, it Rec and are really cool, but the more
1:31:39 No. I I think CSI is, like, kind of pretty heavy if you look at the number of ports it creates. Almost I think I run long haul. It also has, a ton of ports running almost all the time. Okay. I guess it's all backup. Yeah. This is getting healthy. Let's check our actual application, and I'm gonna go to my local terminal. But let's see if we can port forward. Yeah. Rawkode. Yep. Rawkode. Clustard. 16. Export config. K. Get pods. Or forward. Alright. Let's see. Okay. So we're not probably Rawkode thing. Right? We should probably just delete the
1:32:34 post present. Ah, he's changed the service. Just one last kick in the shins before you alright. Let's see. Target port. Thank you. We can just remove target port. It's not a big deal. Alright. Port forward. Yep. DNS. Oh, let's check if code here is actually up. Should we just restart the post? Because since we had, like, a three restarts, like, the vault. Oh, the back end load controller is actually. That's okay. I I don't care about the the packet the packet one I think is is alright. I think I I think I like your idea about restarting
1:32:36 Fixing Rawkode Service Configuration
1:33:35 Postgres. I thought the surface So when you Sorry. Go. Oh, so when I list the pods from the machine, the control play, I only see, like, two pods, but, like, use it three. Oh, that's weird. Yep. So if I do just get pod, I just see class like, two pods and, like, use it three. I don't know why. Is it, a right cluster? Oh, no. It's a fifteenth cluster. It's 16. Wow. Alright. Okay. K. K. Port or well, at least we found another bug injected by Matt Anderson then. Thanks, Matt. Yep. That cluster was fun too. Right.
1:34:34 Ta da. Yeah. This is just a Firefox bug. It doesn't like the encoding on my video. That's the only reason it doesn't No. It's fine. Yeah. But we have fixed I guess. We have fixed this cluster. Yep. Very close. Yeah. I don't think this video works on Firefox. I need to re encode it with something else. Few. There we go. Well done. Yep. Alright. Well, I know we we put a lot of time into these two clusters last night and today. So thank you for joining me again. No. Yep. That was a really really tricky one, but
1:34:36 Cluster Fixed & Conclusion
1:35:14 we got there in the end. Perseverance definitely prefer. So thank you again for joining me. Have a great day and I will speak to you again soon. Yep. Bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments