About this video
What You'll Learn
- Trace an application update failure in cluster 17 by isolating a container runtime error during pod startup.
- Find and stop a suspicious node-level debugger pod while validating other deployment and networking signals.
- Resolve cluster 18 by detecting CoreDNS NXDOMAIN behavior and removing a broken mutating webhook.
Marcos Nils joins to debug two Kubernetes clusters: cluster 17 from Sascha Grunert (a rogue node debugger pod and a containerd 'honk' error pointing at BPF) and cluster 18 from Billie Cleek (CoreDNS NXDOMAIN rule and a malicious mutating webhook).
Jump to a chapter
- 0:00 Viewers Comments
- 1:23 Introductions
- 1:24 Introduction & Show Overview
- 2:56 Introducing Co-Host Marcos
- 3:47 Starting Cluster 17 Troubleshooting
- 3:50 Kluster 17 - Broken by Sascha Grunert
- 4:43 Initial Cluster 17 Checks
- 7:37 Cluster 17 Application Upgrade Failure (v2)
- 8:33 Debugging OCI Runtime Error ('Honk')
- 10:03 Investigating Cluster 17 Configurations
- 19:08 Investigating Node-Specific Issues
- 21:58 Discovering Rogue 'Node Debugger' Pod
- 24:04 Examining Rogue Pod Manifest
- 25:33 Debugging Inside Rogue Pod
- 26:40 Finding Suspicious Host File
- 27:50 Analyzing Rogue Killing Script
- 28:29 Stopping Rogue Service & Pods
- 32:16 Retesting Application Upgrade (Cluster 17)
- 33:15 Cluster 17 App Works, 'Honk' Remains Mystery
- 46:40 Switching to Cluster 18
- 46:55 Kluster 18 - Broken by Billie Cleek
- 47:16 Cluster 18 Initial Diagnosis (API Server Down)
- 49:08 Debugging Control Plane Components (Cluster 18)
- 54:39 Cluster 18 Application Networking Issue
- 55:48 Debugging Networking from App Pod
- 56:47 Identifying DNS Issue (Cluster 18)
- 57:35 Checking CoreDNS Configuration
- 58:30 Found: CoreDNS NXDOMAIN Rule
- 58:47 Fixing CoreDNS
- 1:00:59 Discovering Mutating Webhook Issue
- 1:05:40 Deleting Problematic Mutating Webhook
- 1:06:03 Verifying Cluster 18 Control Plane Health
- 1:08:17 Cluster 18 App Connectivity & Upgrade Test
- 1:08:40 Cluster 18 Resolved
- 1:10:00 Kluster 17 Revisited
- 1:10:13 Revisiting Cluster 17 ('Honk' Mystery)
- 1:12:46 Searching for 'Honk' Binary/Config
- 1:15:58 Debugging Containerd on Worker Node
- 1:21:45 Inspecting Containerd Configuration
- 1:24:22 Cluster 17 Conclusion (BPF Suspected)
- 1:27:18 Wrap-up & Thank You
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:24 Introduction & Show Overview
1:24 Hello, and welcome to today's episode of Rawkode Live. I'm your host, Rawkode. Today is clustered, the thing I hate every single week, but oddly enough, kind of enjoy having fun fixing broken Kubernetes clusters. So I'm gonna be joined today by Marcos. I'll introduce him in just a moment, but we are gonna attempt to fix two Kubernetes clusters. The catch is they are broken by members of the Kubernetes community. I will introduce each cluster as we start it and hopefully as we fix them. A little bit of housekeeping before we begin. Please, if you're not subscribed to the YouTube
2:01 channel, do that now. Click the bell, get notifications every time a new episode goes live. I make it part of my role and I'm, you know, supported by my employer, Equinix Medal, to produce cloud native cloud native learning materials so that we can all learn this vast landscape together. There's also a very active Discord channel. There's a couple of hundred people in there now. We're talking about cloud native. We're talking about Kubernetes, and we're talking about technology, Rust, Go, loads of cool stuff. Feel free to pop out there. If you're not watching live, it is the best way to ask questions
2:32 about any cloud native technology and to suggest new episodes. Win win. And thank you to my employer. Again, they sponsor my time to do this. You can try out Equinix metal. It is a bare metal cloud with $50 code Rawkode live. This will get you around one hundred hours of compute. So, you know, use that as wisely or unwisely as you want, but small instance, big instances, whatever. Have some fun. Alright. My cohost today is Marcos Niles, a software engineer in the gaming industry and all round container and Kubernetes ecosystem member. Marcos, why don't you tell us a little bit
2:56 Introducing Co-Host Marcos
3:06 about yourself, please? Thanks, David. David, welcome everyone to another episode of Cluster. My name is Marcos Niels. As David correctly said, I'm a software engineer at WiLi Studios. It's it's a gaming company. And I've been in the cloud native world, like, for several years already since early early stages of Docker. I've co contributed and co authored play with Docker.com and .com, which are, like, online based environments where you can try and play with open Kubernetes. And happy to be here. Looking forward to see what the cluster breakers brought us today. Looking forward is a bold word, but I
3:47 Starting Cluster 17 Troubleshooting
3:47 like the optimism. So let's just get started on cluster number oh, this is number 17. Wow. Good fun. Alright. We have as always teleport. Marcus and I both have access. I am going to quickly click connect on the control plane of this cluster. If you can join that session and give us a little echo hello to confirm that you are here. Will Getting there. Do that thing where I resize the bottom because we always cut it off and it's super annoying And I will introduce the breaker of this cluster. This cluster is broken by Sasha Grenard,
3:50 Kluster 17 - Broken by Sascha Grunert
4:21 who is the SIG release co chair or chair. I can't remember. Yeah. Chair. SIG release chair involved in the Kubernetes security space and just all around general awesome Kubernetes contributors. So thank you, Sasha for taking the time to break this cluster. I can see that Marcos has given me a smiley face, which means we are ready to start. I don't know if you've watched one of these before. I think you have, but it's always nice to start with an optimistic get pods, get nodes and see if we have any access to the control plane. So
4:43 Initial Cluster 17 Checks
4:53 why don't we just why don't we configure We need to do all the Yeah. All the bootstrap right here. All the board and bits. So we'll get Kube control working. I'll set up the Kube config, and then I'll I'll give you the orders of typing whatever Kubernetes command you want to see if we have access. The stage is all yours. Go for it. So here we go. So let's try to get notes. Oh, it's fixed. We can move to the other one. Right? I wish it was that that easy. Okay. So note theme, apparently, they seem to
5:24 be ready, which is cool. First thing that I'm gonna do, David, because we have been beaten by this is this. Okay. This is very nice. Yeah. Because, you know, we haven't beaten by this before. But in any case, let's try to see the port status. So let's try to see all ports in the cluster. Everything seems to be running, which is Sorry. I I broke it. Went to the other screen. I need to remember Yeah. I need to remember to hit control r on the right screen. There we go. No worries. So it seems like pretty much all the
6:02 bots are ready. So let's see on the default main space, what do we get? Apparently, everything's to be everything seems to be fine. So I think we're gonna find some issues regarding communication communication with the bot, maybe networking related stuff. I think we need to try port forwarding and see if we can access the service at this point since everything seems to be okay. It it does look like a working cluster. Right? I mean It seems to me. Yeah. No. Okay. Well We can try to see the the service maybe. We can try to see
6:40 the service of cluster. It's going endpoint. Endpoint? Yeah. So seems to be okay, Adam. I would just try port forwarding and see what we get. Alright. Let me pull up over a terminal. It's a bit easier to do that from here. And I have my Rawkode cluster directory with cluster 17 export to kube config and just make sure I can run get pods. Yes. Okay. So let's do port forward. Custard. Itty. Itty. Itty. Our cost is working. Ah, see. I told you. Nothing to fix here. So that's definitely interesting. Let's try the upgrade just in case. So
7:37 Cluster 17 Application Upgrade Failure (v2)
7:42 if we can yeah. Good call. Let's do an edit on that deployment and see if we can modify the image, see if it works. And you should just be able to change it to version. V two. Right? Yeah. Oh. Oh, okay. Okay. Now. Now was starting to worry there that this was a someone had forgotten to break a cluster, but it looks like It was going to be way to easy. So let's see. Oh, to copy and paste this, I need to use insert and shift insert. Right? Yeah. Control and insert and shift and insert should keep you
8:26 Yeah. Keep you going. So back off failed container. Hunk. Oh, OCI runtime creates unknown hunk. Oh, interesting. I there's just never any shortage of honks on this. Let's show that's what's here. Yeah. So the first thing that I would do is I would try to I don't know the image pool policy that we have here. So let's see the edit, deploy, cluster, image pool policy always, so seems to be okay. This is the right address, I believe. Right? That is correct. Yes. So let's maybe we can try creating just a raw, like, a raw pod, a raw deployment to
8:33 Debugging OCI Runtime Error ('Honk')
9:20 see if that's gonna fail also. I I'm just trying to see if we see the same thing on, like, a regular deployment or just on this deployment because maybe this deployment has a specific configuration. Oh, makes sense. Maybe let's let's do k run image NGINX NGINX. Get bots. Container creating. And running. Oh, so it's running. Interesting because this this means that this is only affecting this specific deployment. So what could affect the deployment within our cluster? Okay. I'm thinking maybe two things right now. I'm thinking admission controllers perhaps. Okay. K. Get admission controllers. I need to be valid.
10:03 Investigating Cluster 17 Configurations
10:12 If you've got complete, that's gonna help. But validating web admission, whatever that big long thing is or type API resources. Yeah. Yeah. I I can never remember the name of it. Yeah. It's way too long. Let's do a rep. Okay. K. Get validating. It's not completing for me. We'll try and fix that. Do it at the end. Dash a. I don't know if these are yeah. They are not nothing here. We can try mutating. Yeah. Let's try the mutating one. Webhook configuration. Just in case. Interesting. Interesting. Because it seems like since this is only affecting this
11:03 this deployment let's see the the deployment, though. Let's see if we see something funny there. Yeah. Good call. And we accept suggestions from the audience. Strategy, nothing weird. Metadata, labels. We could try changing the label because maybe something watching that as label and doing something funky. That shouldn't affect anything in particular. We have the protocol, the image resources seem to be okay. Mhmm. Termination message policy. The end policy seems to be okay. Default scheduler seems to be okay. Reset quality okay. Yeah. I don't see anything weird here. Shall we go take a look at the
11:52 so there's also static admission controllers, which would be configured on the the Oh. On on the manifests on the API server or Kubla. I can't remember which, but we should probably go check if any of those have been turned on. Yes, sir. Queue. But look, I have suggested that are all complete isn't working because of the alias. Yeah. We'll check that in a second. You might be right. No. So the only admission plugin is node restriction, which is which is fine. I would expect to see that there. No authorization. No. No. That's oh, it's this
12:30 one. Yeah. Yeah. Yeah. So that there's no restriction one that is fine. That's pretty standard Yep. That I remember. Let's see if these files were changed. Good call. They didn't seem to have it changed. They have not changed. Yeah. Yeah. So this is someone suggested something on the on the test. We are because they Waleed is suggesting we take a look at container d and see if the configuration for that has been modified in any way. That would make sense. The the error message that we've seen that say that it was an OCI container runtime problem.
13:08 So Yeah. It was on a t c. I think it was container e dot tunnel. Yeah. It doesn't have to exist. So you might wanna do a system control cat container d and see if there's any overlays on that. Yeah. I don't seem to see anything here. No. That looks good. Maybe on system, system d, container e, Find dot container e. Maybe here. Let's see. Yeah. I think at this stage, I'll look at everything. Yeah. Same thing as we saw before. Nothing that's that's that's what tells all the way around. The inner key. Nothing. Looks good.
14:19 Yeah. I don't see anything weird here. Can we see the header message again? The error on the, yeah, on the deployment? Yeah. So I need to It looks like our pod. Did our pod go away? Maybe it rolled back to the other no. It didn't. Oh, is it running now? Where's the Oh, it's running. Where's the crash loop back off? Yeah. That We have nothing. It'd fine. An intermittent error. Let's put it So it pulled and it started. I mean, that honk is that's that's injected. Right? This that that was not an intermittent error. That's intended
15:09 Yeah. To attach. Honk unknown. That's weird. OCR runtime create sale Hong Kong known. However, we didn't get that on the on the other one. Let me let's try changing the deployment label, but that's gonna affect the service. Right? So we need to change the service as well. Why why don't we try and downgrade the container back to version one and see if we get the same error again? Just to confirm that it's affecting only this this deployment. Image. That's the thing. Sometimes the breakers, their intended breakage can be a little bit subtle and that honk in the error message is
15:53 is gonna annoy me, so I wanna get to the bottom of it. Yeah. Yeah. Yeah. The comp thing is, like, very mysterious. Alright. So that just worked. Let's try upgrading it again to v two. K. Get pods. This is okay. V one. V two. Yeah. Same thing. Okay. So it seems to potentially be either in the same But now it's running. Yeah. It's intermittent. Describe both. I I think Alright. Let What do we We looked at all the static manifest. Can we take a look at the kubelet unit file? If that's been modified in some way. Feeling
16:59 that, I think we need take a random, save the control plane somewhere. Well, have I lost you? I'm talking to myself. All right. Okay. I'll do some typing just now until Rawkode is back. So let's see. I don't like subtle errors because oh, almost like Marcus is potentially still typing. At least I can communicate through the power of Teleport. Thank you. Alright. Marcos can hear me. So we'll continue for now. I am gonna take a look at this system, the unit fail for the Qiplet. And see if something's been passed in that I don't expect to see.
18:16 Marcus has now dropped. Maybe maybe I'll be back in a minute. So it's oh, hey. You're back. Okay. Let's see. Go keep the restart always blah blah blah. Boding. Boding. Boding. Can you hear me? I can hear you. Hello. Hi. Sorry. I'm back. Yeah. I'm I'm just trying to see if I can spot any of it. Right now, I'm thinking there's something weird with the queue, but just because the static manifest haven't been modified, container d doesn't have a config. We haven't looked to see what node the pod has been scheduled on to see if
19:01 that affects, like, where the NGINX one was scheduled versus where the clustered one was scheduled. Perhaps one of the worker nodes has been called. Oh, yeah. Maybe one of the cubelets in the worker nodes are yeah. That would be the case. Have the CRI the the CRI runtime change? It's not container e and it's something else. It's like a honk binary or something. Yeah. It's just these are these things are difficult to work out when they're subtle and intermittent, but it looks like NGINX ran just fine on this FFST. Our clustered one is on that running fine.
19:08 Investigating Node-Specific Issues
19:36 I wonder if the failed ones were scheduled to other node potentially. Although, I don't think a restart should reschedule it, if my understanding is correct. Yeah. Yeah. Yeah. That's why maybe it's working now because it's landing into that node, basically. Right? So maybe what we can do is we can go to that node and bring down the unit so the pods get rescheduled and see what happens. That's a good idea. Okay. Let's pop open FFFT eight and I will shut down the Kubelet. Oh, I wanna drain it first one so that we actually see the
20:11 it is key. Yeah. Yeah. We can just just drain the node as well. Right? Or like Oh, yeah. I can do that from here. Yeah. We could just call it on the node and then see if I Yep. I was gonna try and do it from the node itself. That would be silly. There we go. And the little parts and then drain it. Yeah. Just drain it. Yeah. No. Demon sets. I thought Drain would reschedule that already. Oh, because of the I think we've got a storage error. Right? Cannot delete ports managed by We can use ignore daemon
20:57 sets. Yeah. When in doubt. Blue dash nine. I hope we haven't scaled the only working noise. We broke everything pretty much. Right? Yeah. Alright. We got yeah. Things are affecting now. Okay. Let's see what happens now. We can always we can always I I think the the the maybe restarting the development should have been easier because now we have all these folks that are not gonna be able to get scheduled. Right? Yeah. Rawkode and Seth is gonna be mighty unhappy with me. That's a good point. Yeah. It is what it is. Yeah. It wouldn't be a a clustered episode if I
21:49 didn't make stuff worse before we fixed it properly. So, yeah, I'm I'm getting used to it now. And we don't have to watch we didn't look Sorry. You go. One thing we didn't look at is if there there are any row pods in the cluster. Right? Because we maybe there's something weird running. I don't know. That's a good shot. Yeah. So it's running on that node. Shall we change? One one thing that is is weird, David, is that this only happened when we bumped to v two. Right? When we were using v one. When we rolled back to v one, we
21:58 Discovering Rogue 'Node Debugger' Pod
22:31 didn't get that error. Right? We didn't get any restarts, I believe. Not that I saw. You're right. Let's let's bring that note back just so Rick and Seth won't yell at us too much. If we just uncoordinate and then we can start let's test your hypothesis. We'll downgrade cluster then up on it again and see. We need some code on this one. Right? Yes. And that should get Rick and Seth at least to stop warning us. Pending. Pending. Okay. This should be okay. Let's see if we get any yeah. We don't get any errors on this
23:08 one. So it's definitely something, like, looking at this deployment. I see a raw workload. Speaker? No. That's That's metal l b at the node debugger. Node debugger. So that is not ours? It's definitely not not part of a standard cluster API cluster. No. It's the only thing that looks different. Is it a it's not an RF. It's a bot. It's, like, twenty nine hours ago, so it seems like it could be a role thing. It's a it's a. Seems like. Right? Oh, there's three of them. Yeah. So potentially a. Yeah. Don't see it as a. Right? Alright. Must
24:04 Examining Rogue Pod Manifest
24:07 be a deployment just with replica set to three. I didn't see it twice in at least in QC. We're gonna have to describe it then. Let's describe one of the pods and see if we can get some information out of there. To me. So let's do k. This is. System. Describe bot. What do we have here? Priority zero. This is out time. It terminated. Seems like they are just Thunderbots. Weird. I mean, it's got a host path mode on of root. That's not good. Oh, I didn't see it. Oh, this is yeah. Definitely not good. Let's delete this card.
25:07 Well, I I'm I'm just I'm do we want to delete it? I mean, I made this mistake before. Should we jump on it and look at the the bash history or something? You want to check on this and see what what's what's in here? I I don't know. I I don't know. This? Yeah. Let's let's let's do some debugging. Right? Let's let's go inside about that. The the image seems to be seems to be, like, a standard Alpine image. Right? So it's not something funky here. Yes. But if we at least go into the container, we can go into the proc
25:33 Debugging Inside Rogue Pod
25:45 file system. Oh, no. Because I've got the whole root one. Depends on the namespace. Like, what I'm curious about is is it running any commands on my host right now? So that's why I don't want to delete it. Let's do a p s a u x But see we can see something rogue. I don't know. What what is this even running? Because it's oh, maybe they created this. That's it. So Nuno is asking, who is the villain? That would be Sasha at. So we can do exec system exec dash t I. We can go to the one that is that is running,
26:19 which is this one. Yeah. While while I need to suggest on the same as me, we should be taking a look at that shell history and see what our sneaky sneaky people have been up to. Oh, this is in the okay. In the feed namespace, the same feed namespace at the at the host for what I see. What do you see? Sorry. I I think I'm a little bit behind you. I'm inside the I'm inside the pod, the container basically on that pod. Oh, so it's I mean, same feed namespace as as host. So this is like a privilege. Yeah.
26:40 Finding Suspicious Host File
26:51 I was worried about using host pad. Okay. So it's got host path route. It's got host pad. Is definitely It's on yeah. On host. What's that key.service? Where? /host/key.service. Let's take a look at that. That's that looks new. Slash host/k.service. Oh, yeah. Oh. Oh, Bing k. Yeah. This is something definitely interesting. What is this Bing k? I'm hoping there's gonna be a shell script with a shebang. Should we go peek at it? Do we have five here? No. That's it. But it's on the host path. We can probably jump we could probably jump out. Oh, yep.
27:50 Analyzing Rogue Killing Script
27:50 So I'm gonna get outside here, there. Uh-huh. Oh, yeah. This is messing up with yeah. So it basically oh, randomly kills containers. It's like another way of doing Chrome tab. Nice. See? It was about Chrome tab. Yep. So it's an I just wanna connect it to speak to container d that is just running a kill command. Got it. Yeah. Run c and it's randomly then killing the container, basically. Okay. I think if we kill the node debugger pods. How would you call the the service? It was I think it's being called from that node debugger service.
28:29 Stopping Rogue Service & Pods
28:42 It's this one. Right? K. Yes. Status k. So we can do system c t l disable k. But it's not running, though. When when is this running? And we we can do stop k. Okay. This is not running. So and we can I guess we can delete the The node DB drive? Right? Now that we know what they're doing. But we don't know how they're there unless they're just pods that are named sneakily to look like other objects. Yeah. Shall we delete them? Yeah. I think they're gonna come back. I think they're in the static manifest directory with
29:29 a hidden somehow. That's fine. Delete. But Goodbye. Oh, I forgot the grace period. Hopefully, you might need to force it depending on especially since that one's on interstate. Goodbye. Let's see if if it come back before doing this because system. Oops. Get caught. It didn't come back? Yeah. It didn't come back. So let's go with the other one. Okay. And the last one is in error. Do you want to leave it here just in case for the sake of, you know, life? Or Kubernetes is an error state. Can we see the logs? Do you have previous on it
30:37 so we can get anything out of it? Or even just check the logs of the running one. I guess this is cache. Did I type Wrong namespace. Oh, no. It's right namespace. No. It's okay. Two fifty logs. Yeah. That's right. Dash. Dash p after logs. Yeah. Dash dash previous dash p, I think. Go for it. Exchange that says not found. Weird. I don't know why it's saying Where's the schedule? Let's do it. Dash o white. Sounds good. Let's see if then get four dash o white. S two. Two g l v. Let's pop on s two.
31:27 Oh, we're on s two. That's the control plan. Oh, firewall. VarLock. Right. VarLock containers. This is called node. Right? Node debugger. Yeah. That's there. Yeah. Can't know debugger. Oh, it's basically what I did on the on the on the scene, I believe. Oh, demon reload. Oh, we have the evidence now, so we can chase the criminal. So there's some daemon reload stuff here and system CTL enabled now. Okay. So, yeah, it's like I think that's it. I think we much. Yeah. Yeah. Sorry. Let's let's see if we get the same Yeah. Go for it. K get pods.
32:16 Retesting Application Upgrade (Cluster 17)
32:24 Edit, deploy, clustered. Let's go to image here. View let's do v one again. I don't know if the breakers are on the on the chat, but it's I don't know if there's anything else we should be looking at. I believe Billy is here. I've seen a comment earlier. I haven't seen that offer from Sasha, so I'm assuming that he's not available. But let's see if we running, And let's go back to v two. Yeah. If we can upgrade this, I'm gonna assume that the case service was the I don't know how that was gonna honk
33:02 there though. Seems like it's running just fine. Seems like I don't know where the conk was coming from. Yeah. You're right. I think that's fixed. I mean I think objective is for us to be able to upgrade the application, browse to it. It's running. I mean, I could do the port forward. We did check the static manifest, Malik. We've never seen anything nefarious. The only thing Static, but we didn't see any changes on the static manifest. Right? Yeah. Nothing nothing had changed. We sent it an all host while it's it's it's He's telling us to be thorough.
33:15 Cluster 17 App Works, 'Honk' Remains Mystery
33:42 And Billy has said, hey. Hey, Billy. Alright. Let's just for completeness, we will jump on to each of the worker nodes, and we will check the the static manifest. Yeah. Billy, your cluster is coming up in a minute. Sound good. I'm following you. Yeah. I think Sasha's just modified the thing we we we think. I'm feeling confident. Could be. Yeah. I'm I'm counting this as a win. I say we move on to the next cluster. Sasha can let us know and has write up if we missed anything. The honk is bugging me slightly, but the
34:25 objective is I agree. To upgrade our application and I and we haven't browsed it, but I will do that for completeness. So Let's let's do one more thing. I I'm wondering if the hunk is still there, but only when it tries to schedule the port in one specific node because one worker node was particularly you know? Do we have pods in all the nodes, by the way? Sorry. Can you say that again? K. Get pods. If we have pods in all nodes in all nodes. Maybe there's a particular node that it's not scheduling anything. No. We have pods in all
35:05 nodes, so it doesn't seem to be a container y Yeah. The data set is running. There's stuff running everywhere. Oh, yeah. Yeah. You're right. Yep. Yep. Great. Alright. I don't know what else to do. Yeah. I'm gonna just check if there's been anything deployed to here that would be not part of my standard cluster API cluster. But Yeah. That's the thing though. Like, Sasha is involved in in security and sec comp and all this weird stuff that I wish I understood even 10% that he did. So there's there are so many subtle things he could have done to
35:45 this cluster, and I don't wanna go chasing them just because we think that they're nothing's broken. It seems to be working. Unless someone has any recommendations on where to look at for the honk thing. Well, it is always full of suggestions. He's telling us to scale our cluster thing. So let's make that the call. If we can scale clustered to every node on this cluster, we that's it. It's fixed. Alright? Sounds good. Yep. I think Waleed suspects there's something else, but I mean, I think we all do. But it's if we don't see it, we don't see
36:22 it. There it is. How you doing on y to see what it got scheduled? Maybe we can catch it. No. It got oh, they all went to the same note. That's weird. Alright. Well, it's should never have listened to you. Okay. Let's go back then. Yeah. I have been yeah. Seems like it's I think one node is particularly That's the thing. Right? No. I was gonna say, like, that that system service that we run-in the kill containers, we stopped that. So when we get a run container error, that's confusing me. Alright. Let's do So that's on our control plane. Oh, it's
37:23 the same node. Yeah. Yes. Node, like f f f f s t eight. Why why only the yeah. OCI runtime, create field, honk, unknown. Damn honk. It's weird because it's a it's a warning on the on the Kubernetes event. Right? It's not a it's not not an error. Fails to create pod sandbox. Onk. OCR one config. Save honk and known. So It has to be container d. Something Container d or the kubelet. Shall we check out the kubelet config? We didn't check that, right, on the system d. Okay. Yeah. Let's go into you you wanna go into that FF worker.
38:27 Right? So we can do oh, it's in that, yeah, in that worker only. I I think I want to do something. I would like to call on that node specifically and scale up the the deployment to see if it happens on the other note as well. Because I don't know if it's it's only that note or all of them, basically, that one container error. Just scale it. Just scale it to 30 replicas and then run a wide watch, and we'll see if it happens on others. And we can always grip that if we want just to filter.
38:59 So replica. Let's see. Alright. Just go to stupid number. These are big machines. K. Get both slash all white. Container creating. Had detailed Yeah. There's an error here. Oh, yeah. It seems like it's happening in all nodes. Yep. Yeah. Okay. So replicas. We can come back to three. I don't know. So let's see let's see the queue. Let's configuration on any of the worker nodes. Right? Yes. Okay. Let's I'm on FFST eight if you wanna join this session. I'm gonna go there. Join session. I'm there. Okay. So we've got a pretty standard unit file. Let's take a Yeah.
40:01 Yeah. We should see a config on the environment file. Like, both path. Yeah. That looks normal. So Of course, didn't that weird thing. Okay. Oh, sorry. I was just gonna do a cat on the No, it's no worries. On the Kubelet thing again and see and just walk her Oh, I'm doing the wrong file. And just walk our way through these files. So we've also got how do I do We have the page. Do I mean there we go. Oh, maybe we can check the logs of the Qubelet. Something might pop up there. Yeah. Or even the container d logs might
40:57 actually be a good shot too. I always forget the logs are a thing. Alright. So let's do a journal container d. Continue and time to get drivers. I mean, I see adders, but nothing that nothing that helps. Hey. You wanna see the There's a runtime class in Kubernetes. Right? Runtime class name, which is basically specifies the CRI to use. Third. I know if we check that that resource. Okay. So why cluster? Yeah. I don't think anything is configured there. Oh, I'm on the workers. Yeah. I'm gonna try to go to the master to the control plane.
42:11 K. API resources. Web runtime. K. Get runtime classes. Nothing here. Nothing here. How is he getting nothing? Do do we take the the QLED locks on on the worker? Yeah. They're pretty noisy from Rawkode and Seth. Oh, yeah. Try searching for hunk, maybe. I need to take the follow-up if I Oh, yeah. Alright. There we go. Okay. So it's actually affecting all containers. We can see it on staff containers here too. Yeah. Do I'm wondering if system CDL I'm wondering if we have, like, a system d service here. Maybe that's also, like I don't even know.
43:34 There's nothing in Kubernetes. I think we're confident enough in that. Yeah. The control plane seems to be healthy. I think something has to be happening on a host. But I'm curious if we've got any We have okay. Yeah. Hunk. No. Maybe there's the binary somewhere. I can recall if there's there's any configuration we can do, like, by putting it in files that changes the CRI of the of the QLED. Because the queue list just stops to container e and I'm not sure. Audience, what you got for us? Come on. We'll give this another few minutes and then
44:27 we'll move on to the other cluster. And if we have time, we'll look back. I don't wanna spend all of our time on one, but whatever Sacha whatever Sacha has hidden is is good. Let's do system d. Yeah. It's a good comment from both then. Oh, so we have the k we have the k stuff here as well. Sorry. What was that? We have the k system d service here as well. On the on the worker? On the worker. Yes. Here it is. I wanna check the container the binaries. Yeah. I actually ran a Witch container the
45:11 other and and run fail on it, and I kinda assumed that was Yeah. This is okay. Yeah. It's I think that's okay. Oh, I don't know if we had here on the on the other notes, the continuity comp. No. It was it's okay. Yeah. It is okay. I don't think container d is, you know, why I don't know. It's the same idea of status container d. Do we see something here? We don't. Right? Let's look at. I don't know. Okay. Container. I'm locked, man. I don't know where I have to look. Maybe we could always do the break glass
46:18 button, which is to go to the root fail system, do a thing by modified in the last twenty four hours. But I don't wanna reach to that point of desperation. But yeah. No config here. So something should be Okay. Let's jump over to cluster 18. If we have time, we will Now good. Maybe something will come to us as as we move on. But right now, I don't know where that honk is coming from. Sneaky snapshot. It's big game. I'm serious. Alright. I'm gonna close cluster 17 and we will come back to it. Alright. Oh, yeah. This is 18 and that's yeah.
46:55 Kluster 18 - Broken by Billie Cleek
47:00 Okay. Got it. Let me just shut that down. Okay. So this cluster, cluster 18 is broken by Billy Cleek from Digital Ocean. Thank you, Billy, who is also available in the chat if we also get stumped. I popping open the control plane. We will get the basic setup. You jump in and then we'll check and see what we are working with. I'm getting there. Joining now. What's the completion? Source. Source, you forgot the minus sign. Yep. And there should be a minus sign in in front of the yeah. There. Like, the open open thing, the this one.
47:16 Cluster 18 Initial Diagnosis (API Server Down)
47:59 Oh, yeah. Of course. Less less time. Yeah. Yeah. Basically, that one. Cool. So you wanna do the honors? Yep. Billy's wishing his luck. That's a good start. Thank you, Billy. No control plane. Of course. Oh, of course. Yeah. Actually, think and I prefer when I don't have a control plane because I know where to start looking. I know what we need to check. Okay. No APIs ever running. Please please don't mess with the CD. No. It's really running like okay. We should check the manifest. Right? We should check the manifests. Of course, they were modified
48:42 because YOLO. Let's see QLED system. CTL. That's QLED. K. QLED seems to be running. Error updating of status. There seems to be some errors in the logs. Yeah. If you add that no pager to that, you'll be oh, you can just scroll this side to it. Yeah. So let's check the the S3 first. So SCD, should we check the logs first? Oh, yeah. The fact that the manifest exists, I wanna know why the process isn't running. So, yeah, I think getting a lot Yeah. Probably a good start. Okay. Let's go tell. There we go. Received
49:08 Debugging Control Plane Components (Cluster 18)
49:45 terminated scene now. Shine down. It's static, basically. Create transport fails. Couldn't connect to itself. This just could be a misconfiguration then. Reconnecting. Interesting. Let's restart the Cubelets to see because it seems like it it didn't fail to start up. Right? It it started up, and then it received the Oh, terminal. Oh, yeah. Second one. Yeah. Let's restart container d Tulip? I mean, I've done that before, and it's led to bigger problems. I'm just throwing that out there. You'd be warned. YOLO. So it's really running now. But for how long? Looks alright. Yeah. You wanna cut the lights? Maybe it's gonna
50:48 shut down eventually for some reason. Maybe a cron tap. No. I need to check. Yeah. Yeah. I need to do it, you know, just for the is API server running? No. Okay. The API server is not running. But, actually, it seems to be okay so far. So Can you tell the logs? Shall we go we'll go. Cut. See. Oh, I have two, right, which is the new one. Let's do Pick one and see. It's the second one. Right? The HCT c one. This one. Yeah. Seems to be okay. That is one healthy HCT. Cool. Yeah. Let's do that. The API server.
51:44 Queue API server. Oh, what am I missing? Which is the last one? Can't have etcd. Although those logs are five minutes old, so it's not restarted when you restarted the Kubelet. It didn't? Well, based on those logs, no. I don't believe it's been running for the last six minutes. Maybe did I oh, I I got the the old API server. It's this one is the last one. Right? Yep. That's the right one. Quora admission added. Okay. That that's the last one. I need Do we see an error here? Because seems to be okay. Right? Looking. Looking. Looking.
52:40 Looks like we have our web work configured for local hosts, but I think there's either maybe an impression controller plug in that doesn't exist enabled. Oh. Let's go check the API server manifest. Sounds good. And we have admission plugins, no restriction, etcd servers. We need to check if etcd is listening here, which I believe it we It was here. Believe. A lot of this is okay. Preferred address. I believe this is okay. Yeah. That looks fine. At a first glance, I don't see anything weird here. We should check that we are using the right images maybe, though
53:31 the API server seems to be okay. Yep. So why the why the API server is not starting, though? That's the main let's see. Oh, it is running now. Alright. Okay. Let's try to get notes. Okay. Okay. Are we in the right cluster, David? Yeah. We are. 018? Okay. And the port seems to be running. So can you try accessing that service at least in the beginning to make sure that I'm gonna be monitoring this, but weird. Please, we need not another phone when we are updating the service. Uh-huh. With no networking. Okay. So So let's check the endpoints on the service
54:39 Cluster 18 Application Networking Issue
54:45 before we start diving into networking, I think. Describe service chapter. Right? Correct. Oh, post Postgres. Right? We need to it's called Yeah. Check both. Yeah. Looks good. So the endpoint is there. The port seems to be okay, I believe. That's correct. Yeah. So looks like it's networking. So what we can, yeah, what we can do is the Postgres running an actual database. If the application just tries to connect through, like, standard JDBC or, like Yeah. It's just a database driver? Postgres driver over that port TCP layer. Pretty standard. Yeah. So let's see let's see the
55:39 oh, the the bot seems seem to be running since the the same date. Why don't we jump inside the cluster container and see if we can hit the Google and stuff like that? Let's see it. Let's write off a few things first before we Yeah. Okay. Good old rabbit hole. What am I missing? Oh, I lost the wait. It's not pasting my okay. This is the bot and hash. You're in you're inside the container already. Oh, it worked. I don't know. Yeah. Okay. It's just the node name and the container name is is ridiculous. It's on there.
55:48 Debugging Networking from App Pod
56:29 We have an NS lookup. No. Is it a minor? I think I've been to. Yeah. You got it. Okay. Which one is it? Net utils? No. You're oh, look at that with no DNS. Oh, that's all yeah. That's all DNS. Oh, I don't have ping also. Okay. You do have ping, but try it. Yeah. Okay. So it's just DNS. Oh, okay. Right. And that's the right IP address? Yeah. So let's check the core DNS the Kube DNS service on the Kube system namespace. Yeah. Let's go back here. Oh, we didn't check out the pod. Right?
56:47 Identifying DNS Issue (Cluster 18)
57:15 So seems to be running. Look at they all look at all those restarts. It's CD cluster control plane. Yeah. Yeah. HTT is not happy, but that's not our biggest problem right now. Our biggest problem right now is DNS. So Yeah. It's nothing compared to that. It has one restart. Do I just Oh, let's see the yeah. I was gonna say say describe the Kube DNS service. Let's see if we've got some endpoints there. Sounds good. Kube system. That service, it's Uh-huh. Oh, I forgot an f. Wait a minute. Okay. And now we can do describe.
57:35 Checking CoreDNS Configuration
58:06 Endpoints. Endpoints seems to be there. We have the metrics endpoint, the DNS endpoint, which is okay. Shall we see the configuration for QDNS, maybe? I really hope someone hasn't modified the config again. Call DNS. So this is c name in Google.com. Yeah. This seems to be Yeah. This is There you go. There's the problem. N x domain. So, basically, for any any, you you return an n x domain resolution. So we need to change that that one. Alright. Let's get that fixed up. Well, he called that as well. Good call. How was the name? Hey.
58:47 Fixing CoreDNS
59:03 Okay. So this one goes This one goes away. That's it. Right? I think it is. Unicode is banned, Balid. No one will ever mess me up with Unicode. Let's restart the bot. Shall we do a rollout? I might as well do it the nice way. So now we start to DNS. What's the service? Core DNS. Yeah. And restart service. Deploy. Deploy. Right? Or is it yeah. Yeah. There it is. Depots. So It's it can take thirty seconds. I normally just delete them. Yeah. That's appropriate. Especially if we're having an SCD problems. Reports, and I can do
1:00:10 dash l a p p call unit or QVNS. Oh. You've got get get pods. Oh, yeah. Yeah. Get pods dash l a p p. It's I think it's gate gate set. Which one? Oh, this one. Like this? No. Do mind if I type? Yep. Go ahead. I think the core Kubernetes services are Okay. There's something. Okay. Oh, there you go. So delete them. Go for it. And just for Oh, we just Oh, okay. So we just got a of course. API resources. Grab. K. Get. You dating. Suspicious. Let's try validating. But is is is QCTL
1:00:59 Discovering Mutating Webhook Issue
1:01:32 still working? Maybe something happens to be Yeah. We've lost everything there. Oh, maybe it's down again. Maybe it got killed or something like the what about XD? Yeah. Oh, XD got killed again. So something is killing probably at city. Alright. Let's see what happened to at city. I held our log. At city. Yeah. Same thing. Received terminate terminating signal shutting down. We saw that the Chrome tab wasn't there. So something is we didn't see if we have any Rawkode, maybe bots that could be killing a city on the on the Yeah. It could be a Rawkode.
1:02:30 From within the cluster. Right? I just quickly I guess we need to restart, yeah, we need to restart the Qubelet again so it starts at three again and see if there's anything on the cluster that could be clean. Yeah. We do need to restart. Yeah. Just to get things taken along. I'm just doing a quick scan of the process table. Oh, people are super sneaky. Let's do oh, you've restarted. Okay. So we just need to do the wait. Yeah. Is this new? It's not right. No. It's four minutes old. What did I wait. What did I yeah.
1:03:23 I restarted the Qubelet. Let's go to the static manifest directory and and just modify the file. Yeah. I mean, the the modified timestamp in all of them, I don't know if that's a red heading to get this or, you know, changes have actually been made here. So shall we change anything here? You could just sat there Maybe an annotation? Yeah. Yeah. Just show a label on it. Let's see. Yeah. There is. Okay. That's why we've got an API server hopefully running. One thing we noticed was it never got healthy. Right? So let's check out the probes on that entity
1:04:18 manifest. Oh, you're right. Yeah. Okay. So k get bot slash a. We could just open the entity manifest in front of unless it's done by a mutating because we had so many different ways to fuck up our cluster. Yeah. Do you do you wanna check the mutating admission workbooks first to see if there's anything weird there? Yeah. I'm worried if that's gonna cause the crash. Let's say that. Okay. Gets. I'm gonna go with mutating first because I think it there we go. Yeah. There's something with there which shouldn't be there. So what you are thinking of is that maybe
1:05:05 he trigger something when we try to access a specific resource at SkinnyFTT? Well, yeah, because we've now just lost our API server. Oh. So whenever we try to access a new training new taking webhook, we just lose the the STD. It appears that way, doesn't it? We still got an API server. Do we still have it's still running. Oh. Was that intermittent? Wait a minute. Oh, it's it's here again. Okay. Let's try that AIDA again. Okay. What what is it doing? It's running. STD. See? Yeah. Object selector component not in STD. Let's just delete it. I don't trust it.
1:05:40 Deleting Problematic Mutating Webhook
1:05:51 Well, of course, don't. Yeah. Think. Gone. Okay. What about validating? Okay. Nothing there. Okay. Let's see the SD bots to see if they make sure that the I I think STD Yeah. Stuff right there. STD is now held for. It's passed in its probe. No. That that's for DNS. Right? No. STD. The it's CD cluster, it's it's okay. It's ready, and it seems to be Yeah. But that wasn't doing that before. So now I think we need to focus our energy on this controller manager. Oh, sounds good. Oh, and their scheduler's down too. So Oh, they are okay. Let's see the yeah.
1:06:03 Verifying Cluster 18 Control Plane Health
1:06:47 Let's describe the the pod. And we still need to check our DNS. Well, DNS should be ideally, it should be fixed, I believe. Right? But let's check the, yeah, the controller first. I'm just gonna quickly jump in here. Yep. Sounds good. Let's rule that out. So Do we need to recreate those? Oh, no. Yeah. Now it's running. Nice. DNS is fine. Let's focus on the controller manager. Nice. Yeah. Let's let's check the healthiness post there. Thanks for joining us, Walid. Oh, I'm in there. Yeah. You're in there. Nice. Nice. Good call. Just because I see that there call. Yeah.
1:07:50 Okay. Very good. Okay. So if we Let's see the the deployment of the controller manager c or the the the demo set. Oh, it's all healthy. I I and the controller manager logs here, there's a reference to that. I'm using webhook. I think that was just causing Oh, that's issue with yeah. It was causing issues. Shall we check the deployment now to see if it's we have connectivity? Yes. Good call. I don't know if the DNS actually fixed it. Let's find out. Yep. Nice. Okay. Let's try that. Bump it. Yep. Yeah. I'll give you the honors. Thank
1:08:40 Cluster 18 Resolved
1:08:46 you, sir. Yeah. There it is. Image. No homes now. Please delete. Container creating. Welcome, Adrian. No. We are not breaking things. We are fixing things, hopefully. Since we're running, if you can do the honors and check the port forwarding Okay. Just to make sure. You know? So, Billy, is it is it something that we are missing besides the DNS and the mutating webhook? Because we we don't know if this is gonna go down again. Maybe SCD is gonna get another terminating theme of, and we could be missing something. And we could assume, yeah, that seems
1:09:33 to be running. Yep. This is is well, from what I could tell, you know, we've deployed our upgrade. We've browsed our application. It's speaking to post graders delivering quotes. The video doesn't work because Firefox doesn't support the codec. I need to fix that. But I think this cluster is operational. See, to me, that was a, like, a rest from the other one so far. Yeah. Those mutated webpics are brutal. Like, there's so much carnage can be caused by those things that is wild. Yeah. We don't have anything fancy here. Yes. I guess that's pretty much it here.
1:10:13 Revisiting Cluster 17 ('Honk' Mystery)
1:10:13 Do you want to, like, go back to the other one for a few minutes? I have a twenty one minutes until my next meeting. So I don't know. Let's let's see if we can find the honk. Belly, if we messed anything to do that, there's no and we're back on seventeen. Feel free to join the session. We're looking for a container d honk. Container d honk. Don't even, I don't even know where to look. We did kill service key. Oh, have I lost Marcus again? 2017. Yeah. So that's old. No binaries have been modified or injected except for key, which I'm
1:11:26 just gonna remove. And that wouldn't give us a honk. Bye. I'm back. Hey. Looking back. Yeah. Just a little bump here. I'm gonna sneeze you. Belly has informed us that we we did miss one thing, but they've seen it on our screen multiple times. Oh, okay. Maybe on the money test or yeah. But the deployment was working though. Right? So does it mean that the execution will go down again, really, eventually? Or Yeah. Let us know. Okay. Let's let's think about this hung. I I have to remove that key binary. We don't need it there. Don't care about the system deserviced.
1:12:16 The binary is gone. I started the LS here by time. I just want to know if there was anything else modified. Doesn't appear to be. Has to be You will join there. There's nothing called Hong. Oh. Which I was hoping I was hoping for. What else? I mean, I'm I'm really tempted just to go for the sledgehammer here and see dash how'd you do it? Linux find modified files. There's a way to find like everything modified from twenty four hours. That's the one. Plus the one. Marcos. Maybe that was a bit too too wide scoping. Shit.
1:12:46 Searching for 'Honk' Binary/Config
1:13:16 Three d active I'm joining now. I'm there. I've kind of broken that session, I think. Oh, okay. It's not responding? No. My faint was a little bit too wide scope. So I think we'll need to spin up a new one there. Sounds good. I'll try that again, but this time I'll just target take a look at user. What could it be? Well, it looks like potentially installed to Linux headers, which means I'm thinking of a compiled binary on this device somewhere. It leads me to think just because we've the kernel headers potentially oh, it's doing that weird thing.
1:14:13 Yep. I'm gonna run that one more time but grab dash v l and grab dash Module. Modules. He'll start No. I was hoping for more. Yeah. I'm not I mean, please don't tell me that there's some weird kernel module doing some funky stuff here. I don't see any. Should we check the bash history and see if he's been sneaky enough to remove it? We can. Let's do it. Chrome tab dash r. Maybe there's a oh, system d. Well, that was us. No. But not the enable. Not the system d enable. This was me just before we went live,
1:15:11 which means the only commands that Sasha has left us here are these. So he removed the cron tab and then, yeah, he added the k binary? So I'm gonna assume he's cleaned up the bash history and just left a few things in to laugh at us. Okay. Where's the honk? Where is the honk? It's it's weird that the I we couldn't find any container container reconfigurations that could be causing and why why is it that that only the first time the the container fails and then when it restarts, it comes back to life. No? So something
1:15:57 I I like to do is to stop actually, it's quite sledgehammerly, though, but it's to stop container d as a service and run it manually. Should we try that? Yep. That's good. Because it it means that we can run it in debug mode. Sounds good. Okay. So why don't we pop open another shell and scale clustered up to 30 again and see if we get anything in the container log. Oh, you'll need to pop open another Yeah. Yeah. Another tab. Yeah. Sounds good. Want me to do it, or do you wanna do it? I'll I'll do You You could just join.
1:15:58 Debugging Containerd on Worker Node
1:16:42 Yeah. You go for it since I'm just joining. Yeah. Okay. It seems like the teleport UI is a bit heavy, and it takes some time to download it. Yeah. It's here. Okay. So I'm gonna do alias k, cube control, export, and I'm gonna edit the deploy, clustered, Find replicas. Sorry. I just joined. Runtime. Yeah. Run container error. Nothing on container d. So it must be an external process. It's kinda where where I'm Yeah. The the weird part is that it it gets rescheduled, and we don't have any mutating webhooks or any the manifest weren't changed also.
1:17:45 Okay. Let's go that back down. Checking the mounts, see if they have been messed with. Maybe we're not even looking at the right static manifest directory or I don't know. One thing that I would love I I'm gonna create create a session on the on the one of the workers. I would really like to know if there's maybe we can run container d on debug in the waters instead of on the because you you did it on the on the control plane. Right? Yeah. Maybe we're just looking on the wrong machines. Although, I'm pretty sure we did see
1:18:28 the the adder across a whole bunch of nodes. I lost the history here now. Yeah. It's gone. Oh, wow. Okay. Alright. Let's join worker d case seven. I created a session there. Yep. Alright. Okay. I don't wanna join your session. I will flash your IP address unless you're okay with that. Yep. Sounds good. Alright. Okay. I will join your session. Oh, yeah. It should be the last one I yep. Yep. So we were stopping continuity. Right? Yeah. Yeah. We'll stop it. We'll run it with logs level debug. We'll scale up the deployment and see if we can see anything. Container
1:19:15 d log level equals debug. Oh, look. Oh, yeah. No. Is this expected? Oh, it's oh, it's starting the the container server. Right? Yeah. Oh, yeah. Okay. Alright. So that I'll give it a second to settle. Oh, wait. Why was there who am I? What do you see? I see that who am I run. Yeah. Here. Who am I? And that's normally Oh, what is this? Something nice containers don't often run who am I, I would say. But but this just looks I think it's just Rawkode and Seth. I can tell. It's the number of PGs I wonder.
1:20:09 Okay. Let's see. Has it settled? No. It's still sped out quite a bit, isn't it? Alright. Yeah. But it it is doing some weird stuff. Right? Accepting I I do think so. Yeah. Yeah. It doesn't doesn't seem to be, like, a happy container. The what is this big chunk of JSON? Seems like a container. Nothing fancy here. Rank election. Kraken, Octopus, How's the control manager? Control manager? Oh, then oh, no. There's no debugger, but it's in error. Yeah. All the ports seem to be okay in control manager at least. Okay. I'm stumped. Yeah. All the demons that are
1:21:20 are happy. It's it's weird that that it is complaining that much. Right? So something's definitely Bogdan thinks it is just liveness probes running, which I mean, who am I as a liveness probe? I guess so. Perhaps. Okay. Let's let's kill container d. The other thing we can do with container d is print the config. Oh, you're right. Yep. Don't. Continue to get that solved. I'm not that familiar with this. Handbook. Yeah. Yeah. Exactly. There there could there could be some crazy stuff here. Well, at least I see Linux isn't enabled. That would have fucked me up.
1:21:45 Inspecting Containerd Configuration
1:22:41 Default runtime, run c. I I think this is this is fine. You don't see any honks in there? I I see I mean, I have regret for honks, but I see see no honks. That. Create failed honk unknown. Let's see on the yeah. Disability. Right? Let me see system. System d. Yeah. Nothing here, bro. I'll leave container d. These are only the error logs from previous containers. Create sandbox field. Create sandbox field. Yeah. This is the error for the audience. This is the error that we're getting. The first time only happens the first time when we are trying to run a
1:24:03 container in these nodes, and then it goes away. So the when when the scheduler reschedules the bot on the on the same note, it just starts It works. Successfully. So, yeah, so it it is very weird, actually. Is there a I mean, our application is working. We did upgrade it. So Yeah. Yeah. It's just one of those subtle breaks and that the if it wasn't for that honk, I wouldn't even be looking anymore. But the fact that the honk is there. The yeah. There's teleport of the Excuse me. The the cluster is working. Thank you very
1:24:22 Cluster 17 Conclusion (BPF Suspected)
1:24:42 much. Our application is upgraded, and we've browsed it. There's just a there's just a wild goose on the list, and we don't know where. Yeah. It seems like there's some really low level stuff happening here in all that. Well, could be simpler though, but it's very difficult to spot. Yeah. We we we've seen this with with other clusters. There are subtle things that breakers put in that don't really have the same effect that they thought and the cluster kind of works but kind of doesn't. And those are really difficult for us as fixers to find a debug because
1:25:17 if we don't have a symptom, really difficult to work out what is breaking it. Yeah. This this okay. System, Cubelet. We we checked this one. Right? Yeah. We did. Yeah. No no such thing here. Yeah. And Barley, Cubelet, we have to check it. Cloud provider external. Container runtime remote. Perfect. The Container d. Yeah. Continue. Yeah. I I can cut this one. So because it's a socket. I mean Oh, maybe there's there's a plug in here in the cubelet. This is a roof. Yeah. Well, that's for sure. I mean, you never know. Right? No. Nothing here. I mean I
1:26:27 don't know, man. I don't have idea. Just because I know the space that Sasha works in, I have to assume this is a BPF program with something in the kernel to reject the syscalls to create the sandbox, but it only does it randomly on the first one. I just don't know how we'd be able to pull out the BPF programs to see. I'm not I I think BPF is awesome, really really cool, but working with it day to day is not something I have experience with. So I think what's Yeah. Leave that one up to his report and see what
1:26:58 Tom fucker he was up to. But Yeah. I don't even know where to look. If if it is a EBPF programs, then forget about it. I tried checking the the external module, but EBPF is like a completely different stuff. Right? So Yeah. And he installed the Linux kernel headers. So, I mean, that's gotta be a telltale. Oh, yeah. That makes sense. Alright. You know what? Our applications are working. We did upgrade them. We did browse to them. Mission almost successful. We missed one thing on Bellies. We missed one thing on Zashes. We will get the breaker reports in the
1:27:18 Wrap-up & Thank You
1:27:30 next few days. I think we'll call that I'm gonna call it a win. Yeah. Same thing. I'm gonna call it a win. I'm gonna have, like, a, like, a sweet no. No. Not a sweet. Like, a sour flavor on on the cake, but it's okay. Mean, we managed to do it. Applications are running. Clusters are not restarting. I mean, we're happy. Right? I am definitely happy. We got a couple of final comments. Papa's just install. Alco SysTick. Of course. Yeah. We could install stuff to that for better visibility. We're not going to that now. No suggested LS mod. We did list all
1:28:03 the kernel modules. We did CBPF, but it really difficult for us to work any deeper than that. And we're now at a stage where it's the the end of our session. So how is that, man? Did you enjoy it? It's been a pleasure. It's been a pleasure to me. So really looking forward to the next episode and to read the the post breaking modem, if you wanna call it that way, to see what I what what what do we miss. Alright. Well, thank you very much for joining me today. I think we did really well.
1:28:34 We got both our applications upgraded. Like I said, I'm calling that a win. Thank you to Sasha, thank you to Billy for, you know, for the breakers and the factors to you, Marcos. Right? This this takes time. Breaking a cluster is not always a it's not something that you can just say, yes, I'm gonna do without allocating time to it. So thank you very much, breakers. And Marcos, thank you for setting aside time to join me today. To everyone else Thank you, baby, for inviting me. My pleasure. And to everyone who watched, have a great day. We'll be back next
1:28:58 week. I'll see you then. Thanks a lot. Bye. See you the next one. Take care. Have a good one.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments