Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Debug cluster access problems by checking kubeconfig endpoints, kubelet and API server logs, and transport socket errors.
  2. Identify broken control-plane components by removing stale encryption provider config and validating etcd and control-plane health.
  3. Locate performance kill-switches in cron jobs, disable harmful system services, and restore startup with Cilium and Kyverno fixes.

Hans Kristian Flaatten and Zach Wachtel tackle two broken Kubernetes clusters, debugging a bad kubeconfig IP, an encryption provider config, a disk IO throttle cron job, plus Kyverno webhooks and a Cilium egress policy.

Chapters

Jump to a chapter

  1. 2:41 Introduction
  2. 2:58 Guest Introductions (Hans & Zach)
  3. 3:56 Discussion on Breaking Kubernetes Clusters
  4. 5:13 Show Mechanics & Sponsors (Teleport, Equinix Metal)
  5. 5:50 Starting with Zach's Cluster (Hans Fixes First)
  6. 6:08 Teleport Connection Issues Debugging
  7. 9:55 Hans Shares Screen & Initial Cluster Access
  8. 10:16 Inspecting KubeConfig (Wrong IP Address)
  9. 12:11 Checking Kubelet Status and Logs
  10. 13:28 Debugging API Server Logs (Socket Error)
  11. 15:40 Examining Kube API Server Static Pod Manifest
  12. 17:55 Investigating Encryption Config File
  13. 19:06 Removing Encryption Config Reference
  14. 20:25 API Server Logs: Etcd Timeout Errors
  15. 21:30 Slow kubectl Commands and Pod CrashLoopBackOff
  16. 30:21 Diagnosing Slowness (Hint: Etcd/IO)
  17. 35:11 Identifying Suspicious Defunct Process
  18. 44:56 Discovering the IO Throttle Script
  19. 45:56 Disabling the Throttle Systemd Service
  20. 46:47 Cluster Control Plane Recovery
  21. 48:20 Pods Pending, Cilium CrashLoopBackOff
  22. 49:09 Attempting to Deploy Custard v2
  23. 50:09 Zach's Cluster Fixed
  24. 51:07 Starting Hans' Cluster (Zach Fixes Second)
  25. 51:12 Zach Takes Over Screen Sharing
  26. 53:27 Hans' Cluster: Pod CrashLoopBackOff & Slow kubectl
  27. 55:08 Identifying Mutating Webhooks (Kyverno Annotations)
  28. 57:06 Removing Kyverno (Dealing with Finalizers)
  29. 59:03 Post-Kyverno Status & Custard Service Issue
  30. 59:29 Database Failure & DNS Resolution Error
  31. 1:21:32 Checking Network Policies
  32. 1:22:39 Discovering and Deleting Cilium Egress Policy
  33. 1:23:07 Pod Still CrashLooping (App Check Error)
  34. 1:24:38 App-Level Check: v2 Image Already Present
  35. 1:25:30 Forcing Image Pull Policy Always
  36. 1:25:50 Hans' Cluster Fixed
  37. 1:26:02 Recap of the Breaks Found
  38. 1:26:24 Cron Job Location Revealed
  39. 1:26:45 Conclusion and Thanks
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

2:41 Introduction

2:41 Hello, and welcome back to the Rawkode Academy. My name is Rawkode slash David, and I'm your host for today's episode of cluster. It's been a little while, but we're gonna have a lot of fun taking two very broken and Kubernetes clusters and doing our best to fix them. Today is solo edition, and I am joined by Hans and Zach. Hello, How are you doing? Hi, David. Alright. Well, thank you both for joining me. I'm really excited that we get to do another episode of cluster today. So if we could just start with you, Hans, and then Zach,

2:58 Guest Introductions (Hans & Zach)

3:12 say hello and tell us a little bit about yourself. Hi. I'm Hans. I work as a platform engineer for the Norwegian Labor and Welfare Administration. Been using Kubernetes for quite some time, but really excited for today. I haven't fixed much broken much of these broken clusters so this would be cool. Yeah. Alright. And hi. I'm Zach. I work for Microsoft. My team at Microsoft works on, you know, deploying Kubernetes in all the places that isn't, you know, the public Azure cloud. And, you know, we call ourselves, like, the AKS Edge team, and we do a lot of exciting

3:48 things. And we deal with a lot of broken clusters. So I'm excited to, you know, share some of my learnings and fix one live. So yep. Well, yeah, let's face it. It is not difficult to break Kubernetes. So No. I I I say that leaving a cluster to its own self for, you know, a week or two without touching it, it it's gonna it's gonna go bad. It's like, you know, it's like the gremlins. Well, yeah, because we have this little lynchpin in Kubernetes called NCT, which cannot really go uncatered for un nurtured and uncared for. So,

3:56 Discussion on Breaking Kubernetes Clusters

4:20 yeah, always fun, always a challenge. But it just means that when we come on to episodes of custard like this, you know, I always tell people the hardest part is probably deciding how to break it. Right? Right? How did you both find that? Like like, there's a million and one ways and yet you can't seem to think of what you like. Yeah. Absolutely. Sort of I've I've broken my fair share of clusters, but once I sort of oh, yeah. Forty eight hours to really break this cluster here. I just totally froze. But luckily, I do have some good buddies that

4:49 so made made the thinking go a little bit smoother. Oh, yeah. Exactly. I mean, I just as soon as I told everyone, I am gonna be on the show, they were like, oh, like, remember this time we looked at a cluster that where that happened and when that happened, then, you know, using guy like that back, like, that past experience of, like, seeing some, you know, messed up clusters. I was like, how can I do that on purpose? Awesome. Well, we are about to find out. So this is how clustered works. One, I share my screen here.

5:13 Show Mechanics & Sponsors (Teleport, Equinix Metal)

5:22 We have Teleport. Teleport is a tool that will allow us to all access the same machines at the same time typing in the same terminal, which means we can try and fix these clusters together. Teleport also sponsor clusters, so thank you very much to Teleport. Also, all these machines run on bare metal. They've got a healthy amount of CPUs, a healthy amount of RAM, and that is all thanks to Equinix Medal who also sponsor. Well, thank you both for providing the tools that make this possible. We are gonna start with Zach control plane one. That means hands,

5:50 Starting with Zach's Cluster (Hans Fixes First)

5:55 you're fixing first. I have opened the session. You should be able to see it on the active sessions list. If you could please join and just type echo or hello or your name, anything, but let us know that we are in. It's joining here and it says access denied. Disconnected, actually. Oh, dear. Could I let's refresh and join one more time because it shows up here twenty six seconds ago. Still disconnected now. You know, we got this last time, and I forgot all about it. So let's see if I can fix the helper. The problem was the no. I did update

6:08 Teleport Connection Issues Debugging

6:45 it here. Game of the role. Oh, but this is the join that's broken. Alright. Let's do it the other way. Can you open a session, please? So if I connect Yeah. I think you're just missing the join permission, so I'm gonna try and join. Okay. So I'll joining the control play one. Yep. I've joined. So it's a challenge accepted. We have sort of broken teleport. No? I I didn't touch teleport. No. You know what? It's just been so long since I did a and we did run into this on the last episode, and I 100%

7:42 forgot that we have the problem. Now it's not the end of the world, and I can always do the typing if required, but I'm wondering if we can fix this. I bet we can. Yep. Because we have the rule editor here, and there's an editor rule, which if I copy, grab all of this. Or I could just get myself editor role, couldn't I? Leasing in the teleport tweak. Perhaps. We have to expand the scope of the show from clustered up. Technology. There's sort of a there were three simple rules. It was sort of, like, don't break

8:33 the bootloader, don't exhaust the disk, and don't break teleport. Yep. Alright. Let's copy all this. Look at our. That's a camel. Let's try again. Resources are under road. This is why we want something like Q instead of YAML when you're writing configurations. I feel like Q. No. I I haven't used it enough to make sort of a good time. Let's not debug that. I'll just I'll do the type in for you. Because oh, I mean, you could share your screen if you want. I mean, this is solo edition. Are you happy to share your screen?

9:49 I'm more than happy to share my screen. No worries. Alright. Awesome. You share your screen. Let's do it that way. Okay. Maybe I'll try and fix teleport in the background so we can have it working later. But let's just do the screen share from there. Okay. It's Very professional and sort of cluster. You can see all over this. Google Chrome is required for screen sharing then. Give me some I'll I'll just take that. Okay. Yep. Okay. So let's see what we have here. I'm just going to expand this image here so I can actually see.

10:16 Inspecting KubeConfig (Wrong IP Address)

10:22 Alright. Well, well, you can expand it by right clicking essential controls. It's very hidden by my streaming software, but it is there. So I have exported the KubeConfig for you. These are Kube admin clusters, which means that we are able to use admin dot cov to be able to speak to the Kubernetes control plane. Okay. Cool. So let's do sort of let's do a kubectl version just to see if we sort of are able to oh, and it says the oh, the connection was refused. That IP there doesn't really look right. From my recollection, it should have been sort

10:59 of more of a local private IP. So let's go ahead and and look into that admin conference, see what we are dealing with there. Okay. So I'm flashing Kubernetes credentials to the world. Yeah. Maybe not the Anyone that could copy off these x five zero nine is good for them. Exactly. Oh, man. Yeah. So that's I I sort of I really believe that that IP there shouldn't look like like that. If you could bring up IF config and just look at the what was the yep. So Looks good to me. Oh, yeah. Okay. But there's internal address

11:52 as well. Okay. Okay. Yeah. So it's it's definitely there. Let's see. Can we okay. So is Kubelet even running? What does John Oll CTL says for the Kubelet? Running? By then? It's just material. Yeah. Yeah. Definitely. What does it say here? Unable to write event. So probably it's not able to same as us, not able to connect to the API server. Let's do let's check if there's any containers in the in the k s I o namespace, the CTR. And it's is it containers LS? Okay. So I see I see an API server at least.

12:11 Checking Kubelet Status and Logs

13:15 Can we how do we get the logs for this then? I don't recall if there were any logs output directly from the CTR. We could use CTR. But because we're on this machine, I tend to just use bar log. Mhmm. Yeah. I also like cry cry cuddle. Yeah. So we have Cilium, CoreDNS. We have the Kube API server over here. And then If we have more than one. Yeah. And is it complaining about it city? I don't recall if I saw any. It says it's a retrying. Zach would have to be very cruel to mess with Etsyd.

13:28 Debugging API Server Logs (Socket Error)

14:23 Yeah. No one really knows Edcity. Dialing temp socket file connection, no such final directory. Would would that would that be too cruel for the game? No. But I have pulled our previous guest slash victims on Cluster before. And the number one thing, almost unanimously, that people do not want to see broken is that to be So the definitely something here. Address connection, create transport, fail to connect to, and then it says temp socket file dot sock. So dialing Unix. So it's trying to connect to a socket. So to my that's suspect here, right? Yeah, yeah.

15:22 Somehow, sort of the Kube API server is trying to connect to a socket here. So we need to look at the manifest files for the Kube API and see what's going on over there. Which manifest would you like to look at? It is a it's it is a Kubernetes manifest there, and there should be a Kube API server with their We can be modified. Let's look at that one. So we do have the advertised there. And let's see here. Can I move my the it says servers, so it does has do you have an address there? I don't

15:40 Examining Kube API Server Static Pod Manifest

16:18 see any socket sort of configurations It would indicate that it's trying to connect to a socket yet. Signing TLS encryption provider. This looks looks normal. I say that all the time, but, you know, what the heck? Dave David sees it. It's it's it's it's Oh, I need I need to squint more with my eyes here, and let's see what have I missed here. There's one thing here you just you probably wanna cast your eye on for sure. Okay. Sorry. Maybe if I just move my cursor. And is is that the one encryption encryption provider config?

17:47 Because I know. I mean, I don't remember enabling encryption on RTD. So Oh, man. Yeah. Encryption provider config. No. That's yeah. I Let's take a look at it. Right. Yeah. Let's let's do that. Something tells me we're not done in that fail. But Okay. So Hello? I'll let you you could tell us. What do we see? So there's the socket file at least. Exactly. Yeah. Yeah. So in my sort of simple hand, sort of what if we just remove this from the from the Kube API server? Yeah. I guess we're we're about to find out just how

17:55 Investigating Encryption Config File

18:53 cruel that has been. Right? Like, did he even encrypt anything? Is it just a pseudo misdirection? Oh, man. I don't like it. But we could we could certainly comment that for now. Mhmm. Yeah. Angel's scan. Oh, no. How have I lost David's trust so quickly? It's because I came late to the because I came late to the the pre the pre show. Alright. What are you thinking, Hans? Just comment to oh, and let's let's see what happens? Yeah. That's that's tested out at least. Of course, if it's if it's already encrypted and we need to

19:06 Removing Encryption Config Reference

19:43 we need some keys to decrypt it, then we need to figure out where the keys are. Well, we'll give that a moment to restart. We can maybe nudge it with restarting the Qplet, but we'll give it thirty seconds. There it goes. And it's back. Alright. Okay. What's your next play? Well, we need to look at the logs again. And I think we are still unable to speak to Edcity. Yeah. I think you're right. So what does it actually say here? Request timed out. Is it a connection error? Connection to the cluster. Now let's was running?

20:25 API Server Logs: Etcd Timeout Errors

21:19 I didn't recall if we if that's So Appears. Here. Yeah. Might be worse to be running. Yeah. Just run kubectl get pods and see if yeah. Okay. Can you just tell us to run that to make us Who? Oh, it responded. Slowly. Faster than that. Interesting. Yeah. My my sort of my my video is sort of blocking the lower part of the screen, so it's a little bit hard to see the the Well, it would appear that cube control get stuff work. Somewhat work. Definitely slow. That's suspicious. Maybe I was seeing her. Oh, wow. Hand that time out, so there's

21:30 Slow kubectl Commands and Pod CrashLoopBackOff

22:49 definitely something some did get the pods. I can get services too. You need services. Alright. Well, I mean, the objective is to get clustered v two deployed. Yep. Maybe you can from this position. Okay. Let's let's see if we have the deployment even though we don't have the pod. But That's interesting. Yeah. Let's go ahead and describe that. Definitely something funky going on since it's, is this slow. Yeah. Something's not quite right. It got there. Okay. It's one unavailable. Yeah. Yes. That's interesting. That, let's look at the replica sets. That's So I think this guess is a good indication

24:40 of some behavior here. So you wanna what Can we I would I'm just can you do a rollout restart on the deployment? I mean, I could delete deployments pretty easily, but rollouts? Come on. Come on. He pay he pays by the the keynote. Alright. Okay. Let's do get deployed RF Mhmm. So we have no change. So, yeah, I think this confirms my my thoughts here. I can share, but I'll give you an opportunity to kinda Is there something going on with our KubeCTL binary? So just can you do a which KubeCTL? That's not a lot of trust, Hans.

26:00 And I can see what's the output because that line is blocked by my window. But It looks like we have cube control at the right location as l 64 bit compiled statically like binary. So Yeah. It looks alright. I'm not gonna say it looks good. I'm gonna say it looks alright. Just to be sure, can we just use that slash user bin kubectl kubectl and then get pods just to make sure that it's not something else messing with the kubectl, not an alias or anything? Okay. Yeah. So the same same result there. So I think

26:52 that we have a working a semi working API server to the point where it is returning data, and we can get pods deployment, the scrape, etcetera. However, the API server legally isn't being updated, which means the kubilant maybe isn't functioning as we expect. Okay. We're able to actually report back that our pod is no longer running. Even though it's not listed under get pods, I'm a bit confused. But I feel like we've got some disconnect there that we probably wanna track then. Okay. So we need to check the Qubelet logs then. I mean, listening to me comes

27:25 with its own risk, but It's in a server request time out. Air syncing pod, faintly start continue. So, yeah, faintly update status. Crash the backup. So cube cube controller, let's let's see if we can get the logs for the the controller manager. Mhmm. I'd say that's mostly as expected. Yeah. Would you expect it to lose leader election? Oh, no. I never saw that thing. Oh, yeah. But there's not one healthy line in that log. Yeah. We got a yeah. Okay. Her error retrieving resource lock. Yeah. Everything there is broken. I think we should just do a, you

28:52 know, keep call, get pods. Oh, let's just see, like, how everything everything's sailing around. Oh, yeah. Nothing worked. So at least okay. So our controller manager and scheduler isn't isn't working. Yeah. We can't even control the manager. I don't care about the scheduler, but we do need the Yeah. And the only the only to renew lease, queue control timed out waiting for condition. So it's the leases. I know there is this lease object, but I have no idea really what it's the cube node and control manager, but really don't know what I'll just say, like, let's think about Yeah.

30:21 Diagnosing Slowness (Hint: Etcd/IO)

30:21 You know, every command is pretty slow. The pod can't grab its leader election in time. So how would you slow down workloads on a Linux machine? Are they most of these are containers. Maybe you should inspect them. Yep. Is there something sort of are they sort of resource constrained? Are they Well, there's one container. So let's do that that that control one. I would say, yeah, I think there was a path where you were on the right one earlier on when we're focusing more on, like, core components here. Yeah. Who would ever mess with ECT?

31:51 Very much of us like it when you talk, Zach. It's a it's a it's a Halloween special. A haunted cluster. Definitely a haunted cluster. Yeah. So, yeah, so something is making it behave really, really slow. Snapshot count. Experiment experimental watch progress notify interval. Yeah. Those look I really don't know if those are normal. Yeah. Yeah. Maybe it's, like, the logs. Well, those two flags are, like, weird. Let's follow Zach's advice to check a log. Well, just I don't know if they'll yeah. Let's do this one. Did I get wrong there? Oh, no. I copied it.

33:14 I totally just blew it out loud, didn't I? No. Hold on. I'm just gonna make it harder for you, Hans. Yeah. Because that's the most recent recent one. Right? But that was unhelpful. Yeah. Maybe maybe logs aren't helpful. Sorry. We do have a slow It's the data sync here. Apply request took too long. Well, that doesn't look great. Defunct. No. Good. Do you mind if I kill it? Yeah. Let's do that. The funk the zombie is definitely not gonna go away now. But I try I thought I'd try. Oh, is this something? It is. Yeah.

35:11 Identifying Suspicious Defunct Process

35:12 I think that's an old maybe the process name isn't might be s a s a d. I think that might be an old, because I'm beat anyway. Alright. Let's make a small change to this. I'm gonna suggest that we comment these two experiment and flags out. Just a to force f c z to restart the the Kubernetes to restart it, but also just because I don't I don't like the side of them. So Are you happy with that? Yeah. That's The other thing just not that I have this wonderful cheat sheet that I always use.

35:53 Super. It tells us how to configure it. Yep. One purple link, but I always do. And which will allow us to run a status as well. We might as well see what entity think of entity. Mhmm. Assuming it does get restarted by the gateway. Mhmm. Alright. We have no. I wouldn't. Maybe maybe it's still coming up. I don't know. I would look at things. I I didn't do anything here. This should you should be able to restart everything. Yeah. Yeah. Yeah. Alright. Let's see. This is actually surprisingly healthy. Like, I like it when I see node not found, but

37:03 it just means they can't. But I would still like it to start container. I I don't know if this rep is capturing. Maybe try running. You call again. Never. Well, let's see what CTR says. Right. Control. So, yeah, two days ago. Try Is it not RMP? Oh, no. It is. It's just telling me to stop it first. Because there are two Eat City running. Oh, there are. There's one fifty three seconds ago. I I think it's running. I I don't think this I don't know what this other That might be a it's too runny. If so, that was an accident.

38:46 Pures are hard. These are the same three log files. Something's Yeah. We do have two. And there was one seems to be here. No log. I mean, I think we just have to set fire to this machine. I mean, I don't know where this other let's kill this other ECT, but it's not related to the to the break. Okay. Yeah. It's not. Unless this is a symptom of my break. But I think from from point of view, they don't they they shouldn't care about this one. No. No. Did the pod ID. Oh, you Actually, just stop. Oh, wait. Wait. The Cryoctl

40:13 is showing oh, you're doing Cryoctl pods. Doesn't that always show the pod in the container? So let's stop this pod. Let's stop this pod. That's the two HCD pods. Mhmm. Oh, I think there's a leak pod, but not a leak container. If you do get if you do Crysatail containers oh, sorry. Just Crysatail LS or PS. LS or no? Or PS. Alright. Alright. So this is the new one. So it looks like it created a new pod, but so, like, there's, like, the pod it creates in the container. I think it leaked the old pod.

41:23 Cool. But this is a little bit of a of a side rabbit hole. Yeah. I I I don't know why it's it's not starting. I think it is. If you do if you do kube cuddle pods, I think it's Yeah. We don't have an API server. We got a Kiplit. Got a Kiplit. API server is restarting. But I think if you do that privacy tail PS again, maybe, it might or PSA. We'll see if I y see, it's saying, oh, it's a new HCD, not oh, this is. So HCD is still not touched. We might

42:23 maybe kill it from here. Okay. Painter ID. Alright. Well, I might have learned the valuable lesson here about what not to play with actually. Alright. Let's do something we've never done before. I'm gonna suggest we reboot this box. We jump over to the other cluster, and we start tackling that, and we come back to this one and hope the reboot gets sent to were your changes left through a reboot? No. I don't think so. Well, yeah. So how much do you wanna answer that? Well, the changes will not persist, but there might be something else on the machine that would

43:39 rekick it. But I don't think I'd set that to I don't think on a reboot, that wouldn't start again. I don't know what to do about that zombie process. I think I I I don't know. Yeah. I mean, hey. Reboot's fair game. Right? Do we have any other options? I mean, we could try to stop the thing that's causing that maybe is causing a hang here. Like, there might be a service here that's doing something that's maybe just making somebody else not happy. Right. Okay. Let's see if we can find that service. It's probably named Teleport or something to sort

44:36 of disguise itself. You're good spying things out. What do you think this is? Oh, it's real ugly. Alright. So we find a malicious actor. Let's see. For thirty seconds. It is rather cry control PS. Looking for it. Yeah. I think this is maybe related to our problem. Alright. What's it doing? It's getting a container ID. Ah, right. Okay. So you're modifying your throttle. Change on the IOPS. So Yeah. I'm setting an IOPS, and it's per second limit. Alright. So we got secrets directly. There's two things. Right? We can first tell that script, and, he's not headed in

44:56 Discovering the IO Throttle Script

45:36 the system to somewhere. But but It's a it's a service. You'll have to you might wanna disable that. Yeah. The block IO service, isn't it? Oh, it's a little bit down. Okay. And let how do you flush this? Well, the zombie's gone. I guess someone I guess holding on to some file there is is not making someone happy. Hey. Alright. We may just have to wait for the API server to come online. We'll give that a minute. Yeah. That definitely did fix the NCD then. So you were you had that script running at the system

46:47 Cluster Control Plane Recovery

47:25 you serviced every thirty second. It was changing the throttle parameters on the the disk IO too. I'm assuming just zero. Oh, yeah. I mean, honestly, I just wanted to change them once, but the problem is, like, there was no, like well, I guess container d is adding support for it, but I guess it's, like, one dot seven or something comes in. Kubernetes has no support for, like, IO limits. So I was like, okay. I could change it for this one c group. But then if he just accidentally kills that process, then it's and he's, you know, he's free. You

47:55 know, what's the fun there? So figured, like, if we had, like, a service that could just reset it in case of the world where, like, you do restart the pod and then, you know, a new container would come up. And, like, for thirty seconds, everything would be happy, and then the limits be back. But, clearly, it did a little bit more, harm there. I guess the script was a little too hacky. Kiddos. But things appear to be getting better. We do have a custard pod trying to be scheduled. It's currently in a pending state. So let's see

48:20 Pods Pending, Cilium CrashLoopBackOff

48:31 if we can take a few more minutes onto this cluster and get something working. What's your next step? Okay. So let's check this, Selium operator that's currently in crash loop. I would maybe just delete that part. The other one seems to be healthy. Yeah. Okay. Yeah. I I feel like that's a a red heading. Although, again, I'm I'm wrong more than I'm right. But I I don't feel that we should worry about that. Oh, no. The cluster is running. Yep. So let's call on port 666. The whole the node port is 30,000, but the cluster Oh, yeah. Yeah. Yeah. Yeah. Yeah. Alright.

49:09 Attempting to Deploy Custard v2

49:26 You have v one. Okay. So v one. Let's edit the deploy and then change it to v two. Cannot be that simple, but Oh, no. I think I think we're good. It's terminating the old one. Mission successful. Yay. That's good for us. Oh, man. I hate density. Oh, man. So, yeah, just yeah. I didn't know. Slow HCD is really just a headache. It's so hard to figure out, like, when when that's even the problem. So Yeah. I I guess slow everything instead of yeah. Yeah. It was really weird, though, how, you know, for the first while, we were

50:09 Zach's Cluster Fixed

50:46 able to see at city running. We were getting slow responses, and then suddenly it just kind of zombied on us. I'm not sure what triggered that or how that happened, but there you go. Very sneaky. Alright. Let's see how cruel Hans has been. Ready, Zach? Sorry. I I I think my thing paused for a second. I was just saying I hope Hans has been really cruel. Good luck. Oh, yeah. I probably deserve it after that one. Do you want do you want me to share my screen, or you want you just wanna type it type continue?

51:12 Zach Takes Over Screen Sharing

51:28 Yeah. If you could share, go for it. Alright. I'll share my screen. I I am using Chrome. Don't tell my employer that I'm not using Edge for this, but okay. So you see my screen? In just a moment. Pop over here. I'll quickly duplicate your face. There we go. Alright. Got it. Alrighty. Can you Let's start. A window or a desktop? Desktop. I couldn't figure out window. Do you want me to do window? I I just kind of only give me a No. You can you can do the window, but I'll just I'm gonna ask you to resize it,

52:17 but so it fits in the frame. That's all. Oh, sorry. I don't know how do I have to stop? Let me share again. Oh, it should've been there. Oh, window. There's a window option. Here we go. Okay. That should be better. Alright. And if you could zoom in three times. That's it. Good. I'd I'd go to one more. I know it's gonna look annoying on your site, but, yeah, that looks good. Yeah. Okay. We we may have to just if you drag the left hand side of the window to the right oh, the wrong way. Yep. Wrong way.

53:04 Make your window yeah. That that's it. Perfect. Right there. That's the one. Good. Hey. Good luck. Did you bring hands? No. Anything I wait. What did you say, David? I said, did you break it to d? No. I did not. Oh, yeah. I crashed it back off? Can you make your window slightly smaller vertically? Just a smidge. Good. Audrey, try and make it 16 by nine, which I know isn't a possible request. But Did I lose you? Yeah. We've lost having 10 We've lost the sides. What? We've lost the sides. Like, if you could just I know.

53:27 Hans' Cluster: Pod CrashLoopBackOff & Slow kubectl

54:23 Yeah. K. Okay. Let's go. That's it. Don't touch anything. All done. And now I wanna touch my keyboard. Okay. So I was not found, but it kept positive showing. Oh, we're doing something funky here. Oh, no. We're just kind of Interesting. So is the something quick here. Generation one. What is going on? So are you touching this? Two seventy five is pretty high. Is this going up constantly? I suspect so. Yeah. I think there's an a a trigger. Let's see. We got a lot of time to trigger. This weird custard is my favorite and also worst tool that I've

55:08 Identifying Mutating Webhooks (Kyverno Annotations)

56:08 I've never enjoyed and hated something so much. Yeah. I just remember, like like, looking at myself this morning being like, why did I reach out to be in this? I blame the Ship It podcast. I kinda wanna check that out. Oh, we got Coverno. Funny. Yeah. Coverno caused a lot of headaches for me at my last job on. Shout out to any of my, accomplished coworkers on the call. We all know the fun of Coverno. My favorite thing about custard is that I give everybody forty eight hours to break it. When we look resources, it always says modified

56:42 an hour ago. Okay. This is fun. Should we just get rid of Coverno? I wouldn't judge you. All from a release. I I just wanna make sure I take a pair of part to burn it on the right way, but let's just Helm list as names based if that help. Oh, yeah. Thank you. I I just I guess I could find the helm release and leave it properly, but I think I can get enough of it gone to not cause me a headache. The finalizer enters enters the chat. Yeah. Famous last words. Oh, it it ran away. There you go.

57:06 Removing Kyverno (Dealing with Finalizers)

57:48 Never seen that before. Alright. Well, after if you ever have a cluster like this to install, you like, you get, like, scarred from, like, trying to delete a namespace. Yeah. I got a special alias now that deletes finalizers from all objects in the namespace. Yeah. It's like like, once you get finally get a and you say it's finally deleted or You can do delete dash dash all as a friendly tip. Ah, thank you. Okay. I've told no one's my favorite. Got 10 more. It's 30,000. Oh, I'm supposed to trust you now? Just I just you could you could say 30,000 and, like,

58:50 in my head, like, I'll just type out, like, the wrong number. I feel like I'd see it. Those connections refused. Alright. So let's let's just recap here. Right? So Yeah. We jumped at their cluster. The pod was definitely rotating regularly. You think that was a mutated mission controller and you have parsed kavernal from all existence. Yeah. I'm actually not sure. Maybe that is a it's still doing some weird stuff. So, like, it's possible that was a a red herring. I don't know. I meant at least I'm doing what it used to be doing. I think that replica set ID is still

59:29 Database Failure & DNS Resolution Error

59:29 changing. Let's apply patches. Let's see. Although the last applies configuration annotation there that show hyperinal integrating. Yeah. An AppSec profile. Oh, why is back up? Oh, boy. Who is who is adding this Kiburno again? I swear I'm I'm not in a machine and doing something. That would be evil. Sneaky. So is there, like, a is there, like, a boxer controller or, like, a controller that's installed? No. Don't see anything. The Hans controller. People on my Discord have suggested that we run a Twitter controller so that people could tweet at a bot and it runs commands

1:00:54 at the cluster. I've not been feeling that brave yet. Okay. Might do something nuclear here. Looks like not nuclear. I just might go into I might turn on audit logging. Let's get the. Is that I'm trying to think if that's, like, too much. It's like because I I have a little I don't really know who the heck it's doing. Man, I could try this. I just hope. Yeah. I might do that. Alright. So something is causing Kiberna to come back. There's no controller in the cluster. So the thought process here is something on the host.

1:01:45 It would well, it could be other hosts too. Right? I if anyone that has visibility in this API server so I I and I could I I mean, the other two nodes don't have permissions to talk to the API servers. Oh, they okay. Gotcha. I mean, they can subscribe to the API server and the Kubel can run pods, but they can't create workloads unless someone has tampered with that too. But yeah, I don't think he's done that. And that was a good good suggestion, David. Looking on the host. Was a suggestion? Let's say I I did consider using a

1:02:32 service, but they are spot on that something is causing Corona to come back. Let's let's list all let's all processes. I'm just, like, Is it p s aux without the dash? Or You can do either r. Yes. Okay. There. Sounds but why not? I mean, I would think, like, what I always find telling, right, is that the things people debug in the other cluster are usually things related to what they've done themselves. So think about what Hans' questions were early on in your cluster. You mess with oh, no. You did. Probably what you suggested. Sort of end up becoming completely paranoid. Was

1:03:52 at least sort of my all of these things going through my head, debugging that, oh, what is can I trust this command? Can I trust output here? Trust nothing. Yeah. Trust nothing. Did oh, did he rename something? That would be sneaky. But I think you I I think you've already I think you're the trigger, if that helps. Yeah. So let's see if I just now use this for that one. When does this come back? Yeah. It comes back. And so you use the service to sort of keep it running. Right, Zach? I think I use something else.

1:04:49 Yeah. I'm trying to I mean, I just need to You trust skip control? No. I think there's just something. Maybe the cron job or something. I just wanna get all I just wanna get all the processes. Why am I not why is TX aux aux not showing everything? It doesn't understand. Am I? What's what's up with the whatever session I'm in? I'm Try what? That's weird. But what I just do? Oh, I just got oh, Just did and that's enter. Tell me your I have your your session back. Oh, yeah. I didn't notice that behavior earlier.

1:06:03 I just created the session on hands control point one and run PSE UX and see the interesting output. That's funny. I think I know what that is. I'm sorry. Yeah. Just open a new session. Yeah. I don't think PS is PS. Try running type p s. Wait. Let's see. He messed up he messed up the actual p s, but no. Because, like, I would see something in top of this. I would start check the bash RC. I mean, this has gotta be a function or an alias. Right? Or what can it be if if it's

1:06:55 not out of those? Oh, you enabled control completion. That was very nice. Yeah. Yeah. Thank you. But you are on on the sort of the right hand share that PS is not necessarily PS. Yeah. It could just be a script. Right? Oh, man. It's already in your trap. Alright. What are we doing here? Quite funny. K. So you're I mean, PS should be a bash button. Right? So you you can probably just remove that script altogether. Yeah. You can just move the PSS to PS, and then you're good. Let's remove this. Yeah. No. You you just move move user

1:08:23 bin PSS to user bin PS to rename it. Is PSS not built in? No. I I guess I guess not. Okay. There we go. I was losing my mind for a second. I was like, is is is AUX not what gets all processes? Now now now I'm just looking in the in the field for whoever is doing something bad. Oh, okay. So we Are we doing a? But you did you did mention something earlier, Zach, something about Cron. What did you But I'm not sure if it show up on the process list, though. Right. You can do CronTap dash e.

1:09:34 Did I mess with CronTap? No. I I did I did consider sort of messing with CronTab as well to sort of remove when you're sort of viewing. But Yeah. It does not appear to be a Cron job. Or but can you run a cron job without it being in cron tab? Yes. Yes. I just I just, like, disabled your guide. Okay. Did you I'm gonna I don't know if you copied off the You could disable the current service and the ATD service if you don't want any of these jobs to run. Oh, okay. Or you could just, yeah, break

1:10:34 his access with the admin comp. I think both would be pretty successful. Unless I've copied the credentials. I know. I know. I I I was that's that's what this Let's rotate the search. Come on. Yeah. Well, that this would be a good point. Let's see Kubernetes had replication. You know? Server application. Could be okay. I would stop a t d as well. What's what's that a? A t d. They are demon. I still wanna break them up. Okay. Pass the back off. Bye bye. At least it's not changing anymore. But it depends whether that was modifying the pod

1:12:05 or the deployment. You may find this added something to the deployment that you need to purge. I'm assuming we maybe have a last applied annotation on this. Maybe not. It's what what is this error? Never even got to the bottom of that. Well, I thought it was the app armor or second profile being added by. But maybe I'm wrong. Oh, we probably need to I would have to kill this. So let get roll out now without modifying that. If, yeah, if Kibarno modified the pod, then deleting the pod should Oh, we got some Valve in the webhooks too.

1:13:37 You're Go developer, Zach? I am a Go developer. What? You're single dashes. Yeah. Oh, no. I just you know, I'm a when when typing on on livestream, every single keystroke I manage to get correct is is surprising to me. Alright. I think you're gonna have to take a look at that deployment. Wow. Look at all those. Oh, we got it. Or describe the pod and see what the header message is. I can kill this one. Just receive a code. Keep it. It doesn't log anything. Right. I'm sorry. No. I guess It's running, so your curl may

1:14:37 work there. Not quite. Yeah. It's hanging. Oh, man. There we met. It should be running on port 666. And there should be an IP table rule to forward 30,000 There's nothing 30,000 here. Maybe not. Actually, I can't remember if QIP proxy runs in these clusters or whether I would feel silly and remember. There's a golden rule of running get services, and you never run get services on its own. Call my EP. I mean, anyone at least has service on its own is is a braver person than I am. Right. It looks good. I think I think it

1:15:51 should work. Does it have the does this local host have to work? Oh, wait. Wait. Where is it scheduled? Oh, yeah. Yeah. I would try the the 10 dot IP address there. Yeah. You could use the cluster IP if you need to. Yeah. Here we go. Okay. Use the cluster. But then it's 666, right, on that IP? It should still be. I have it as the 30000. Alright. Cluster networking. And this this should have worked by now. If you do, keep control get nodes dash. I was gonna I'm also gonna go into one of these other nodes and see what's

1:17:02 happening. Yeah. You can run local host 30,000 on that. But I am not Because it's a worker node that should work. I don't know if this is if it's scheduled on this one. Yeah. The the north port should have parked, though. That's two. Yeah. I don't think it's working because I've just tried it externally and it's not Oh, it is. It's just very slow. Oh. Because Yeah. It is. Database failure. Oh, database look up. So we're getting a time out. A failed to look up address information on the database. Okay. So we messed up the database.

1:18:18 I guess that was fair game. Let's go back. Just because you didn't open that in a browser, the error message is failed to look up address, temporary failure, and name resolution. Gotcha. So okay. Working as is is the application reaching it's reaching out through Kubernetes service? It is. It uses Postgres as a service name. It's hard coded on the app. Okay. This looks okay. It does look okay. Yep. How else can traffic within the cluster be manipulated? Yeah. Both database and DNS, it's should be running fine. Yeah. There's a lot of ways to to do that.

1:20:37 That's it. What's this eDNS? Trust AP option. I don't know it. I deleted. If it's a DNS specific problem, then you should still be able to communicate directly with IPs. I'm not sure what's inside the container, if there's anything like curl or something to test that. I I changed the is it is it it's hard coded, the It's hard coded. It's off to Postgres. I would maybe look for network policy. Or just NetPol for short. There's also CNPs and CCMPs because it's a Cilium cluster. CnP? That's the Cilium network policies and CCN. Cilium. And excuse me.

1:22:39 Discovering and Deleting Cilium Egress Policy

1:22:39 Silly and policy. Yeah. That's a silly cluster wide network policy, which is blocking all Ethers traffic. Oh. From the cluster pod. That's pretty easy. And your favorite to the alt command? Hey. Let's see if we can get this upgraded to v two. Something not happy about now. Well yeah. Good luck. Back. V two already present on machine. Oh, man. This is this is the last one, and then you're home free. K. That's good. Oh, wow. You shrunk my image for me down to nine meg. I'd love to know how you did that. Probably. That image with a 200 meg video in

1:24:38 App-Level Check: v2 Image Already Present

1:24:57 it. I don't know how you got that to nine meg. Magic. You may find it easier just to do image cool policy always from the control plane. Yeah. Yeah. Cool. That's your fault. So what is the name MegaMhans? Oh, it's just a single go binary that just stops and just has some error message if you have if you would have been looking at the logs to just throw it off. Yeah. Up five more lanes. That's it there. Yeah. If not present. Oh, yes. Now do we need it? Pull it out. Yeah. You'll just do just oh, I think it's

1:25:50 Hans' Cluster Fixed

1:25:50 fixed. Oh, it's ta da. Hey. Well done. It's fun. Alright. These are both very cruel people. Lots of of of lots of cool breaks there. Yeah. The one was really funny when it kept coming back. It's like the convert because, like, from the dead. You know? Yeah. Is always not bad, like, and they're just like, it's back. But, yeah, I can't find the Etsy break Etsy brief. I never we never got the answer. How how were you hosting the It's the Chrome CronD directory, and it doesn't show up when you do use the Chrome tab.

1:26:24 Cron Job Location Revealed

1:26:41 There you go. Alright. Awesome. Thank you both for, a, taking time to break cluster, which I know is is not easy. It takes effort and time to think about how to do it and to actually apply it, and we see some really fun breaks today. And, again, I hope we never see entity again. But, also, for joining me live in front of an audience and being vulnerable enough to try and fix some clusters. So I really appreciate both of you taking the time to do that. Thank you again. Hopefully, you know, maybe we'll see you back

1:26:45 Conclusion and Thanks

1:27:10 again in the future. But for now, I'll say thank you very much. I'll say thank you to Equinix and to Teleport and to everyone watching. Any last words, Hans or Zach, before we finish? It just reminds me how much I hate hate it's a city. So kudos hats to you, Zach. Really, really clever. Really enjoyed it. Yeah. Yeah. No. Yeah. This is a lot of fun. And I I love I love seeing Caverno in the center of it. I really love Caverno, actually. It's it's really it's really great. So it's good shout out to them. But,

1:27:43 yeah, that's a lot of fun, Briggs. This was fun. I enjoyed this. Alright. Thought it would. Alright. Well, thank you both again. I'll speak to you soon, and to everyone watching. Have a great day.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

Additional Resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos

More about Teleport

View all 38 videos
Cilium

More about Cilium

View all 36 videos
Kyverno

More about Kyverno

View all 9 videos
etcd

More about etcd

View all 24 videos
CoreDNS

More about CoreDNS

View all 21 videos