About this video
What You'll Learn
- Walk through live debugging of failing Kubernetes clusters, from broken node status to service-level outages.
- Diagnose admission webhook failures and broken pod recreation behavior to restore stable application state.
- Fix kubelet, scheduler, API server, and Cilium policy misconfigurations blocking database and network connectivity.
Teams from Carta and Fairwinds debug each other's broken Kubernetes clusters live, tackling admission webhooks, rogue pod re-creation, kubelet and scheduler misconfigurations, an API server manifest typo, and a Cilium network policy.
Jump to a chapter
- 0:00 Holding screen
- 1:35 Introductions
- 1:39 Introduction & Welcome
- 2:50 Team Carta
- 2:57 Introducing Team Carta
- 5:40 Starting Challenge 1: Debugging Fairwinds' Cluster
- 8:21 Initial Break: Incorrect `kubectl get nodes` Output
- 11:05 Debugging Application Connectivity (Database Failure)
- 15:26 Identifying Admission Controllers
- 27:00 Debugging Persistent Webhooks
- 30:05 Debugging Rogue Pod Re-creation
- 43:26 Consulting Challenge 1 Hints
- 54:14 End of Challenge 1 Attempt
- 55:00 Team Fairwinds
- 55:21 Introducing Team Fairwinds
- 57:43 Debrief of Challenge 1 (Fairwinds Explains the Breaks)
- 59:51 Starting Challenge 2: Debugging Carta's Cluster
- 1:00:01 Initial Break: Cannot Access Nodes
- 1:01:35 Missing Pods & Deployment State Mismatch
- 1:05:04 Identifying Worker Node Taint
- 1:11:07 Using Hint 1: Fixing Kubeconfig
- 1:12:30 Debugging Scheduler Errors in Logs
- 1:19:03 Debugging Kubelet Configuration
- 1:22:28 Identifying API Server Manifest Typo
- 1:25:31 Finding Disabled Controllers
- 1:26:40 Application V2 Running (Goal 1 Achieved)
- 1:27:36 Debugging Database Connectivity
- 1:32:32 Using Hint 5: Fixing Scheduler Name Typo
- 1:33:57 Database Pod Running
- 1:34:10 Debugging Service Connectivity (Continued)
- 1:36:25 Fixing Cilium Network Policy
- 1:37:09 Application Fully Fixed (Challenge 2 Complete)
- 1:37:27 Wrap-up & Conclusion
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:39 Introduction & Welcome
1:39 Hello, and welcome to today's episode of Clustered at the Rawkode Academy. Today, we have teams from Carta and Fairwinds who have broken some Kubernetes clusters for us, and they will attempt to fix each other's clusters live on today's session. Before we kick that off, there's just a little bit of housekeeping. First, thank you to Teleport, the sponsor of ClusterT. And if you've not seen Cluster before, then, well, you're wrong. You should. Teleport is a tool that we've used since the very first episode. It allows us to have access in a pairing fashion through a web browser or through the command line client
2:17 and actually type in the same terminal together and pair on problems, an invaluable tool when debugging Kubernetes, and we're gonna see our teams using us today. So thank you to Teleport for that. Also, you should really subscribe to the Rawkode Academy. And there's a button right below this video that will allow you to get updates to all new episodes of Prasut, and I do my best to try and provide tutorials, guides, and walkthroughs for all of the projects and the cloud native cloud native landscape. I can't talk today. So please subscribe, follow along. And if you wanna
2:48 come and chat, there is a Discord server as well available at Rawkode.chat. Now I'm gonna stop talking before I make any more mistakes and jump over to introduce our wonderful first team. Hello, team Carter. How are you both today? Hello. What's up, everybody? How's everyone doing? Better now that the pressure's off me and now on firmly on both of your shoulders. So can you just do us a favor and just say hello? Tell us who you are and share whatever you wish to share, please. And start with you, Mario. Sure. Awesome. Thanks. My name is Mario Loria.
2:57 Introducing Team Carta
3:23 You may have seen me tweeting Meminetti's randomness and funny things, not incredibly serious. I am a CNCF ambassador, currently a senior SRE at CartaX along with this guy next to me. We're going to be showing you about great our skill set is in fixing things. I also do cloud native consulting from time to time, so I've seen a lot of different various scenarios. Really invested in helping the CNCF with cloud native TV, KubeCon conference, which was just announced the schedule as well, and really love the overarching community. My specialties are kind of in the networking
4:01 realm, and the nuances of of doing things in auto scaling and just building resilient, cloud infrastructure, of course, on Kubernetes. And I love all the the little projects that come about and things like Cilium really really get me excited about the feature of of containers and and orchestration. So really happy to be here, and thank you so much, David, for setting this up. Awesome. Thank you, Mario. And, Mudi. Hey, everyone. My name is Mahmud. I also go by Mudi. I'm based in New York City. I've been a senior SRE at Carta for a few months. But before that, my
4:38 background was more working with financial companies to setting up platforms, Kubernetes platforms, as well as AI companies to do large scale data processing. So I've I've been on mostly doing mostly doing software engineering, but more recently, been doing more SRE type of infrastructure type of work. So it's been exciting. I'm also very excited about the upcoming KubeCon. Hopefully, we'll see a bunch of you guys in the audience there. And, yeah, super excited about today. Awesome. And you've been on before and absolutely smashed it. Right? So this is gonna be a cakewalk. Oh, I don't know. This
5:24 Teams Teams is a little intimidating because it sounds like we have we're we're we're working against four four very talented folks from Fairwinds, so we're very excited to see what they have for us. Yeah. Alright. Awesome. Well, that's the nicest piece out of way. We're gonna get into the fun of debugging Kubernetes. So I'm going to pull up my screen share here, and I'm going to attempt to use the Teleport CLI application for today, which allows me to list the nodes in the Fairwind cluster. I am going to connect to the control plane like so. You both should be able
5:40 Starting Challenge 1: Debugging Fairwinds' Cluster
6:02 to join through your own session. And if you could just give me an echo hello, an echo Moody, echo Mario, whatever you'd like, let me know that you're in the session, and we'll kick things off for today. Thank you, Moody. So, Rawkode, I'd love to go off the beaten path, and I would love it if you made another session just for me and shared it on the screen. We have dual sessions, and people can watch and see what we're both doing at the same time distinctly. Now I think maybe Moody I'll let Moody drive, and maybe I'll do some research looking around,
6:34 and he can make some of the actual changes. But I'm gonna join this session right now. Thank you so much. This is a great idea. I just wanna make it more exciting for the fans. Well, this is a a clustered first. We have never had a split screen action on the go. So nice idea, and let's see what happens. Blades and trails. And I see that the Carta team have left you a nice little hello message as well. Isn't that sweet? Oh, yeah. They're saying, hello there, Carta. We're excited you're here. We left you fun treats and
7:04 three fun treats, and don't forget the hints. Cool. Thank you. So so, Mario, shall we just see if we have access to this cluster? Should I drive, or would you Go ahead. Take it away. Yeah. Yeah. So we'll see qcl get nodes, I think. Oh, okay. So we've got Mario taking a look at the hints, and we've got Moody checking the nodes above. So at least you have a working control plan. Right? Of course. Good old fair ones with that RBAC lookup. That's their own homegrown tool. Of course, they'd they played it into the the process. I'd love it.
7:44 Well, I was gonna say wait. Wait. I don't think we should read the hints yet. Right? Oh, you oh, okay. I was just gonna ask. Okay. Alright. Yeah. Yeah. Can you can do whatever you want. Right? It's it's it's it's up to you. I I think I think we should read the hints when we get stuck. I think it's a little early to read those. Alright. Alright. Okay. If fair winds are selling you bad, don't open the hint yet. So Oh, man. Alright. I am gonna start your four to five minute timer. Thank you, Noel, for the reminder. You may almost have
8:15 infinite time there. We now have a little timer in the corner. Take it away. Good luck. So so I didn't see the hints, so we'll see how this goes. But the what I'm seeing is get nodes is returning what looks like pods. So these are not these don't look like nodes to me. So that's the first very suspicious thing that's going on. Can I get pods? Can I get anything other than pods? Namespaces. It looks even like it's a saved it's a saved output. Like, the seconds are not changing. So it's probably like an an alias or something
8:21 Initial Break: Incorrect `kubectl get nodes` Output
8:54 or a function. I'm gonna say which kubectl, see what that so, yeah, let's let's go there and use our bin and let's l s kubectl. We have the Fairwinds team in the chat written you on, I'm sure. Let's see. Let's see if I oh. Nope. It is a binary. Gonna say, just to see what kind of file this is, statically, like, execute. This is great. K nine s on my screen is just showing up just really well on the stream, isn't it? I'm sure everybody's loving loving this scrolling. That's wonderful. So I do have access, Woody, to the
9:37 cluster. So it's it's definitely a QCTL, specific thing, it sounds like, or something static. Yeah. So yeah. So I don't know why KubeCTL is acting this way. You're assuming it's KubeCTL. Right. So, Mario, what do you think we should do here? Or should we just use canines since that client seems to be working? I I kinda wanna just use canines. I'm worried about oh, it actually displays pretty well for you, yours. Okay. Yeah. Let's just use canines. Like, I think most of, like, the added actions and research is gonna be in there for the most part, I think. So
10:16 Okay. Yeah. Sounds good to me. So oh, I think the terminal is getting messed up now. It was working, and now it's, like, a little. Yeah. Uh-oh. Just give it a little How about how about now? A little drag. A little drag. Okay. How about now? Yeah. Nope. Let me just start over. Yeah. Nines. Oh, man. Having trouble getting it to fit on the screen. Yeah. That's because I resized my window. Sorry. I'll drag it back. Oh, there we go. Oh, there we go. Thanks. Nice. So okay. So let's first find out if the application's running, I guess. So
11:01 should we try to look at the services? Oops. Service. And so it's on Port 30000. Is is is there an ingress that we can test with, David? Or Yeah. I know he can click the entry and Yes. So if I pop over to the web interface, we do have the clustered application exposed securely via Teleport. I can click the button, and we will wait. Oh, yeah. I'm testing it. It doesn't look like it's gonna it's not doing anything, it looks like. So interesting. Okay. So let's see. Let's find out why. It did load and we got failed to connect to the database.
11:05 Debugging Application Connectivity (Database Failure)
11:58 So for anyone who's not familiar, the application cluster is a small Rust application that speaks to a Postgres database inside the same namespace to display a quote. It cannot touch that database for now. Gotcha. So so yeah. So let's take a look at maybe the if there's any logs No logs. Okay. Maybe I'll describe the deployment and just take a look at what's going on in the image. So it looks like it's using the right image name here, so that's good. I wonder if it's like a like, it's like a network policy thing potentially. Or
12:42 So I do see a network policy actually, and it looks like it's got a pod selector for it is allowing, though. It's allowing egress traffic to the Postgres database. However, I have a feeling, Moody, maybe we just delete it, and you can view that if you do in k nine. It's just a colon network policy. I have a feeling we should just we should nuke that one, though. The default one. Right? Mhmm. Yep. Yeah. Yeah. It looks like it's seems like it's denying denying egress, it looks like. Oh, you see a deny rule? Or is that an oh, it's an allow
13:29 rule. I see allowing egress traffic. Yeah. But I think for safe measure, maybe just removing it would be Okay. If it's not in the way. Let me delete it. And now let's test the application again. Oh, there we go. So the application loads now. Yep. Excellent. And you get the watch face video, which means we have version one of the clustered application. Awesome. So let's see. Shall we try to upgrade it? What do you think? Yeah. I'd say let's just go for it, and then we'll see what hapload breaks from there. And I'm actually reading the comments right now
14:07 too because this is fun, and it's a nice dopamine hit. And, Kelly wrote, so you don't wanna fix kubectl. And, my answer to that is I do, but, like, our main priority is upgrading from version one to version two. And so I kinda just wanna get that done. I definitely do wanna figure out what they did because I'm interested in how they broke that and got some static output, but it's a binary. It's kind of it's kind of interesting to me. So maybe we'll come back to that one. But the big thing great question, Khalid, is,
14:40 like, I just wanna solve the the main thing that we're trying to do first, and we have API access. So, k nines is, yeah, letting us do that. So So so here's, what I just found out is that I I can't update the image tag for whatever reason. It keeps sticking back to v one even though I changed it to v two. So there we go. Let me try it again just to make sure I did it right. Oops. Yeah. B two. And if I describe it again, yeah, it still says v one. So admission controllers,
15:26 Identifying Admission Controllers
15:28 you think? Yeah. I think that's probably the the logical logical one. Now I'm interested. Yeah. Yeah. Indeed. There's a there's an admission controller right here. It's saying Okay. Well, I I can't really see it because it's the window's so small, but trying to yeah. Create the operations are create pods down here for a mutate. Yeah. So shall we just delete it? I think that's probably a safe bet. Yeah. Let's do it. The sledgehammer approach. Let me look at the deployment again and change to v two again. Okay. I don't think it worked this time either.
16:30 Uh-oh. Interesting. So maybe that was like a that was like a red herring. I'm gonna look at the replica sets. Oh, so there's a bunch of replica sets here. It looks like the one that's running right now was it's forty hours ago. Right. Interesting. I wonder what happens with yeah. Oh, go ahead. I was just gonna say, I wonder if we just if we actually clear those all out and just let the deployment object recreate them naturally, so we we kill anything that's currently running and just let just to get any semblance of a modified replica set.
17:12 Just delete everything. Delete them. Delete the cluster. Just delete all all all of these. Right? No. Yeah. All I would say all of those. But, again, just kind of a a blanket strategy. Nothing too focused. Yeah. Let's see what happens. Do we okay. So we get a new one. We got a new pod. If I describe this one, we do it. Did it. Still v one. Okay. Let's look at the deployment again. Edit. It's a nice effect that changing the image, there's no new state in admission controller and it's still v one. I like that.
18:00 It's not doing anything. So it's not detecting any changes to the deployment. I'm thinking we should look at, like, maybe the controller manager. What do you think, Mario? I just look took a quick look, and I didn't see anything. But, yeah, let's let's look through it once more. Maybe you'll see something that I did not. Yeah. So go to the manifests. Oops. Kubernetes manifests. K. I'm gonna I'm gonna do a v I cube controller manager. Swap file. Oh, are we both editing the same file or something? It wasn't editing, but I'm out of it now. So go ahead.
18:56 Oh, okay. Oh, there we go. So, yeah, I wonder if there's anything in here that's suspicious. The image what image are we using? So We may as well upgrade to one twenty two while we're at it. Right? I don't know about that. Wait. The control plane is running 121 now or is it 1122121, it should be running. 122 came out yesterday or today. So I was just making a joke about upgrading at the same time. Yo. YOLO. YOLO upgrade live. Yeah. So, yeah, I'm not seeing anything that interesting. We could look at maybe some of these
19:44 files. Leader Electrue port zero? Port zero was What is update? Actually that's actually normal. Yeah. That's normal. And then yeah. Nothing Something that's weird to me is the something that's really weird to me, and I don't know, maybe, David, you can verify this, is, if you go back in that file, Moody, you'll see that there is a go to the options, if you will. You'll see that there's a scroll down. There's a request header client CA file, and I don't remember seeing that on our cluster, but that might be something that's normal. I could be wrong, though. It's just kind
20:23 It is normal. That's okay. It is normal. Okay. Okay. Mhmm. Cool. Yeah. So okay. So let's think again. So what else is watching the deployment? So there's the so we checked the the webhooks. Should we look at validating webhooks maybe? I didn't really check that. Yeah. That's that's true. I although I I think we would get an error Yeah. That would It wouldn't be something wrong. They're laughing at you in the chat. I'm just just letting you know. That's all. I'm just a messenger. That's fine. Just a messenger. Yeah. Yeah. There there is one right here. So
21:04 I'm trying to see what it does, but should we just delete that one as well? Okay. Yeah. Oh, wait a second. I'm interested. Does it have what was its config? So, yeah, I'm here. So it says Just to see a bundle. Service is cube system admission. It would mostly fail. Apps v one create resources. Deployments are in there. Yeah. Side effects, none, and I'm not seeing anything about notifications. So I'm just very interested in how it's not, like, notifying us. The failure policy was fail. So you should and, clearly, get a message if you try and update the deployment.
21:49 Yeah. Yeah. Okay. So Fairwinds is giving us a a break here and telling us k nine doesn't show the validating admission webhook messages. We're gonna have to file an issue on that because k nine is supposed to do everything. So You could always fix cube control and, you know, look at it that way. I I really wanna fix cube cube control. Yeah. I'm I'm looking forward to that. But let's first check this out. I'm gonna delete this guy, and then let's let's edit the deployment. I will be so impressed if it doesn't update the image, though.
22:25 Yeah. If there's even another break here, that would be interesting. Still? That's awesome. Well done, fair one. Fair one. Mhmm. Really? I have an idea, but I'm I'm holding it. I'm keeping it. Very interesting. Okay. Let's see. Yeah. It's still v one. I'm gonna look at replica sets. Just gonna kill this. Well, actually, no. Let me look at deployment and just make sure that it was edited or it was not edited. Yeah. Set is right. People are enjoying breaking cube control these days. This is not the first time that's happened. Yeah. So Ivan telling you, you should you should
23:12 have cube control by now. And for everybody wants to see you get cube control working. That's that's all saying. Should we just do that? Let's do that. Yeah. So I don't know where to start. So let's look at maybe, like, is there anything weird in the batch RC file? Take a look here. Wait. What was that? Oh, there we go. Look at that. It's right here. Uh-oh. Uh-oh. A function because it doesn't take over a which thing. I've used that break before in clustered. Interesting. Thanks a lot, guys. We're just copying all stuff other people have
23:56 done. No originality factor. Come on. KubeCL? Oh, wait. What? You'll need to resource your bash shell or or maybe even just bash again until you yeah. Bash. KubeCTL. Okay. It's back. It's back. We have kubectl, guys. Nice. So now yeah. So now what I was trying to do was edit. Yeah. Do an edit from there. Yep. And let's see what that tells us. Yeah. Clustered. So you're expecting the k nines was actually tricking you here or leading you astray. Potentially. Right? It was. Look at that. Okay. So mustard could not be patched. Mission webhook denied.
24:48 Sorry, Carta. Love fair waste. No. I can't get that. Okay. Oh, there's an eighth burn from George there. It's not original, but you fell for it, Mario. Yeah. That's true. That's true. So so let's look at the API server actually. So I'm thinking maybe there's some admission webhooks plugins enabled. So maybe if we look at manifest cube API server. Oh, no. Someone else is editing the file. I'm in it right now as well, but I don't see anything that would lead me to I don't see any other admission plugins, though. Node restriction only. Yeah. Wait a second.
25:38 I'm trying to figure out how they're I mean, they're running some webhook somewhere for sure. I mean, the fact that they had a custom response saying love fair winds. I mean, there must be a web server somewhere. So I'm gonna just look at just all the pods that are running. Just get a overview here of what we have. Okay. I'm gonna do some research. Oh, there's okay. API server. API server. Interesting. I love it when people say interesting because it means I have no idea what's going I have no clue what's going on. Services. So I can share I
26:38 can share a thread I think is worth investigating if you want, or I can give you some other you can have some more time to kinda How are we on on time, I guess? You've twenty six minutes left. Okay. How about in in, like, if in five minutes we don't know, we we we Alright. Go for it. Oh, Mario, are you are you seeing Yeah. So the webhook? Had we looked at the validating webhook configurations yet or no? I don't I thought we deleted it. Maybe they did. I thought so too, but it's it's back.
27:00 Debugging Persistent Webhooks
27:14 Yeah. Great. Kubectl dot yeah. Yeah. Go ahead and do it from just kubectl. And so we had deleted the validating webhook and the admission webhook. Right? The the the mutating and the validating, and they're both back, it looks like. Okay. Right. Something is something insists on having those around. So let's try to delete them from here, actually. Yeah. So I'm gonna say delete mutating webhook defaults. Oh, sorry. Is it called one? Oh, there. Yeah. So it says deleted. Now if I get it again, done, Does it come back in a few minutes? It does. There it is. Back almost instantly.
28:05 Yeah. A nice comment there from Fairwinds. Kubernetes is the self healing system. And another comment from Ivan, customer custom admission controller response is all part of the Fairwinds customer service. There you go. So Rachel from the Fairwinds team has said if you need hints, let them know. What about We we also have those hint files for sure. Should we look at Cron? Maybe? Maybe. I don't I don't really Something is recreating those admission controllers. Right? Yeah. I'm trying to think what it could be because I didn't see any pods that were, like, suspicious in the cluster. I mean, we could search
29:00 around and kinda Yeah. We could definitely look around. I was thinking maybe doing a let's see. Like, image. I'm gonna say tripod everywhere and grip image. Oh, it's image and with a capital I, I think. Let's just do that. Okay. So here are all the images that are running. Can we see anything weird? I see something weird. Oh, there is a wait. I see a Cilium op wait. Hold on. That's on my screen. And your right here. Right? VIP. What's that, Luis? Do you see this this image right here? The the cluster Data. Dev cluster
29:55 dev, Sudar Sudar Manjur. Interesting. I don't know how to read that name. Sorry. Yeah. You find that part. That seems suspicious. Okay. So let's maybe do a, I don't know, before five or something. And well, actually, we can just say dev. Cilium. A Cilium pod is running a clustered image. What's the name of that? Hold on. Hold on. Why? Oh, it's Okay. What namespace is that in? I'm gonna go further back. What's what's the name of that one? C four 50, whatever. I'm not seeing that. Oh, it's it's on the okay. So the pod name is CiliumJVKNR.
30:05 Debugging Rogue Pod Re-creation
30:56 There we go. I was looking at the container. Yep. Okay. I'm gonna mute that. Yeah. So I'll switch to your screen to see what you're you see. So JV So this is the Did you tier one. Did you kill it already? I killed it already. And it's back. It's back. Same image. What's managing it? What's under the metadata? Good. That's a good well, it can't be is the VIP something that's there? Is what I use for ingress. Okay. Yep. Got it. Okay. That's what I was figuring. Yep. So this is the the the main data inside is fine.
31:36 So something is managing that That image. This is what it appears to be because I have a feeling. No. That is the actual operator. Okay. I'm gonna try to describe the well, is so the pod got a new name when you deleted it? Is that what happened? A new pod spawn, which means I'm guessing there's can't do that. I'm guessing there's a controller behind it. Yeah. So if you describe it, right, you should be able to see what's managing it and what created it. Long night terminal is weird here. So It was this one. I'm assuming API tier control plane.
32:31 Firms worker two. Alright. So so, yeah, are you describing it, or should I? I'm looking, yeah, I'm looking at the describe output right now. It's gonna be assigned, pulled image. I'm gonna try with the Okay. O y o y m o as well. Alright. Interesting. Cool plane. Do you see I'm trying to see if there's anything. So, David, you said somewhere in the metadata, we should see something, maybe? So I wanted to know where it was getting his name from. But if you look under the spec, it's using the generate name. Yeah. So it's making it look like a pod
33:43 even though it's not a pod. Yeah. And I'm trying to see if if there's any thing that says how it was scheduled or who scheduled it. I wonder okay. So I wonder if we can see anything in the events eventually. No. This is sixteen minutes ago, so nothing oh, perhaps in the namespace cube system. Right? Because events are namespace. So here we go. Oh, did you stop a container three minutes ago? I guess that's the one. The the seven f is the new one. Yeah. So there's no owner ref. So yeah. It it's hard to trace this.
34:35 Yeah. So yeah. It doesn't look like it's a cron thing. It doesn't look You have eighteen minutes. We check for cron jobs in the cluster. Sure. I I wonder if that's that's what's going on. Get cron job. Nope. Any job in the cluster? Oh, man. I think it can only be one of two things. Right? Either there's a road process on the Linux host or Yeah. The API server image is not the API server image. Which I thought we checked. Did we? Let me Well, you checked them. Manifest has the correct image name, but not that the image
35:27 on the host is necessarily the correct image. Oh, okay. So could we use, like, what, container d? Yeah. Yeah. You can use cry control images, I think. Okay. You'll need a real run type endpoint. I can take that for you. Oh, thanks. Yep. Var run container d container d. So kubepi server control manager. I mean, we could potentially inspect it. I mean, you could repill it and see if the image ID changes. And if it doesn't, there's well, I mean, I don't wanna see too much. Here if you wanna ask questions or there's hints as well. Yeah. Thank you. So I'm
36:28 just gonna copy the run time flag. There there's a favorite trick on cluster that's been done multiple times with images, which I can talk about if you want. But I'll I'll give you some give you some better time. Thank you. Maybe is there a is there an inspect? Yeah. Inspect images. We're gonna try that. So if I say inspect image and then copy the kubectl server and it was version one twenty one dot two? Okay. So let's see what we got. A nice comment from Noel as well. Container d is not distributed, but I don't think that applies in this
37:17 case, Noel, because the API server is running on this machine. There's yeah. There's only one control plane node. Right? So Yep. I mean, this is this looks like the right instructions as far as I can tell. Doesn't look like a random Docker file. Kubernetes authors. I mean yeah. This I mean, it looks like the right image as far as I can tell. Mhmm. Should we look at controller manager, maybe? I mean, I I actually think we're I don't feel like that's what's going on, it could be. Mario, what do you think? Yeah. I kind of agree. I don't really
38:04 feel like that's what's going on, but we also could be wrong. So Yeah. Go ahead and take a commander. Unless it's like Not really seeing any problems, like, weirdness with the API pieces. So or I'm sorry. The cube system pods for the most part. So Yeah. They all look pretty normal. Right? Right. So let's step back and see what what was the ish the last issue we faced. So we were we're trying to delete the webhooks, and somehow they still come back. Something's creating them, and we think we we tried seeing if it was the
38:49 images, and the images look good, at least the API server image. We could probably check the others, like controller manager. But yeah. I guess we could look at the host processes and see if there's, like, a something in the background that's running. Let me try that. Let's just Let me fix that for you. Oh. So Oh, okay. Thank you. Well, Fairwinds in the chat have suggested that logs are often useful. And Rachel from the Fairmans team says trust your intuition. Which one of them? Which intuition? There there were so many of them, but let's take a look at the yeah. I'm
39:53 looking if there was any processes that have been kind of running in the background doing Alright. Thirteen minutes. So oh, the other thing we I just remembered there's a pod that every time we kill comes back as well. Right? The Cilium dev one. So should we look at the logs of that pod eventually? Shoot. It's one of the Cilium more recent Cilium ones. This one. Yeah. Oh, yeah. I see here validating webhook. This is definitely receiving the requests, it seems, for the for the validation webhooks. So Yeah. So it's probably the the yeah. It's probably what's creating
40:48 those back. Is there a what's it called? Can we edit it so it can't get scheduled possibly? Oh, not sure. What what am I trying to run here? I'm trying to see there was a API resource for Mhmm. Was it called API resources? What was it called? Oh, you're wait a second. Your command is not found, which is weird. Oh, thank you. Did you set up an alias? Yeah. T l API resources. Yeah. You can I I get I get it out of my session? I'll let Moody drive. Thank you. So so, yeah, I'm taking a
41:35 look here. So I'm trying to see if there is any API services. That's what they are. So get API services. I keep using the alias. Oops. So auto scaling. Is there any weird API services here? No. Looks normal. Yeah. I can't figure out what what schedules that pod. Is there Damon's? Did you check the process table? Did you go through it looking for rogue non Kubernetes processes or even rogue non Kubernetes containers? So do you mean like this? Like yeah. Like, literally just running that and going through it and seeing if there's anything that doesn't look like a standard
42:39 Linux process. Which is always the needle in a haystack. But, I mean, something something that's running on this machine. Yeah. I try Bash. I mean, this looks like our sessions, and then if we go up a bit Oh, you want me to scroll? Sorry. Or Oh, I I'm do you see me the do you see me scrolling? Or No. I I we won't see you scrolling, unfortunately. You can just let us know what you find. Yeah. Yeah. Go ahead and, yeah, scroll up. You you do have hints, remember. We can always read those things. You know what? Let's maybe
43:26 Consulting Challenge 1 Hints
43:26 we should use that now. The Venus suggesting the BPF BP filter is a road process. No. I've thought that before in previous episodes. It's always there, so now I ignore it. I think we're okay there. Yeah. Let's let's check out those hints, I think. Yeah. So what are we checking? Sorry. I got I got lost. I think they're in the root directory, but you can check out whichever ones you want. Alright. Let's start from the top. Hint one. Having fun with functions yet? Yeah. We've already figured that one out, I think. So as slow as our
44:15 the RBAC lookup will help you find RBAC permissions that are difficult to discover. I think we're past that one, or it's probably it doesn't seem relevant right now, which you agree. I I'm not sure what what it actually solved. What was the issue for the RBAC lookup one? I don't really know. I'm wondering if it's something with the webhook or something like that, but I I would rather solve the issue in front of us. Wait a second. Okay. So this is a command. I I just realized something. Static pods. Could it be a static pod?
44:55 Yeah. That's kinda what I'm thinking, but I I'm not sure where, like, it's defined at all. And I think It's it's in here. So It was I I looked. I didn't see it in there at all. But There's no there's no static pod at all. And the manifest directory? I mean, you could store multiple definitions in a single file. I I and I didn't I didn't see any in there, and I looked through them as well. They all looked valid. Well, you know what to search for? Like, custard. So Oh, right. True. Yeah. I didn't I
45:30 didn't grab. Yeah. You could definitely grab. Yep. ETC Kubernetes manifests. Term first, I think. Right? Yeah. It might be term first. Yeah. I'm gonna just say dot dev. Yeah. Nothing. So what about Cilium? Nope. Yes. It's not not a static pod. Alright. At least we rolled something out. So back to what we were trying. You've got more hints? Yeah. Let's look at hint three. Did we look at that one? Oh, yeah. I think you've identified as the Rogue pod, but you're not sure how it gets there. Right. Yeah. The best practice prevent attackers from injecting malicious content and doing extra damage.
46:34 Go to this URL. Okay. I've I don't really know if this has anything to do with what we're facing. I don't know that I could be wrong. It's it's definitely a it's it's definitely I think this is this is it. Yeah. We this is the last hint. So Fairwind's team is saying that the call is coming from inside the house. And you have six minutes. Fun. The call is coming from inside the house. Let's try. Sorry. I'm gonna Yes. Should work. Right? P s? Oh, no. I just can't spell runtime endpoint correctly. Oh, p s is fine.
47:54 So yeah. So So there's no Two days ago, twenty six hours ago as the Cilium agent. So that's suspicious. I mean, what we checked that and that was I mean, you could look at the logs maybe too, but I'm pretty sure we looked at that and that was an issue. You could actually do a describe. I think create CTL as a describe or something. I know Docker did Info. I don't know. But I wonder yeah. I wonder let's try that. You said this like, a describe on from The container ID. Is there a describe command?
48:42 I don't know. You'd have to scroll up on the help output maybe. There's attach create There's inspect and inspect p. Yep. Yep. I was just gonna say inspect is your go to probably. P and then the You could add in the pod ID. Is this it? Oh, no. Plot ID. The the other one on the end. Mhmm. There we go. So what do we look for now? So the this one, we suspect, the one that has the the the the bad image. Right? The colon dev image? Correct. So let's find it. Wow. We probably should have crept.
49:39 Yep. No. It's not in there. Mhmm. Okay. I actually grepped in the bottom, but you guys can't see. But Oh, did you? Okay. Okay. Got it. The pod recreates itself. So what about container d's configuration, actually? Maybe? Should we try that? Or is it My initial guess was that they were using a container d feature to pull in their own image instead of the official one, and their API server was rogue. But I'm not entirely convinced it feels like that anymore. I don't think that's what's happening at all. Yeah. Do you guys think it could be, like,
50:23 the container deconfiguration itself? I mean, you can run container deconfig dump, and it will allow you to take a look and see. So same command. No. Container d, actual the daemon process, the config space dump. Failed to load TOML file. No such file. That normally works. Is there any other oh, the directory is not there. Yeah. I mean, you let me maybe I just forgot the command, but I thought it was config. Yeah. So we just don't have we don't have one. It normally prints up a. I don't know if that's something new, different, or not. But
51:16 yeah. If you don't have a config, then it's unlikely they're using that cheat to pull in the alternative image. And you've got one minute and forty five seconds. This is a tough one. I gotta say, really tough one. Cilium operator. Yeah. He do Cilium namespace dash o wide. Oh, can you repeat? Sorry. Get pods in the Cilium namespace with a dash o wide on it. Yep. Oh, it's in cube system. Right? Right? Oh, yeah. It's What? So So I was I was it's not like it's a rogue Cilium process. Something is modifying the Cilium pod only on the control plane
52:23 node with that image. Like, you run a if you just get all that output YAML and filter for custard, we only see one. Right? Yeah. Let's see. Describe pod in cube system. Wait a second. You you find something? I don't know what we did, but I was messing around on canines, and I was able to change the version to two, and it's sticking. Oh. So I don't know if we did something where we disabled it. So all I did is kill the pod that had come back again, and nothing else came back. And then I
53:05 edited the deployment, and it looks like it has successfully stuck to v two. And you can go view that. Is that the proper version, Rawkode? Normally, it takes a couple of refreshes and stuff before it finally Gotcha. Fills it. Right now, it looks like v one. No. It still says v one. Yeah. I'm not sure. Yeah. I could be something else as well, but you do a describe on that pod, Moody. Describe pod? Sure. Yeah. Which pod? Yeah. Which pod? Or just describe on the the deployment, whatever you wanna do. Yeah. Clustered pod. Extra PO.
54:05 Oh, sorry? We've got an extra PO. You can just delete that. Oh. It's still v one. Yeah. Alright. We are out of time. Yes. Oh, man. That that was tough. That was tough. Seen Street was The hint did not did not tell us much. I I I couldn't couldn't tell from the hints what's going on, but thank you, Fairwinds team. That was fun, though. Yeah. Well, the Fairwinds team, I'm sure they'll they'll tell us in just a moment what happened, and there'll be a pull request to the depository with the breaker notes. And Rachel from the team is saying you were so,
54:14 End of Challenge 1 Attempt
54:42 so close. So I'm gonna have to say goodbye to you both, and we'll find out what it was soon, and we're gonna tackle cluster number two. So thanks again. It was an absolute pleasure, and I'll speak to you both soon. Thanks. Awesome. Good luck, guys. Bye. Yep. Alright. I will pop over here and encourage the Fairwinds team to come and join us. And I'll also take a moment just to say thank you to Teleport for sponsoring Clustard. If you want to support the show, you can go check out Rawkode.live/teleport and check out there is an awesome product.
55:00 Team Fairwinds
55:14 You've seen it in action. I encourage you all to check it out. Alright. Let's pop over. And we have our first member of the team. Hey, Rawkode. How are you? You're I'm good. How are you? Oh, there we go. Alright. Your team is now filtering in. Hello, Rachel. Hello. How are you? You guys are a mean team, I gotta say. Carter, you guys are amazing. Y'all did such a good job, and y'all were so, so, so close. So thank you for being good sports because, honestly, that was that was amazing. How how just how close were they?
55:21 Introducing Team Fairwinds
55:59 It's hard to tell. I didn't see how the pod actually got deleted. So I think they were close. But They they have one more item to do. And yet hint number four four. That was the last one for us to fix. Ah, unlucky. Alright. Well, I'm pulling up our cluster now. Oh, sorry. Luke is here too. Hello, everyone. Alright. Let's start with a little bit of an introduction. We'll just start with Rawkode and and kinda look around to Luke, Andy, and Rachel. But if you could just tell us your name, who you are, and maybe something you like.
56:45 Well, I'm Rawkode. I'm just a, you know, engineer manager. Just joined recently to Fairwinds, and really excited to be here. Rachel? Hey, everyone. Rachel Sweeney. I've been at Fairwinds about a month. Absolutely loving it. Worked with customers, helped them with all their problems that come up. Yeah. Super excited to see what Card has done for us and dive into this cluster. Andy? I'm Andy. I've been with Firewinds for about three and a half years, working Kubernetes for five, and something I like is Kubernetes, quite honestly. I know that sounds silly, but I really enjoy it. So go ahead, Luke.
57:29 Hey, y'all. Luke. I've been here for a couple years. Mostly working on internal tooling and open source stuff and still doing some SRE stuff as well. Excited to try out this cluster. Awesome. Perfect. Alright. People are asking in the chat. How was that pod getting there that was doing the naughty thing? I posted the code in the chat. There's a link to the repo that I just opened up for everybody, but it's just a static pod or not a static pod. Sorry. A naked pod, but it catches a sig kill and recreates itself and catches the signals and then creates a
57:43 Debrief of Challenge 1 (Fairwinds Explains the Breaks)
58:05 new version of itself every time you try to delete it. So you scheduled the first one manually, and then it would schedule itself autonomously after that. That is so sneaky. I like that. And then the one last Sorry. Go ahead. The last piece is that we took the v one image that needed to be updated and we tagged it as v two and took the image pull policy and set it to only if present or not if present. Nice. Alright. Very sneaky. Okay. Let's pop over to our screen share. Hopefully, you all have access to this.
58:39 I am going to run TSH LS and I'm gonna SSH onto the Carter control plane one. You should all be able to connect to this session. If you can give me an echo hello or something to let me know that you are there, and we will get started with this. And a lot of love and oh my gods for that SIG kill capture reschedule thing. That was super sneaky. Alright. We've got all of you there, I think. So I would suggest you export your cube config, check for our control plane, best of luck, and let's let's go for it.
59:25 Cool. Andy, are you driving? And I will start your timer. Oh, yeah. The teleport command lines is not like them, so I am gonna join from here too. One. Alright. Did leave us some hints. I'm gonna leave those for now. Let's just see what we've got here. Alright. Can't access resource nodes. So let's take a look at our oh, so it looks like they've moved the original admin config. Let's just take a look at the one that's in the dot here. I think I'll just list the one we have. Let's see. Let's try the end the.admin.com.
1:00:01 Initial Break: Cannot Access Nodes
1:00:57 Have the control panel? Access. Hey, John. Well done. See, he's all there. I don't know what's wrong with this one. And looks well, the user was changed from Kubernetes admin to cluster admin. Nice. Alright. So Are we trying to update it? That's what I was gonna do. First, I was gonna look for any rogue pods, but that's probably just me thinking along the lines of what we did. There is nothing in the default namespace. Is the app just isn't there? Is that Seven hints. What have they done to this cluster? I don't know. Can we check and see if the app has
1:01:35 Missing Pods & Deployment State Mismatch
1:01:45 even loaded? We can. Me pop it open. Do have a deployment? The deployment exists in the cluster. Oh, but there's no pods. Got it. Okay. Yep. Alright. Let's do describe here. That's interesting. It says one ready, one available, one up to date. That's a true none. Yeah. That's interesting. Yeah. That's that's a nice effect as well. Can we check Rback? Can we check Rback on the user that we're authenticated as? Yeah. See if we can view pods. So the pod's running, but we can't see it. Yeah. I don't make the SDF available on my Kubernetes clusters normally.
1:03:03 Oh, I thought I had it in my batch. Wanna see. I but it must not happen. Oh, I forgot to install it. That's right. I have it sourced, but not looks like we had a cluster. We were the Kubernetes admin users. So So what are you looking at right now? I was like looks just looking at our back. So your hypothesis is that kubectl get pods isn't working because of permissions. Right? I think so, but that doesn't make a ton of sense because I'm not getting the I mean, you could run getting any pods. A different error. Off, can I? Right? That
1:04:00 that would show it. Yeah. And I did dash a, and I can see pods in other namespaces, just not the one for the replica set. That's what made me think it might be scoped. So it would be maybe a role binding, not a cluster role binding for the namespace. What would make the API think the pod exists? The pod doesn't exist from the perspective of get pods, but not the perspective of the deployment. So we can see the deployment, you said? Yeah? Yeah. Do we care if we can see the pods? Can we just try and edit the
1:04:41 deployment? Oh, we get a internal server error from the app, potentially. I didn't actually see We did get an internal server error. So it is not running. It's definitely not running. That's correct. Yeah. Does it look like postcast maybe? Or it could be on a worker node. As I said, do y'all see the the the container itself running on the worker nodes? Take a look. There's a taint on the worker nodes for cluster t app. There's what? There's a for the cluster tab. No schedule on the worker notes. Okay. That's just so it doesn't get scheduled
1:05:04 Identifying Worker Node Taint
1:05:28 on the master. No. It's on the worker notes. So it can't be scheduled on the worker notes. Oh, but then why would the deployment say it's ready? Yeah. So there's one ready, but we can't see Really freaking me out. Got it. Okay. So let's see if I can remove those tanks first. I don't I need to resize my window or something here. Hold on. Yeah. Yeah. Let's see what's in the scroll disappears, if you just kinda resize a millimeter, it kicks it back into shape normally. Got it. Okay. It definitely looks like I was doing some
1:06:52 weird stuff there. Still no pods, though. Yeah. And there is a selector in the deployment. And that selector or just because that selector is normal. It's matching the pod to the it's already v two. It's doing your job for you. What's going on? I I yeah. Could we do the other jobs? How nice of your character? Did y'all find the the container running on the other on the worker nodes by chance? Nothing's obvious. I do control Nothing's running. Nothing's staying running on nothing's on worker two except for Cilium. Unless it's hidden, it's something like Cilium.
1:08:10 The database is also not running. Same deal. It says it's got one ready pod. Let's go look we should maybe look at logs. Yeah. APIs are on the logs. Some of our own advice from the last one. Alright. Somebody's gonna have to have me with control. Is it just there we go. Wait. There was no on our cluster, there was actually a pod for the logs or for the API server. Sorry. Now the container's running, but there's not there's not a pod for it. I'm not missing it. Oh, you're not. Alright. Because node name Carter work Carter worker one
1:09:51 does not exist. Is it specifying the node in the deployment maybe? Does have a node selector for app clustered. That's not a node selected. That's the deployment selected for the pod, isn't it? People are always so mean to these clusters. Very interesting. They really are. Come on. Is it time to dive into a hint? I'm pretty sure. Yeah. That get pods not showing anything is interesting. Yeah. So there is a node selector into the deployment, Andy. Yes? Okay. The hint one was the first thing we found. The brokenadmin.com. Right? Yeah. Seems Oh, there it is.
1:11:07 Using Hint 1: Fixing Kubeconfig
1:11:16 There it is. We just removed that. Yeah. Yeah. We don't need that. Uncurrent, undesired, zero running, zero waiting, zero succeeded. What is Alright. Let's let's go for two. There's seven of them. Right? You got that one. Yeah. Alright. Love the hours, by the way. Good job, Carter. Find the birthplace of the unborn pods. Do pods have parents, and are the parents still alive? Interesting. Pods. You have parents? Control. Right. The scheduler's not there either. Oh, no. It's it's right there. Oh, it's back. Yeah. There's there's a scheduler. Maybe we should check the config. Moody in the chat says it's payback time.
1:12:30 Debugging Scheduler Errors in Logs
1:13:01 That's fair. Fair. That's fair. Fair? That's fair. System cube scheduler cannot list resource config maps in the API group in the namespace cube system. So maybe check logs of. And we've got a bunch of failures to watch, failures to list, so we should check the permissions on queue on the scheduler. What do you think? What what were you saying, Luke? Sorry. That we should look at the log of Kubelet. I'm looking at them now on one of the work on worker two, and it's there's some errors. So I might need to look at the configuration
1:13:51 if we wanna open up. Failed creating a mirror pod? What about about the RBAC in the Kubelet config? I think mirror pods are related to static pods. That is correct. I've never known why it calls them matter of pods, but they're static pods everywhere. No. I don't know if that's it. I do need our backup lookup. We've relied too much on our own tools. Feel free to show off. I don't think I've seen our back lookup before. Show me the coolness. Get it. I'm gonna get it. So they You really don't see any cluster. Carter's cluster does not seem to have Fairwinds
1:15:14 Insights installed. No longer is it. Oh, boy. Sorry. Could not resist. Shameless. Shameless. Know is software we create for best practices, cost analysis, stuff like that. So I'm not seeing anything in static manifest that look unusual. So what does the r back lookup do? So r back lookup is giving me I just searched for cube scheduler or just for scheduler. It's giving me the service account, cube system, cube scheduler, has the role cube scheduler, and the role binding system leader locking KubeScheduler. So it's basically just finding all mentions of scheduler across RBAC and then linking them together
1:16:02 so I can see all the permit like, roughly the permissions that or where the permissions for the scheduler come from. So has the role system leader locking system cubes scheduler for cluster role. So maybe we marked with cluster roll system. Yes. Get get. Get get get. Classic. This looks vaguely correct. What was the alright. So that's the wrong path. We go to a worker node and look at the logs of kubect there. I mean, they're able to join the cluster, but I see some errors and some other things. This is yeah, there's a queue API server error.
1:17:28 So I probably don't see that pod. I've opened a session on worker one. Anyone wants to drive on it. I can switch between them. Oh, great. So it looks like the QLET was stopped a couple days ago. And then when it started up, it just started spewing these. So Yeah. Let's look at the service itself, maybe, the definition. Let's see. In a while. Looks like the API server and the controller manager were modified. At least they have different time stamps than some of the other pieces. Okay. There we go. Queue config. So what lead are we chasing right now?
1:18:53 Is this the the scheduler complaining it can't read config maps, or are we looking into something else now? I'm looking into the QLET configuration right now to See if it's been messed with Don't see any pod security policies. Static path is as expected. I do find it interesting that we don't have the API server in the pod list Yeah. That it's running. Yes. It does not like me right now. There we go. Definitely a bit stuck. Anybody got any issues here? Something still tells me it's. But yeah. Do you call that? So it's failing to
1:19:03 Debugging Kubelet Configuration
1:20:30 create it's failing to create the API server pod here over and over and over again. So that's quite likely our issue. Although, if I did pause that the API server is responding to me at all, what were you saying, Luke? It says I'm available to validate pod request there. So maybe there's something wrong with manifest. Something else here that we missed, maybe. Nope. That's just responding to me over and over and over again. We just got a meeting in ten minutes. Alright. There's no webhook configurations that I can see. Just in case, replace the API server with something.
1:22:13 I'm not sure I would catch the ML error live like this if there was one. Right. So most of that, it's okay except for one line. For one line. Where did your cursor start? I remember that from watching the digital. It was exactly where the cursor started. Where did your cursor start? Oh, yeah. Admission controller lane. There's an extra one. Enable admission plugins, no restriction, pod security policy. Anybody familiar with admission plugin, sir? Because I don't remember what the default is now. The default is node registration. Registration? Interesting. There were no pod security policies that I
1:22:28 Identifying API Server Manifest Typo
1:23:12 saw, but no restriction over that registration. Yeah. It should be known registration instead of restriction. I think there was one more change. Other registration? Yeah. Next one with you. It's not rough. The chat. Very quiet for a change, which is unlike some I think everybody is is pretty stumped. Yeah. Good job, Carter. You got twenty minutes left. Time flies when you're on the sun. When you're live on YouTube. Yeah. Bye. So Cube API server is not ready when you do Cryo CTL pods list. We'll see if we can get logs on that one. So there's a second pod. Oh, there's already
1:24:28 one that started a few seconds ago. Can we just kill the other one? 19 hours ago as well. Four? I I would change the node registration back to restriction. I think that was a slip of the tongue, but it should be restriction. That's the default value. Okay. Got it. Failed to start and and that's probably our old message here. Let's just Yeah. See what we got. Okay. We have an API server. It's pending. If you look at the cube controller manager under controllers, there's a lot of minuses. Minus replica set, minus deployment, minus staple sets.
1:25:31 Finding Disabled Controllers
1:25:38 You disable the video control things. Uh-huh. And looking at if you do CryCTL pod list, that one was modified nineteen hours ago, seventy two days ago. You're right on it, Andy. Up at the top, one of the flags. Yeah. No replica set controller, no deployment controller, no step for set controller, push traps, talking cleaner. They went to town on that. Yeah. A few lines on that. I think we just I'll retrofit it. Remove that. Just remove that line. Right? Yeah. Just leave the asterisk. Or do you yeah. Yeah. Yeah. Yeah. Because asterisk should be everything.
1:26:08 That's right. Sneaky. Sneaky. Sneaky. So we have still one oh, yeah. Controller manager. An API server. There you go. Yeah. How are you getting responses from an API server anyway? Weird. I don't know. It I think it was just rejecting the pod registration, but the controller the the containers I have no idea, actually. We have the containers ready now for clustering. Woo hoo. Hey. Hey. Starting off. You get that? A couple. Okay. Two replicas set. It'd be lying if I said I thought we were out of the woods. Yeah. It's probably more. Alright. Let's see.
1:26:40 Application V2 Running (Goal 1 Achieved)
1:27:01 That started. It says it's running. Postgres. Happy? Is pending. Can we try updating the deployment, or should we try the front end first? Oh, we can. I feel like it's gonna say it can't reach the database. Yep. I think we're about to get a angry man time out. Alright. So not getting any reason why Postgres won't start. Let's go to specified in the deployment because I don't think I see it in there. It's a staple set. Staple set. Okay. Great. Let's see. What? So it's got a config map mounted to it. That could be missing.
1:27:36 Debugging Database Connectivity
1:28:48 I would have expected to see a message about that, though. Generation two, so it has likely been modified. Yeah. Can we look at generate the replica set and just go back to the other one? There's no Rawkode set for safe old sets. Oh, right. Right. Right. Right. I was thinking of I wonder if scheduling or alternate strategy. Have you described, like, the project? Yeah. I did. There was no events. But what causes pending? Yeah. It can't scheduled. Yeah. Normally scheduling, I think. Right? Oh, yeah. Or scheduling. So so I had selector on that one too,
1:29:56 maybe. I had a nice look. Or, like, the update strategy. Maybe that's Yeah. I hate myself for that. Like, the service is Postgres, and the stable set is PostgresQL. That's on my to do list. To the strategy. Yeah. Not sure. Just checking things. I just see just, you know, some something from scheduling or, like, for the other strategy. Right? Otherwise, Does it have anything on the pod called mode name? It says that it has been scheduled. That's what the scheduler puts into the pod once it's running? Yeah. So if you edit the pod, I would expect to see the
1:31:11 value node name somewhere. I mean, you could always add it manually to see if they confirm it as scheduling. And I'm just saying my advice is often wrong and leads you down the wrong path for twenty minutes. So, you know, use it at your discretion. Just take the scheduling logs. Next thing I would do is. Found the watch. That's been going on since it started. Although, I did just restart because we didn't have it before. I don't know. It's been quiet since then, it looks like. We got twelve minutes. Alright. It's time to open here at five,
1:32:32 Using Hint 5: Fixing Scheduler Name Typo
1:32:34 I think. It's the final resting place for power nodes. So why can't this one be scheduled? You got a very interesting comment in the chat from Don O'Neil. Not all default schedulers are default schedulers. Default schedulers misnamed. Oh, interesting. Maybe I noticed he was paying attention. Tables are Schedule in the end. Yeah. Once you see it, I'll set you can't unsee that. Oh, it's too blurry. Oh, so sneaky. That is good. One letter off. Yeah. It's crazy. Yeah. My wife's actually standing outside my window with a sign that says scheduler typo right now. Hey. We got a we got a database.
1:33:57 Database Pod Running
1:33:57 Will it blend? Oh, alright. Well, looks like we still don't have an application. So I would start looking at the clustered service. Yeah. What's the connectivity look like? So Teleport expects cluster to be running as a node port service on port 30,000, I believe. We're not getting anything anymore? Because before, we were getting a database failure. I think we've had internal server error every time. Yeah. We expected a database error, and we never got one. Never got it. Okay. Oh, there it is. We can also try curling local host on 30,000 just in case they broke teleport, but
1:34:10 Debugging Service Connectivity (Continued)
1:34:55 I don't know if that's okay. Just in case some noise changes. What was that last thing you said? I'm sorry. You could call local host from 30,000 to confirm if the node port service is working. Thirty. Yeah. I'm gonna guess it's not. Death by a thousand cuts here. Did you just oh, you deleted it. Okay. Never mind. Yeah. I restarted it. But it seems like we're would Teleport give us an internal server if it just couldn't connect at all? I believe so. Yeah. K. How much time do we have? Eight and a half minutes. There's a there are a few replica set
1:35:58 versions for for clustering. I checked the image pull policies always, and the image is correct. So I don't think it's a rogue image. I I don't I don't like that we can't curl localhost on 30,000. Yeah. That that seems networking ish. Yeah. I'm I don't know if you caught Mario's introduction, but he did talk about his networking skills. Yeah. I We talked about this. Silly in network. Silly in network policy. We talked about doing this, and we didn't do it. Yeah. We did. Yes. That was lucky. And Steven Augustus. That's awesome. You got the dance. Nicely done.
1:36:25 Fixing Cilium Network Policy
1:37:04 Seven minutes to spare. Left out some furthest. Harder. We left with Steven Augustus. You would have kicked yourself. It was a silly of network policy that blocked you from there from finishing it. So I'm glad you cut. Nice work. Alright. Carter, that was amazing. That was a lot of fun. Yeah. So much fun, Beverly. Well, thank you all for joining me, and congratulations. You got through that with a whole seven plus minutes left to spare. And your seg signal capturing to respond our process was definitely ingenious. I'm not gonna forget that one in a hurry. Thank you. And so thanks again It
1:37:27 Wrap-up & Conclusion
1:37:44 was wonderful. For joining us. Hopefully, we'll get you back another time, but I'll say goodbye for now, and have a great day. To to everyone else, thank you for joining us. Thank you for watching Clustered. Thank you for engaging us in the chat and supporting both the teams here. Thank you to Carter, to Mario, and to Moody. This was a fantastic episode. Thank you to Teleport as well. Please check that out at Rawkode.liveteleport. We'll be back next week. Have a wonderful day. I'll see you all soon. Thanks.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments