About this video
What You'll Learn
- Fix expired Kubernetes API server certificates, static pod manifests, and restart flows to recover initial cluster access.
- Diagnose etcd readiness and resource limit problems by removing probes and tuning limits to stabilize control plane.
- Debug application breakages caused by invalid webhooks, Postgres DNS, and Cilium kube-proxy replacement networking during app rollout.
Skyscanner and DigitalOcean each take on a broken Kubernetes cluster. Fixes span expired certs, etcd probes and resource limits, a kube-vip VIP, a rogue validating webhook, kube-monkey, and Cilium with kube-proxy replacement.
Jump to a chapter
- 0:00 Holding screen
- 1:50 Introductions
- 1:51 Introduction and Sponsor Mention
- 2:55 Introducing Team Skyscanner
- 3:00 Team Skyscanner
- 3:36 Skyscanner Team Introductions
- 5:35 Skyscanner's Challenge Begins: Initial Cluster Access
- 6:17 Debugging Connection Refused Errors
- 8:03 Checking Control Plane Components (Kubelet, Static Pods)
- 9:12 Investigating API Server Config
- 13:04 Checking API Server Logs (TLS Issue Found)
- 14:22 Renewing Kubernetes Certificates
- 15:10 Restarting API Server and Checking Logs
- 16:37 Forcing Static Pod Restart via Manifest
- 17:40 Continued API Server Log Analysis
- 18:41 Investigating Etcd Logs
- 19:37 API Server is Functional
- 19:49 Identifying Unready Nodes
- 20:36 Using k9s for Cluster Overview
- 21:37 Diagnosing Pod Connectivity to API Server (from k9s errors)
- 22:14 Considering Application Upgrade Strategy vs. Fixing CNI
- 22:54 Persistent API Server Connection Issues (from k9s)
- 24:15 Checking API Server Logs Again (Context Deadline)
- 25:00 API Server Received Terminate Signal
- 25:26 Etcd Member Status Check
- 26:36 Chat Suggests Resource Limits Issue on Etcd
- 33:30 Kubelet logs show Etcd container failing
- 34:09 Checking Etcd Logs (Shutdown signal)
- 35:38 Host Suggests Removing Probes and Limits from Etcd Manifest
- 37:16 Checking Etcd Logs After Modification (Looks Healthy)
- 37:58 Checking API Server Status (It works!)
- 39:22 Attempting Application Update (Deploying v2)
- 40:02 Application Accessible (Skyscanner Mission Success)
- 40:26 Skyscanner Team Reflection
- 41:02 Skyscanner Team Signs Off
- 41:19 Intermission & Sponsor Thanks
- 41:59 Introducing Team DigitalOcean
- 42:00 Team DigitalOcean
- 43:50 DigitalOcean Team Introductions
- 45:00 DigitalOcean's Challenge Begins: Cluster Access
- 46:41 Initial `kubectl` Fails (Unable to Handle Request)
- 48:20 Checking Kubelet Logs
- 4:49:07 Checking Etcd Status and Logs (Looks Okay)
- 4:51:39 API Server Logs (Failing or Missing Response)
- 4:54:06 Investigating API Server Manifest and Flags
- 4:55:49 Verifying Virtual IP Address
- 5:01:08 Cluster Info Command Fails
- 5:12:12 Host Hints at API Server Startup Logs
- 5:16:06 Looking for Anomaly in API Server Manifest
- 5:18:31 Killing and Restarting API Server Process
- 5:25:10 API Server Responds After Manual Restart (`kubectl get nodes` works)
- 5:30:20 Investigating Webhook Configurations
- 5:31:40 Deleting Validating Webhook Configuration
- 5:34:30 Attempting Application Update
- 5:37:10 Checking Application Access (Internal Server Error)
- 5:42:40 Debugging Application Connectivity (Postgres DNS)
- 5:46:57 Installing DNS Utils in Pod
- 5:50:10 Checking Postgres DNS Resolution (Works)
- 5:54:30 Checking Application Logs (No Logs Available)
- 5:56:30 Re-examining Application Deployment/Binary
- 6:00:30 Checking Services (Clustered and Postgres)
- 6:01:08 Discussing Networking Issues (Cilium)
- 6:01:30 Checking Network Policies
- 6:01:58 Checking Cilium Pods/Daemonset (Recently Restarted)
- 6:05:00 Checking Cilium Config Map (kube-proxy replacement disabled)
- 6:09:00 Finding and Scaling Down kube-monkey Deployment
- 6:12:55 Enabling kube-proxy Replacement in Cilium Config (Probe)
- 6:16:00 Forcing Cilium Daemonset Rollout (Add label)
- 6:22:45 Checking Application Access Again (Works!)
- 6:24:00 DigitalOcean Mission Success & Reflection
- 6:31:00 DigitalOcean Team Signs Off
- 6:31:48 Outro and Sponsor Thanks
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:51 Introduction and Sponsor Mention
1:51 Hello, and welcome to today's episode of Clustered. This is episode three of Clustered Teams, and we have two wonderful teams joining us today. We have team Skyscanner and team DigitalOcean. Now before we introduce them and get started today, there's just a little bit of housekeeping. Firstly, I wanna thank Teleport for sponsoring Cussert. Teleport is a product that we have used on this show since the very first episode even before they sponsored us. I think it's a fantastic product. You're gonna see it in action today as we debug these two broken clusters, and you should check it out for
2:26 yourself by checking out rawcode.liveteleport. Also, you should subscribe to this channel. There's lots more content focused on cloud native Kubernetes and everything in between. It is a vast cloud native landscape, and I hope that these videos will help you navigate it and all learn together. We also have a Discord server available at Rawkode.chat, so come on, say hello. There's over 400 and something odd of us in there now chatting, so I look forward to meeting you. Alright. Let's get started with today's first team. I am joined by team Skyscanner. Hello, team Skyscanner. How are you?
3:00 Team Skyscanner
3:04 Hello. Good. Thanks. You? I mean, I don't need to fix cluster, so I'm feeling pretty great. How are you feeling, Guy? Moderately terrified. Well, you should be because I'm sure people might remember your little voice and your smirky attitude as you broke a cluster for me previously with a Unicode character. Not that I'm better, but, you know I I I feel I was appropriately apologetic about it. No. It's an absolute pleasure to have you, and I'm really excited to see you work on this cluster together. Can we start with our round of introductions, please? We'll start with you, Guy.
3:36 Skyscanner Team Introductions
3:41 Sure. Yep. I'm Guy Tempton. I am a software engineer at Skyscanner, where I'm, partly responsible for the, shared container platforms that we run, the website on. Outside of that, I'm also, one of the co chairs of SIG Auto Scaling for Kubernetes. So, please feel free to come and harass me if you think auto scaling in Kubernetes doesn't work the way it should. Okay. Matthew? Okay. Hi, everyone. My name is Mateo Rina. I'm a software engineer at ScanScanner. Been here for four years enough now, and I've been working with Guy and Alex for such a long time. I
4:30 work as well on the container platform and recently focus on on deployment where we are doing some open source work around application set and to to release a progressive sync controller. So if you're interested, please come come to me. We're trying to to scale the tops for our needs, and I love to hear your opinion. The good news is it loaded. The bad news is I only see two machines. I'm gonna open a session on our control plane node. Please, if you could join the session, we'll keep an eye on the number in the top left, but feel free to type
5:35 Skyscanner's Challenge Begins: Initial Cluster Access
5:49 an echo hello to let us know that you are in. And we've got one, two, one more to go. Oh, there we go. Perfect. Awesome. Supposed to be fine. Alright. Well, this is a QBDM cluster, so you will have to export your cube config. And I suggest you check for a control plan and take it away. Best of luck. Okay. Do you wanna steer the first obstacles? Sure. I found that in the comments someone says someone broke telephone again, we really have a lot of the previous episodes. There's What a surprise. There's there's rule, people. I mean, come on.
6:17 Debugging Connection Refused Errors
6:44 It's not as hard. But yeah. K. We've got a connection refused. Did we specify the right host report? I mean, just can can you add the prompt? That's that'd be really that'd be really cheeky, but I think it should be fine. Maybe it's just pointing to something something DigitalOcean, though. K. I don't see a cluster. This is Yeah. Context. Is it off? Or is the Maybe maybe don't open it. There is no case. Oh, there you go. Where's the scroll? Yeah. Yeah. As the scroll goes, just give the window a little jiggle. There we go. That's an x term JS
7:36 bug. Hoping someone's gonna fix it one day. Is that right URL on the server? Yeah. Hopefully, the right IP address. Should we check should we check if we've actually got a kubelet running and containers? Yeah. In the run to this, I've had to learn container d. We do have a kubelet. Is it happy with? We're getting nodes. Okay. And let's check for the manifests. We do have a Kube API server in there. Let's get to our controller manager. Alright. Here's here's the question. Can I remember how to use container d runtime? Minus n indicates the idle. Yeah.
8:03 Checking Control Plane Components (Kubelet, Static Pods)
8:55 And then c l Containers. C for short. Okay. So we have a cube scheduler. We have a cube controller manager, and we do have a cube API server. So maybe we want to have a look at what the config of the API server is. Sorry. Say, we were happy with the API that we have in the admin on call with with the IP address. That's know the universe of this now. That is a good point. Yeah. So there is a server yes. So that URL is what it's supposed to be. Shouldn't be localhost. If we only add the chance to access
9:12 Investigating API Server Config
10:00 another cluster and look at the keep it all together. Oh, yeah. Guess I can't remember. So right now you're trying to you're you're speculating that potentially the modified IP address in that file. Okay. So you're gonna wanna check the IP a d d r on the look back interface. That is where the BGP advertised EIP is or the control plane. IP I should check it. D d r. Space a d d r. Yeah. You must do IS config. This should get you some It was being deprecated twenty five years ago. Still works, though. So this is your
10:51 BGP advertised address here, 1477580Dot168. That should be your control plan. K. It's right in memory. Yeah. I think that is. Server? 147758168. Looks good. Okay. K. So that's working. That that's not that's not wrong. Sorry. The chat thinks it's DNS, but I think they're trolling me. I think they might be, and I totally deserve it. Okay. So what was the what what error message were we getting again? Collection refused. It's trying to go somewhere, but either it's not listening. So maybe they're buying the port for the API server? Yeah. It is the the the Kubernetes config,
12:06 which is in a config app, which you can't get or someone on disk. We can see what the Kubernetes server is being run with, though. If I remember to Greta. Now what you do, you shouldn't see that as a service because it's a static part. Right? Can you hear them? Yeah. Yeah. It should still be Used to a PS UXF. I don't know if it's just there. This is what we get for not really Kubernetes. Yeah. K. I'm gonna go ahead and raise a ticket with the AWS and see what happens. These are bare metal clusters. Come on.
13:04 Checking API Server Logs (TLS Issue Found)
13:04 I mean, I'm gonna throw something out there, but, you know, you haven't attempted to look at the API server logs. That is that's a Very good point. Do you know where they are? No. We we we use docker. Proper proper runtime. Varlog containers. Okay. So we have could you really keep it keep API server? The command you're looking for is tail. Yeah. I was I was gonna go with less, but failed to find any pen data in the key the key data. Nice. TLS. Is that all it is? That probably should have won a cubic in.
14:15 So we do have key search. So in one of the previous episode, there there was a summary generating all the certificates with Keyblade. Maybe we can do the same. You can indeed. QMIDI and search. Search. QMIDI something something. Yeah? And I'm pretty sure that the help the help the page with with now as well. Search? Medium search renew. And maybe there is, like, a old or something. Yeah. There there is a renew. Wait for it. All? Look at that. We've never used a tool and just guess the CLI. We need it to get CLI. There you go.
14:22 Renewing Kubernetes Certificates
15:09 Okay. Freshly minted certificates ready for your service. So restart Kubelet. I can never remember. If we restart Kubelet, that won't necessarily bring back up the restart all the the pods, will it? The static pods too. It says It won't because the container d runtime is still running. If I even restart now, it wouldn't restart it. But it is a crashing static pod, it will restart manually. I'll just have a look at logs again, see if it I guess I'm doing unless it's crashing. That one isn't That was three minutes ago. Yeah. But have we got a new one yet?
15:10 Restarting API Server and Checking Logs
16:03 Just minus l on it. Yeah. Yeah. And capital t, I think, the time. Not that. No. Letter t. So I think I got a Steam 42, but nothing okay. So it's not immediately restarted. I mean, you can modify the static manifest to encourage it to be restarted about the car. Yeah. Looking at the manifest of the this is the manifest. So I mean, on the basis of modify something. Something people don't take advantage of, and it shows the them remembering the cursor from the last edit. That's that's a nice hint. How how how far did I move it?
16:37 Forcing Static Pod Restart via Manifest
17:22 You weren't paying attention. I wasn't. Feels like it might have been around the port. Yeah. You should check for logs. That's there should be Yeah. There should be something there. Yeah. The Vim thing's always funny. So I can't remember what time. Here we go. Chat is surprisingly quiet. They're normally in there with all the advice. This wow. This, the scrolling issue happens on every on every command. Yeah. It's it's an idea. Now if we could all just make our window the same size, it would go away. Yeah. That'll that'll happen. Okay. So we're still getting errors.
17:40 Continued API Server Log Analysis
18:26 And that's it. Connection refused. Yeah. Interesting. Might have to restart that as well, which is or look at the last one. Yeah. So we'll look at the manifest or the logs. Where is It's on some chat bar log pods as well. We can indeed. Nice. The set. K. So it looks happy enough. Yeah. I I think that maybe a two d restarting after the search has been renewed. Yeah. And the port was the right one. Two three seven nine and Yeah. There it said, 2379. I mean, that's well, that's that's also now a minute ago. So actually
19:37 API Server is Functional
19:42 Ta da. There we go. You have a control plane. We have two unready nodes. Try something. Canines. Goodness. I I I why not? Again, some encouragement from the Digital Ocean team. Shahar is saying you're doing great, smiley face. There you go. The rest of the chat are just laughing. I don't blame them. Hey, Khalid. I think I think the problem is we are teaming with Guy and everyone remembers him from the previous episode. So this is why the chat is quiet and no one is cheering. Yeah. There's a there's a comment from Waleed saying this is team unique, Khalid. So there you
19:49 Identifying Unready Nodes
20:29 go. We're we're like a yeah. We're like a friendlier. You you see Oh, who stuck k nines on this? Just you just saw me do it. I'm sorry. I was looking at the chat. Oh, we've we've not had anyone bust out k nines yet. That's nice. You start seeing things that are nice and red like that thing which we started a lot. Yeah. Oh, and cheeky CDM as well. That's the the new command from previous logs. This is complaining. I'm talking to the API server, unable to Yeah. Communicate. Oh, turn on wrapping. 1477580. I mean
20:36 Using k9s for Cluster Overview
21:16 I don't know the notes. I I quite like if they click time stamp time stamps on those What's the Better policy? Why? Everything happens. So what's the symptom you're seeing just now? Your Celine pod can't reach the Kubernetes API server? Yeah. Think so. Yeah. So a few things are where things are happening as well, but a few things fail to find out. Alita's asking, what is that tool? That is canines. You should check it out. And yeah. So if you hit the pods one that's having problem on worker two. So worker two seems to be our our next
21:37 Diagnosing Pod Connectivity to API Server (from k9s errors)
22:12 problem. I I mean, I guess the the question is, our our aim is to upgrade the app. Do we try it and see what breaks? So the problem is the Postgres is a worker two. So unless we move it, we probably need to talk to worker two because otherwise It's different. What's that? You're just so sneaky guy. Like, we're not gonna fix the broken thing. We're just gonna make the thing work, which is a good strategy. A good strategy. Okay. Yeah. What what are we trying to achieve? We're trying to give, you know, the the viewers access to your lovely new version
22:14 Considering Application Upgrade Strategy vs. Fixing CNI
22:44 of the website. Yeah. That's that's the main thing. We can come back and fix things after. I think that's why. First. I don't see any Level set. No API server connection. Where? In the bottom? See the the red Ah. Yeah. Do do do we have, like, a do we have, like, a a network a network role that does random dropping or stuff and stuff like that? Yeah. So that's why I was seeing here when I was trying to So last night, I had I had my first look as a serial documentation, and and they have something called network policy.
22:54 Persistent API Server Connection Issues (from k9s)
23:23 Yeah. Like You want to send that out? But now it looks like we've had to to go on, which means That was a nice shout, though. Yeah. I mean, that's what I would do. We got a comment from someone called Spedge who says he's very proud of you, guy. Pointing hair manager likes likes the being able to cite Skyscanner's principles. Yeah. Let's pretend Skyscanner.com is done. Come on, guy. Let's get this thing fixed. Get it fixed. A comment from Noel saying, just schedule the workload on the control plane nodes. Go for it. Yeah. Yeah. Yeah. So, Michelle, should
24:15 Checking API Server Logs Again (Context Deadline)
24:15 we look at the logs again? The the tailing of the API server. Yeah. Yeah. Because if the end of the certificate expired again, then maybe that is Context context deadline. This is a time out. You just have twenty minutes remaining. That's fine. So that's it, Jean, that, like, had a wee problem. I'm gonna take over. Received terminate signal. From who? From who? At, yeah, five minutes ago, which is roughly when we started seeing weird things happen, and it started back up again. Tests for Rats again. Mhmm. Anything up there? Oh, looks like it's got two members.
25:26 Etcd Member Status Check
25:29 I need to make my screen the exact same size as David. Listen URLs. It's got two IP addresses, but it's one of them's public. I'm gonna start using the command line, teleport client, I think. Anything obvious? The license. Does it come here? Anything obvious? No. Key points to members local on private. It's a hundred oh, the and is it you killed? Oh, do you see that? Someone in chat said it's a hundred megabytes of RAM. I think it's the hundred milli cores on the CPU. I know. Know. I know. Haven't profiled the SCD myself, so I'm not
26:36 Chat Suggests Resource Limits Issue on Etcd
26:47 sure if that is a good or a bad value. But I think it should be alright for it to be, to be fair. Maybe not. How how does I see things. So we've got Did it say, like, I don't want anything about it? No. It's just in the container. This is to make sure I'm on the right page. Do we have a working API server at the moment? Kind of Intermittently. Yeah. Let's go and look at it as a part then. It's If you don't see we should see it. Does it run as a static product, did it? Or does it
28:01 run? It should be running as a static project. Yeah. And it should be visible in the cube system namespace. Yeah. Okay. Okay. Should we have a look at the cubelet logs to see why it's not? Yeah. There we go. Failed to create a mirror pod for no priority. No. Let's I'm going sideways. Go max prox. Okay. Content exceeded for API server. Connection refused to the API server. You you were using So is there yeah. Is that have they got an admission webhook then? Looks like it. There is there is in the logs as well. So many people.
29:15 If if we can catch it whilst it's up. The admission rather yeah. Give it to TL. I mean, should should we set up the auto complete whilst we're doing this as well? Because I did not trust myself to go, yeah, I've typed that correctly so I could trust what I've I've got back. Yeah. Nice. Yeah. That's a. And you missed the API server window. Yeah. What's the difference here? Even that? You just have fifteen minutes of fun left. So did we check if for the if the what priority is in the manifest? No. You have to come.
31:09 I I I think Mateo has been confused by that. Do you want to take over, Mateo? Yeah. Now I got you. I was gonna sit down and wait. Yeah. It was it was about something something that they were drinking. I was like, sure. Is that what goes as well, brother? Clarity class system not critical. Yeah. So I was relaxed in the chair, so what were we doing? See the visual webhooks. Yeah. Billy from the Digital Ocean team has said you're doing great. If you want a hint, just see. Maybe not yet. Yeah. Let's let's see if we can
31:58 Have a look at the validating 1.2? Web validating. Yeah. You're gonna have to fix that I said the API server. You you can't debug without this. Yep. Yeah. So is it the API server that's restarting, or is it at to do this restarting? What times do we have if you run a PSE UX? What did you do for f? Just yeah. It has all of the okay. Yeah. This whole process is a bit scroll up. Oh, wait. Just watched I may wanna just grab for API server and and etcd. In fact, there's the API server there.
32:59 So that's literally just restarted. Yep. And what about ETD? When did that actually start? Okay. It's not running. So Yeah. What did you say in Cubelet, Guy? The I missed it. It was I was trying to post to an API, and it was getting refused. Was it paused? So the API server is probably reboot because the SCD is slapping. You're gonna wanna work at SCD. So if you have a look at the the keyboard logs again, Matilde. It's just Yeah. And then with controller. So if you have a look at, there was yeah. There's error syncing pods.
33:30 Kubelet logs show Etcd container failing
34:00 Error syncing pod failed to start container for ICD with partial back off. Alright. There you go. It's crashing. We need to get into the STD logs, I think. Yeah. Go have a look at the logs. We we looked at them through parts last time, but we can This just gets a shutdown. Yeah. Do we have something sending this signal to CD? So this is gonna be probes. That's my hunch. Yes. It's it's getting a nice terminated signal. So We should see that in the Qubelet box. Because the Qubelet controls it. So just grab this for searches with.
34:09 Checking Etcd Logs (Shutdown signal)
34:58 Connection refused. Error syncing. Question back. Unable to write event. You were hot earlier. Now we're cold. Label to write the end. Does it mean that there are there is no space? Okay. So we have ten minutes left. So I'm gonna throw something out there, and we can ask for a hint if he's one. But I would remove all the probes from the static manifest and the resource limits. Like, let's not give it any reason to shut this part down. Some of the cover? Some I see. Manifest. I'm just I think this point just That that is that is a subtle way
35:38 Host Suggests Removing Probes and Limits from Etcd Manifest
36:03 to say that I was typing too slow. Yeah. I mean, not so subtle, but Yeah. Remove the limits. Yeah. And remove the liveness. Yeah. Just delete the whole file. It's nothing important in there. Right? I mean, those are just requests. So but you can remove them anyway. Start a probe. I'd remove remove the start a probe as well. Sublime is all And if anything past commit is on 16, thank you very much. And should we save this? Cool. Coming from Rawkode's in Rawkode's smash. Yeah. I'm not as subtle or elegant with some of my effects. That is
37:02 you've gotta rule out anything that you think could be getting in the way again as effects. Right? You mean you can add back probes later. You can add back resource limits later. Yeah. Really, you just need a working entity. You wanna check those? Let's see those entity logs. Okay. It won't be two twenty anymore, though. Won't even do this thing anymore. Probably just because the pod ID has changed. Use firewall containers. That way you don't have to worry about the directory changing. Okay. That is a healthy entity Yeah. For for now. They are namespaced as well, remember.
37:58 Checking API Server Status (It works!)
38:15 Dash dash all namespaces like an absolute monster. We've taken that. That's working on the I think it failed. It's still there. Ta da. That looks suspicious. Now you gotta look at it or delete it? Curious. You gotta look. I got my boss. Nice. They did. I create pods star. It went to a object selector just for this. Alright. Remember your mission. You got seven minutes. K. So if the question's looking slightly healthier, I'm not overly happy with that. But No. Let's think first is to this. Anything else? And it already failed, but we'll try that
39:22 Attempting Application Update (Deploying v2)
39:35 first. Nothing within container creating. Mhmm. It's pulling the image. Yeah. There you go. But can't does it work? Would you like me to try? Give it a shot. Yeah. Why not? So it's exposed via teleport, and we can click custard. That's the successful one. Mission successful. No. I think you still got a particularly broken cluster, but you did manage to deploy an update to your application and browse to it. So well done, team Skyscanner. It's eventually consistent. How was that? You enjoy that? Yeah. It's good for me. There's some tricky ones in there. I I think as always, the most
40:26 Skyscanner Team Reflection
40:33 irritating ones are the intermittent ones where you go, okay. Go on to the oh god. No. No. Still broken. Alright. So now you know what you've fixed and you know what you've broken. Who who's meaner? I think we we might be meaner. Alright. We're dying. So Well, you don't keep my reputation up. I don't think it's up, but whatever, Guy. No. It was a pleasure. Thank you, Guy. Thank you, Matthew. Thank you, Alex. I hope you enjoyed that. Please feel free to watch from YouTube, be in the chat if you're available. I'm gonna invite Digital Ocean to come and
41:02 Skyscanner Team Signs Off
41:11 join the session, and I will jump back over to here for just a moment. So thank you. I'll speak to you soon. Alrighty. So while we wait for Digital Ocean, I would remind you all to like, subscribe, and click that bell, and another awesome thank you to Teleport for sponsoring Clustard. They made this show possible with their support. We use the product. We love the product. You should check it out too. Run Teleport on your infrastructure to commoditize access through secure protocols, remote pairing. Right? That's pretty cool. And application proxy in as well. Okay. We've got,
41:19 Intermission & Sponsor Thanks
41:48 I've still got to kick them out. Skyscanner just hanging around. Definitely outstaying their welcome. Welcome, Shahar. Oh, it's got that's your team coming in there. There we go. Welcome, Billy, Jeremy, and one more. So this then affects everything. Right? So the the they they kinda fix it, but they worked around the STD problem. Right? So, like, it I did the same STD thing I did last time where I just changed the the liveness probe to be health c instead of health so it would crash every two and a half minutes and then be down for a while. And I
42:00 Team DigitalOcean
42:36 noticed it because your Vim cursor started there. I even gave them the hint, but they they they didn't they didn't take it. Yep. Alright. Wait. There there was there the the validate emission or the the mutating emission webhook would have all all it did was it changed the v two tag to v one. So they would have changed it in the deployment and then the pod still would have been running v one even though the deployment had been changed. Sneaky. I like that. Definitely cool. You could do some serious amount of damage with those mutating and webhooks. Right? And you
43:11 have no visibility into what's happening, which is which makes it even painful to debug. Yeah. I have another idea for doing the same kind of thing, but without without without without a mutating webhook. You keep that to yourself for now if we can bring that one. We're still missing one person. Colin. Colin. Alright. Let's get you hooked up with teleport. Oh, no. We haven't done introductions. I'm so rude. Apologies. Let's start with a a quick round of introductions, and hopefully, Colin will join us, but we'll let him see oh, there we go. Perfect. Welcome back. Alright. So
43:50 DigitalOcean Team Introductions
43:50 Alright. Can we do introductions? We just take it away. Shahar, do you wanna go first since you're top right, then just loop around. Yeah. Hi. I'm Shahar. I'm a software engineer at DigitalOcean working on the app platform. And, yeah, excited to be here. Billy, you wanna go next? Sure. I'm Billy Kleeck. I'm a staff engineer at DigitalOcean. I've been doing Kubernetes stuff for almost four years, running internal platforms at deal that that host the vast majority of our internal services. Jeremy, you're you're you're next on my screen. Hi, everyone. My name is Jeremy Morris. I'm
44:29 a software engineer. I primarily work on DOKS. That's DigitalOcean's managed Kubernetes product. Hello. Hi. My name is Colin. I I work with Jeremy on the containers team, primarily focused on the the registry product right now, hoping to get into Kubernetes more later. Awesome. Thank you all, and thank you for joining us. That was a sneaky cluster. It's now your turn to see what SkyClanner have prepared for you. So I'm gonna pop open my screen share. I will open a session on the control plane note and if you can just let me know when you're in there and ready and feel
45:00 DigitalOcean's Challenge Begins: Cluster Access
45:08 free just to check for a control plane when we get a after we've got a few echo hellos. Oh, we're missing a face. Let me get that working. There we go. Okay. I see active sessions? And you should be able to see, yep, an active sessions. Oh, no. I mean, fresh. I'm definitely a little nervous after after they said that they're they think they're meaner than us. I'm looking through some some of Guy's previous breaks, so we'll see. Yeah. I did swear I'd never have him back on the the show. But Yep. That's that's probably for the best.
45:52 Anyone that brings Unicode to clusterfight isn't good in my book. Alright. How are we getting on there? We're still waiting. I might just tag along then. I'm trying to, like Just to make sure you're on Skyscanner.cluster.live and not digitalocean.cluster.live. Right? Yes. And you went you went to active set you went to activity in active sessions and joined the active session? I I'm not able to oh, I might not have oh, my bad. I have not set up correctly. That's on me. Alright. Sorry. Whenever you're ready. If someone wants to take over and check for a control play, we'll get this thing
46:38 underway. Best of luck, team Digital Ocean. Alright, Jeremy. Are you you gonna drive? Yeah. I'll start driving. Alright. Let's get a let's first things set up get set up config and and our alias. I love that that alias is the first thing even before the cube config. That that's my kinda party. Yep. And then, tipped a bit for the auto completion as well. I don't remember that. Do you wanna type that in? Yep. This is still a lot. We can run this command. Yeah. I'll cut it. There we go. There we go. Cool. K. And
46:41 Initial `kubectl` Fails (Unable to Handle Request)
47:30 so maybe look at the note or something. Okay. Wonderful. Okay. There we go. So, couple of things. Let's let's go check those manifests. First thing, Etsy Etsy Kubernetes manifest and just let's see what the timestamps are. What I love is that is an actual response from the API server. That is not that that is a working ish API server. Oh, yeah. That's right. Yeah. How do I get the oh, there yeah. It's just time stamps. Currently unable to handle the request. Did you do export? You did. Oh, boy. Okay. This is this is fun. So
48:17 does does anything work? Can you do k get pods? K. Okay. Let's check let's check let's check the logs for Kubelet. How how do I click that? Sweet journal CTL slash u kubelet dash f. Just follow follow them for a bit. No. It'd be dash u kubla first. Yeah. K. Second event. K. Let's let's let's check to see if, etcd is running. We so component status is is not something we can do for, the controller and but I I think it does social ID. Right? Or is that not even used to use at all at this point? I don't
48:20 Checking Kubelet Logs
49:27 I don't think we're gonna be able to use, cuckuddle at all for a little bit. So let's let's just check to see whether SED is running with PSAX grab for SED. Minutes. K. Alright. So and can we check logs for a etcd? Etcd or something like that? I think we're gonna have to look at the varlog containers to see them because gets that TD is being run system d as running as a static pod. Yeah. So check check those STD logs. Yeah. Checks okay. Okay. That's interesting. Okay. So we so we can we can it looks like we can connect to Kube
50:40 Kube API server, but it can't by the way. I'm not sure what's up with that. But Yeah. So there is a a small bug in Xterm JS. If you can't scroll, you just, like, resize the window a millimeter. It just kinda fixes it. I know it's it it does get a little bit frustrating, especially as it breaks on my side and then I do it. But eventually, we'll get something that works. I'll just resize it as needed. Awesome. Alright. So what should we do next? Let let's just do a k get notes again just to make sure that they didn't
51:16 do something like what we did where SEB is crashing. No. Okay. Server is currently unable to handle the request. Let's let's check QA API server logs. That was the handle of the crash. Failed the bot. Failed the list. What's that top one? Failed failing or missing response from? Should that be six okay. Let's go check that's interesting. Oh, we can't check that. So is there, like, a cluster role that that's for this? Yeah. Yep. So but I don't I don't know how we would check it without without Kubernetes, without without being able to so but but let's try. Like, may maybe
52:29 maybe maybe you can see its own cluster roles. So just do k get cluster roles. Let's see what's there. See if it'll tell us anything. No. Nothing. And do we do we do we have a do we have a log that will tell us about this in Kube API server? So this this values pay that we just did. See the logs again. So we maybe it'll give us some hint about what we just failed. That doesn't look recent. It doesn't? What time is it? Yeah. It's not too bad. Jeremy, was was that was that the end of
53:13 it, or was that the This You did 10 end of it. Yeah. Yeah. Yep. Good catch. I didn't notice that timestamp, but that is indeed a loop. And there's two of these these pods, though, and we're Yep. Yeah. Could be like an underdog. Agreed. So wanna take a look at the other one, Jeremy? Yep. So not 40 Yeah. That's the one. There we go. Okay. Alright. So data watch server is currently unable to repush request in ESpace. I will sync it right. Okay. Let's go take a look at the the QAPI server manifest. I think that's a good idea.
54:13 The the reason I'm doing that is is up it says that the server is currently unable to connect to a it's currently unable to service the request. So I'm wondering if these IP addresses are correct. And I know that from looking at so because I I did a little bit of reconnaissance on since Guy had done this before to go see what kinds of things he had broken before. Because I know I repeated some things, I thought he might as well. So secure port looks right. We should probably check these these IP ranges for the services and for
54:48 what we're listening on. So I don't think I wanna take a look at the flags here in just a minute too. We resize the window again. Hold on. K. There we go. Alright. Let's let's go up to the top here and take a look at the flags that QAPI server is running with. Gonna go to the top again. Oh, okay. Let's go down. Yeah. So is that IP address for the advertised address? Is that right? I think so. How how do we check that? I think we'd have to take take a look at the the IP addresses the the
55:46 virtual IPs that we have. Is that the IP adder command? Yep. That'd be something else. IP IP adder l? There we go. Okay. I think that's the same. 108046143. I think that's right. Can can you can you just double check the the flags again? 108046143. 10 80. Yep. And is let's because team Unicode, let's make sure that I oh, let's just type that I IP address in. I like it. What am I doing with this? We're just gonna retype. So it's 10 so I just would go go to the equal sign and just do an insert and put
56:51 that same IP address in, delete delete what they had. Oh, so just delete this and retype it? Yes. K. So it's 1080. I don't remember it. 40 no. I think it was 46143. Jesus. Okay. I mean, you could just do a slash and search for that address and make sure it matches as well. But if you hit escape and then forward slash and then search for 108046143. And then you can use n to make sure it matches the one at the top, but I think it will. Press enter and then press n. And it's matching it to them all, so
57:43 I think you're good. Yep. I think so too. But but there if there if there's my my concern, David, is that there's there might be a character here that looks like a valid character. But but I think I think you're right. Since he searched them, we we it it wouldn't have Guy has been in the chat to confirm there is no Unicode list. Thanks, Guy. K. Shouldn't we look at the maybe it's obvious. I'm not sure. But should we have looked at the config for Sure. Config? Maybe that was messed up. Maybe not. Yes. Well, potentially.
58:22 Yeah. It's it would it would have to be connecting to a different server. Right? Because because we know we know that we're getting request to our our instance of API server from the logs and from the failures that we're seeing, don't we? Is anything, like, off to anyone? Turn on the what's the hang on a second. I wanna take a look at Teleport real quick. The error that we were getting wasn't able to get some endpoint. So Yeah. So and and and because I did recon. Another thing that Guy had done previously was to shut like, mess with Kube controller
59:14 manager to shut down a controller. I don't know if Kube API server has anything like that, but it or or if that if that could affect this, but it potentially could. I think I'd still like to go back and look at the Kube API server config some more. So it's failing to watch all of these resources. Or if it makes sense to look at do do we already look at the containers? Maybe we should look at that. Sure. K. Real before you do that, we we do l s minus l on these manifests again. IT Sploit is asking if the IP is
1:00:31 correct. Yeah. I think we've confirmed that. So Yeah. Not it. Can is asking how can I get this terminal? It's teleport. Go check it out. Rawkode.live/Teleport. Naveen noticed an IP mismatch in your admin.com. That's expected here. The BGP advertisement address of the control plane is different from the node itself. Just a quick question to if we can do this, does that mean that we might have access to other commands? Or Potentially. Yeah. So so what what this this means, like, it is talking to the server. Right? So we can talk to the server, which means the server's up and it's running.
1:01:17 That makes it seem like it's permissions, but I don't know of a way to deal to to mess with the permissions in a way that we could like, with anything like that. But it could certainly be a flag on Kube API server or something. Oh, wow. I'm I'm not sure that it's permissions. I think a permission error would say something in in those logs that would be very specifically, like, cannot get resource because of, like, x y z, where it's saying, like, failed to watch, unable to handle the request, which makes it seem that the resources that it's trying to
1:01:49 watch might might not be responding, or they might not be listening on the port that we're watching them on, or they might not be up. So you will see a lot of those fail to watch errors and logs kind of all the time anyway for reasons, but but you're right about I think you're right about the the Can I throw something out there? Or can't hit the new point. What's that? Can I throw something out there? Yeah. So I think we're maybe missing the most important error message in the logs by using tail as kinda my hunch. And I think
1:02:20 Okay. It's a d is a problem. If you remember that first PSU that it's a d had only been running for three minutes. Oh, okay. And it seems to me that kubectl version is a sub command that doesn't speak to it's a d and it's working. So Mhmm. Sure. Where is that? Where is the etcds? That one. And how long has this one been running? Can we see notes? A CUI API server and ITV have been up for the same amount of time, it looks like. Yeah. You might be right, actually. Let's see the API server logs again.
1:03:18 I said maybe this is a shotgun approach, but should we look at, like I just try to spot check the manifest files? So we only look at the API server. Right? I think that makes sense, but but because we know that we because we know we can talk to QA API server, I wonder if there's some other things we could do. Like, let let's try a a cluster info, which I know is is cluster info and component statuses, which is deprecated, that would be useful, then also get events. See if either one of those Oh, okay.
1:03:45 Cluster info dump. But is there a way to, like, filter that so we don't cover the screen? I think just bot. You don't even need a dump. Just do cluster info by itself. It should be done for now. Okay. Oh, yeah. So so I guess we do that. Yeah. But it's it's it's telling the same thing. It can't handle the request. Okay. Let's try the get events. I I tried that in the beginning. We can't do that. K. Server's currently unable to handle the request. K. Let's check the TD logs. STD calls are the worst.
1:04:37 True that. Because no no no one deals with it all the time. So okay. So compaction is happening. It seems to be healthy. Are we actually listening on the is that TD listening on the same port that CUI PI Server is connecting to? And what what what would that error look like if if if it couldn't if CUI PI Server couldn't talk to that TD? I think it would make sense if the error would be something like what we're seeing where it's not able to process the requests for, like, watching stuff. Yeah. Let let let's check those ports and
1:05:15 IP addresses for the HDD is listening on, make sure that the writer can leak guys over config. Do you real quick. Do any of you guys know the the reason for the discrepancy? Like, I see the store index and finished scheduled compaction. It just happens sometimes, I guess. Maybe because I'm only tailing the logs. Yeah. It's because because you're just because you're telling, but, like, that store index and compaction, that that happens periodically with FTD as expected. They know that was, like, triggered by, like, some action that is causing our issues. So what should I
1:05:47 check out next? Let's let's check go see what HTV is listening on. So let's go check its manifest. We're going we're gonna look at a couple of things. We're take a look at those the the startup probes, libraries probe, the health probe. Probably also wanna take a look at it and then see what HTV is actually listening on, and then compare what the entity is listening on to what Kubernetes servers configured do. Okay. So it's at a 146143 IP address. That's IP address looks fine. K. Login's probe is health, not health c, so we know that's right.
1:06:26 Yeah. And it is that the right port? Is it twenty twenty three eighty one? I believe that's right for TD by default because it's 2379, 20 3 80, and 2381. Right? I believe you're right. Yes. Can we can we can we curl that that endpoint? 2381? Yep. Slash curl. So the flags you wanna use, if any? No. No. I don't think so. There you go. 2381 slash health. Mhmm. Great. Okay. Let's go back to that manifest. Alright. So it well, how's how's the readiness probe look? There is no readiness probe. There's liveness in the start up. Can can you scroll down so
1:07:37 we can look at the start up too? Start up probe. K. K. I just wanna make sure it wasn't the start up probe wasn't set to, like, something, like, way in the future, so it wouldn't like, a super long delay or something, so it wasn't responding. Okay. Well How do we so so, like, the things that it's saying it can't get, are they up? Like, if we what are the logs again? It was it was like So, like, it it's it's saying it can't it can't it's Can't watch these okay. So it it would be at CD.
1:08:23 Potentially. It it could this is a tricky one. I I still think we're we're not seeing enough of those log. Yeah. He's still, like But I I'm I'm stumped. I've gotta say, I have no idea. I've not seen this before. Really? I'm gonna reset my screen again. K. Just give me so let's just take a minute. Let just kinda read through these, see what we see what's useful here. Server's currently unable to handle the request. I wonder I wonder if it's so so this could be a couple of things. Right? It it could be search account credentials for
1:09:15 a Kube API server or yep. Tilt retrieve. It's code. Service available. So how do we confirm whether or not it's service account credentials or not? How we how we confirm what? You said it might be service account credentials? Yeah. Potentially. So I think we'd wanna go with a couple of things. We'd wanna take a look at the Kube API server manifest again, but we'd also want let's let's take a look at the let's it's in Kubernetes PKI directory and see if anything there in there has been changed. So I'll ask myself. The Skyscanner team have said, remember there are
1:10:10 hints under slash root if needed. And another comment from a Skyscan Skyscanner member saying, when we broke this, we did panic that we couldn't fix it. Good job. Well played. Okay. New solution will want to check out the API server. Hold on a second. What was that? What what was your question, Jeremy? This is what we want to check the timestamps and permissions on. Yeah. Can you also check that SCD directory? K. So I think that's interesting. Like, they were a little panicked that they that they wouldn't be able to fix it. I'm trying to
1:11:05 think of things that you could do that you wouldn't be able like, did did you guys mean, like, you didn't know if it was possible, or you didn't know if you would be capable of doing it if if you had the same problem? Because those are very different things. Yeah. Let us know team SkyScanner. And I do believe based on conversations we had earlier that they attempted to fix and that did work. But it's fixable, but I I think they were panicked. As far as I know, I could be mistaken. Do you wanna take a look at the
1:11:33 hint? I was so proud of them for not taking the hint right away. I don't wanna take it right away either. I wanna keep looking. Okay. You've got time. You've got twenty minutes. Yeah. Yeah. Okay. So all these that they edited one of these, like should we just do, I just renew the service or something like that? Sure. Instead of just checking up individually and because I know there's a change that we did, and then they just wiped it out with that. So Yeah. So I I I so that's that's certainly true, Jerry, but I will point out
1:12:05 that, like, all these time stamps are around the same time, and they're not all exactly the same. They're, like, within a second or so of each other or a minute, which I think probably means that it was that they haven't been touched. I don't trust them. I I think they've changed something to reset a time stamp on it. Okay. Can't you do touch star, right, to reset the time stamp? So so, David, are we allowed to check the bash history? Yeah. I don't think they'll have left, you're more than welcome to check the bash history.
1:12:34 Let's check it. Just run history, Jeremy. Where is this? Does this how yeah. I don't think you go. Echo Jeremy at the top, so not useful. Yep. But good job. Nice try. We we come we covered our tracks too, but you just never know. Right? Is that one cert file supposed to be five bytes only? The s a dot s s cert? Yeah. Are are we able to connect to the other nodes? Like, does the API server see the worker nodes? We don't know. Right? Because we could do we did k get nodes, and we don't
1:13:14 we don't get anything. Yeah. Yeah. That's a good point. And we and we know Kubla is running. Right? Anyone have any prior CTL commands? What's that? SkyScout or team are saying nice try on the history. People forget to cover their tracks. I would still like to go take a look at the Kubernetes server manifest again and come and check those IP addresses for XD and whatnot. This feels like it's permissions related, though. But I don't know I don't know off top of my head what would cause that. So there's a couple things. Can I do a
1:14:02 l s dash a l on the manifest? I'm not sure if that makes sense to do that. Yeah. Very good. We've done that a couple of times already. Yeah. That's those are the same ish. Back in this file now. If if you try to just, like, l s in the varlogs containers like, what containers are are running right now? Like here? Yeah. Just wondering. Okay. Cool. So we have controller manager. We have scheduler. We have we know we have FTD in Cilium of CUE API server and CUE BIP. Can we check CUE BIP? I don't think that'll make any difference.
1:15:04 Sorry. Jeremy, I actually meant the manifest. Let me see what the manifest is. Oh, the manifest. Okay. Yeah. I have say that the fact that we are on Teleport via a browser means is probably okay. Okay. If we I stupidly use the BGP advertised address for Teleport, which I really need to fix. But everybody's mentioned that, like, almost every clustered episode. I know. And then I'm like, I'll fix it. I'll fix it. I never fix it. Can I make another suggestion? Just because this one is is Yeah. Has piqued my curiosity to no ends, but I'm gonna suggest you kill
1:15:54 the API server on purpose and catch the logs from start up. There there has to be a clue in there. Yeah. It just feels like like that. There's there's something we're missing there, and I can't Yeah. That's a good idea. Is there supposed to be missing value there for the one key? The? Oh, yeah. Yeah. It looks like there's a missing missing, value field for Yep. Yep. Yeah. That that's okay. The password is null, so it's it's acceptable. Okay. Okay. Yeah. I I agree with David. Let's let's let's tail those logs. And and can we
1:16:34 is it possible for us to easily can can we connect with another session to to Feel free just to, yeah, open one on your own site to kill the API server and leave the logs on the displayed one. That's that's all good. Okay. I'll do that. Jeremy, do want me to kill the logs? Kill the TV API server logs? Yeah. You'll need to wait for the new pod ID to come in. So but you you should be able to get it pretty fast. Give me one second to get this set up. I was gonna do get notes.
1:17:15 K. Done it. And so Let's take a look at this. One am I doing? I expected a new one, to be fair. Yeah. Me too. Did we kill the API server? Okay. And I should kill minus nine on the process. Yeah. So so, Jerry, what what Dave is asking asking us to do is kill kill the Kubernetes server, the process, so so it'll get re so it'll get restarted so we have fresh logs. Then I'll run a command so we can maybe spot where the problem where the problem is. Just, like, try to give you a a
1:18:01 catch up or something like that. Or Yeah. API s should be enough. And so now we're gonna kill nineteen eighty five. That's the that's the one we wanna do? Yep. Nope. Not that one. No. No. +1 984 was. But Glad there's no teleport. Down. Measure twice, cut once. Oh, you got something different? PFF. There we go. Okay. Alright. So are you tailing these logs now? I'm run a get those command. Yeah. Tail dash f on that. Just do tail yeah. Okay. Just get done here. You ready? Yep. A dash f on that. Sorry. I get
1:18:54 we got a response. So that worked. Jeremy, stop this. Yeah. And just do, hey. Get notes. Please. What if it was process 1984 was the problem? Okay. Okay. Good luck in everyone. I I did the quick fix. Maybe. Yeah. Who was expecting that? I wasn't. So I will say I I wondered about this kind of thing when because we did we did something similar with our with our webhook where, like, you can change a manifest in some way to to get it to spin up a bad process and then change the manifest back so it doesn't look like it's out of
1:19:34 sync, but it actually is. Okay. So now component statuses, cluster info I don't really care if a statuses. Right? I would like to see it. Just curious. Okay. Sky scanner are clapping Good. And seeing I don't saw anything. Just scheduling control manager, we can ignore that. It always looks like that. So I tell you, this this takes me back to my Windows days. Like, oh, it's broken. Just restart it. Turn it off and on and on That's right. Yep. Okay. So let's so we we we talked through, like, some things to do kinda from the get go.
1:20:18 Can we so we can see let's check the meeting webhook configurations, routing, or the configurations. I can type it. All these shortcuts, and they don't have shortcuts in it. I don't understand. You you can hit your auto completion that you that we created. Oh, it's validating. Oh, nope. And it seems to do that. Hey. Okay. Alright. Let's delete it. Uh-huh. Delete it. Uh-huh. I don't I don't even care what it was. Just delete it. Well What's up? That's what I hope they didn't do the kind of thing that we did too or that we that we talked about doing, Shahar. So
1:20:59 one thing we talked about doing is putting a valid webhook in that would replace the valid webhook configuration. So I just have one in that would, like, replace reschedule a replacement. Oh, I need this part right here. Yes. I know this. That's a great idea for future episodes, actually. Like, have a a broken cluster with a mutating weapon that fixes new commands. How do you delete this? Well, let's go do so just just do k delete validating webhook configurations dash dash all. Oh, okay. That's easy. Cool. There you go. Can I get it again? Yep. Cool. Okay.
1:21:40 Cool. So now let's To the fix? Well, let's update? Let let I was gonna say let's let's try to we've got a v two and see what happens. Yep. I did that deployment. I suspect we are not done, but Mhmm. Better now. It's Hey. You don't need to update it. That's it. Did we know if it was actually working, though? Like, did anybody check the web page? I don't care if it was working. Yeah. It's like always a good thing too. Well, because I tried checking it, and then I'm not sure if I'm using the right
1:22:21 URL, but it's the internal server. And I can I can click on applications here, click clustered? Let's see if he typed in the right. Work. Right? Yeah. I'll do those. Wonderful. It's Okay. So let's so now so now we may be dealing with an application error. Let's go see make sure our postcard pod is running. So I'm gonna check the services and the pods that are running in the default namespace. I guess what we're looking for is we we it's there's no space between the comma and pods. Because what we wanna see is we wanna
1:22:56 see if if it's application is all of its components are up and going. So it looks like they are. Let's check the clustered. So, David, one thing I didn't notice I was looking for earlier is is the the IP address that cluster connects to for Postgres, is that hard coded in a container? Yes. We'll use Postgres as the DNS name to reach the database. Cool. Then let's and and with that in mind, Jeremy, I think we should check core DNS to see what the core file says. So we should be able to get get pods
1:23:34 or get config maps in the Kube system namespace. I wanna see that. So I I don't think they're gonna mess with it. It doesn't look like, but I would like to see that core DNS config map. That's what Guy changed last time for the record. I don't think that's would be So no. To get oh, yeah. I'm off. Because we wanna see the I wanna see the whole thing. Okay. Hang on a second. I don't know what that lame duck health is. Do you know off top of your head, David? Is that legit? No. And I should know better by now because
1:24:32 I've had to debug Kubernetes many times in this show, but I don't. I will point out though that we got internal server error on the application and not the DNS error, which would be displayed to us with angry man from movie. If so if we if we had if there were DNS problems, you're the clustered application would still say It would tell us that it failed to reach the postgres. But, I mean, I left a shell in the clustered pod. You could go ahead and run a curl if you want to see the resolution. You should be able to run bash in
1:25:06 the clustered pod if you wanna confirm. Yeah. Let's let's The chat is telling us that lame duck is acceptable and fine. Thanks, Noel. Cool. Thank you. My deal with the cuddle. One sec in. So cuckuddle exec. Alright. Just under ten minutes per minute. It's getting I'm gonna do that. Gonna copy it. Yeah. One thing that I I found myself trying to do last night was I I work I do all my work in TMUX, so when you're messing with things trying to break it up, I'm trying to do the TMUX. What is Short prefixes to like, or
1:26:03 something like that? Yeah. Batch is available. Yep. So what do wanna do now? You are in the plug. Yeah. Yeah. Curl from here. What are we curling again? Just Postgres. You just wanna see it resolved. Yeah. I I kinda wanna see if if what dig Postgres resolves to. I know I read dig post search Postgres. You're gonna have to remember if it's DNS utils, DNS dash utils, Bange utils. Yeah. Okay. So it's it's DNS utils. I'm getting this utils. You wanna type it? Sure. We've got some support in the chat from Rawkode saying go team dough.
1:26:57 Mhmm. Oh, got install install. Yeah. I think it's dash utils. Is it like that? I think so. Yeah. I have to do this every time. I should really remember it by now. I would've just googled this by now. It's all good. I like to get it wrong 14 times and then Google it. So That's right. There we are. You always gotta do that update. It Postgres? Is that the name of that? Postgres. Service? Yep. Okay. So so it can resolve that. Let's You have a I'm gonna get that out here. I I wanna see I wanna see the
1:27:55 logs. Like, of the actual application? Yeah. Because that might just tell us what the error is. Oh, yeah. We didn't run the logs command. Yeah. There's not a lot logging in the application, to be fair, but we can see. Yeah. Well, thanks, David. Alright. So we're back in the container now. Jeremy, I'll let you start driving again. Alright. Alright. So what could this so we know so we're pretty sure this that they haven't messed with the deployment itself. They might have. Right? So let let's let's go. What's what's running here? PSAX. Okay. HTTP makes sense.
1:28:54 Let's exit out of this. That is not my binary for the record. Okay. Good to know. Let's go let's let's go take a look at the deployment again. Sky scanner. Oh, guy particularly says, you are doing a great job in getting close. And Alex confirms he read the Rust code. There is zero logging in my application. Oops. Alright. Let's so I want I wanna check command line arguments and image name and and things like that up here. Matthew says it's written in Rust that doesn't need logs. I'll translate roughly what he was saying there. He's right. Memory safety, it can't
1:29:41 crash. Yeah. So so you so if it can't crash, you don't need to know what's going on. Right? Exactly. Right. Jeremy, can you go can you go to the top here? Let's go take a look at the image and the container arguments and things. On inside of this Here we go. What what like, inside this call or something else? No. Scrolled up. I want yeah. For the deployment. There we go. There we go. Oh, there's a lag. I don't I don't know. Oh, yeah. It's lagging. It's crazy, I guess. Alright. Jeremy, you you gotta slow down so we can
1:30:20 I'm not I'm not clicking anything. It's just Oh, sorry. That was me that was scrolling. No, David. If you could please sabotage us, we'd appreciate it. Okay. So There is? So I'm seeing ghcr.i0 Rawkode clustered v two. Is that the right image name? That is well, it's it's the right value for the image in the YAML. Yes. Okay. And image pull policy is always. Let's go check let's go check the service. Juan, is it possible that I somehow ex like, left? Oh, now it's updating. I don't know what's going on. Alright. What are we checking now?
1:31:14 Let's let's let's check the clusters. Let's check the clustered service and the Postgres service as well. But I think that David gave us a really good hint with that's not the HBD. That's Apache. That's not his binary. Binary. Okay. I guess I can't do it like that. If you so you can do so if you want if you want to determine, you specify multiple services with a space between them. So resource types are comma separated. Resources are space separated. So 10. Uh-huh. 1011112197. External driven. Node port, 308080. And that was that's for what? That's This is clustered. What's clustered.
1:32:23 Is it possible that the the service is, like, only exposing, you know, that, like, broken maybe, like, Apache server instead of, like, the regular application? It certainly is, but but what I'm confused about is if that's not David's binary Mhmm. Then how is that what how is that the thing that's running if we're if it's actually using his image? So there's an image pull policy of always. Presumably, they didn't somehow get access to the Rawkode Docker registry, right, or Docker repository and and replace it replace it. So how how is it running the wrong thing? So we go take a
1:33:00 look at the look. Can we go take a look at the deployment again and check those arguments? Because the last time we were doing that, we got scrolled away from it, and I think I got distracted. Alright. So we wanna go take a look at those arguments, Jeremy, the image up in the pod spec or the deployment pod template. Yeah. I can scroll it. Yep. There we go. Alright. I'll just hang out here for just a minute. Alright. So clustered GHCR.io, Rawkode, clustered v two. That all looks right. So what's responsible for pulling that image?
1:33:43 So the image pull policy, but it it's set to always. I don't What what actually pulls the image to the machine? It's not the Kubit. Container. Yeah. Container d? Oh, okay. So I don't I don't know how I I don't know a lot about container d and where things start to how you can configure everything. Yes. So you can need the help. You can do a cryo control images to list the images, and there's also a super cool command called oh, you'll need the dash dash runtime endpoint, and let me take that in for you just
1:34:20 in the interest of time. Thank you. Oh, so this is how you avoid the socket issues that you were having, David? For container d. Yes. Yeah. And the other thing I would suggest is container d space config space dump will list the container d configuration. However, my advice might be getting you colder as Guy has just said in the chat. But I don't know if that's my advice or what we were doing beforehand. But I would still take a look at container deconfig dump before moving on. Yeah. I agree. Alright. So Although my image isn't listed there, by the
1:35:06 way. Yeah. No. We're since it's on our worker machine. Sorry. Where where's this where's this pod scheduled? Yeah. I'll get or something. Let's see. Or Dash o y. Yep. And then we can select the list scheduled on. Cord cord hey, Jeremy. Cord in the other two nodes. Delete the pod. Okay. So I like that. Which, what what's the cord in command? Is it, like, core k cord and then the node? And then the then the node name show. Which one do you wanna do? Worker one or work with two or something? Or I want I want cord in both of them.
1:35:53 Skyscanner dash worker one and Skyscanner dash worker dash two. Okay. So Is Skyscanner worker dash here? Yep. Mhmm. No. Space. This is their resources. So resources are space separated. Resource types are comma separated. Remove the comma after the one. Okay. And then something. You'll need to remove the tint from the node control plane node. What is that? Like, get taints or something? Or You can just do edit node and just delete the actual yeah. No. We don't need to do it the nice way. We could do it through a taint with a dash syntax, but it's not important.
1:36:55 And control plane. This is the worker node. Yeah. We we want to remove the taint from the control plane. Those guys gonna dash control. Yeah. And just search for taint. In fact, there's there. Top of the spec. Just No. All the way. Three lanes. You're right there. Under spec, tanks, the no schedule. Just remember Oh, yeah. That. That's Yep. K. And what's the next thing you wanna do? Delete how you want it. Delete that pod. Yep. Is that the the clustered one? Or Yes. No. It's okay. Yes. The clustered one. Bye bye, broken clustered pod.
1:37:56 I figure I figure if if if this fixes it, that's the working around the poll is as legitimate as Remember a lot. I just broke from my TD. Right? Right. Just schedule our workloads on the control plane. That's funny. Like, managers are happy. Just get pods. Oh, no. I still Jeremy, just getting an internal service here. Get pods. Okay. It's still responding to internal server error. Yeah. It's going to because it's it's Well, it has to get ready first. Well, I was thinking if the pod if the application itself is an option, you would get a different error besides internal server error.
1:38:43 Yeah. That's that's very possible. So we so what we're gonna wanna do is go take a look at that cluster d service again. Clustered service. Alright. I might have made a boo boo. I think I did call that binary HTTP. Woah. Alright. I just checked the Docker file and now I feel awful. So It's okay. That's alright. Oh, it's ready now. Okay. Let's Let's let's hit it. Let's see what happens. It's still internal service error? Yeah. That's what I was saying. Like, the application doesn't need to be up to get internal server error. It's probably happening before it hits the application.
1:39:30 Try 30,000. 30 thousand. I'm sorry. What is it? 30,000. Oh, So there should be a 30,000. A node port service with that. So so we need to go check the service again, Jeremy. Cluster is the name of the service we wanna check. And not wide. We wanna check I think you pulled up. Yeah. Okay. So ports, no ports 3,000. Port eighty eighty. Target ports, eighty eighty. No endpoints. K. Ta da. Let's go check. Do do do do get inputs just to make sure. Very clearly, it seems like that's the case. But and show can you show the clustered one with
1:40:39 the YAML? Alright. I'll give these another five minutes because of my horrible detour. So I'm wondering if this is networking. Is this Cilium? Are there any network policies defined right now? Good question. There's a command for that. I just remember calling it or something. Yeah. Can you, like, auto complete with network? Oh, the something happened? Yeah. All of them? Okay. So they There's also Selium versions. So if you can use the all we'll complete Selium as well. I'd also like to after this, also like to do the the COOP system the pods in the COOP system namespace again. Let's see
1:41:52 how long this pod's been running. Oh, it all restarts. Oh, no. Is that good or bad? Do dash do dash o wide on that widget. So it looks like the Cilium pods themselves were set up Right. Couple hours ago. Yeah. That's why I wanna do dash o wide. I wanna see where those are running. The guy says you're getting warmer and remember their sense. One. Not so this is the operator. Right? So if Jeremy, if if you'll just do just do that same command but with dash o wide so we because I wanna see what nodes,
1:42:39 the Cilium. Rawkode two? No. No. That coop get get pods dash in coop system dash o wide. I just wanna see what nodes the Cilium pods are working on. Oh, okay. Because because Cilium start with restarted not too long ago. Restarted not too long ago. Mhmm. Let's go let's go check let's go check the Cilium configuration for that daemon set. Yeah. So You wanna do, like, get a YAML? Or Yeah. Kube system. Yeah. You're gonna have to dash in the Kube system. Okay. And we're gonna need to see this whole thing. So let's go let's start from the top.
1:43:53 Oh, wow. Okay. So this was reapplied, at least once. Generation seven. Keep going down? Yep. That's what what I'd probably do is I just scroll down the page and and let's stop right there and take a look. So we have a so we're gonna need to take a look at that Sicilian config map. So you want to Yeah. Yeah. Look at that? Not yet. Both config maps. Okay. Going. Yeah. Actually, I kinda get get get in our heads, like, the things we need to go look at. So, obviously, we wanna check widens probes and readiness probes. I
1:44:38 don't think that's gonna be an issue. K. This is the kind of thing it's gonna be hard to spot if it's in here. What's that? Hold on. I don't where were you? I think go down. No. Right there. That volume out that mount path host host PID, is that one n s? Is that right? I don't know. Where? Right here. Under volume mounts, mount path. I just highlighted it. I'm waiting for it to show up on okay. There it is. Yeah. So slash host has mounted into these pods, but I don't know. I don't think that
1:45:37 the host pad one MSS. I don't maybe we should take a look at the mode definition. I like it. We enter volumes. Oh, yeah. There we go. Do you see it? Not yet. Volume outs or volumes? So go go back into volume outs so we can Oh, there it is. What we're looking. It's the third one. What's that? Oh, yeah. It's right there. Yeah. So so, certainly, host proc NS is is legit. Right? So then it's just mounting it into host PID one NS. So I suspect it's right. Can we can we search this for host PID one NS?
1:46:34 You might you might need to do an edit just so you can search in VIM. I mean, the pod is running. If the mounts were in correct, we'd probably get an error when the container was starting. Yeah. Is this too many? The host pet one n s. Oh, okay. Host pet one One n s. N s. Yes. K. That's not Okay. Go on. Oh, hit in to go to the next one. Alright. Should we take a look at a couple of the hints and get this? Yeah. Yeah. Yep. I think it's probably time. What do guys think? You wanna hint?
1:47:28 Think so. Sure. I wonder if DNS policy is something too. So where's the hint? In the root? Slash root. Yep. Let's do hinge one, I suppose. Oh, thanks for the red herrings. I think that sometimes you need to call IT. What is the So turn something off and back on. That was that was what? The cubelet with the API server that we did? Yep. Okay. Oh, there's a link. Alright. Did it recrawl? Think he's Oh, nothing. That's just just the API server hunt. Yep. Yeah. I think you've worked that out. We know it's networking. Yep.
1:48:43 Wonderful. Okay. So I don't I'm I am not a Cilium expert. I've I've haven't done much with it. So Chirp. The component they're talking about do you want me to say something, or is Shahar a Cilium ninja here? Like I I do not know what it would be. Oh, so well, so Cilium can can be set up to to replace, like, all of the the the networking, I guess I guess, DNS, couldn't it? So I think you're talking about Kube proxy. Kube proxy. That's right. Say, let's let's check the config. The config maps for Celia.
1:49:26 Yeah. It's gonna be in the Kube system namespace, Jeremy. There we go. There you go. And so this should just be a flag that allowed me to do that? Stop. What's cube monkey config map? There is the cube monkey pod running as well. I wasn't sure if you noticed the earlier. Okay. I'm assuming that's not a Yeah. So that's probably the one I think. Think we're gonna need to so I think we're gonna need to a look at Cilium config to figure out what what's happening there because we're probably gonna have to put Cilium fix Cilium config,
1:50:04 and then we might have to delete KubeMonkey config mapping. We need to play the. Alright. Look look for KubeMonkey in here, would you? I would do an edit on this just so that we don't have to scroll. Yep. And then about cube proxy replace? It was set to disabled. Okay. Did we have a Kube proxy running in the Kube system namespace? I don't think it's a question. What am I taking again? Is it pods pods and KubeSystem. Okay. Now we have KubeMonkey. Oh. We we just so let's that so that KubeMonkey is a deployment. Let's just delete it in its config
1:51:17 map because it shouldn't be running. I don't know what it is, but that seems very highly suspicious. Or actually, don't let's not delete no. Let's not delete the pod. Let's edit the deployment scale to zero. That way we don't if if it is necessary for something, we don't lose it. So just do two. So this is least you can do. Yep. That'll work. And what are we editing? Replicas. So you said the scale? Oh, zero. Got it. Scale it down. Yeah. Scale is DOCC. Replicas is the the Kubernetes name for that. And then we'll do this guy.
1:52:05 Something should have happened. Right? Yeah. So you're probably going to want to turn on the cube proxy replacement because we have no cube proxy, so there's no custom networking. Right. So should that be enabled then? There are two potential values. There's probe and there's strict. I would go with probe. What what do those do, David? Probe allows us to do checks to see if it should be taken over, and I think strength enforces it to run and do e b p f routing. I can't remember exactly, but I think that's correct. We we we the cluster that I work on mostly use
1:52:51 Cooper routers, so I haven't done much of Cillium. You'll need to do a rollout of the Cillium pods. But there was a keen eye comment from Noel in the chat suggesting that they noticed this Celium deployment was actually mounting the monkey config map. So you may wanna Oh. Check that too. Okay. Go to Celium deployment and let's edit that out. And the cube system namespace. The Cilium deployment. Yeah? The I think it'd be the Cilium operator. Nope. So we were looking at the Cilium daemon set, and we searched for monkey who didn't see it. Yeah. So I would just
1:53:41 edit the demon set, change any value, or do a a rollout because you need you need it to pick up that. The silly one right here? Yeah. Alright. So search for monkey again, would you? Noel has retracted his statement. Guy. Well, there might not be a monkey, but just make an artificial change to to rotate those pods over and see if we get. The if we can't call 30 our networking isn't working. So an artificial change, what what would that be? Just add a new label. Yep. What? Add a new label. So under the template There we go. Yep.
1:54:39 Nope. No. No. No. Not that. That's not gonna do it. You have to go down to the spec templates labels. Just add a label underneath. See where it says labels? Just a few lines down from where from there. It says k h a. Just add an additional label. Right. Yeah. It's called there you go. I can just put a null or something. It doesn't matter. Yeah. Empty string will work. And we should see those pods moving. Yep. I mean, we could also have done a control rollout restart demons at the end, but who's got time for that?
1:55:26 I'm more of a delete pods dash dash all kind of person. Too. Yeah. You could probably try local host and see if that's working. Yep. Oh, look at that. That's bear. Right? There we go. Oh, yeah. It's up. Cool. Go to the go to the cluster. Oh, Stay the skirt. Oh, that's good. Very nice. Finally. We got to use the hints. I'm glad. It's all good. Actually, I don't think you I mean, we read the hints, but I think you were already there. Like, you were you were ahead of the hints. Like, none of the hints were surprising to you. Right?
1:56:15 So Right. Yeah. Yeah. Yeah. Yeah. This is fun. Two tough clusters. Thanks. Damn. Thank you, Skye. I think breaking is a little a little easier, but Alright. Well, thank you so much for joining me. It was great watching you as well. Thanks for having us. I'm glad we we stuck with that. We were so close for for a while there. I'm really, really sorry for leading you down the HTTP binary. I'm gonna kick myself for that. That makes it more realistic. That's how it works in production for real. I think it was great. Alright. Well, thank you again. Have a wonderful
1:56:51 evening. Enjoy your weekend. Thank you to all of our viewers for watching. Thank you to Teleport for sponsoring. We will be back next week. Thanks all. Bye, everyone. Bye all.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments