About this video
What You'll Learn
- Fix expired Kubernetes API server certificates and regenerate trust with practical kubeadm commands.
- Trace scheduler and controller-manager issues by checking static manifests, port settings, and pod restart behavior.
- Resolve resource quotas, taints, containerd images, and registry redirects to restore app and database pods.
Container Solutions and Civo Cloud race to repair broken Kubernetes clusters, fixing expired apiserver certs, a tab-corrupted scheduler manifest, a malicious validating webhook, node taints, scheduler port mismatches, resource quotas, and a containerd registry redirect.
Jump to a chapter
- 0:00 Holding screen
- 2:00 Introductions
- 2:04 Introduction to Clustered Teams
- 2:24 Housekeeping & Sponsor Thank You
- 3:00 Team Container Solutions
- 3:08 Challenge Begins: Team Container Solutions
- 3:14 Introducing Team Container Solutions
- 5:21 Task for Team Container Solutions
- 5:53 Container Solutions Debugging Starts
- 7:07 Debugging API Server & Certificates
- 10:10 Renewing Certificates with Kubeadm
- 13:58 Persistent API Server Issues
- 17:05 Debugging Missing Pods (ReplicaSets)
- 18:48 Clue: Scheduler Not Creating Pods
- 19:39 Identifying Missing Controller Manager
- 21:25 Debugging Scheduler Static Manifest
- 23:05 Investigating Admission Controllers
- 24:14 Finding and Removing Malicious Webhook
- 25:42 Debugging Pending Pods
- 26:45 Discovering and Removing Node Taint
- 32:09 Application Pods Running (v1)
- 33:12 Upgrading to Application v2
- 33:48 Success: Application v2 Verified
- 34:08 Container Solutions Debrief
- 34:40 Team Civo Cloud
- 34:42 Intermission & Sponsor Message
- 35:11 Introducing Team Civo Cloud
- 36:37 Civo Cloud Team Introductions
- 38:10 Civo Cloud Debugging Starts
- 40:53 Debugging Scheduler Port Mismatch
- 43:06 Fixing Scheduler Configuration
- 44:21 Restarting Scheduler Pod
- 52:51 Debugging Resource Quota Limitation
- 53:19 Inspecting and Editing Resource Quota
- 57:08 Deleting Resource Quota
- 58:21 Forcing Pod Reschedule
- 58:42 Application Pods Running (v1)
- 59:58 Debugging Incorrect Image Pulled (Containerd)
- 1:03:12 Checking Images on Worker Node
- 1:05:32 Deleting Incorrect Image
- 1:09:54 Debugging Containerd Registry Redirect
- 1:11:39 Fixing Containerd Config
- 1:13:29 Restarting Containerd
- 1:14:30 Database Connection Error
- 1:15:20 Debugging Postgres Pod Placement
- 1:15:50 Rescheduling Postgres Pod
- 1:17:55 Connection Timed Out / Network Policy Check
- 1:18:55 Finding and Deleting Network Policy
- 1:19:42 Success: Application v2 Verified
- 1:19:51 Civo Cloud Debrief and Conclusion
- 1:20:14 Final Wrap-up
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
2:04 Introduction to Clustered Teams
2:04 Hello and welcome to today's episode of Clustered on Rawkode live. Wow. Today is the first version of Clustered Teams and I'm very excited to have a team from container solutions and a team from Civo Cloud here to fix some rather broken clusters broken by each team. Now before we begin on that, there's just a little bit of housekeeping. First, if you're not subscribed to the YouTube channel, please do so now. This is gonna get you alerts and notifications for all new episodes as we explore the cloud native landscape together. Also, Teleport joined us last week as a sponsor
2:24 Housekeeping & Sponsor Thank You
2:42 of Clustard and I was just the easiest decision I've ever had to make when it came to this show. Teleport has been the software we've used ever since the second episode and makes debugging and fixing and pairing on broken clusters well, as easy as it can be, but you know, still painful. Right? It's Kubernetes. So thank you Teleport. It's been a pleasure working with you and I'm excited to bring Teleport to more people. Okay. Housekeeping done. We're gonna start by fixing some broken clusters and we've got a great lineup today. Popping over here, I am joined by team container solutions.
3:14 Introducing Team Container Solutions
3:18 Hello all. How are you today? Hello. Hey. Panicky at all. Not panicky at all. No. No. I I have no idea how today's gonna go. Like, I'm so excited I'm so excited for teams edition, but also logistically, I'm like, this is gonna be interesting. I'm really hoping that as three colleagues, you're all just gonna be pairing together perfectly and harmoniously. Right? Yes. Yes. I think we should start by pointing out, although we work together, we we're not on the same team, we don't answer as well. I met I met Adrian today for the first time.
3:58 Okay. Well, why don't oh, apparently I'm a bit crackling. Alright. I'll fix my big Kai. Thanks. Can you all do me a favor starting with Charlotte, Ricardo, then Adrian just say hello, tell us who you are and we'll get started on today's cluster. Of course. Hi, everyone. I'm Charlotte. I'm an engineering manager at Container Solutions. Still do some technical things every now and then, so we'll see how this goes today. Ricardo? Oh, sorry. Hi. I was looking at Adrian's. Yeah. My name is Ricardo. I'm a I'm a cloud native engineer at continuous solutions. I've been playing around with Kubernetes for the over
4:41 a year. I'm having a lot of fun there. I'm based in Portugal. Hi. I'm Adrian. I should have my surname because I don't know why you can't call me as Adrian. But, anyway yeah, I'm chief scientist at Container Solutions, but I really look after our labs projects. So it's like all our open source projects at Container Solutions. So whilst I do do a fair bit of Kubernetes, it's on the development side, and I'm not really used to maintaining clusters. I've done notice I'm getting my excuses in fast. Nothing wrong with getting an odd excuse in
5:16 early. All part of the game. Alright. So here are the rules. You have forty minutes to work on the SQL cloud cluster. Your mission is to upgrade the cluster deployment from the v one image to the v two image, be able to access the application, see the quotes, and the video. Simple. I have I also caveat this. I have no idea what either team has done to their clusters. I will be of no help whatsoever to you. I am here purely for moral support. Now, I'm gonna pop up our screen share like so. This is teleport. I will open the first session
5:53 Container Solutions Debugging Starts
5:59 on the control plane node for you. If you can all join and just give me an echo hello to let me know that you're yet let you let me know that you're there and if you wanna nominate one of you to start with a typing that might work best. I'm not sure. We'll see. And I'm just gonna spend today's session looking at the chat. Alright. We got an echo hi. So at least one of you and two. Alright. Who quotes their echoes? Come on. That makes sense. Alright. It looks like we have three guests all in this session.
6:45 Disappointed with the lack of all caps from you guys but this is okay. Alright. This best not to start fighting just yet. He's gotta work together as a team to fix this. Alright. You'll need to set up your cube config and any aliases that you need and then if you want to do the honors and check for a API server, it's usually a pretty good place to start. Yeah. You be an existing. If you want me to set up the cube config, I'd be happy to do that. If you want to it yourself, please feel free.
7:07 Debugging API Server & Certificates
7:17 Ticket Charlotte says you type in. Right? Yes. Is that alright? Am I so slow? Works for me. There's a feature request for teleport, actually. We need colored cursors so we know who's typing. Certificates. Why? Current time is after last year. So certificates expired. Classic. Oh, it's a date. No. Hang on. Oh, jeez. Check the current date. It's a date in the past. Oh, current time today. That's fine. Current time's after. Oh, yeah. Okay. So that's when the certificate expired, is it 2020? Who wants to go to renew certificates? For all the searches, under manifest? API server.
8:16 There. Why So what's because that number does oh. Alright. I was just gonna say what cert does the kubelet use? The API server or something. Is that API server cert? I mean, I know it's in the kubelet config. Don't know. The 2019 is what I mean about here. Yeah. I'm loving the timestamps on those files. Oh, I can scroll in here. Yes. You can each scroll independently. If there's something you want me to show to the audience, let me know and I'll do my best to scroll on my side so they can keep up. Oh, cool. Okay. Okay.
9:28 I mean, these ones. Right? Is one of you looking at part of your new certificate? I thought we just set the date two years in the past and see what happens. We had a comment from that in the chat. Diego is suggesting just change the date on the machine. There are rules to clustered and messing with the dates will break all certificates and couldn't teleport. So we'd probably lose access to the machine, but I do not recommend it. That's good to know. But then we're done. We just go home. Yeah. I have no idea to rotate the
10:06 certificate to be perfectly honest. So So you can use QPDM to renew search. You don't need to learn open SSL commands or anything like that. Yeah. There is actually PDM. So I see I see you're onto it now. I can't see the bottom of the screen. You can't see it. Can't scroll for some reason. If you can't scroll, if you just like resize your window a tiny little millimeter, it seems to fix it. It's a it's just a wee frustrating. But like if you just like drag the corner of your browser window and make it a little bigger and then a
10:10 Renewing Certificates with Kubeadm
10:44 little smaller, it try it fixes the scroll. It's like a sync event that happens when people join. I'm not really sure what what the problem is. Alright. What's that? The CVO team are vocal in the chat and they say they're here if you want hints. No. We'll try and fix this first. That's gonna be the stone snippet when I publish this after you you feel measurably Adrian. We'll be fine. Sorry. That's harsh. That's harsh for me. Okay. They're all involved. Up next, it's Jason. Yeah. I guess we're pointing out. Doesn't matter what the project does. It's not just shell. Does that
11:37 sorry. Let me go. Okay. We can't we can't even do that because we can't talk to the Good. I need to need to start with the API server. Yeah. No. Yeah. I think there's no dash. Isn't there renew everything that's expired? Is it dash dash l? Oh, yeah. I'm doing my best on ace. It's just the all there. And there's no Did you try to renew all before? Didn't like it. I tried this with minus minus, but. No. Look at that. Okay. And let's restart. Restart everything now though. System c t l. Yeah. Oh, those are the pods. Right? Oh, if
12:34 I restart the cube load on this one, it should restart everything. So these are It's looking healthy. Good. Hey. Think Okay. Okay. Yeah. Looks good. I hate whoever does start breaking, by the way. No offense to see what he can has been broken, so I'm gonna be completely useless again. Alright. Although maybe it could just be affected by the So some configuration error in Cilium. That's your first ten minutes down. You've got thirty five left. I'm sure that's all they did. I'm sure, Sivo, were really nice to you. I hope. This node shell. What's that node shell completed?
13:41 That looked weird. That worries me. This node shell completed the job thing. Must keep the system. Hube API server isn't running. Oh, no. This isn't ready. There's a lot of restarts we've got there. Yeah. Are you seeing a different screen for me? Me? Because mine says cube based, so we're fine. It looks fine over here too. Mine mine is fine as well. Adrian's working with one of his own clusters, I think, here. That's called up. Okay. Just gonna look and see them for now. It says restart once because it worked when restarted. That's all it is. I was looking
13:58 Persistent API Server Issues
14:41 at the old one. Oh, you're scrolling in your own version? Okay. Yeah. It still works. So what are you thinking right now? Any ideas? I'm thinking we cannot connect to the Cyllium agent on the worker, which makes sense because the worker isn't running. I'm getting a lot of helpful info from the pod itself. I thought you're gonna say for me. Which would be accurate if I tell you anything about Cilium. I mean, can we I think you already Yeah. So it looks like the Cilium crash that back off is on worker x c x. Would you like me to open a
15:50 session on that machine? Yes. But give me a second first. Of course. So I'll check if it's in the logs. Where? I lost my screen. Okay. I'm back. Difficulty isn't oh. I do love the delete all approach. That's one of my favorite techniques on this show. You've been criticized about several times I believe. I know. Because that's how I also usually upgrade applications and people keep giving me grief. To be fair in this case, the only ones I wouldn't have had to restart are the Hubble ones and they probably don't break anything. Uh-oh. Uh-oh. What yeah.
16:49 Must be sticking that. That's the API server IP. Correct? That's not looking healthy. Yep. I lead the star approach. The delete hammer of doom. So what does it mean if you've got deployments, replica sets and no pods? So usually, I'd say that's something with a scheduler, but since oh, right. I checked for the pod and not the record. Oh, because it's let's check here. What was that? Is that the screen port? No. It's for the record. Yeah. Replicas two current, two desired, zero running, zero waiting. Desired as a two. So you mentioned scheduler. What was your thought process there?
17:05 Debugging Missing Pods (ReplicaSets)
18:11 Usually, when a pod doesn't show up, it doesn't show up at all, like, isn't that means it isn't scheduled. Right? Can you Shall we check the configurations there and see? You could do cry CTL and see the pods are running there, but it needs to be in the work or no. Give second. These ones were worrying for you. Right? And maybe we wouldn't delete any more pods for a bit. It's really not those. Feel like he's walked into a bear trap there. So I'll give you some info. Like if it was the scheduler, the pod would be created.
18:48 Clue: Scheduler Not Creating Pods
18:52 It would just wouldn't be assigned to a node yet. It wouldn't be able to be created. I but I don't see anything there. I don't see anything to resemble a pod. Nobody could just schedule and you'll know it doesn't even stuff there. Sure. Some great taps from Waleed in the chat if you want me to pop one on the screen. You. Network policy. Alright. The replica set is not creating, then check the controllers and the API server. No controllers. Look at the API resources as well. They're not there. That might be Yeah. There's no controller manager.
19:39 Identifying Missing Controller Manager
19:45 Okay. Really suggest that Siebel have invented a new technology, invisible pods. This isn't particularly useful, unfortunately. Siebel do build some pretty cool stuff, but I'm not sure if that's on their road map. Where did the static pod logs go? That was bar log containers. What is this node shell stuff? Have a look at that. Is that something normal? No. No. What's in the log? There's no logs in there. That's rubbish. Okay. Can we run system CTL as well? Maybe there's something Still looking at the schedule logs for a second. It looks fine, though. Isn't it? Just your system system CTL status
20:57 as well. That's on yeah. That's on here. Yeah. One for sure. Do the do those have services, the scheduler? I don't know. Yeah. So have confirmed that node shell is nothing to be worried about. It's just part of k nine s, which I think they were using for playing around. Okay. Line 38 in Oh, look. Could not process mine. There you go. I think you found it. Yeah. Okay. So there's something wrong and there was Kubernetes manifest. Yeah. A shame there's no GitHub code pilot for Vim. Yeah. Did you hit Vim there? Because the first
21:25 Debugging Scheduler Static Manifest
21:45 the place where it started might have been That's that's that's always a good giveaway I found. Oh, yeah. Because that's a pub. What's the Vim command to show spaces and tabs? Something? I don't remember. Was just curious if anyone knew. I don't remember. Can can you see the whole of that on your screen? If I scroll to the right, you can't? Oh, maybe I can. Don't know. So with journal control, there's arguments to the it wraps it properly. If you need to do it again, I'll I'll throw one out at you. At mission control Sorry. Yeah. Was is there a mission controller
22:42 running? Let's just kill any of the mission controllers. Burn them in hell. I'm gonna get a t shirt that says it's always with mission controllers except when there's finalizers. Mission controllers are added here somewhere. Right? API server? Depends what type of admission controller. Yeah. It run you know, when that is API resources or just gets admission controls. I gotta agree with Nolan. This putting tabs in YAML is evil. Yeah. Actually, it's gonna. Sorry. No. No. No. No. That's It's a great tip to cat dash t apparently. I'm assuming shows tabs. Didn't know that. Oh, really? Yeah. Cool. Oh,
23:05 Investigating Admission Controllers
23:37 nice. Okay. So Webhooks. I can't remember. Who's googling? We can just do kubectl get under Webhooks now. Yes. You can do a Kube control get, and there are two types of admission controller. There's validating web configuration and mutating. Yeah. Yeah. So Kube CKL get and validate the mission. Yeah. I suspect there's gonna be mutating, but anyway. It Permission was a validating wrapper configurations. And unpacked. And mutating. Yeah. You can also run kubectl API dash resources to get a list of. Yeah. Oh, look at that. Describe it, I suppose, or that or kill it. I would have failed it if the describe
24:14 Finding and Removing Malicious Webhook
24:45 works. You know, I don't know. Deleting pods got you in trouble earlier. So, you know. Yeah. Daily policy fail. So it's not gonna yeah. So that's not gonna let you start new pods. So that was that's probably what's So the So the starting. Yeah. So that would make sense. Right? The the pods didn't come up, and it's because that was stopping it. All of these faults or things the civil team have done or had customers done to themselves. That's good to know. Real world situation. Themselves. Like, it was all your fault, customers. Did we get rid of the machine tool
25:29 or what's that one? Yes. That was deleted. I deleted it. Yeah. It's always funny when those things come back, though. Nope. I guess. Yeah. Yeah. I got any pods starting up? No. Oh, if you're looking to see your news, there's pods coming up. Cilium Oops. Pending. So Nice. Okay. Who's got the show in the background? Mommy. Yeah. I think it might be on my side. Sorry about that. It's all good. You're in Portugal, so can't even make a football joke. That'll be too easy. Fifteen minutes ago, I was still there. Okay. Oh, it's kind of funny when I scroll.
25:42 Debugging Pending Pods
26:40 I'm not quite sure. Well, at least you've got pods that are just pending there. Yeah. This failed to fetch token doesn't sound great, is it? That's thirty minutes ago. That was thirty minutes ago. Think that was still a certificate. Yeah. Can we can you describe the replica set or the pod? Can we did it give us a indication why it was pending? Yeah. If you just scrape the pods, you'll get the event on. Because it's still Yeah. It's still mentioning the certificates. Yeah. Because they're not assigned to any node yet. But if we describe the replica set, will
26:45 Discovering and Removing Node Taint
27:37 not tell us what's If you describe the pod, it will tell you what it's waiting for. I mean, events are usually at the bottom. Right? So that should There's a Charlotte's first guess was right. No. No. There's none. Scheduler? Indeed. So We fixed the static manifest. What do you call them? We fixed the controller manager. We did not fix the scheduler if that is broken. There's two. Maybe I looked at the wrong one. Why don't we just do now the star shell and see what the edit is? I'll do a find dash dash anything. Find slash. What are you looking for?
28:50 I'm just one of the edited other commands, other stuff in the it's the Kubernetes. So Adrian thinks to modify the static manifest for the scheduler. Yeah. You're doing LS dash AL. They might have changed the Oh, nice. K. What's the problem there? Is. Yamal. Yamal. Sneaky. We're doing ALS. ALS just to see those. Oh, so very long. Mhmm. You have fifteen minutes left. No pressure. That's fine. Oh, fine. But that's Is it fair to tell you? These are these are finding finding problems and fixing them. I'm impressed. So what's currently broken? It's still the pending pod
30:19 because the scheduler is still not there. We still have no scheduler. Where? I think it might be coming up. So I was hoping. But some of them are running. Yeah. Yeah. The schedule is running, but the pending is still there. Hubbell's down, but do we care? I don't. Hubbell. Come on. What happens? Paint. There we go. One node just oh, I hate things. What was that? Paint and then minus. Right? There's Maintenance through. There you go. No scoreboard buttons. Maintenance through taint. Yeah. How did I do that? I have no idea. No idea. And note
31:28 You can do a 10 And then the minus at the end. And then the yeah. With the minus. Oh, yeah. That's a bit weird syntax, isn't it? They love the minus syntax in Kubernetes. It just means that it's dash. This one? Yeah. That should work. Dash dash all. Sorry. Alright. Back to Kubernetes. I'm not using word certificate. The cube admin thing anymore. Okay. Well, that's that's impressive. Okay. Oh, look at that. Alright. Something's actually running. So do we how do we see? Yeah. You wanna check it now? Yes. Yeah. What's the but that that should
32:09 Application Pods Running (v1)
32:21 be public. Right? Because we have the There's a node port on it, 30,000. Yes. But I can't see the external IPs from on here. That's alright. I can fill it up here or you can just curl it from the machine, whatever you prefer. You know, you don't sound particularly confident that this is working. It's working. There you go. You got version one running. It's hard to tell which oh, yeah. Version one. Oh, okay. No. Don't update it. Can should we check it's okay. What do you mean? I want to see the web page. That's all I mean. Oh, I put it up.
32:58 Sorry. Mean We did cam thing. That was good. Yep. Back up again. Letters me looking at my watch saying you're running out of time. Oh, yeah. I see it. No pressure. I didn't I didn't do what you said. I don't have the other thing to the side. Yeah. It's probably just pulling the image unless they've left any more sneaky breaks on there for you. It's pulled. You could do dash w, you know. It is running. Could. But it this is better for my anxiety to just constantly do the same thing over and over again. You
33:12 Upgrading to Application v2
33:46 got the dance. You refreshed. Oh, really? Is that v two? Yep. Charlotte, you did this all by yourself, by the way. I think Yeah. You did. I was I was doing business. Well done. Team continuous moral support. Also, thank you for people with the hints. I'm I'm very glad and thankful and grateful for any kind of hints always. Alright. Well, team Sivo, if you wanna start making your way to the link, team consider solutions. Thank you very much. Well done. You smashed that and you still had nine minutes left for further debugging. Sorry Charlotte. You smashed it. Adrian and Ricardo. Thanks
34:08 Container Solutions Debrief
34:25 for joining us. Alright. If you just want to drop off, thank you again. I really appreciate that and I'll speak to you also. Thank you. Alright. Thanks. Bye bye. Alright. You're stuck with me while we wait for the civil cloud team to join us. And I'll just take this opportunity to say thank you to Teleport again for sponsoring. It's a pleasure to have them on board. We use this tool every single week on clustered and it is just a fantastic tool. If you have a Kubernetes cluster, you should deploy it. Check it out. It's great on
34:42 Intermission & Sponsor Message
34:59 bare metal too. Okay. Let's hope the latency is not too bad and civo are gonna start piling in any moment now. Oh, we do. Perfect. We got the Nash. Hey, mate. How's it going? Oh, it's I am too. Thank you. Awesome timing. Let's get your names on the screen. So we're just waiting for David to join us as well and then we'll get started with the next cluster. How is how is that watching that? Yeah. It was They did really really well. Did they They did. I gotta say and there were some really really fun see as soon as I seen
35:11 Introducing Team Civo Cloud
35:43 the search I was like, ah, that's because those are so painful to fix. So that was cruel that you went for that but some nice ones, the YAML I got I love the double a on the YAML as well. I didn't even notice it as well. Like I just thought it looked fine to me so very sneaky. Okay. Hopefully, you all got registered with a teleport for the container solutions. So I will pop open a session in just a moment, but first, if we could just get some introductions from y'all, please feel free to share a
36:14 little bit about you and what you do and then we'll get started on our next cluster. Dinesh, we'll start with you and then we'll we'll move around. Do you want to introduce yourself? Oh, sure. This is going well. Yeah. So Dinesh, director of innovation at Civo. I've been working with Kubernetes for years and years and years. And, yeah, it was nerve racking watching that. Well done. Our brains if David's in our brains, our excuses are coming out now, but we've been on a p one since midday today. So we we actually got it fixed. I
36:37 Civo Cloud Team Introductions
36:55 was on one of the default calls ten minutes before this started. We're we're leaving it all to say I'm here. No. You're not allowed to practice before a cluster. Come on. That's too much. Especially in production scenarios. No. I'm glad you got that fixed, and I'm glad you could make it today. Thank you, Denise. Sam? Yeah. There needs no introduction, but I'm gonna let you do one anyway. Yeah. Working as direct reflecting evangelism at Civo. And, yeah, I'm a CNCF ambassador. Do tons of stuff for the community and run my own YouTube channel as well. I'm
37:32 happy to be here. Let's see what has container solutions have for us. Thank you. And, David? Yeah. I'm one of the site reliability engineers for Civo. The only American part of the engineering team over here. I don't have the pleasure of being in the same country as these guys, but we get a lot done. Just like to say, good job, Container Solutions. We tried to come up with some ridiculous stuff. We came up pretty well on that. Awesome. Thank you all for joining me. Let's get this party started. I don't know if I can call it a party, but I'm
38:08 gonna call it a party. Alright. I have here our teleport instance from container solutions. I am clicking connect on the control plane node. Please use the activity menu on the left. Find the active session and join and if you can type echo hello, make sure you're all in there and then we'll get this underway. Call it cursors teleport. That's what we need. Please make it happen. Alright. I think we got everybody there. Awesome. Okay. I think as customary on custard, you'll wanna set up your cube config and check if you have a control plane. Let you
38:10 Civo Cloud Debugging Starts
38:45 just take it away from here. Good luck. Have to decide who's gonna type first. Who have we got typing on there? I guess me. Gonna do a x export of cube config. Yeah. You wanna do that? I'm driving. I typed slow. Hey, Mark. Welcome to the party. Oh, it's content. Someone from the container solutions team has said there are some hints and slash root have required. You will need those hints. Know it. Things seem like they're running, which is interesting. Job done. Well done. You fixed it. I doubt that highly. Well, there's a deployment that's not running.
40:26 Looks alright to me. Hey, Jeremy. Welcome to Clustered. Comment from Kubernetes community days UK that someone forget to break the cluster. Clusters can be deceiving. Even when they're working, they're probably fucked anyway. Air mounting pod filters of sandbox. Uh-oh. On the control plane and the scheduler, What time was that? A while ago. It did say that the scheduler was running, didn't it? Mhmm. Four hours ago. That's worrying. Copy. Paste. Failed to list nodes. Connection refused. Check IP tables real quick. Kai is asking is it just a flash wind? Oh, it's looking a bit different now.
40:53 Debugging Scheduler Port Mismatch
42:01 We'll see. Alright. Unless what's the default policy, it's fine. Do you think anyone's ever enjoyed working with IP tables? No. Hang on. 6444. That's not 6443, is it? And that one's been changed. Interesting. Is it in the config scheduler config? Or is this just been properly messed with? My head is spinning with ideas here. Yep. There you go. There it is. Yep. There it is. See, they're getting back at us for YAML. Should get restarted. Yeah. They could be able to detect the change. It may just take thirty seconds to a minute for that to
43:06 Fixing Scheduler Configuration
43:41 to respond it. Oh, wait. No. It won't. Because we've changed the scheduler config, not the manifest itself. Uh-huh. So let's just touch that, and that should sort that out. Should sort that out. I hope that wasn't someone's pager. No. Spam runs. Do we delete it? Well, you could run a p s a u x and see how long the process has been running. Or a f x. Nice touch. You can always touch the p s flags. Keep sketchy that. One forty six. Oh, that just done it. No. Take away and kick the pod. Delete. Nothing ever bad happened from deleting a pod.
44:21 Restarting Scheduler Pod
45:05 I'm just saying. Yeah. That's still not done it. Let's delete it. Khalid is asking who's wedding. The audience is winning as we're all absorbing knowledge today. Here we go. Yeah. That looks better. I mean, does look better till you did that. Yeah. Can we reschedule that? That does look like it should have been changed. Oh, I know. Copy paste. I'm assuming that So why would that not be picking up the right config file? Do wanna just restart the cube with maybe? Oh, shit. These all been spelled right. Is that you checking for Unicode? That looks right.
46:47 Looks okay. Yeah. What's happening right now is we have a scheduler running. You're looking at the manifest. It's using scheduler.conf, which you've just modified, and it still appears to be broken. Yeah. It's using proper image. So it's not that that's been messed with or has it? Is a Is your scheduler.com definitely still 6443? Yep. Yeah. The logs from the pod are 6444. Interesting. Is there a service config file in, etcetera, Kubernetes for the scheduler itself? Sorry. I gotta what did we say? You wanna drive? At yeah. Looking at etcetera Kubernetes, so we can look here.
47:59 There might be another. So we have that one and Cubelet one. That was my idea. So just check them once again. What's in there? Probably looks like install time. You definitely restarted the scheduler. Right? That that that restarted. It's definitely a new one as of three minutes ago. Should we restart the cubelet? Maybe we should burn down the note. Oh, it's getting hard. I'll get them. Copy paste. That's another thing for teleport. Copy paste on on highlight. I was gonna ask if you just want a hint, but I don't even know if that's wrong. But I've I've
49:41 got an idea. What's the idea? Right now you're working on the assumption that the static manifest live in that directory. That is true. So this is a cube admin cluster, isn't it? It is a Kubernetes cluster. ETC Kubernetes manifests. Damn. Which is where we were. So the other thing I'm thinking is that it uses host path volume mount. So unless there is the scheduler dot yaml somewhere else on the find file system. It's a tricky one. Yeah. Is. This is Ricardo has dropped a hint in the chat for us, but he says a small edit
51:34 in scheduler dot YAML may have been better than a touch. Okay. I'm not sure if c is Charlotte or not or maybe Adrian with a third account, who knows? But saying don't overthink it. I think that was aimed at me, so I'm just gonna shut up. Oh, yeah. Thanks, pop. The key is do the opposite of what I say. Alright. That's That was it. Thank you very much. We just we we've done it. It just didn't decide to do its thing. Alright. So we've still not got that up and running. Nodes are still ready. Welcome, Sam. Thank you for joining the academy.
52:51 Debugging Resource Quota Limitation
53:09 So now you're looking at the replica set and trying to work out why we we have no pods. Right? Why it's, yeah, why it's not being scheduled at all. Exceeding quota. Can you spell that? So those are set on the namespace, aren't they? The limits are Are they What was more challenging? This or your p one? Well, we still are not entirely sure why the p one happened. I mean, we're back up and running, which was the key thing, but the root cause is still unknown. Use limits. So where do the limits get set? Is it get resources,
53:19 Inspecting and Editing Resource Quota
54:17 was it? Or how do you get all of the types in Kubernetes? It's kubectl get dash resources. Sorry. API resources. Forgive me. No get. Just API dash resources. Thank you. Network policies. What ranges? Oh, gone. Yeah. You you can do get limit ranges or or quotas. Yeah. You just basically got it. I see we can just either edit that resource quota. Right? Edit, delete, and you get, and I'll pick your poison. I think edit it. Give it Load. 50 m. Does anyone really understand the m's on CPU cores? No. No. It's really CPU. Is that self explanatory?
55:28 So 500 ms would be 500 milli cores. Right? So half of one CPU. Half of one. Yeah. What's wrong with 0.5? Come on. YAML. YAML. Damn. Yes, Adrian. It is cube control and I will hear the end of it now. Thank you. That's the status. You don't need to edit that. So I think once the spec is done, you can just save it. And that was edited while you were editing it. So unless you're maybe okay. Get pods. Is that been scheduled now? No. Just wanna describe that new replica set one five f d b.
56:54 It's still there. Should we edit the pod and delete the limits the request? Thanks, group daily. I mean, I would just delete that resource quota. I don't I don't think it's doing you any favors. That's that's my advice. Yeah. It's all changing your mindset from being destructive with things to being like, we just need to massage it into working because you don't wanna cascade into more and more errors. But As someone who has personally cascaded himself into many subsequent errors on clustered, I will still defend deleting things. Yeah. The chats are green. Delete the quota.
57:08 Deleting Resource Quota
57:49 There we go. How's the quota actually gone? This is still nine minutes ago. Maybe. Yeah. Has the has the quota definitely gone? I love it when you delete a resource that comes back. I think you're good. Delete the replica set and that should force it to be rescheduled. Just delete r s dash all double dash all. User twenty minutes in, twenty to go. Twenty five. I can't count. There we go. Hey. The pineapple juice, Willit. Wish I had some. Pop, I will not stop talking, although I do agree that Dinesh has a very soothing voice.
58:42 Application Pods Running (v1)
59:08 Put that on the community chat as well. I feel bad about talking all the time. And we check that, David. What would you like to check? Sorry, Sayam. This looks like we have changed to v two. Right? Yeah. You can curl localhost on 30,000 if you want or I can pop it open in a new tab. Yeah. I want to see the fancy You wanna you wanna see the dance. That's it. Right? Yep. Yep. Local host. That's not gonna work. I can't do local host. But Oh, look at that. They have swapped out the something.
59:58 Debugging Incorrect Image Pulled (Containerd)
59:58 That is not my custard. So I know in the I know in Docker world, yeah, I can do that with CryoControl. Images. Sneaky, Mary. Mhmm. So I'll give you a hint just because it's not always easy. We're gonna need to set the runtime endpoint on cry control. So dash dash runtime dash endpoint equals or do want me to type it? Yes, please. So it's bar run container d container d dot stock. What did I get wrong? Here's me sounding all confident as well. Runtime endpoint. Maybe Do you need a do you need a Unix in front of that
1:00:58 to say it's a socket? Or I'm just gonna try. There we go. I just had to come before the command. Adrian seems surprised that the the image showed. Well played. We might need to check on the MacBook now. Your Noise Gate is pulled you out. No. Noise Gate, MacBook today. The we wanna check that on the work that happened. I got none of that. Did anyone else? I think David suggested checking cryo control on the worker node where it's scheduled. Yeah. So if you tell me where it's scheduled, I'll open a new tab. You'll need to set up cluster too. Yeah.
1:02:06 There's cluster of all the names. That is on S 9 R 6 L. Alright. There you go. You have a session opened on that machine. Can I open that as a tab or does it need to be a like, kinda close this and No? No. You can just open a new tab. If you just go back to your first tab, you should be able to join the active session. It'll pop open in a new tab for you. I hope. I think I'm not opening up a new one. Let me close these and see if I can put it.
1:02:55 I'll leave that session. Alright. I have you all on this on this node. Is that me? That is you. Yep. Yep. Sam's there. Must be getting late for you, Sam. Yep. It's eleven. Thank you for staying with us. I appreciate it. Alright. What are we doing on this working note then? I have that in my pace buffer. Do we just wanna delete it? Yeah. And make sure it gets pulled again? If you're driving, David, do wanna yeah. Look. There's a 457 meg one there. That looks Yeah. I love the image ID across v one, v two, and latest there.
1:03:12 Checking Images on Worker Node
1:04:22 Just Is there a delete on this one or is it just list? I am not entirely sure if you can delete an image with with cry control. Hey. Come on, chat. How'd you delete an image on container d? Delete anything? Request. RMI. That would be a good chat. Has someone got the YouTube page open on the side? One or more image ID. Noel has said you can use CTR to remove images. So check out the CTR command. Or Charlotte is suggesting cry control RMI will work too. So Yep. I think it worked. Right? Someone got
1:05:32 Deleting Incorrect Image
1:05:36 the YouTube page open on the side. I think someone does have the YouTube side. Someone's YouTube is actually. I hope it's not me. Any other we want to do? The all of the clustered ones. Delete them all. There's a matching tip from Seth in the chat. You could always change the image pill policy to always. It's a good shifter. Yep. Delete it. Delete it. Delete it. Still an LS again? Because I wonder if it will keep it around if it's actually running a pod, which it doesn't look like it has. Right. So I guess if we go back
1:06:22 to the the master, the controller thing. Indeed. That tab is front and center again there. And if I drive which node did we it was s 9 that was known good, wasn't it? It was S9. Correct. Let's make sure it doesn't go over there. Copy. Paste. Delete of the part to cut it to be recreated. I guess you want me to check the image? Oh, that's still not work. I saw you were making a bit of a whale of assumption there. Did you call on the right worker node? Because I think it was on s nine.
1:07:30 Yep. So I called them to the other one. So you're making the assumption that the break here was Adrian had pre pulled an image pre pulled a broken image. Right? It was. So the other thing it could be, if we go back to the node, is check the dodgy hosts entries. Yeah. There is a new image actually on the worker again, which is different one, I think. Nothing on the host entries. So, yeah, it was either the the assumption was either that it was pre called or built on the node. I think I know what it is, but I think it gives
1:08:18 you a better time. I just do an images l s again. Check check for images to see what's been pulled. I think someone's been abducted by aliens. Images. Images. Yeah. That seven b seven c b looks remarkably similar from my memory for hashes that I've seen briefly passing by on our screen. That looks like it's So maybe we just yeah. I would agree with that. Yep. So all we've done is change v one to v two. It's pulled the right image in theory. There's nothing pointing it to a different container unless there is a cry control
1:09:34 sort of proxy that you can do. Sort of like the docker proxy. Is that a thing with cry control? So I guess, I mean, I I I'm assuming I know what it is and I don't wanna give it away, but there is a container d command that will dump its config. Okay. Should we get that dumped on the on the on the worker node? Inspect. Oh, no. Inspect the invite. No. Config. Yeah. So it's not on cry control. It's on the actual container disk daemon binary. So if you just type container d help on its own.
1:09:54 Debugging Containerd Registry Redirect
1:10:40 Or you can look in the config.com, that will reveal some of it. Yeah. There is a honk.trowel.io that looks remarkably suspicious. Good catch. So container d has this feature, which I don't think I have. It allows you to redirect container rip up registries to other registries. Guess, Simon, you're you're driving. Do you wanna remove that? You could also done container d config dash dump, I think, as the command. I have to say, I've still not played around with Creo container d and if the releases of Kubernetes, where they're depreciating and getting closer and closer, we really need to start to sign up.
1:11:36 You can you can cut call the first three level render two end Should we do the container d dump one or we delete from here? No. Just delete that line and the line above it. And that should be fine. You probably have to delete the image again. Nice break, Adrian. I like that. It was RMI, wasn't it? Yeah. You know, you run just run it from from your history because it's the same ID. Alright. And then if you get back to the master and oh, sorry. The control plane and delete the delete the bot again.
1:11:39 Fixing Containerd Config
1:12:41 There you go. I never wait for that to delete. Yeah. It's gone. It should have gone it. What it's doing is it's waiting for the next pod to start up and become running. So the theory being that once it's deleted and it starts up, you're like, everything's fine and you can carry on, but everyone's too impatient. So I'm still getting Does container d need to be restarted or anything? Is that a thing? Yeah. You'll probably will have to restart container d. I got that. Would be particularly evil if they had found another way to have that image delivered.
1:13:29 Restarting Containerd
1:13:47 Do wanna delete the image again, David, while you're on that? Yeah. That will have been pulled again. It will have been. It looks like yeah. Delete it. You should be good. Adrian says it's on both workers, but they have cardened the other worker, so we should be alright. Taking a little bit longer to come down. Or is it not even coming down? No. It will be coming down. It's just a 500 mega image I'm afraid. There we go. Let's see. Failed to connect the database. Oh, you got more. Damn your container solutions. You got the angry man. So you have
1:14:30 Database Connection Error
1:14:41 one more problem that looks like the fix. What pod is that? Postgres? Oh, sorry. What node is the Postgres pod running on? Sorry. Whoever was driving can drive now. No one's driving. Everyone's gonna be hands off. So at least you've got a Postgres pod. Right? That's that's a good start. So that's on our other worker as well. Suggest we delete it, and it'll get rescheduled back on the right one. And then if there's anything messed around with IP tables, but Yeah. You didn't do MTD around this, did you? That'd be terrible. Unless Brooke's been messed around with as well, but
1:15:20 Debugging Postgres Pod Placement
1:15:37 it looked looks like Postgres is running. Hey. I took Rawkode of the equation when I built this application that uses a and and a container and no PVC, so you should be good. Alright. I deleted that. Just get wide again. I'm just making sure that this is Nice one, Seth. Container solutions, more like container problems. Yeah. I was trying to come up with a funny pun, but none of them were nice. Do we have any endpoints on our postgres service? It's running now on the same node. So let's check the service. It's the next thing.
1:15:50 Rescheduling Postgres Pod
1:16:29 Yeah. I would describe the service. Let's check for endpoints. Even if I could spell, we'll be doing good here. Alright. There is an endpoint on it. Does that match the pod? IP matches. Yeah. How about the ports? Portal sign 5432. Is it fine on the port, though? Alright. He's got a few more minutes left to think of this last problem. If it's the last one. Yeah. Ports are right. Ports are right. I have not so I've got connection timed out. Is core DNS running? Is and hasn't been messed with well, can you check the config map for core DNS?
1:17:55 Connection Timed Out / Network Policy Check
1:18:17 It's not DNS. Container solutions have confirmed that It's not DNS. Oh, so they're not be able to connect. Can we get a shell on the clustered application? Suggested if you look in the hints in the root directory, hint three for this problem. I think it's slash route. And then Charlotte's in there with a hint even in the chat. What prevents traffic? That's a pretty big hint, I think. Well, it's gotta be potentially IP tables or policy. Yeah. Network policy. Yeah. I think you're right to that. I think a network policy. Don't look. Don't look. Check network policy.
1:18:55 Finding and Deleting Network Policy
1:19:12 We've had we've had a hint over the chat, so I'm sure it'll be the same one. I hope it's the same one. There go. Just delete it. You sure you don't want to edit it? No. That one seems mean, and I probably learned my lesson. Alright. So that is a network policy. I should be able to just refresh this pretty quickly. We have the dance. Awesome. Great job. Lots of great problems across both clusters there. I'm trying great fixing from both teams as well. And we're on time both clusters fixed. What what more can we ask for? That was
1:19:51 Civo Cloud Debrief and Conclusion
1:20:01 awesome. All right, great job Sivo. Thank you for joining me. It was a pleasure. I hope you had fun with that too. Absolutely. All right. Awesome. Great. All right. I'll say thank you to container solutions. Thank you to Civo. Thank you to Teleport for sponsoring today's episode and thank you to everyone that tuned in and watched. We are looking for more teams to join us on cluster. So if your team works with Kubernetes and you wanna come on and have some fun with us, reach out on Twitter or the GitHub repository or email me at david@rawkode.com.
1:20:14 Final Wrap-up
1:20:35 To both the teams, thank you again. I'll leave it there. Thank you. That was absolutely awesome. I couldn't be happier with the first episode of cluster teams. I hope you enjoyed it too, and I'll see you all soon. Thanks. See you all. Bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments