About this video
What You'll Learn
- Debug broken Kubernetes control-plane health by inspecting kubelet status, static manifests, and API server errors.
- Trace worker-node scheduling failures through node status checks, webhook diagnostics, and CNI init-container behavior.
- Repair control-plane defects by fixing kubelet binaries, corrected API-server manifest settings, and Cilium image pull policies.
Lee Briggs joins to debug two broken Kubernetes clusters live: missing kubelet binaries, a renamed API server manifest, rogue validation and mutating webhooks, a stopped containerd, and a Cilium CNI image pull policy bug.
Jump to a chapter
- 0:00 Viewers Comments
- 0:50 Introductions
- 0:53 Intro and Guest
- 2:51 Initial Assessment (Cluster 1)
- 3:00 Cluster 007 by Dan Pop and Matt Moore
- 3:43 Debugging Cluster 1 API Access
- 7:18 Debugging Cluster 1 API Server Manifest
- 11:05 Fixing Cluster 1 API Server Manifest Filename/Port
- 13:10 Debugging Cluster 1 Kubelet Binary and Permissions
- 15:45 Fixing Cluster 1 Kubelet Binary and Permissions
- 19:06 Debugging Cluster 1 API Server Connectivity (429 Error)
- 33:40 Debugging Cluster 1 Admission Controller (Validation Webhook)
- 36:40 Fixing Cluster 1 Validation Webhook
- 38:23 Checking Cluster 1 Node Status
- 39:25 Debugging Cluster 1 Worker Node & Pod Scheduling
- 46:40 Debugging Cluster 1 Allocatable CPU (Red Herring)
- 55:11 Debugging Cluster 1 Mutating Webhook
- 57:12 Verifying Cluster 1 Fixes
- 58:16 Initial Assessment (Cluster 2)
- 59:00 Cluster 008 by Akos Veres
- 59:37 Debugging Cluster 2 API Access and Components
- 1:00:01 Debugging Cluster 2 Kubelet RPC Errors
- 1:01:28 Checking Cluster 2 API Server Health (429)
- 1:10:27 Debugging Cluster 2 etcd Connectivity (Red Herring)
- 1:17:15 Debugging Cluster 2 Containerd (Exited)
- 1:18:29 Fixing Cluster 2 Containerd
- 1:19:22 Debugging Cluster 2 Pod Scheduling
- 1:22:19 Debugging Cluster 2 CNI Init Container Error
- 1:30:57 Identifying Cluster 2 CNI Image Pull Policy Issue
- 1:31:16 Fixing Cluster 2 CNI Image Pull Policy
- 1:31:47 Verifying Cluster 2 Fixes
- 1:33:13 Conclusion
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:53 Intro and Guest
0:53 Hello, and welcome to today's episode of Rawkode live. This is the clustered series. Today, we are gonna be taking two broken Kubernetes clusters and live debugging and hopefully fixing them. I am not smart enough to do this alone, so I am joined by a wonderful guest. Today, I'm joined by Lee Briggs, a developer and developer advocate at Pulumi Corp, and the very first breaker. Hey, Lee. Hey. Thanks for having me. It's really exciting to be here. I've watched all the other episodes. I think I did the first cluster. You did. You broke the very first cluster.
1:30 And I was kind, and I just edited a config map. And it looks like since then, things have escalated fairly significantly. So, yeah, prior to being a developer and developer advocate, I was building out Kubernetes platforms from scratch and have built them before EKS and and all those kind of things were things. So it's but it's been a while, and I'm excited to see if I still have all of that knowledge and memory there or if I'm gonna be any used to it all used to you at all. Well, you you will. I think, you know,
2:03 based on my experience of having three of these under me, like, you just you forget how to think in a weird way. Right? And having having a guest here and being able to talk about what's actually happening will slow us down. Hopefully, not get us down too many rabbit holes, but, you know, it's it's tough. And as you can see from the comments, everybody is so supportive. They're all on our side. I know. This feels it feels like collective shard and frog at this point. Everyone's just really enjoying our sense of doom and dread as things go along.
2:34 And I've been on all three sides of it now. So breaker, observer, and now fixer. So this is gonna be exciting. And I also will say it's the first at at least we're united Great Britain from at this point as well. So Yeah. Well, I think we just need to rip the band aid off here. We need to get into the first cluster. Let's see what we're dealing with and then take it from there. So this is cluster zero zero seven. This was broken by Matt Moore and Dan Paul. We should just get this on with
3:00 Cluster 007 by Dan Pop and Matt Moore
3:11 not looking forward to this one. Gotta be on it. So as always, we are well, as always, as of, like, two whole episodes ago, we are now using Teleport to aid us debugging these clusters. Really cool tool. You should check it out. It gives us access to each of the nodes so we can try and fix things together. The superpower here is we have a shared terminal from David. See, I can't even type echo. That's not a good start. Lee, do you wanna just confirm with it? There we go. Nice. Okay. There we go. So we've not looked at anything yet. We
3:43 Debugging Cluster 1 API Access
3:45 don't know what state this cluster is in. We don't know if we have an API server. We don't know well, we don't know anything. So I usually like to start just by using the admin token and seeing if we can get the nodes, get the pods to get anything at this point, and then we'll work out from there. My personal preference if we go straight for nodes, like, kinda get an get an idea of if yeah. Okay. There we go. Oh, it's it's bigger, right, With no API. Yeah. Yeah. Excellent. It's getting all far too familiar with this.
4:16 Okay. I wish I could say it gets easier. I think that it gets worse. It definitely gets worse. Okay. What do you think? What do wanna look at first? Six nine six nine looks like a weird port. So I think running Docker PS would be a good start. Make sure the actual oh, sorry. Contain use Containerd. We do use Containerd. Actually check that Containerd is running the control plane pods would be probably a good start and see what part they are on, and then go from there is was probably what my first thought is. Well, that's one
4:54 of those things I've never actually bothered to look up. There is a cry control, but it doesn't do anything without configuration. And I've never actually got it. Weird bug in Teleport where I have to relook my page randomly or it doesn't scroll. Yeah. I'm gonna have to fail that as an issue. It should be a good old source citizen. So if we don't control pods, it never works. It never points to the right socket. I know Walid has just commented saying run that command. If you remember what the flag is to point it to the
5:20 socket, drop it in the chat and I'll use it. However, I think you're rightly. That's not the right port. Yes. So let's just sorry. There you go. No. That I I we basically need to get an idea of if the control plane is running at this point, and old old school would be docker p s and actually check if the actual containers are running. And because, obviously, the port could have changed in the admin.com, but they might also not be running as well. So I I with container d, I'm not entirely sure how you would run p s. I
6:02 Oh, I've just tried to p s to look at every process on this host, and there's nothing that resembles an API server. Excellent. Okay. So is the Kubelet running? We have a Kube controller manager. That's it. Cool. System CTL status Kubelet would be probably a good start. Sorry. Say that again? Oh, yeah. Just check yeah. System CTL status Kubelet and see if that is running. It is. But it looks like there's a bunch of logs there, and, obviously, it can't connect to That's weird. I never saw Kubelet. Oh, I rang okay. I rang Kubernetes. That
6:49 was my mistake. Okay. No. Why? Where's my oh, no. That is okay. Yeah. We've got a we've got a kubelet. We've got a kubelet control manager. Okay. So the kubelet is running, which means it should be starting the static manifest, which run the API server, the scheduler, and the controller manager. In the ETC I'm guessing they're in ETC Kubernetes at this point. Yep. This should be in this directory here. Okay. We do have an API server, but it is not running. So let's let's scan this. And I'm really bad for this. I keep flying through these and messing
7:18 Debugging Cluster 1 API Server Manifest
7:33 stuff and the audience. I just used it. There we go. Thanks, Dan. Alright. Yeah. I'm really bad for skimming through these two quickly. So I'm gonna slow myself down. We can see if you ask server, the advertised address looks like it might be alright. See, let me know if you see anything that looks fishy. I'd look looks all looks good to me so far. I wonder if it's just the port in the cube config. Yeah. We might be well, I think I think we might be getting there. Let's we should probably fix that first, and it it
8:17 it should be 6334, I think. So it's 443, not 334. There we go. Make sure there's no more. Yeah. Okay. So keep config. Probably gonna have to restart the cubelet, I think. Oh, no. Because we only modified the we didn't modify the Oh, yeah. Yeah. We just modified the config. So I did try and speak to this here. But, of course, we do know that by running PS, we didn't have an API server either. So we're gonna need to work out where our API server is. Now the manifest is here. It didn't get started. I guess we wanna start looking at would
9:06 you say the cubelet logs? Maybe where we wanna head next? Yeah. Yeah. The the cubelet logs seems like a journal CTL on the cubelet logs. I think it's just gonna tell us the the cubelet can't connect to the API server more than likely. Okay. Kubelet. No pager. And if I can remember, is it since fifteen minute. That only works. Right? Alright. Guess not. I mean, I'd less. Just dump the whole thing. I see a crash loop back off. My screen is definitely locked up. Oh, okay. Yeah. We do have a crash loop back off. Something reference in Cilium.
9:58 Yeah. I may have to refresh this tab. Well, it's not happening. Yeah. It's crashed. Okay. Yeah. I'm I'm not getting any scroll back in Teleport at this point. So Alright. Let's go to the firewall containers. Let's see what's recent. Where's the API server? Oh, there it is. Okay. Cube API server. There we go. There we go. That looks pretty always and we're gonna have to regenerate all the certificates. No. If anyone's been that harsh, they're getting a swift kick the first time I see them. I'm hoping I just I I handling keys is is an absolute nightmare.
11:05 Fixing Cluster 1 API Server Manifest Filename/Port
11:08 I'm hoping it's just Oh, yeah. API server dot turd. How wonderfully juvenile. But we already had the port six nine six today. Know. Right? I yeah. There's a theme generator. There's a theme going on here. I I think we're definitely gonna have to restart the keyword at this point. Yeah. Let's do that. Okay. So our API server wasn't available. We have seen the error from the viral our log containers directory. We've restarted the kubelet, fixed that file. Hopefully, we have an API server. Not not We do not. And let's just see if we get any
12:00 additional logs from QB API server. Not yet. So it's maybe just not restarted yet. I had this weird issue last week as well to be honest. And I think the solution was to move the API server out and bracket. Let's see where that gets us. When was that? That was a few minutes ago. I guess we can just wait. We should maybe we can I I know the cubit logs are? Let me have a look at what the general CTN come up. So general CTL command is for follow. So Matt's comment is saying, at least there's only
12:49 one control plane node this time. Yes. Have learned that I learned that lesson. Nobody wants to see me replicate the same fix across the different machine. I think we can do minus u kubelet. Office file or directory. Oh, the kubelet binary is gone. Is that is that right? Is this can you wanna It looks like it. A wedge on user Ben Kubelet? Now if if I if if I'm gonna learn anything from what we've seen so far, this is gonna have an interesting name. My l s minus a l user in crap. Sorry. Nope. Yeah. May just If you deleted it altogether,
13:10 Debugging Cluster 1 Kubelet Binary and Permissions
13:57 we might have to download it. Don't you try l a dash l h t, which should sort it by that sort by time, and then we can just see what was changed recently. Yes. Go ahead on that so we see the first. Yeah. Oh, no. That that's not working at all. I think it's just been removed. I mean, there's every possibility here it's just been completely deleted. Oh, wrong directory. Which, by the way, is real really mean because the cubelet was running, and it obviously, the binary is being used by system d, and so it's
14:48 been removed. And as soon as we restarted it, we're now in a worse position than when we started, which is just really mean. So I think what we're probably going to have to do how does your setup install the cubelet? You know when I something I'm gonna take a shot in the dark here. No. When I did the when I moved to manifest this last time, I didn't Oh, wait. Wait. Wait. Wait. Wait. Wait. Wait. Somebody in the chat just said it's it's there. They've renamed it to Pubelet. Yeah. Look. It's the third line up here.
15:28 So let's do Bash history is off limits, I'm afraid. Nice idea though, but we do have a a an agreement. So Okay. So now we should be able to start this. How does everybody see that? That see, it's the blinkers have just been here. That's what it is. Yeah. I think yeah. So then let's see. T l status. Kubla is running now. Is it it's loaded. Is it running? No. I don't think it's running. What what are the permissions on? Alright. Just give a wee wee nudge here. What happens if you're on a PS and correct
15:45 Fixing Cluster 1 Kubelet Binary and Permissions
16:32 for triplet, is it? Alright. You can just do a system control start to play. Maybe it's just not enabled or something, or is it failing? It's exited. What next? Oh, we're getting a stack trace from API machine. That's not a good that's not a good start. Do you wanna let's run this through. No pager. And then put it through in there. We've got At the top there. Yeah. Okay. Okay. So we need to go to viral load kubelet. I'll I'll let you do the the tape in just now. What did it say? Look at the permissions
17:39 on that config. Well Oh, excellent. I think this is the the cat and mouse cluster. Thanks, Matt and Dan. Is it I'm wondering if they've set ATR on this or something. So I'm gonna try this with not the okay. At least they haven't set ATR on it. That's that's a good start. Let's give this let's give it a restart now and see if that's happy now. Well, I mean, I'm gonna be don't try this at home, folks. I'm gonna do that. No. No. That's not about it. That's not a good idea. Let's do 644. There's no judgment on this show, mate. What's
18:23 that? 777. I just wanna get it started. We can, like I mean, this is a production incident. Said no judgment, and then you take service cubes that restart like an absolute I know. I'm old school, so Alright. Let's get It's not quite there. Yep. Four seconds ago, auto restart. Not being good. Alright. I've keep the config. No such file or directory. Oh, wait. That's probably old. Oh, is that a different error? Looking Oh, no. This is this is the March 9. Right. Yeah. This is this is a long time ago. Let's I wonder why my since fifteen minutes didn't
19:06 Debugging Cluster 1 API Server Connectivity (429 Error)
19:17 work. Anyone in the chat remember what the time parameter is on journal journal CTL? Yeah. Let's see. It's there's so many logs that Teleport can handle the scroll back. So I think I've just crashed my session. Alright. Let's just reload. That's just quite okay. Yeah. We need to be careful with that. Okay. So journal c t l dash u dash dash tail. I yeah. I usually use follow, which will give you, like, minus f, which will give you, like, real time streaming at least. But I don't think we could scroll back up with that, can we? No.
20:02 Okay. I need to remember what the syntax is. I guess you could Is it? You could bust out the man page. Woah. Dash. Yeah. Oh, there we go. Thanks, Matt. So what's that? General C T L Dash U Q Dash N 2 3 Okay. Need no pager on that too, actually. Otherwise, I don't know if need to read that. Is it a political opinion if we start talking about system d in general t t CTL at this point? Yeah. Go for it. It's all good. And this is all easier when you could just tailhog files, in my opinion. I like
20:51 h Come on. No need to bring the Scottish roots into this. So the green Alright. We we need to go modify the duplicate configuration. So d d r lib. Definitely a cat in this episode. Hagas. Lovely. What's our CEO's goal? I'd like to just take a moment to say everybody on the chat who hasn't tried Haggis is delicious, and you should definitely go out of your way to try it. Yeah. I guess it's amazing. C a dot p s there. I don't even know what one we need. C a dot s there, isn't it? Give that a second.
21:44 Oh, it's minus five minutes. Is that right? No. Just drop that. If that works, I'll be so happy. I knew it was cool. There we go. Perfect. Thank you, Noel. Alright. As discussed, I'm crying. No. I'm not crying. Thank you. Activate. Look at that. So let's check if our API server is running now. Oh. No. So we at least we got the Kubelet running. Oh, did we move the manifest back? Yes. Yes. We did. Alright. So I'm gonna change that. Oh, you can jump into the firewall containers firewall containers. Oh, yeah. Tail the API server. Well, log
22:44 containers. Tail at c d. Anyway, I I can't remember anybody breaking the, at c d setup yet, so maybe we're gonna get lucky today. Yeah. You wanna tilt the API server so we can see why it's not starting? Oh, that's what I thought. Oh, no. I did the control plane. My bad. A cube dash API. Just tilt that. Oh, sorry. Yeah. My bad. Alright. Livestream bunkers. It's tough. More PKI changes. Alright. It's just changed in the context, so we need to change it back to be API server -kubelet-client.crt instead of .pop. S yeah. As in the bar lab kubelet
23:37 config dot yaml. Oh, sorry. Yeah. Barlib kubelet config. Carlos is asking who broke this cluster. It is Matt and Dan Pop, former friends of mine. It's not in there. Okay. So that means it might be in wherever that be configured. Oh, the static manifest perhaps. Yeah. I thought what my yeah. That was that. Yep. Hop is there. Oh, goddamn it. I've got my VIM config. I feel completely You want me to do it? Yes, please. I'm completely useless without my ID at this point. I feel so there we go. Okay. Now because we've changed the manifest, let's just
24:36 give it that we've I mean, I usually find that just kicking the cube load will will be enough, I think. Well, it's gonna restart I mean, because it's a static manifest, it's gonna try to there we go. It's gonna try to restart it in a few seconds. Alright. I'll let you take it. Do wanna see if can get get notes on this thing? Kubel I can't remember the magic incantation. I know. Kubectl dash dash kubectl. Oh, yeah. Sorry. Yeah. Kubectl. Kubectl. I've been get nodes. Post is Kube config. Oh, yes. Sorry. Yep. That's it. Oh, wait. What? The the nodes aren't
25:15 joined though. But we should be able to get the pods, and it should okay. That's interesting. I was kinda hoping that was us. Hoping that was a fix. Right. Okay. So none of our nodes are connected. Although the API server wasn't running. Let's give that a minute because the other machines are gonna have to jump back in, get connected. And the API server not running means we actually probably wouldn't be able to run a bit pods anyway and open that are running. You wanna do a PSC UX and correct for a core DNS? Let's have a look at the API server logs
26:05 while we're in this directory. Queue API server. Matt has said, welcome to phase two. Minus n. Let's do a hundred. Storage error. Key enough. Oh, now we're in now they have they have screwed this as a CD. That looks things. I don't maybe I need to refresh. I I don't think I'm good. So I don't know. Unable to remove old endpoints from Kubernetes service. Key not found. Okay. Yeah. So this is at CD, I think. This is the I believe. Okay. So let's let's think this through. Let's get nodes. It's not returning even this local
27:23 node. Yep. We have a QUI run. So the a the API server, the process is running, but it's still not healthy. Like, it's still not started up correctly, and it's probably not actually responding to all of the nodes that are trying to connect is is what I'm seeing at the moment. So the the API server logs that we're seeing here I don't know if you can see this in I don't know if you can see that in your teleport session, but there is a log entry that is telling us when it's the well, when the
28:10 API server is trying to connect to, can't remember if it's a controller manager or at CD. But the there's something wrong, basically. So Kubernetes manifest cube API server. Yeah. So we're connecting to the local at c d here, and I think the local at c d has been Wanna check it's running? I don't think we've looked for that in a p s command. I accidentally looked for it earlier. But yes. So it is running. It's not That's the p s server. Okay. So our static manifest for SCD is parked as well. Oh, god. Yeah. Also we
28:59 also don't have cube scheduler running, so we're gonna need to fix that too. Okay. So let's go back into the static manifest directory. Vim, etcd, Kubernetes, manifest, etcd. Okay. Let's see. This set your mean, my skin is that there's nothing modified. So it could just be the schedule. Don't know if it's static manifest. Wonder if there's anything in the CD log here. I'll state as okay, but That's that's right now. That's that's, like, two seconds ago. Maybe it's just got off the scroll. Am I not seeing it? Oh, it is running. Okay. I think our scrolls are maybe different.
30:11 There is an entity there. There's a recent Yeah. It's that. It's yeah. It is running. I don't think so. I have to refresh. Oh, no. It's not a scroll. You're alright. Okay. Let's do yeah. It's running. Is that five minutes? No. March time. Okay. SCD is okay. Sorry. I'm SCD is running by the looks of things. Okay. Let's go work it by the schedule. The schedule is not running. And then we'll come back to the API the the notes not being registered. That way, I'll I want the whole control plane online before I start debugging on. Yeah.
31:01 Yeah. I I agree. I agree. So We got a few comments. Scotland sucks. Well, he says, no distractions and mission weapons. Yeah. I mean, definitely stuff we need to look at. Carlos. Yep. Think about life cycle to register node. And Matt has given us a little bit of confidence. He thinks that the API should be fine. So let's get the scheduler on mine, and then we'll take it from there. Hang on. Authorization queue configs. That certainly seems better. Before we even before we even try and get the scheduler back online, let's look at the control manager,
31:58 which I think when we were doing our earlier PSs, they were all fine anyway. So Yeah. I just gotta scan, sir. Come on. Qubel's running. API server is running. Hang on. At CD, API server, Qubeler. Still no sign of the other ones. Scala. Cannot open. Oh, wait. I'm logged on to you. Yeah. Oh, yeah. It's coming. Okay. That looks better. Not like you know history, Rule. I feel like that I would definitely be logging in and being like, what did I do earlier that I've forgotten? Well, I think they have forgotten. I mean, even Matt was like, oh, yeah. Get
33:08 notes. Still no notes. Run a PS. Can we see the scheduler? Just grab for scheduler, I guess. Alright. Okay. Cool. Yep. Okay. So now we have to work out the what's going on. Let's look at the API server logs again. I don't know if that actually use a red heading or not. Maybe Matt or Dan, if you wanna drop a comment before we start going down the wrong path here. And now now we're into Oh, there we go. Validation web. We have web. Yeah. So okay. So we're getting there at least. So how maybe I mean, it's been a
33:40 Debugging Cluster 1 Admission Controller (Validation Webhook)
33:50 year since I've done this. How would you specify a validation webhook statically with the We can check. So it it depends if it's dynamic or not. So oops. Sorry. Do you mind if I type? Yeah. Go for it. Yeah. So it could be that they have added a web hook on the API server manifest. I guess the the API server is now responding, so we should probably be able to do get validating web hooks. Right? Depends how the In theory? Yeah. It might be it depends if it's a dynamic admission controller or if it is here. But you
34:29 can see we've got this node restriction here. Enable admission plugins. So I'm gonna There we go. We'll let that restart the API server, and we should be able to have our notes register. Doesn't feel like it doesn't feel great to sit here and cross your fingers, does it? Like I'm sorry. Oh, go ahead. You do you do the you do the queue control. I just wanted to check if we got a restart on this. Yeah. Yeah. No. Not yet. Did we make it worse? We made it worse. I wonder if I can't remove that flag
35:05 just wildly with that. So API. Hasn't auto completed properly, I don't think. Yeah. Dot log. There we are. Which one do I want? F f or c? Pass. The newest one. Quota admission. We don't have a node validation anymore. Did I not just remove it? Okay. There's the API server. Let's see what happens with should set this up beforehand so people don't need to keep watching me do it. I have an alias and and locally as well. So yeah. Alright. Can you can you get can you get the admission controls? Yeah. Validating webhooks. And just do dash dash all name. Admission.
36:14 I can't remember the actual name. No. Me neither. This is when Google call Yeah. Resources. Yeah. Oh, what was that again? Q c t l API dash resources. Oh, yeah. It's not get, is it? It's just that. Yeah. And grab for validating. Okay. So now we wanna do a key get, And we just delete that. Can we do all net before you do that, can we do all namespaces? I don't think they're I don't think these are namespaced, but yeah. Yeah. Yeah. Sorry. Yeah. They are global, aren't they? Yeah. Okay. Well, I wouldn't delete it. Let's have a
36:40 Fixing Cluster 1 Validation Webhook
36:55 look at it first, see what they're doing. Like because if we have to get if we have to get it back manually, I don't know if I'm gonna be able to do that. Yeah. I think this is this just needs to delete these. Well, I I think another a better way, think, is to remove the resources. So, again, like I'm thinking yeah. If we if you make that an if you make that an empty array, I think that might be yeah. Nope. They didn't. Is that immutable? Oh, no. Required field. Just maybe foo as a resource as one array
37:45 because that's not a valid resource. Right? Webhook rules resources as required. It's there. But it's I think it's empty. So if we I think you might be right. We might have to just delete it. We could just do oh, hold on. Yeah. Cool. There we go. You wanna you wanna try to get notes? No. Is Noah a valid answer? There we go. Nice. Okay. So those are two. Like, worker notes I was expect I'm hoping that maybe not changed anything on the workers. They should come back online now that the registration should be alright. Yeah. That won't work with the alias. But
38:23 Checking Cluster 1 Node Status
38:41 you can just do key get notes dash w. I don't like the output. There we go. Okay. That's good. Alright. Let's see what the comments are seeing. I think people are agreeing with it. Nice. I do think we we can wait for the WorkerNodes to register, but I do think it might be the case they have messed with those as well. Maybe I'm being very cynical. So we might wanna switch to a WorkerNode and see what the life cycle looks like on there. Yeah. Alright. Let's go. Let's I'll be I'm connecting to first worker on
39:25 Debugging Cluster 1 Worker Node & Pod Scheduling
39:30 the list, JBW7V. Yeah. I have to see the active session, so I'll join that. Okay. Oh, there won't be a cube admin on this. Only way I'd be able to do it locally. Okay. We can just check the basics then. So we expect to see A cube look. A cube look running here. K. That's a good sign. Let's just restart it and see if it helps speed up the joining process. And if I really quickly pop back. Journal. No. We I'll look. Do we come back? No. It's registered. It's still not ready. Okay. I think I guess, actually, probably
40:14 I think we should kick all the other ones real quick and make sure they all join. And then I would feel pretty okay once the once this working order's has joined. I have I received your call. We should also check our workloads are running through. K. Well, WordPress is still pending. Let me describe it. Describe your NGINX. We've changed the max port limit. Yeah. I can't remember where that is. If you know where it is, then please feel free to take over. It's in the controller manager, I think. Where'd you set the max plot limit again?
41:30 I think I think controller the scheduler, sorry, not the controller manager, which seems Kubernetes manifests. Kubernetes scheduler. Nope. What's the port number on that? Zito? Isn't that I think that's valid. That I seem to remember that being valid. Yeah. I believe so. Let's see what else we got. Yeah. Port zero again. Maybe that's just a red heading. I'm gonna just look it up. Kubernetes. Oh, wait. We're looking at the wrong node. We're looking at the control plane. The cubelet on the actual node is probably where the Macs pods are set. Alright. Let's jump onto the first one again.
42:31 Because this obviously isn't gonna get scheduled on the control plane. So the cubelet config on here is probably Weird bug. Alright. Okay. Let's see. Yeah. Let's just Google it. It could be on it could be on the unit file if you wanna check there. On the system to go unit file? Not much help. It might be in this environment. I mean, it could be a set in as environment variable, I guess. Do you wanna quickly lock it up look it up while see where else it could be set while I just randomly type command and hope it
43:41 gets fixed. Oh, I'm in the control plane mode. Yeah. So did you look at the unit file? I did. Sorry. I looked on the I I didn't realize we were on the back on the master node. Sorry. The controller node. Alright. I'm drawing a blank. Well, I mean, they definitely they definitely have set a max number of nodes. It's just figuring out where it's actually been where where, like, where all the possible places that it's been set. So the capacity is a 10 on our first worker node. So I did a describe on that node.
44:39 I'm on the control plane again. Sorry. Oh, we just have a no schedule uninitialized on this. I'm gonna remove that team. I don't even know if that was them or if that's our CCM. Wait. Do we remember earlier on there was a log message about Cilium? Did we haven't actually looked to all of the cube system, like, the pods to make sure they're all scheduled correctly. I'm gonna remove this 10 anyway just to see what happens. Yeah. And then I would do k get pods minus yeah. Woah. Okay. Some stuff's running. I'm gonna add all of those notes.
45:37 I'm just gonna remove that. I'll just do it from now. 10. Think confused about why they're approach. I I I I'm a bit confused why they aren't registering. Like, it is the is the CNI initialized? Okay. So we have got a pod. See, almost? Well, I mean, maybe we could I think we just need to kill these pods and have oh, no. Because we already ran around that NGINX container and that didn't get scheduled, did it? Okay. Matt, Dan, you're in the chat. Have we fixed everything that you've broken and this is an unrelated error? Or is this still
46:31 part of your phase two? We're gonna allocate a few more minutes to it before we move on to our second cluster. Greatly. There are some pending Cilium pods here. Somebody in the chat said allocatable we must have missed this. The allocatable CPU is only 10 m, and I think we were just kind of scrolling past a lot of stuff because we Allocatable CPU. Where is that setting, Liv? On the worker? I think it think it might be on the worker. Yes. Okay. Within the flag. Alright. Let's use the control plane. Right? This is what this is good at. So get
46:40 Debugging Cluster 1 Allocatable CPU (Red Herring)
47:24 nodes, describe nodes, whatever workers. Right? And someone in the chat suggested that allocatable CPU is small. Yeah. Look. See, the request the CPU requests are a %. But I think that's just the number of calls. Right? I'm probably reading. Invalid capacity is zero on image fail system. Something wrong with our disk. I don't know if that's we can done the wrong path again or not, though. Well, let's let's describe the node. Let's describe the node. Yeah. And then Oh, yeah. Allocatable CPU 10 m. Okay. So where do we change this? Is that just under just as a node.
48:28 Right? Yeah. Yeah. I think so. Okay. Delete. Wait. That's the you're removing that from the status, so I don't think that's gonna matter anyway. I'm removing it from everywhere. Yeah. Okay. Let's describe our notes and see if that is gone. No. It's still there. So this must be a set and provided somewhere. Oh, all y'all. Sorry. That was my fault. I started typing. Oh, you're allowed to type. I'm the one that shouldn't be allowed to type. Okay. How do we fix this? I'm gonna Google it. You can if you wanna click the Yeah. I'm just gonna look at the notes.
49:34 Interesting. K. There's flags on the cube lit cube reserves that we should probably go check out. So we're gonna wanna check the system to unit file, the kubelet flag, the kubelet flags, and So what we're looking for? Sorry. Well, I think we need to go on a worker note. Look at the cubel configuration. There's nothing in this file. Right? Doesn't look like it. I'm almost stuck in them. Okay. Your flag. I didn't I didn't see anything there. That would be confusing. By the way, Matt and Dan are saying that they I'm not ready to give up yet. So
50:28 I don't wanna know the answer yet. We're I feel like we're close. And Let's check the defaults. Is there anything here? No. Alright. Status. Okay. Everything we need to know is in this unit fail. Everything I think. So I'm just gonna cat it. Nope. Not that one. Where's the other one? I just do if you just do system CTL cat, it will show you the if you do system CTL, it shows you, like, all of the drop ins and then rendered, like, the the full actual yeah. So the environment file is empty. The sorry. The second ETC default environment file is
51:13 empty. The environment file, the cube ADM flags dot env isn't there. Sorry. Isn't didn't have anything in it. I wonder if the set look at the bottom. Right? We've got, like are these environment variables? Could they be provided? No. They have to come from environment files, don't they? So Oops. Alright. I can smell it. I We must be staring right at it. That's the thing. I think we probably are. I'm in the KubeConfig. We checked that fail. Oh, wait. There's the could be an etcetera Kubernetes KubeConfig or etcetera Kubernetes Bitstrap KubeConfig. Let's go check those.
52:14 Come back on it. So that one doesn't that's what put That one doesn't exist. I think we looked at that. Yeah. That's just I can't modify that. That's just the that's just to connect to the API server. Yeah. Alright. I'm gonna try Varlib kuplit. Well, it says slash e t C slash default kuplet. I did look there. I didn't see a kuplet. Yeah. That's it. It doesn't even it doesn't even exist, I don't think. Yeah. Yeah. It's not even there. Look. No. It's jessie ten dash people dot com. I'm not sure where you've seen that.
53:00 Oh, that's the see, look, this is right here. That's the drop in file that is like, the system CTL cat, like, concatenate it altogether. I feel like we're so close. I'm not I'm not ready to give up yet. Like I think this is hiding stuff from us. Can we open that 10qbdm.com? Because it's going off the screen and I think I think there's a flag on the end of it. So it's a system b system. Yeah. What's an I'm gonna wrap it here though. Where is the hiddenness? I almost want to general CTL manage you, Kubla, ISF.
54:16 All that stuff. We've we've spent too much time on this cluster. Feel free, anyone. Let's tell us where the status is because we've checked everywhere I can think of except maybe in just the, like, a set of profile. Yeah. If I added anything in there. Okay. That was a cool command. I like that cat thing. I wasn't aware of that. I always end up just trying to find the unit tail and then say the getting confused. If you actually did that, Matt, that would have been hilarious. I bet you set up an alias in the actual system configuration, which won't be loaded
55:03 by Teleport. But they tried to replace them with Emacs. Oh, horrible piece. Oh, we haven't looked at the mutating webhook. Oh, that's so mean. That's the workers, so you have to go back to the control plane. And it's mutating webhoc. You put yeah. I'm lost without my auto complete. There's a defaulting on it. Dot k s I o. Look at that. That's so I actually didn't know you could do that. That's relatively new. So it's if you I think, again, if we change the resources to foo like this not actually sure what this is doing. Modifying.
55:11 Debugging Cluster 1 Mutating Webhook
56:15 Right? Side effects, none? Oh, no. It's calling this external URL, which is the new mutation. So they're setting up in their own thing. I want I really wanna just see what this is. Yeah. You need to pass it on body and stuff. Alright. That's good. I like that. Bastard. So it looks like they're modifying the mutating webhook is modifying the node as it joins the API server to mutate the available resources. Am I right in what you had done there? Yeah. They've added that mutating webhook. It's called a third party external URL, and they're writing in that weird stuff too.
57:04 That's that's good. I always forget to check the admission controllers. I really need to get into better habits. Alright. So let's look at our pods. Of course, it runs on. There we go. Look. Do want me say k native or native, Matt? It's such a powerful it's a such a powerful part of Kubernetes that you can basically modify any resource as it kind of turns up in the API server. Like, if, you know, if your Kubernetes cluster was to be, like, attacked for some reason, like, you basically will be chasing, you know, chasing things forever.
57:12 Verifying Cluster 1 Fixes
57:46 If you can host your own URL as well, that's feels pretty pretty powerful. Alright. I am calling this one fixed. Good job, Lee and the audience. Let's get on to our second cluster. Is it too early to have a drink at 10AM? I think I'm starting to feel like I might need one. Yeah. I'm not far. I'm I'm not far in my my victory beer. But oh, come forward. Yeah. My losing beer. Consolation beer. Right. Okay. Our second cluster is broken by Akash at puck on Twitter. I guess we're just going to the exact same thing again and
58:16 Initial Assessment (Cluster 2)
58:29 Did you set up the alias here just in so that in case I yeah. K Get notes. Nope. Oh, god. I can't remember the syntax. Did you join your own session? Oh, no. I think I I think I joined the one we set up earlier. So hold on. Let me get into that one. Alright. Okay. No. It's okay. I'll join that one. It's fine. Oh, no. No. No. I've closed it now. I need to I need to I need to close it down again. So if you set sign up a session, I'll join it now.
59:00 Cluster 008 by Akos Veres
59:00 Okay. I'm in a session. We have our alias. I am unable to run get nodes. Excellent. I'll be with you any minute now. Your colleague Paul has said it's never too early. It's 6PM. Okay. From our breakers as well. You know what Okay. That's timing out. I'm not sure what's going on there. So I'm gonna let you drive. Let's just do the pulse check on all the control plane components and see where we are. And so it's running. API server's running. Control amount just running. Queue proxy's running. Queue bot's running. Let's check. The Kubelet is actually
59:37 Debugging Cluster 2 API Access and Components
59:50 healthy. I have a theory that I don't want to share just yet. Okay. So we're getting RPC errors as it's trying to connect to the trying to connect to the API server. I wonder oh, and, yes, I don't know what the there we go. Has have we had somebody mess with the IP tables changed? Reload it. It's gonna not be scrolling thing. There we go. So you think it's an IP So when you re no. I was just making sure that, like, there was no because the cubelet's trying to connect to the API server at the moment, and it can't,
1:00:01 Debugging Cluster 2 Kubelet RPC Errors
1:00:53 and it's, like, timing out even. And we're getting a lot of if you look at the errors here, we're getting a lot of RPC errors. And my my theory was that there was like, we've we've rejected all the traffic, but I think that was a poor theory at this point. Okay. But we do have an API server running. Right? We've seen that on a previous test. Okay. Interesting. Can we curl a Con whatever? Wanna drive that? I think it you probably want the IP address and then health z or something like that to make sure it's
1:01:28 Checking Cluster 2 API Server Health (429)
1:01:39 or you could just cheat and That what we need is here. So advertise address. URL. H t p. This is why I shouldn't drive. Okay. HTTPS at 22 secure. K. Yeah. Healthy. Many requests. Okay. So Akash has confirmed the IP tables was not touched. Thank fuck. Seriously. Okay. So if the API server is unhealthy for some reason, what could have caused that? Nothing wrong with drinking at 10AM, Carlos. You enjoy. I think Leo will I'm on vacation after this too. So I thought I need, like, two weeks off in order to get over this. So
1:02:43 Okay. So we got too many requests. So the we need to work out why it's rejected. So you have just pulled up the is this the API server manifest? Yes. Okay. And I'm just seeing trying to see if we have restricted the allocations to this, and it doesn't look like it. But that and that's not even the right place to look because it would be in the queue, but I think, wouldn't it? But the API server is rejecting our curl. So something wrong with API server. My but the too many requests is probably the the pod that's running probably has
1:03:26 a limited number of actual OS allocation. I don't think that's the right way of phrasing it. Could you are you following along with what I'm trying to say? Like No. So the the web the API web server is, like has got allocations at the pod level, like actual operating system allocations of, like, CPU cycles and memory and all that kind of stuff. And even just a standard control plane with, like, a nodes and a few pods is is basically overwhelming the API server because the like, it that there's it's serving too many requests is what I think this is saying.
1:04:05 So my working theory is that we've restricted the allocations at the pod level to, you know, give it only 10 megabytes of memory or something silly like that. I could be way off, but that's what I'm currently thinking. Okay. Can we not check that from the proc file system, I think? Yes. And you're gonna you're gonna have to drive that because I am out of out of touch with the the If we just grab the process number, the process speaker. What we got? There's a way to see that kind of stuff. If it was at a
1:05:02 secret level, we should be able to kinda get something from here. Too long. Okay. Limit. That throws my theory out of the window then. Yeah. I think that's that's fine. So let's go back to the manifest. What do we got here? Right? We got our entity. Oh, the timestamps and none of that has changed. So there goes that plan. Right. Yeah. What about valid Cubelet? I just wanna see if the API server did we look at these yet? Not yet. No. Wait. You just look at this. I think you pulled these up, didn't you? Alright.
1:05:58 Okay. Alright. So you want to go to Device plug in, pods, pod resources. I don't know where those directories are. Interesting. Man, the old cluster, know. So I remember seeing those before. I have no idea. I don't think. Okay. I David. I don't even know where to start with the too many requesting, to be quite honest. Like, it's not it's not very promising. What's our symptom again? Can't hit the API server. And your hypothesis API server's API server's overwhelmed. It's serving summary. I guess what we could probably do is just restart the cube load here.
1:07:11 I mean, that's not goes into any trouble at all yet, has it? Has to be the API server. Let's run gantlet. In fact, let's try that curl again. Too many requests. Okay. Let's get some headers out of this. We're getting a four two nine. Boom. Mhmm. Which is too many requests. I just say, CTB code. I just stated for my own sanity more than anything else. Did we modify the manifest? I don't know what's going on there. Alright. When in doubt, API server 429. Overloaded API server sends a 429. We're not getting a lot of
1:08:34 help from the audience on this one. I'm just looking up CTR. C s. A. That's interesting. I would've expected that to list something. Certainly, I brought one of these courses, it did. No. So valid sent me this command to fix that, which I think is needed in these environments. So if we do crash CTL. Yaml and drop in this path, think no, okay. Didn't quite work. Well, he says he wants to see the logs on the API server. So let's let's do that. Queue API server. Wish I had monitoring on these machines. I need my time
1:09:44 series. Why don't you just open that in VEM? A log file? Yeah. Let's just Alright. So that started okay. That's when I spun up the machine. So let's just shift g right down to the bottom. Like, I haven't seen the actual error yet. Right? Alright. Just gonna scroll up very quickly. I'm looking for my pattern matching. I wonder if etcd is been messed with, and that's why the API server is saying too many requests. Because the API server connects to etcd, and if it can't write to etcd for some reason or Well, you'd suggest is there some for runtime?
1:10:27 Debugging Cluster 2 etcd Connectivity (Red Herring)
1:10:51 I I don't see it. Let's go down. Let's assume your theory is maybe right then. So let's confirm HCD is healthy or at least running. We never actually checked. We only did the oh, we we I I think we did see it earlier, and this should be a health endpoint. I think it is. Nested cheat sheet. I can use it quickly to the health code. But let's check if if you do a p s and make sure it's running. Or just curl if you want. Yeah. Yeah. It's there. Okay. And then I think it's l
1:11:34 k l c. I'd expect that to work. I guess I could just do, that's client URLs. For etcd, we'd have to pass in the certificate the the key from there slash Oh, cost reward. Let's just get Kubernetes as you do. I I just use this all the time. I never remember the entity command. That's what I Are we on the right track, Akash? Like, I'm pretty stumped here. I'll be quite honest. Like, I I I think I said it's SEB. Can't remember. HCD client. Cool. Okay. So HCD is configured to work, and it's timing out. So there we go. So we need
1:13:09 to fix HCD. Oh, I crashed I'm glad my Siri. SCD is fine. SCD is fine. Not touched. Excellent. This is great. So XCD is broken, though. I cautious suggested we look at the kubelet log file. You feel like we've looked at this, and it didn't help much? How did we did we not No. It looks like we checked the we were too focused on the API server. So it is running. You wanna do just get the last few hundred lanes and see if we can pinpoint that error? Okay. What we got? A lot of errors is what we've got.
1:14:21 Like, a lot of errors. See if I can get this to start. I guess what we could do is just restart it and then tailor as it comes up so we don't have to go through all the scroll back. I feel like I'm committed now. Let's Okay. Yeah. Let's let's restart the QPET and then jump straight into the tail. And if we need to, we can always modify the unit file to have it not restart. K. So it's starting up. It's connecting to it. Oh my god. That was fast. Lots of errors. I'm scrolling. I'm scrolling. I'm scrolling. So many
1:15:24 errors. If any Kubernetes developers are watching, can you not, like, panic and give us a full stack trace like this? That'd be great. Oh, yep. I broke it. I scrolled off the screen. It's the error scrolled off the screen, and it's restarting. So I'm going to grab 5,000 of these. Put them into fail. Okay. I don't think I've even got the start of it. That looks different. Okay. No. Not yet. Just a wall of text. It's just a complete stack trace of all of the oh, wait. So this looks There we go. There's stuff. Okay. Okay.
1:16:36 K. Keep going. I think it's because no NVIDIA devices were found. Like, that was a log message that seemed relevant. Keep going. I guess that error that we've mentioned earlier, and we never saw. But we've got a unknown service runtime. Failed to run kubelet. Failed to create kubelet. Get remote runtime type version Okay. So something that had wrong with container d. Which we were on that track a a little while back and then got off that track. Okay. So let's let's do status, the entity, journal, the entity. There we go. We never checked this. Like, we never we
1:17:15 Debugging Cluster 2 Containerd (Exited)
1:17:54 never we just assumed this was running correctly. This started ten minutes ago. Alright. Let's That'll do it. Sometimes it's a simple thing that we overcomplicate it. Alright. Let's see. K. Seven seconds ago. So I should be able to do this now. Oh, wait. No. What is it? Try control. Try control. Pods? Okay. The container is fixed. I wonder if that gets us back online. Do you wanna try to get notes? Alright. Looks pretty good. Akash is laughing at us. Okay. Yep. The simple things, and he's also said that was sorry. I assumed you'd But at least we
1:19:22 Debugging Cluster 2 Pod Scheduling
1:19:35 have an API we have an API server now. So I mean, there's a there's every chance this has happened on all the working notes too. But it looks like our control plane is generally coming up okay. And, like, once our DNS comes up, we should be okay, I'm guessing. Well, Akash said there's a second problem, so we'll need to try and see if we can work that out. But let's yeah. Let's see if we get core DNS. It doesn't Do I describe that part? It's not Where are these in where are these supposed to get scheduled?
1:20:20 Well, if you just describe it, we'll see if there's any conditions that are failing and then sorry. Let's see. No event. Initialized but not ready. Pod scheduled. Nothing obvious there. Can I run get pods again? Okay. So we've got a we've got a problem here. Okay. We're on the control plane node. Core DNS wouldn't be scheduled there. Should we go check out the worker node? I think that's a good plan. Okay. Let's assume they've made the same change. These people hate us. I'm just joining your session. Hold on. There we go. Cool. Do you wanna grab the
1:21:21 what bottom marker v c and then just fix the container d config and then we'll see where we are. Servers connect. Okay. So And it was in cat e t c container d. Yeah. Just delete I'll figure out how it works. K. So we've got we're working already now. Third one, I'm assuming, is coming. Yeah. I just restarted it, so it should be up any minute now. Second. And then we'll try and work out why that core DNS isn't getting scheduled, and we'll do it all under eight minutes. Alright. Can't wait. We got some crash loop back offs on the
1:22:19 Debugging Cluster 2 CNI Init Container Error
1:22:21 CNI, which looks interesting. Right. Okay. So let's you driving? I was gonna describe I'll watch. Describe PO. Oh god. I've got something in my in my that is how you accidentally cause production outages, everybody. Exec in a container, no such file or directory. Cilium has that in a container that is oh, wait. Key get c m Cilium minus o yaml. We got to scroll past all of the oh, wait. No. That's gonna be in the pod definition. K. Get d s cilium. And if I upgraded the cluster to one twenty one, we wouldn't see all that managed
1:23:26 fields. I need to remember that for next time. Okay. So there's an init container. And it looks like the command for the init container has been edited to add a forward slash to it. Does that does that look like the right thing to you? Hold on. Containers. Containers. Where's in it? Oh, there they are. Okay. No. The slash in it dash container does look okay. Do wanna copy the image name? Let's let's oh, we don't have Docker. Let's see. Let's just run it and poke around. But what was the error message again? It said that the let's look at the
1:24:14 said that the it's trying to execute in a container dot s h, and there's no such file or directory. Okay. Just because my screen is shared, I'm gonna copy that image name and run it locally and take a look in the root directory. You alright with that? Yep. That's what works for me. Yep. Because We got six minutes. Easy. Alright. Software container run dash dash r m dash I t overwrite the entry point. Bash. Assuming it's got bash in it. Mean, normally, it's like entry point dot s h in the root of the container. I'm
1:25:02 not familiar with the sodium one. So we're just I think I should be there if we just look at it. It looks like Akash is saying that the first problem is not fixed. So Oh, they followed up with as fixed, I think. Right? Oh, problem one is not really fit or is really fixed. Why is my connection going on? Maybe it's the Docker Hub rate limit in a The connection is pretty fast. I wouldn't expect the 25 meg fail to slow me down. Okay. Come on. I'm impatient. Gonna start pulling all these images out of
1:26:00 the fence. Alright. So well, let's let's think about what this is actually saying, though. Right? Because it's, like, the default command in a container. Like, somebody in the chat is saying to remove the command, and we should just use the default, which is totally valid. But But then it could take us there for a reason. Right? Yeah. Yeah. What's what's a little confusing is the how would you change that mount point to, like, remove like, if this is the upstream Cilium, like, the upstream Cilium image, how have how have you managed to remove the default command?
1:26:43 All that I wonder if they've added any do wanna edit that daemon set again? I wonder if they've added any volume, man. That would just make that That's what I was that that's kinda what I was trying to yeah. K edit d s Manasen Cilium and Cilium. Right? That's for volume. Account type. All you might mount. No root there, is there? Oh, wait. This is the command. It's not the entry point. Change command to entry point. Like that? Uh-huh. I think that's it. Nope. No. No. We can print it. It says command command and Rx. Yeah. Yeah. That's what
1:27:58 I yeah. That's Okay. Where's the entry point gone? There we go. Goddamn it. Why is my Yeah. You So this should be this is this should be command. Right? Yeah. That was just me being silly. The audience is suggesting the command shouldn't start with a slash. I I don't think that's a problem. Feel like I don't think I I I think it is a problem. I'm with the audience. So I'm gonna edit this and see what happens. Container error. Alright. Do you wanna get the logs out of that again then? Yeah. That doesn't change anything. It's still Script
1:29:07 is still in the root directory. It's still called init. Is it called init I'm gonna copy this verbatim. Hold on. Copy. I'm gonna go over to here. Assuming it worked, it didn't crash. Yeah. I got zero. Why is it not working here? Like, what could what could cause that? Let's put that back because it made it worse. So We could remove the inner container and hope there's not needed. But I think it will be. But I don't think that's the fix. Slash inner Okay. Init container on. It must be. So suggested we could always have command as bash and then r x
1:30:22 dash c slash, and I can do that. I mean, but there's a I'm not convinced that's a problem because there is a Shebang fail. Although, assuming is there a? I mean, the only way that I can think of to override to, like, to move like, this is the upstream one eight five Cilium container. The only way that I can think of to remove that in a container from the root file system is to mount a volume over the top of it. And we don't see any volumes in the demon set. Oh, well, what if you what if
1:30:56 they sorry. Pre pill the image, and then the pill policy on it is is not present. The image could be a random image. That's a great point. Image pull poll. I have not present. Not present. I set it to always. There we go. Do you wanna use the oh, wait. We're running out of time. Do you wanna use a Vim? You wanna do that? Because I I'm I'm out of this VIM. Yeah. Done. Oh, okay. Oh, look at that. That was very Sticky. Yeah. So, I mean, if you pre like you say, you pre pulled the
1:31:47 Verifying Cluster 2 Fixes
1:31:55 image and then modify the remove the inner container on the on the image. That's really, really mean. So Yeah. For all we know that image, it's all time retagged or something. We fixed it. We're two minutes over, but we've What? Two minutes over. I feel like that's a victory. I do feel like that we got there in the end. We went off on the wrong direction to begin with and then got there in the end. Yeah. Nicely done, of course. That that was good. It's one of those things, like, you have a moment of clarity,
1:32:34 you could you just don't think about it. That that was that was good. Two really good problems there. And you're just checking if can if we can run stuff. I'm just checking if we can I I I I I I know over time, but I feel like I can I'm not gonna sleep tonight if I don't actually try and run a bot? So there is a crash loop back off happening, but it's Seth and who needs storage in a Oh, no. Seth takes Seth will take a lot of time to get help. Yeah. There we go. It's running. Excellent.
1:33:07 Okay. Nice. Alright. That's the I think I think we got there. Yeah. Thank you to all of our breakers. Not not to Matt. Really, I'm just thanking Akash. Those were some good problems, actually. And it's really I mean, it's difficult, right, when you're staring at this and you're trying to work it out and just feel a bit of pressure. Easy if you go down rabbit holes. And I think we had a few, but I'm really really happy. I think we did really well there. That was great. Oh, you did. Yeah. And when we got we
1:33:13 Conclusion
1:33:38 got two working clusters. It's a team effort. We got two working clusters at the end, and I learned about container d. When I was doing this a year ago, the the default, you know, image container runner was was Docker. So I've learned a lot about container d today. So I feel like I come away with something. Alright. Lee, thank you. That was awesome. We did it. We fixed it. Well, it's got a good comment. You could change your background now. Everything is fine. We have actually Yes. Absolutely. Yeah. I can change it to, you know, me on vacation, I guess.
1:34:14 Nice. Alright. Well, thank you again, Lee. Enjoy your vacation. We will be back next week with more clustered. Have a good time, everyone. Bye. Thanks, everyone.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments