About this video
What You'll Learn
- Troubleshoot Kubernetes LimitRange restrictions by tracing controller-manager events and adjusting pod resource settings.
- Diagnose ImagePullBackOff by validating image pull secrets, registry credentials, and image policy consistency.
- Repair node scheduling and Cilium startup issues by auditing taints, resource quotas, and containerd mirror rewrites.
Mahmoud Saada and Marques Johansson debug two broken Kubernetes clusters, tackling a LimitRange blocking scaling, a bad image pull secret, resource quota and taint issues, plus a Cilium outage caused by a containerd registry mirror rewriting images to honky.io.
Jump to a chapter
- 0:00 Viewer Comments
- 1:38 Introduction to Klustered
- 1:47 Housekeeping & Announcements
- 2:28 Guest Introductions (Mahmoud & Marcus)
- 3:58 Starting Cluster 1 Debugging (Marcus's Cluster)
- 4:50 Initial Cluster Checks & Scaling Deployment
- 5:44 Scaling Deployment & Observing Failure
- 8:22 Investigating Control Plane / Controller Manager
- 11:16 Analyzing Controller Manager Logs: Resource Limits
- 12:13 Identifying & Deleting LimitRange
- 14:20 Teleport Disconnect / Session Issue 1
- 15:15 Teleport Recovers / Coincidence?
- 16:06 Debugging Image Pull BackOff (Bad Secret)
- 21:46 Fixing Image Pull Secret & Pod Runs
- 23:10 Finding Missing Database (StatefulSet)
- 23:58 Scaling StatefulSet & New Scheduling/Quota Errors
- 25:58 Debugging Resource Quota & Persistent Errors
- 27:21 Debugging Node Scheduling (Taints & Selectors)
- 32:33 Teleport Disconnect / Session Issue 2 & Removing Node Selector
- 35:59 StatefulSet Pod Runs & Cluster 1 Appears Fixed
- 37:12 Testing Application Access & Confirming App Version (Cluster 1 Fixed)
- 43:30 Transition to Cluster 2 Debugging (Moody's Cluster)
- 1:05:05 Fixing Controller Manager Image
- 1:08:07 Restarting System Pods (Cilium, etc.)
- 1:12:28 Applying App Update (v2)
- 1:14:17 Debugging Slow Image Pull & Final Fix
- 1:17:24 Recap of Breaks & Conclusion
- 4:41:11 Hint for Cluster 2
- 4:44:19 Starting Cluster 2 Debugging: Widespread Failures
- 4:56:13 Debugging Cilium: Honky.io Image
- 5:05:15 Identifying Containerd Image Mirror Issue
- 5:23:33 Modifying Containerd Config & Restarting on all Nodes
- 5:40:11 Containerd Fixed: Cilium Pods Initializing
- 5:58:03 Investigating Persistent Honk Image Issue (Static Pod Manifest)
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:00 Viewer Comments
1:32 If I wanted to write comments during Hello. Oh, look. Yeah. Hello, everyone. Welcome to today's episode of Rawkode Live. This is clustered, the show where I get to look really silly fixing broken clusters by members of the Kubernetes community. And before we get started, there's just a little bit of housekeeping. I would encourage you all to join where's my button? I swear I've done this before. It's been a break. Please join the Discord channel, Rawkode.chat. There's a 300 and something people in there now talking all things Kubernetes and cloud native. Come and say hello. It'd be a pleasure
1:47 Housekeeping & Announcements
2:07 to have you there. Also, if you've not subscribed to the YouTube channel, you should do so now. Remember to click the bell and you will get notifications for all future episodes of clustered and other episodes of Rawkode live. I will do my best to cover all of the different technologies in the cloud native landscape, giving us a hands on tutorial on how to get started. Alright. Now for cluster today, I am joined by Mahmoud and Marcus. I almost called you Displeak there because I was looking at your face and your name. But thank you both
2:28 Guest Introductions (Mahmoud & Marcus)
2:36 for joining me today. Mahmoud, how are you? Thank you. Thank you for having me. I'm doing great. My name is Mahmoud. I also go by Moody. I'm very excited to be here and work on the the the what we described as a car crash of a cluster that Marx has set up for us. I'm very excited for that. The clusters are always fun. And, Marcus, please tell us who you are. Yeah. Marcus Johansen. I work at Equinix Metal on all sorts of integrations, Kubernetes and Terraform and SDKs, etcetera. And, yeah, also looking I I looking forward is, like, not the
3:14 right set of words, but this will be an adventure. You're right. Definitely. I I I totally forgot to introduce myself. Sorry. I'm I I work at Carta as an SRE, and I I'm we're hiring. If anyone's interested, please DM me. That's it. I I just assumed you didn't want to share with us. That was all, you know, but no. Thank you for that as well. It's it's interesting that you say you're looking forward to it and we talk about car crashes. Like, it's definitely the most daunting thing that I do on a weekly basis. But at the
3:50 same time, it's weirdly fun and I'm hoping we we all enjoy this today. Okay. Well, let's get started. I will pop up my screen. I'm hoping you both have access to our Teleport Clusters. Custard number 29 on episode 13 is from Marcus here. So will I call you Muddy? Is that what you prefer? Yeah, sure. Muddy. Right. Okay, Muddy. We're jumping into this one together. Are you ready? Sounds good. I'm excited and scared at the same time. Oh, good Firefoxes. Oh, there we go. My computer has been struggling today. I hope this is not part of episode thirteen's curse,
3:58 Starting Cluster 1 Debugging (Marcus's Cluster)
4:37 but I am in a session. I will type all echo hello. If you wanna just do the same, let me know you're in the same session. There we go. Perfect. Yay. Wonderful. Alright. So we've got a little bit of setting up to do. I'll do this for you and then I will give you the honors of running whatever command you want to check if we have an API server. Oh gosh. The flow Should I should I be able to see the should I be able to see the session too or housekeeping? I'm hoping you can see my screen.
4:50 Initial Cluster Checks & Scaling Deployment
5:13 I can see your screen. I can't see the session in the list of session in the list of active session. It's a different cluster. You'll just need because you're the breaker, you don't get access to the terminal. You're Why not? Alright. To the broadcast view. Okay. Cool. I guess, should should we get started? Yeah. Go for it. I recommend get nodes, get pods, pick whatever one you want. It's always good to work out if we have an API server first before we start doing anything else. Yeah. Oh, thanks for the alias. Yeah. I guess we'll check. Oh, we have
5:44 Scaling Deployment & Observing Failure
5:46 we have a response. That's good. Around 01:20.4. Awesome. Let's check what's going on with the nodes. Are our nodes okay? Looks looks fine. We can check pods. Alright. Let's check the default first. The default namespace no real pods. Let me check if there's any deployments, maybe. See that the this was scaled down. I don't know if this is part of the puzzle, but let let's maybe we're maybe there are notes here for us that we can look at. So let me see. Nope. No notes here. Okay. So services, ingresses. Any ideas? David, please. Yeah. Yeah. So let
6:38 let me catch up there. I had to refresh my screen just because there's there's a weird bug in Teleport that's not gonna affect. But you ran it get pods dash a. Right? And I'm I kinda missed out there. So what we see is this is a pretty healthy looking cluster with the exception of when you run get deploys, it looks like a skilled then. Mhmm. I I would suggest we just list edit that deployment and check what the replicas are set to, try and scale it back up. I'm assuming it's not gonna be that simple. However, we'll we'll find out. Trying to
7:08 look at Marcus's face and see if he's laughing or sneaking or not. Yeah. We need to we need to monitor Marcus. So I'm gonna say one replica. Does that sound good? Yeah. Go for it. One of the things I love about this show is just seeing the amount of ways that people modify and interact with the manifest. I'm a I'm a I'm a cube control edit kind of person, but a lot of people like doing like deploys and scales and stuff like that. Yeah. Sorry. I hope you can't hear that noise. I I I'm in New York there's a
7:33 lot of noise outside. Hopefully you can No. All good. Yeah. Alright. So okay. So yeah. Go ahead. Sorry. I was just gonna say your your skills are up to one replica. We can see we have one desired but we're we're not getting anything here. No pods. Yeah. We've ran a replica set. Mhmm. Our control plane is potentially being played with a little bit here. I wanna look at, yeah, I wanna look at the the deployment. Think it's called Yeah. Plus third. Yeah. Thanks. Alright. We just see that it's scaled. We're not seeing any other events listed here.
8:12 Failed to create replica failure. This could be interesting. Yeah. Let's look at the the concern that the break isn't the break that actually happens. My first thought is the flags on control plane components. I've removed the replica set controller. So The flag the flag on on where? Sorry. On the API server, it's actually a controller of controllers. And you can actually disable individual controllers like the replica set controller or the pod controller. I suspect that could potentially have been messed with. Let's let's look let's look at the cube system, actually. Let's see if all our
8:22 Investigating Control Plane / Controller Manager
9:04 what what's running in here? Etcd API server controller manager scheduler core DNS. I mean, everything seems suspiciously normal in here. Oh, wait. I do see I do see oh, that's the name of the cluster. Right? This plague. That's your handle. Right? That's his handle. Right? Time every time I see it on screen, I'm, like, getting highlights. It's like Slack highlights going off in my head. I can't I can't help but try to respond to the screen. Yeah. So, David, you were saying you wanna look at one of those controllers? Yeah. Let let's let's go to the
9:45 the manifest directory, and we'll see if Marcus has been kind enough to it doesn't look like anything's been modified, but that doesn't really tell us anything. So my first explanation was they should Sorry? All those folks who go out all those folks who go after file times like this can be manipulated so easily. Yeah. Yeah. I think on the cluster I broke, I set them all to like nineteen forty two or something like that. Just to confuse people. Alright. So my gut tells me that we should be looking on the API server here. There is a
10:24 No. What are we look are we looking for honks? What are we looking for? Yeah. Well, maybe I'm looking at the wrong one, but there's there's a flag you can pass to the API server to tell it which controller. Oh, in fact, It's controller manager. I'm I'm being ridiculous. Yeah. Maybe control manager. Let's see. So what do we got here? Yeah. Here. So this modified controllers. Now the star means run everything, but you can actually do replica sets to disable that controller, but I don't see that. So there is my hypothesis. The so the controller of controllers
11:07 is most likely running the this that that specific control. They're having, like, a set controller. Right? So Right now, I would assume it's running. So my next thought is we probably wanna take a look at the logs of the controller manager. That's not sent to you. Okay. Yeah. Sure. Let's find the pod and cube system. And I'll say logs, tail I don't know. 100. Oh, I have to say cube system. Sorry. Alright. There you go. Interesting. Minimum CPU per container is one. Request is 500. Minimum memory usage is container is two fifty six, but required one twenty eight.
11:16 Analyzing Controller Manager Logs: Resource Limits
11:57 Oh, this okay. This feels familiar. I'll say what I think is going on. It it it might be a limit range going on in that namespace perhaps. Yeah. Either the limit range has been set of the namespace or because of the deployment has been modified. Yeah. I think we should check both. Okay. Do you wanna do one of them? Sure. Okay. So I know what the manifest should look like. So I'm gonna jump straight in Yeah. Into here. Alright. You're the main author. That's right. Yeah. Yeah. This is my my million dollar application. So Mark Marcus, you did you say something? Sorry.
12:13 Identifying & Deleting LimitRange
12:52 No. I said that's just it's cheating that he has the whole he's got, like, a built in differ in his head. Yeah. So I think it's probably gonna be limit ranges. Do you wanna pull them out? Yeah. Let's let's look at that. So limit range. Oh, look. Default created on today. Sometime today. Okay. So let's describe it. Limit range default. Oops. I got it here. Okay. So so that's it looks like we found the first main issue. Let's see what happens. Should we turn it off or delete it, or do you wanna change the resources on the deployment? Either one
13:37 would work. Right? We could just delete that limit range. That's that's not doing us any favors. Alright. Let's do it. So delete limit range. I'm gonna say you know, it's funny because I'm laughing because I I had similar ideas when I was thinking about the cluster the cluster that I would break. So this is that that's awesome. Yeah. The limit rate. And that customer I broke, I used limit range as well because I thought they're so esoteric. Nobody would know to look there. And they apparently, everybody just knows limit ranges are a thing. So I I guess so.
14:16 If anyone didn't know there, now you know. Okay. So should we look at the logs again? Because I'm not seeing the pod show up. So Yeah. I think that's a good idea. Oh, I I'm typing, but I'm not getting any feedback. Do you mind typing it instead? Because it's it's not typing for me. Yeah. Danger. Uh-oh. Okay. I'll I'll rejoin I'll refresh and rejoin the session. Yeah. I'm gonna relook today. Reload. Uh-oh. Uh-oh. I While while I was breaking it, my Wi Fi, like, cut out while I was doing it, and I worried that I had
14:20 Teleport Disconnect / Session Issue 1
14:59 done something I had not intended to do. Oh, man. I no I don't think I have access to Yeah. I think Teleport is down. Marcus, did you Teleport? I my hands are right here. The wait. Did we did we do something that broke teleport? I mean, all we did was delete the limit range. Could that have delete happy kind of debugger. I suspect the max of one gig on the memory potentially let something loose. But I didn't see any other pods running on the system. The the the tell was was Teleport did it go out of memory or something?
15:15 Teleport Recovers / Coincidence?
15:50 I don't know. Something happened for sure. Should well, let's ask Marcus. Should Teleport still be working? It should still be working as far as I know. Yeah. Although it's great to watch you, like, just, like, paint this bridge that you're gonna drive down and Alright. So as a backup, I've if you wanna just jump over to my screen share, we'll try and SSH up to the host. Yeah. Yeah. Yeah. Sounds good. The host is done. Oh, there we go. Okay. So Marcus, what have you done? It's it's How how are the other nodes? Well, yeah, I guess. I've got I've filled up
16:06 Debugging Image Pull BackOff (Bad Secret)
16:27 the portal, but, yeah, why don't I try another node first? Oh, yeah. Yeah. Let's try the work. The coincidence the the timing is uncanny, but Yeah. I really don't think I I don't think I triggered this. Let's see. So the other so yes. Something happened. When we deleted that limit range, there's been a causal event. It feels, yeah, it feels almost like a direct side of The teleport's back. It's back. It's back. Okay. Okay. Go. Go. Go. Alright. Refreshing. Our session is still there. Yeah. Let's see if the CPUs went wild. The API server no. That's okay.
17:13 Load average actually hasn't changed. Okay. That that timing is uncanny, but I think it was Yeah. I'm looking at the I'm looking, yeah, I'm looking at the the the what's it called? The three CPU metrics, and everything looks looks okay. Yeah. I I think it's fine. I think that was I I don't really believe in coincidences, but it must have been a coincidence. I mean, we can we can look at the teleport pod and see what happened. Right? Yeah. It's running on a host because I don't trust people not to kill it in a container. Oh, gotcha. Gotcha. Okay. Okay. So,
17:54 Marcus, how are we doing? Give us, like, a a maybe a little hint. How are we doing? Are we okay? Are we doing good? Or No. Doing good. Like like I said, I I I love the paths that you're setting and that you're exploring. I think it's kinda giving giving me some insight into things to look for in your cluster. So keep keep playing those tabs and Okay. And take note. Alright. So looking at logs, it looks like we got a pod created. Oh, image pull Back off. Okay. Good. Let me just pop this terminal up so that I'll cut off
18:37 that. And, Khalid said, just joined, what am I missing? I'll do a quick ten second summary while we do that. Currently, Moody and I are working on Marcus's cluster. Marcus used a limit range to stop our clustered pod from being created, which we've deleted, and that blew up our cluster, but magically it came back. We don't know why. We're not gonna yeah. Actually, if that just now. We're now seeing an image pull back off error on our containers, our pod. And it looks like we're getting a forbidden from GitHub container registry, which is a Four
19:15 four zero three. Yeah. So is that well, I'm gonna assume this is part of the plan and that there's probably some like, maybe a image do we have an image pull secret, I guess, on this? Yeah. I think so. Like, this is a public repository and a public image. The only way this error I think would happen is if someone added Docker cred style secret for that domain, which we don't need. So So let's yeah. Let's take a look at the I guess there's two places I can think of the deployment itself or the pod,
19:52 which actually we have up here. So let's look. Is there an image pull is there an image pull secret here? I don't I'm not seeing one. If we run a k get secrets, we'll be able to see if there's anything configured on the namespace. Is there oh, yeah. There is. There's a docker config JSON called Kubernetes. Kubernetes. Yeah. Which I think is to find hide the fact that it's weird. But would it be would it be here, you think? If I say I don't know. Yeah. Get the deployment in YAML format. Would we hopefully see a
20:28 Kubernetes image pull secret here? Image pull policy. The secret will be configured with a a registry domain. Do you mind if I type? Yeah. Sorry. Go for it. Yeah. So if we do a get Kubernetes dash o y m o. I really need to upgrade these clusters to one two one and get rid of all this in here. Am I missing it? I'm thinking the server maybe it's in the service account. What do you like a like, service accounts can have image pull secrets. Maybe we maybe we can describe that. I don't know. But, yeah, let's look at this first.
21:11 Bananas. There we go. There it is. But it was attached somehow to that service account. Right? So can we describe the the default service account? I don't think that's how the oh, maybe I'm completely wrong. So I don't know. I'll I'll it just for for the exercise. Yeah. Please. Yeah. I'm curious, Niro. Because I thought I think the secret exists, it gets hard. There there you go. Yeah. I knew it. I felt it. Nice. See? I'm learning shit. So I'm happy. Oh my god. Marcus, your our thought process is very aligned even though yeah. So it's that's crazy. That's
21:46 Fixing Image Pull Secret & Pod Runs
22:01 awesome. Alright. So we should either modify that service account to remove the pool secret. What do you yeah. What do you do you think yeah. Should we edit you like to do keep still edit, so let's do that. I I love a good edit. Push me in Vim. I'm happy in Vim. So Alright. So I guess we can remove the the no. No. The top image push secrets. Yeah. Yeah. We could just delete those. Okay. So now I think if we delete the pod, it should just fix itself, hopefully. Hopefully, it's my favorite part too.
22:43 Yeah. I mean or it will take us to the next piece of the puzzle. We'll see. Don't need to wait on that. Let's just see what's happening. We got container creating. Creating. Looks good. Alright. I think we can maybe look at the logs if we have to, but this looks good. Nice. Good job. Well, we're still missing our database container. Oh, so do we have a stateful set or something? Oh, yeah. Okay. So I'm I'm internally fuming because I think one of my barriers didn't go up. But Maybe you could tell us a bit of
23:10 Finding Missing Database (StatefulSet)
23:27 that when we go through the next hard build. I think that's the best part about cluster. That you do all these things and then you hope that they fall in a certain way and then it just doesn't really work that way. Alright. Do you wanna describe our Postgres and we'll see what's happening here? Sure. Stateful set PostgresQL. Nothing. I think I had nothing. Know STS was a shorthand for Stateful set. Oh, yeah. I I use it all the time. The the there was a was it API resources that shows you these things? I believe. Yeah. So you can look and
23:58 Scaling StatefulSet & New Scheduling/Quota Errors
24:03 you can run kubectl API resources and see all the abbreviations in here. So I I just picked them up as I go. So you'll probably find stateful yeah. There's the stateful set one. Nice. Cool. So but yeah. Back to what we're doing. I think we just need to scale up this stateful set to I think it's scaled down. So let me just scale it up if that sounds good to you. We got more people joining late. This is the episode 13 curse because I had the YouTube time wrong, I'm afraid. But thanks for joining anyway.
24:44 And you can always catch the Maybe it's maybe it's not maybe it's not a curse because that means more broken clusters, more broken things, which is which is I guess a positive in this context. Yeah. I'm not seeing anything here. Interesting. Yeah. We got a class we got a pod that can't terminate and we got our stateful set which hasn't been scheduled. Do we have a replica set for a stateful set? Oh, no. Didn't get replica set. This would be a part. Oh. Oh, a little more limits. Another limit issue. The failed quota. Uh-huh. Default must specify limits.
25:25 Oh, this okay. So now we're oh, this is different. So this is almost like either well, it's coming from the stateful set controller. So this is the first thing I look for is where this event is coming from. So that means it's probably a native Kubernetes thing such as pod security policy or something like that. Well, there is this resource quotas on the namespace that we should probably check if they exist. Oh, yeah. Yeah. Okay. So get resource quota. I've rarely had to use this the resource quotas, but that's I'm learning. So request dot CPU, request dot memory. So it
25:58 Debugging Resource Quota & Persistent Errors
26:10 looks like it's it's I'm trying to read the complaint here. So it's forbidden because you must specify a limit for each of those. I see. So we can either specify the limit or delete the source quota? I'm a big fan of the sledgehammer delete approach. But don't don't let me gauge you. Oh, you you wanna do it? Go for it. Yeah. No. No. I said don't let me lead you down a dark path. If you don't want me to delete it, just tell me to shut up. But if you're happy to delete it, I think we delete
26:41 it. I love it. I love it. Sledgehammer. Let's do it. Alright. Let's hope we hope that the stateful set controller picks this up. I might have to scale up and down or something to trigger it. It's not Yeah. We may have to describe the stateful set again. Describe state stateful set postgresql. That's still upset. Maybe there's a retry. It's like doing a back off kind of thing. So maybe yeah. That could be a validating web controller, maybe. Okay. Yeah. Let's look at that. Validating Never, my web Configurations. Yeah. No. All namespaces? No? Okay. I'll also look at me just just so
27:21 Debugging Node Scheduling (Taints & Selectors)
27:42 we know. We should have done this earlier. Just, like, so we can rule this out. Good. So none of that. Can we look at pods again? Oh, it is not. Pending. Pending is not normally a good thing. Spry pod. What is another piece of the puzzle maybe? Oh god. Okay. So alright. So we have how are we on time, by the way? I wanna make sure we're we're doing good. Oh. We're we're good. We got we got fifteen minutes there. Oh, nice. Okay. Alright. So, obviously, Affinity, there's a taint that we're not tolerating that's called
28:27 master okay. So maybe we edit the stateful set and either remove the that or let's see. Maybe the nodes are tainted. Right? So if I say get nodes, is it show taints? Is that a flag? I know there's a show labels, look at that too. But maybe the labels were edited. Just do a get node stash o yaml and just let it list them all straight up. Like that? Yeah. And then I do like that show labels. Yeah. It comes in handy. Oh, go ahead. Sorry. I I like the way you're, like, spelling out mutating webhook and configuration
29:17 as though as though that's not what you used in yours. So I'm on to you. I'm not gonna answer just to keep things suspenseful. I'm looking for taints here. Should I just grep or something? This is a little noisy. It is noisy. Right. Yeah. Just why don't we just describe one of the nodes maybe or or do that. Yeah. Yeah. Tate, so I see here there's a only the master node is it look I I'm guessing this is the master node that has this thing. So we should just edit the stateful set, I think, and just look at what the heck is
29:57 going on with affinity or note selector or something like that. What's that? Sorry. Yeah. You're right on it. There's a note selector for named PostgreSQL. Yeah. I don't remember seeing it's like, a label like that. But why is it try Remove it. What Yeah. We should remove it. I I'm I'm just really wondering something else now is why did it say that it that it's it can't tolerate the the the master. I guess it's trying it's it's just one of the things that failed, I guess. So it's trying to find this label. It can't find it and it also doesn't
30:39 tolerate the master. Yeah. So I think here the node selector is trying to play it. Let's yeah. You're right. Let's walk it through before we go delete it. So let's drop it at s and run get nodes again and then let's describe our control plane node or show labels. Yeah. Maybe that would maybe that shows what we need. So yeah. So we wanna look for like a Postgres here. Sorry. I too much text. Okay. So there's no Postgres label anywhere. Yet somehow it was complaining about the not tolerating the master taint or something that looked kinda like this, like mast IO slash
31:17 master. Yeah. I expected to see a label for the node selector to resolve to the control plane and that didn't happen. That's an inter yeah. That I find that interesting that the error is almost misleading. Let's look at it one more time. Sorry. So we solved that we solved that. Oh, sorry. The pod. So describe pod, Postgres, l zero. Too many s's. Sorry. Yeah. So That's the only clue. That's the only hint I give. Very generous. Thank you. So oh, I see. So one node had a taint master that the pod didn't tolerate. Three nodes didn't
32:08 match the pod. Okay. Got it. Got it. So I only read the first part. So it's trying to the first so the control plane, it couldn't schedule there because of not tolerating that taint. I wonder if this is like a default behavior. If your node selector fails, does it automatically default like the control plane because it would assume those exist and then the taint there is causing the weird error message. Yeah. So we Like, if we change the I'm curious. If we change the node selector to be named Moodie, do we still get the same error? Like,
32:33 Teleport Disconnect / Session Issue 2 & Removing Node Selector
32:41 is it defaulting to that control plane? Got it. Yeah. Yeah. Uh-oh. Has it gone again? Think my sec my session oh, no. Oh, no. Did we did we cause Teleport to crash again? Oh, no. Yeah. Teleport is down again. I'm on. I I often think that this show should be, like, sponsored by Teleport, and then you run into these things. And I'm like, ah, I know why the sponsor's on here. I don't think that Teleport is the problem here. I think it's you, Marcus, and you're you're breaking of clusters. Hopefully, it comes back again in a minute.
33:29 I really do think it's just timing. Yeah. Do you have any rogue processes on this machine doing nefarious activities? Is there a Bitcoin miner on my hardware, Marcus? There are no Bitcoin miners on your hardware. All only only Ethereum, Cardano, and. Yeah. The comment. I saw messing with Teleport was a big no no. It is a big no no. We don't we don't break Teleport. I think I think we're back. I think it just came back online. And what are all the rules again? Just to restate them all. Like, rules and the what the expectation is. You cannot break the
34:07 bootloader, and you must not break Teleport. And then what what is the the goal of the the whole session is to have what happened? For us to be able to modify the cluster deployment from version one to version two and then browse to it and see a cloud native quote. Okay. So let's test our theory there. We are back. I don't know. Did I what kind of hardware did I throw in this? Is this just a shitty box? I don't know. Yeah. Oh, yeah. At this point, we're we're investigating to the teleport issue, but I think we should maybe
34:48 Yeah. This box is fine. Okay. Hopefully, it doesn't happen again. Okay. Let's so I had a theory and I'm curious, but please if you've got anything better, let's say, let's jump. Yeah. Yeah. Shoot. But I think I wanna try editing this node selector to see if Oh, yeah. Still get the same message. I'm wondering if it if it's doing some sort of weird default policy here where if it can't match the selector, it just says, okay, you probably just wanna run this on the control plane. Although I would expect a better error message. Yeah. Let's take a look.
35:25 Oh, it looks oh, wait. Seventy five seconds feels too long ago. I'm gonna delete it. Yeah. Sounds good. 3Period04. The sledgehammer just keeps coming. Yeah. I know. I have a I used to tell a story about when I was an SRE formerly. And like if I got paged in the middle the night because of a desk would fill up, I would just blow away Varlib log and go back to sleep. I do like my sledgehammers. Awesome. Okay. So Same message. Exact same error. So it's I guess the part I'm looking at is the second part. Three
35:59 StatefulSet Pod Runs & Cluster 1 Appears Fixed
36:06 nodes or, I guess, worker nodes probably didn't match the pod's node affinity. So let's just take it out. You're yeah. Let's just remove the node selector. Do you want me to do that? Yeah. Go for it. Yeah. And let's just throw it in there. Let's make sure there's no other scheduling annotations of anything. Okay. Yeah. So let's look for tolerations. Nothing. Let's look for affinity. Nothing. Nothing. I think this should be it. Yeah. Let let's test. Let's see. Edited odds. Come on. Sure. I love how we're just killing database pods. Like, it's like, oh, yeah. It's just a database.
36:54 Just kill it. The data's loaded in a in a container, but we are past pending. So Oh, yeah. Okay. It's it's looks like we yeah. We got past this part. Let's see what else is gonna come up now. Alright. It's good. Awesome. Okay. Let's try and browse through our application. I'll run a put forward locally. Okay. I was trying to be time sensitive too. Like, I'm wondering if I'm, like, coming right up on a forty minute solve. But, yeah, as as I'm seeing things, it's like, you know, there there was that opportunity. There was that thing I considered doing that
37:12 Testing Application Access & Confirming App Version (Cluster 1 Fixed)
37:31 I didn't do, and that would've been great right about now. Yeah. I mean, did we finish do you think we finished everything that you wanted us to work on that you planned? Did your evil plan work, or did we not finish the task? Well, I Alright. Our application is working. Although the video doesn't work in Firefox, so let's confirm. This this website broke my Chrome really hard. Like, the the Chrome the UI, the the the borders the window borders on a Mac were flickering red, which I don't think is a thing that you're supposed to be able to do.
38:07 It was only my only time I've ever seen it is loading that website. I've I've had this Chrome issue. I had to turn off I've I hate this bug, by the way. It's a bug in Chrome when accelerated graphics are turned on. If you turn off accelerated graphics in Chrome, it might fix the issue. And Yeah. It's We have a fixed cluster. Do we know that it's the right version? How do you know if it's the right version? Oh, the version of The image? The image. Yeah. It's a different video. It's a different video. I think the text If you can't see,
38:53 it's a different thing. Come on. Work. I I think the title on top was was different. May your bag something something? I don't know. Oh, yeah. That's it's just cloud native quotes from It's a brand name. People of the Kubernetes community that tweeted. I mean, I can always open this in Chrome and then my computer Just see, like, what the running pod is. There's something that I'm not sure if it's if it's working as expect if it's not working as expected. Alright. Let's try opening Chrome. And, hopefully, my computer doesn't fall over. Here. If you could just check the
39:34 I'm a fan, dude, that you need proof. I mean I need proof. I need proof. He needs proof. No. I maybe that's I expected something I mean, there's still something that's staring you in the face that's, like, you're ignoring. But it's it's not a problem if if the site works, so fair enough. Nice. Yeah. He's giving us a clue a clue, David. He's he's trying to help us. Yeah. Well, I mean, if your if your perspective is that everything is working, then everything's working. So I I can't I can't argue with that. Port forward. Running container
40:13 ET. Okay. Reload. I really have to rebuild this app and make it less shit. Yeah. I I struggled with the video streaming as well, but I don't know why. Yeah. I don't I didn't I I don't expect that I broke the video. I'm just how do we know that this is the right version of of the app? Let's yeah. Let's let's look at the the actual deploy the pod, maybe? The pod describe the pod and see what the heck, yeah, that's going on. It could it could very well be the right one, and I'm just
40:55 I'm just salty that things didn't go as planned. Gonna yeah. I'm gonna describe the pod and see. It's saying the image looks right. Yep. V two It looks good. Alright. Salty Marcus is then we Salty is. Yes. Fixed the cluster just in the nick of time. Good job. Nice. So so are you gonna give us the the debrief, Marcus? Yeah. So, I mean, the the thing that I thought there there were the thing that I thought would have triggered here is the reason why you have the terminating pod still, and that's that there's a finalizer that I
41:41 put into the into the deployment spec. And when I was when I was messing with that earlier, it seemed like it prevented the scaling it prevented the scaling from working and prevented the upgrading from working or at least that's what appeared to happen for me. So I thought that that would have blocked you from getting the new pod. I kinda thought it would have torn down the old before starting up the new. So I guess that's a a scaling parameter that I could have specified, didn't specify. You know the position on finalizers. Like, I'm actually offended that you even went down that
42:21 route. Yeah. I it was just an annoyance. Purely for deleting all finalizers and clusters. They get on my nerves so much. Like, I understand why they exist. They're amazing except when I wanna do something with a sledgehammer. And then I'm like, no. Go away finalizers. Awesome. Yeah. I think that Yeah. Oh, go ahead. Yep. The other thing that I wanted to, like, keep the entire problem scope, like, limited to things that a user could shoot themselves in the foot with. I didn't reach out to, like, system processes or configuration, but it it's fun to me to watch you, like, suspect that
43:04 it might be something more nefarious. I I think David David's been bit too hard. I think he's he's he's yeah. This is PTSD right there. Yeah. After 13 episodes, I trust no one anymore. Alright. Awesome. Good work, Marcus. Great work fixing that. That was awesome, Marcus. Yeah. Thank Let's see. Now it's time to cry. Yeah. Let's see what else we got. This is cluster 30. Marcus, I will open a session and type hello. Please let me know when you are available. And I will make a font a bit bigger. Oh, I got that weird bug. I think
43:30 Transition to Cluster 2 Debugging (Moody's Cluster)
43:46 that means you've maybe joined. There we go. I'm just gonna reload now because it's annoying. I need to reach out to teleport and ask them to fix that because I wonder if it's Firefox maybe I should just start using Chrome. Anyway, let's not digress. Let's set up our kube config and then work out which awful things my mood has done to this Kubernetes cluster. I I I'll I'll give you guys one thing that I wanna say is that there's a text file in your home directory. That's all I'm gonna say. If anyone has seen custard before, it is
44:23 never a good saying when someone leaves you breadcrumbs. Good. Alright. Nice SKR. Alright. My NGINX pod would come up for some reason. Can you please take a look? Securly, I'm at. Alright, Marcus. Do the thing please. Do we have a control plane? You went for API version. That's oh, I do have We got a Yeah. I'm kind of and I'm, like, kinda curious if there's anything odd installed. Okay. I think these are all your standard installs. Right? We'll find out in all in good time, I'm sure. Let's try some of the basics. Yeah. There we go. We've got our working notes.
45:22 We got things and stuff. Okay. It's all stuck in containers. Yeah. And those all look like they're in bad shape. So let's go straight for the Podigo. Take a look at a deployment first. Looks good. Let's describe where I put that clustered pod and see if we can get anything there. Oh, you're gonna need the name again though. You have It's like the various forms of injecting things into the shell have changed, so I'll just throw it into temp where somebody else can write into it for two seconds before I source it. And now we can have more fun.
46:37 Hopefully. I faint a lot of the time the completion doesn't always work, but let's let's hope. Oh, it doesn't have to spelling clustered. Right? Did I? You spelled cub. Cub. Cub? Alright. I see q. At the end, you put cub and then hit tab instead of cla and then tab. Oh, alright. Sorry, guys. Sorry, guys. This does work. Yeah. Yeah. Nice. Yeah. And then I'm gonna avoid the tab complete and try to copy paste it. Yeah. We did defeat the purpose. Alright. We have to find plugins c n a. Awesome. Great. Let's check our cube system namespace
47:28 and see if our extended control plane is happy. Yeah. Okay. Let's get pods everywhere and see if anybody is acting unusual. Nope. Nope. Nope. Nope. That's a lot of broken things. So, yeah, we don't have our cilium d Core DNS. Core DNS, cilium. Who else is broken? Take inventory. Queue proxy. It's it's And it's pulled back up. Episode it's episode 13, so you can't expect anything good. Do you know? CCM's busted. That'll allow b is not happy. CSI's we're in trouble. No. I think this is a cascading failure. The fact that we're seeing so many
48:18 container create tells me I think that's a byproduct of Cilium potentially being fucked. So let's focus on Cilium first. Maybe. Cilium deployment or is it a That's even cycled Cilium. And maybe in the Cilium next phase. I think Cilium goes in Oh, okay. This one's cube proxy. Do we have any hints in here? No. I wouldn't have thought so. Alright. Celeb. Let's take a look at one of the pods. I think we need to get What have you done to this cluster? Alright. Let me do a little bit of because I'm I'm I'm just Oh, see what
49:28 yeah. Okay. I see it when I'm like scroll wheel scrolling, it's for my own benefit. Nobody sees me doing that. Yeah. Yeah. You you pager. Oh, look at that. We got our first honky honk. Okay. That at least is an easy fix. The plot seconds. Is there any containers? That's the only image here. And this is you're looking at the the daemon set? Yeah. So is it they modify the replica set or the pods? Yeah. There won't be back propagation on any of that stuff. Right? So oh, no. But the daemon set create doesn't create replica sets. It's just straight up pods.
50:29 But how is the image changing from the daemon set? This could if you added a container d on this. I'm looking to see if he's laughing. Are you familiar with that feature, Marcus? Poker face. That's my poker face. This is I hate it. It's the worst option ever, but someone bit me over it in the past. What what is it that you're suspecting? Okay. So container d I I was thinking that maybe the the webhooks that maybe he's taken us down that path since he pretended to not be able to spell it. You are 100% correct that we
51:08 should definitely check for mutating webhooks. But the thing that really bothered me though is the daemon set hasn't been updated. And if there was a mutating webhook, we would still see the change here. Right? Because this comes from the API server. I need to satisfy my curiosity. There's a dump config on this but I can never remember how to do it. Where is it? Config. Gonna have to go to it. Is it container d? Yeah. Config. So this is the container deconfiguration. And it allows you to see any time someone requests this image, give them
52:17 something else instead. And you can actually swap out the images that way. Oh, we should probably take a look at this file as it exists, but I would expect to see something in here. There was no honk. Right? Oh, there we go. There's a plugin for honky.io. So honk. We will remove all honks. I've I've actually never been to that domain, so I don't recommend anyone to try. It's IO, so it's gotta be tech related. Okay. So we need to restart. That was one of the paths I was gonna I was gonna register something similar to
53:11 ghcr.io. Unicode advantage of DNS. Yeah. No no more Unicode changes with o's that look like o's. I'm not having it. Done. Alright. Where are we? So they get pods. I'm gonna have to roll those out, aren't we? Do you remember the rollout command? Or will I just go sledgehammer? We tap our way through it. Yeah. Go for it. I think it's if it rollout, restart. Yeah. And then, yeah, daemon set Cilium. Cool. I like it when people configure tab complete. I think that that it's my lesson for the day. Which one in the Solium? Yeah.
54:04 Alright. Let's keep an eye on those pods and see if we can get some image pools at least. No. Alright. Let's describe that first one. That's how long ago? That's about when we did it. Right? Let's check that container deconfig again. Is there maybe I can do you guys want a hint? No, not yet. You be quiet. You little person. I'm I'm not here. Okay. If the honk's definitely gone and I did restart container d, maybe we need to maybe give the kubelet a nudge. I'll pass the pass the control over to you. You wanna restart that? Yeah.
55:20 It says container gaze. Okay. Because I would expect that error to have been fixed. And you know what? I am just gonna go sledgehammer here. Actual sledgehammer. The problem with the sledgehammer approach is sometimes is you make it worse for yourself. Image pull. I'm assuming that's gonna give us the same error. So did my containerd restart not work? Oh, the shims. That shouldn't be a matter. This is where my knowledge of this is gonna be. Like, container d is responsible for pulling the image. Right? So this process was restarted. I'm I'm I'm just sort of wondering if
56:18 he also played the finalizer game with those not deleting. Container. They're still honking at me. I'll Extra container d configs? I can't remember why. I wanna tell you guys so that you don't, like, you don't get too stuck in this stage. Like, does container d run-in one place? Of course. You modified every host. That's it. All future episodes are getting one machine. But we only need one machine to make it work. Right? Yeah. That I should have thought of that. That was me just being silly. I I always fall into that trap because we're always
57:11 treating Kubernetes as this distributed thing. And then containerd is that one thing that's like just like the cubelet is like all over the place. So Yeah. I guess you would have to set it on them all. Otherwise, it would be an intermittent bug and that would be a pain in the ass. You're actually being kind with your devilishness. Alright. So refrain. I'll do it the nice way. Hey, we have a pod initializing. That's better. Alright. Thanks for the heads up. Oh, fuck off. Alright. Colin says check CRI CTL logs. It's still honking. Is that expected,
58:18 Amit? Yes. Yes. Okay. Yeah. Yeah. Alright. Just checking it's not just me being an idiot with my container d restarts. I think, Marcus, you were probably right earlier. Will we check for mutating webhooks? What was the other Validating, but format. Validating webhook configurations. But they can't actually modify the resource. So okay. Let's sync this one through. I'm curious now through whatever mechanic ever daemon set has been modified or there's still something happening on the horse. Let's check the image. Yeah. Okay. So something's happening between this and the controller to run that. And if there's no sorry.
59:41 I need go, Marcus. No. There's a there's a dump command that yeah. I've been trying to find ways to just get everything, and then we just search everything in one blast for honk. Uh-huh. I I thought it was, a dump a dump command. What is it, though? I mean, we could do get all dash o y m o grep honk. Yeah. You've got a lot of honks. Okay. So Yeah. All all all only shows a few resource types too. I was hoping for something that would show even more. What's the format for the before and after?
1:00:23 Oh, there we go. Yeah. There's also well, in GitGrep, I recently learned about dash p, which gives you the context of the function you're in. I'm not sure if what that would do in YAML and if that's supported in this one. Alright. We need more lanes. I don't like all this honky business. You can also just pipe less it and then use less to search. Oh, net without the gripping. Just use less for Yeah. That's right. Bet right there. In fact, why don't you just take over? Thanks. Alright. Honk one. Cluster direct like a set.
1:01:10 Did we get the name letter? I saw that all the managed fields weren't in a more recent Yeah. One twenty one. Using. And this is a one twenty cluster, sadly. It says it's clustered itself, the clustered pod. Are they all this? That do we already look at that? Let's do the get all in this I mean, we we've got bigger issues still. Right? Yeah. We've got bigger issues in the cluster pod. Let's do a get all on the Selium namespace and less it for honk. That was that was silly of me. Alright. So there's that image.
1:02:13 So every image That's the best. Every image everywhere is the hunk image? Is that maybe what we're getting at? So that's interesting. Something's modifying the pod spec, but not the daemon set spec. Image. Yeah. So we never actually checked that. Do wanna describe that part? And instead of just looking at the header, we'll see the image as the home. So Yeah. Yeah. Cool. So the image on them is yeah. That makes sense. I don't know how it's to get in there, but it makes sense now. So the demon set what is this magic? There's a comment from,
1:03:16 Khaled who says, try another name. Also, the fire flash says there can never be enough honks and rick rolls. Right. Could the c r I c t l logs tell us anything useful? We've that. No. This isn't container d anymore because the pod spec has actually modified. Yeah. So Alright. But the d m s has the correct one. So I'm trying to work out. Anything running? Because whatever is running could be the the instigator. And we also know oh, no. We don't. I was gonna say we know that this teleport session was working, but you say
1:04:21 that's not managed by Kubernetes. So this is the static manifest. And we've got our cube controller manager here, which has not had this timestamp reset, which is always fun. And I'm looking for a So there like you can have mutate and webhooks which are dynamic admission controllers, but we can also have static admission controllers defined in this file. And I'm curious. There's just something in here. Yeah. Looks good. The Hulk controller. Yeah. Get the image. Yeah. Sneaky. Alright. So I can give you guys the the actual image that was there before. Would that help you
1:05:05 Fixing Controller Manager Image
1:05:25 guys? Yeah. So it's kate.gcr.i0/kipcontrollermanager. Is that right? Yeah. It and it was version version number? Version one one dot 20 dot four. Yeah. Yep. Sneaky. I I think there's a there's a v in the tag. There's a v at the beginning of the version. Yeah. Awesome. So our controller. There we go. So you modified that. Yeah. Of course. Like, so the daemon set doesn't have any other controllers that create the pods except speaking to the pod control. So you just modified the controller manager to actually change the image on all pods? Exactly. Yeah. On the
1:06:23 create pod step, so there's a part that where it creates pods. I can actually put the code here somewhere if you if you if you want. But it's like a little block that basically says replace every container's image to honky honk, basically. So the container d alias was really just to throw us off this the track. Right? So so it wasn't a throw off. It was another piece of the puzzle that I was trying to add. So so I was trying I was hoping that after you do this fix that you would have other things to look for.
1:06:59 Actually, let's see if any of them worked, but something tells me they probably didn't. Like, my my own my own disruptions don't seem to have taken effect, but I tried a couple of things. I did the container deconfiguration to basically say, instead of pulling from docker dot I o, pull from somewhere else, honky dot I o. And then the other thing I tried to do was to do something very similar to what Marcus did, which was to change the service account image pull secrets, right, so that it fails to pull the image. So well done, Mark Marcus on that.
1:07:47 One of one silly looks good. Yeah. The image put back also the older ones. We need to roll out the you know what? Can I can I just slide time on this? Needs a dash capital a for good measure. That may be step two, I think. But that should get as at least Cilium daemon set running, the Cilium operator running, which we also need for all of us to even work. It should hopefully allow us to start restarting pods and other namespaces and not be fumbled by the CNI plugin. But we've no idea what else, if anything else
1:08:07 Restarting System Pods (Cilium, etc.)
1:08:27 to expect. That's looking healthy now. You wanna start in a bit of digging and see what else we've got? So what do we what do we know about over ciliums are coming up. They look like they're okay? Like Yeah. They are. They look Alright. Let's take a look at the clustered. Line line wrap is messing me up here. Okay. And the pods, those Oh, fancy. It worked. Oh, yeah. Because the name is different, but now I can tab complete it. Right? True. And so we're still yeah. We need to rotate these because we're still getting honky.
1:09:42 And you killed everything in Cilium namespace or everywhere? Just the Cilium namespace. I'm curious about those NGINX deployments and why they're there. But I'm gonna ignore them for now. It's The user decided to create their own NGINX deployments. I don't really believe it's NGINX. I think that's maybe that's my cynicism coming through. Maybe it's just NGINX. Maybe you just really wanted the hello world NGINX thing running on my cluster. That's fine. So what did I get pods show? Anything? I was just looking for those finalizers because those deletes didn't go through there. What did the get pods show? Yeah. Just get
1:10:35 pods. That's vanilla. Yeah. Let's get rid of that. Okay. Good. That looks a little too good. Alright. Let's do the test. Let me port forward. You got it? Or do we need to check something about the NGINX app as opposed to the cluster app? The NGINX is is not mine. I'm gonna ignore it for now. I am gonna do a port forward to our cluster pod though. Looks like it's spinning, which would tell me we probably got a failure to reach the database. Yeah. We don't need to wait for that. So either yeah. I'm assuming core DNS probably
1:11:41 needs to restart it. Right? Do wanna do a check gib system or something? Yeah. Yeah. We're still got a carnage in here. I'll let you decide the best way to get all those pods running again. How long how long do you have? Don't don't try this at home, yeah, for the audience. Never I mean, this this will recover. It will recover. Yeah. You're right. That's that's the point. But not something we recommend running in the real world. Probably a bad idea. Oh, look at that. Kevin's just signed up to be on the show. There you go. He's got some naughty ideas
1:12:28 Applying App Update (v2)
1:12:34 he wants to try. I'll reach out to you afterwards, mate. Is does that help our port forward at all? I would suspect. Or maybe I hope. Not sure what the right word is there, but yeah. We got our app. So we should try and update to v two. Yeah. Does does import forward go through Q proxy, I think? Yeah. Well, no. Or does it? Always forget. Yeah. Oh, sure. Yeah. The networking state of Kubernetes is where I've been completely ignorant my entire Kubernetes career. It's just that other people deal with that. I don't know the the fancy command to
1:13:20 upgrade the v one to the v two. So I'm just gonna do it the dumb way. We're going to two? Yep. Looks alright. They're creating. I mean, that's to hopefully just pulling the image. Big is this image? It's just a rust bainery. Although it does have an m p four in it with dot move, whatever it is with the the thing, but it's not too big. That seems unusual. Yeah. It does seem unusual. I'm looking at your moody. Well, I I I this is unexpected, so I'll do that. It should yeah. I think this is this
1:14:17 Debugging Slow Image Pull & Final Fix
1:14:42 is probably more g GitHub GitHub thing than yep. Or is it, actually? Yeah. I don't know if the simple way to pull images just to test that that we could do that. I don't know if we I've never done that with CryoController or anything. All images. That is gonna list them. Oh, wait. That was the first one. I'm looking at these clusters. So pull the r. I couldn't figure out how to use, like, CRI CTL to run an image. It seemed like to do so, you need to set up a a manifest. Like, you have you you need, like, a
1:15:30 at least, like, a five line YAML, and I just I couldn't accept that. Was looking for ways to run it. I read the answers. There we go. I ran into this. Yeah. You need the runtime endpoint to be the Oh, yeah. Yeah. The run socket thing. Yeah. The container d socket, I think. Yeah. I don't know why it's not the default. I always wondered that. Yeah. Oh, is it slash run? I think it's slash run. There we go. I just missed a I just missed a container d. Correct. Yeah. Yeah. Nice. I think it's run time endpoint.
1:16:12 I don't know if we should be seeing any progress. Unless we Okay. I mean, that should be fast. Yeah. Okay. So maybe it's just okay. Some something wrong with DNS networking. Dig at google.com. I'm using Cloudflare to do Google. That made me laugh at least. I don't think we've got DNS. There's a debug flag that we could try in that c r c q. Awesome. Holy. I don't know if it's just pulling in slowly or I'm gonna try it locally. Let's let's try it. Maybe it's image. Maybe we could try an entirely different image. Oh, no. It pulled.
1:17:16 There we go. Oh, it pulled. It's just maybe just low. Yeah. That's it. It worked. Yeah. Good job, guys. So yeah. Is there anything else? I think you did. There was one more thing. I think that I mean, it's so subtle. I didn't think you guys would notice, but there is a in the container deconfiguration, there was a a platform field where I put ARM 64 hoping that it would only pull ARM 64 man or images, met manifests so that so that what would happen is you would find other unexpected issues in the process itself because
1:17:24 Recap of Breaks & Conclusion
1:18:01 the ARM 64 version of NGINX, for example, would have completely different behavior on an AMD sixty four platform. But, yeah, that I couldn't get that to work. So if anybody knows, I would love to get some input. But yeah. Alright. Awesome. Great breaks, both of you. Great fixing to terrible. They were terrible breaks. Nobody should have to ever deal with these problems. I don't know why everything is so, like, happy. You know? But these things do happen though. Right? Like, misconfiguration is is definitely entirely possible. I mean, I'm not sure I'd ever end up with a honky controller manager. That was
1:18:39 a nice touch. But I did not expect you guys to go that route at all. That was that was unexpected, like, trying to debug Cilium first. It made sense, but it was never part of the plan. But I can see how that ended up that way. The idea was to kind of scatter everything so that you kind of figure out what the root cause is. So it kind of hopefully, you guys have enjoyed it. Yeah. Yeah. Running get pods all and seeing a literal clusterfuck on every pod was not something I expected to see. Yeah. So where where is the code for
1:19:14 your controller at? It's it's I can I don't know if I can can I share screen? Maybe I can You can. Yeah. If you want. Okay. Let me see if I can I'm curious about, like, the exclusions that that you had to bake into it. Yeah. Here. Let me try this share screen window. This guy. Does that work? Yeah. Yeah. Cool. So if you guys can see in the Kubernetes project under package controller control controller utils, there is a function called create pods. You can guess what that does. It basically creates pods for any con
1:19:55 controller that's inside the controller managers. For example, the daemon set con the daemon set controller, the replica set controller, the deployment controller, the stateful set controller, they all basically call they all basically call this function. Yeah. Sorry. So, yeah, I I just you can kind of read this four five liner. That's exactly what what I did. Yeah. Awesome. It would have been interesting if there was, like, an init container, and we would have seen we would have noticed that, yeah, it didn't pick up that. I I I was thinking about ways to use a an init container for mine to
1:20:36 to just cause extra havoc, but didn't in in this case, that would have maybe well, I don't know if they would have pointed us in a better direction. Having having the hunky controller manager, let's say, in retrospect, it seems like that's the only way that could have been done. Yeah. Yeah. The the you like, the part where you you you look at the deployment, you're like, that I that should be the image that gets gets created and it's not. Right? Like, so that was the main clue or or puzzle. Yeah. Awesome. Well, thank you both. Two very different
1:21:12 breaks there. We got the nuclear approach from Moody. We got the death by a thousand cuts by Marcus. Both very interesting to text. Thank you both for joining me. It was a real pleasure. I had a lot of fun with these clusters. And hopefully, the cluster 13 curse will not follow us into future episodes. Alright. Well, that's us for today. Thank you for taking the time out joining me. If you're watching and you wanna take part in clustered, reach out to me in the Discord or on Twitter. We're always looking for new people and new breaks and new fixers.
1:21:44 And it'll be a pleasure to have you all. Okay. Have a great evening, afternoon. Yeah. Afternoon, I guess, for you both. Thank you very much. I'll speak to you all soon. Take Friday off. Cheers. Bye. Bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments