About this video
What You'll Learn
- Remove node taints and unschedulable flags to restore pod scheduling on a disrupted cluster.
- Audit and repair CoreDNS failures by fixing ConfigMap entries and restarting DNS pods.
- Recover a Kubernetes control plane by correcting etcd permissions and resolving loopback mount disk exhaustion.
Adam Szücs-Mátyás and William Lightning debug two broken Kubernetes clusters. Fixes cover cordoned nodes, a CoreDNS ConfigMap, a Harbor image redirect via containerd, plus etcd permission and disk-full recovery from a loopback mount.
Jump to a chapter
- 0:00 Holding screen
- 1:42 Introductions
- 1:43 Intro, Housekeeping & Sponsor
- 3:01 Guest Introductions
- 4:12 Is it Scary or Fun to Break Clusters?
- 6:42 Debugging Adam's Cluster (Part 1)
- 6:45 Cluster by Ádám Szücs-Mátyás
- 7:30 Initial Checks: Nodes Unschedulable
- 8:30 Application is Down
- 8:55 Incorrect Application Image
- 11:42 Fixing Unschedulable Nodes
- 15:57 Database Connectivity Issue
- 16:08 Debugging Postgres
- 20:50 Checking Network Policies
- 23:10 Network Debugging Tools & Failed Ping
- 31:06 DNS Resolution Failure
- 33:24 Fixing CoreDNS Config
- 34:17 Restarting CoreDNS & App Works
- 36:11 Upgrading Application to v2
- 37:17 Adam Reveals His Breaks
- 39:04 Debugging William's Cluster (Part 2)
- 39:15 Cluster by William Lightning
- 41:10 Control Plane is Down
- 41:50 Investigating Static Manifests
- 43:55 Flushing IP Tables
- 46:01 API Server Error: Cannot Reach etcd
- 46:58 etcd Permissions Issue
- 48:09 etcd Disk Full Issue
- 50:07 Fixing etcd File Permissions
- 54:18 Disk Space & Loopback Mount Issue
- 57:56 Searching for etcd Backup
- 59:50 Restoring etcd Data & Control Plane Fixed
- 1:02:41 Nodes Not Ready (CNI)
- 1:03:53 Scheduler Crashing (Volume Mount Issue)
- 1:08:54 Manually Scheduling Pods & App Works
- 1:14:50 William & Adam Reveal Breaks
- 1:17:10 Conclusion
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:43 Intro, Housekeeping & Sponsor
1:43 Hello, and welcome to Clustered on the Rawkode Academy. I'm your host, David Flanagan, although you will know me across the Internet as Rawkode. And we have a great episode of cluster today with two fantastic guests. Before we move on to that, there's just a little bit of housekeeping. Please remember to subscribe to the channel. The button is right there. Click the bell and you will get notifications for all new episodes of the Rawkode Academy. Also, have membership options available where you can join for a modest fee and have access to some of the new courses that are
2:14 being released on a weekly basis. We are currently working through the influx DB complete guide. We also have a Discord server available at Rawkode.chat where you can come and chat with 500 other cloud native and Kubernetes enthusiasts. Alright. We're gonna say thank you to our sponsor. Teleport has recently started sponsoring custard, and it's been amazing to work with them. We've used Teleport on custard since the very first episode. It's an amazing tool. You're gonna see it in action as we debug these clusters. So you want to support the Rawkode Academy, you can go to Rawkode.liveteleport
2:54 and check it out. You'll have a lot of fun. I promise you will love it. Okay. Let's get started with some broken clusters and introducing our wonderful guests today. I am joined by Adam and William. Welcome to Clustered. How are you both doing? Thank you. I'm great. And, I'm Adam, and I'm working currently at GE Healthcare. But, probably, it's not too relevant anymore because I just resigned today. So Congratulations. I mean Yeah. I guess. I'm showing something else. So Awesome. Thank you, Adam. And William? Hi. I'm William Lightning. I work for Field Medical Group, an ENT practice consultancy across the
3:01 Guest Introductions
3:43 The US. Our jobs are to be kind of the best in the business, and my job specifically here is to I think to think of it as greasing wheels. People come to me with problems, especially programmers, and I try and fix them, and I try and get them past whatever is blocking them. So I've always told people, you know, my my specialty is to fix things. And today, I'm a little nervous because now I have to prove it. Alright. So, I mean, I've never really asked anyone this before, so I'm gonna put you both on the spot for the first time ever.
4:12 Is it Scary or Fun to Break Clusters?
4:17 But is it fun breaking a cluster, or is it scary? Like, what's your thoughts on that? Who wants to go first? Well, you're literally gonna say something. I it was it was scary mostly because I'd never run before this week. But it was it was a lot of fun once I got it going and and gave me some respect for the the Kubernetes developers because it did a lot of healing itself as I went. So, yeah, it's it's it's a lot of fun, but it's it was nerve wrecking yesterday at, like, 04:45, and I had to go home
4:57 at five, and my cluster wasn't broken. Well, that is a problem. I mean, most people would celebrate that, but, of course, not when you're coming out to cluster. What about you, Adam? Scary or fun? Yeah. Absolutely. I I had some idea how to break it, but, unfortunately, two of my ideas didn't work for a very different reason on 1 dot 22 because those are the current cluster. And, also, I tried to, over the weekend, prepare for this because we migrated away from, like, eighteen months from QADM based clusters, and we used the the clusters based on Talos.
5:36 Those guys were here a few weeks ago. So I tried to install it in Oracle Cloud. I failed, so I'm not sure how it makes it justified that we switched. But I guess I I I didn't break it that it cannot be fixed for sure. And, also, it could be done on Talos as well, but it would be much easier to detect because it's it's it's in it has a lot much smaller attack surface than a regular OS. So Alright. Thank you. And also, we are shooting that it can be fixed. That's always nice to hear before we kick the episode off.
6:17 So There there is there is two ways to fix at least. Alright. Well, the one is just yet. We're gonna start our cluster in just a moment. So I'll just say that every time I reach out to people to come on and enjoy us join us on clustered, people are always really happy to come on and do the tech sweep, but people that really struggle to actually wanna be able to break the clusters. And I always thought it'd the other way around. It's nice to hear your opinion there. But enough with the pleasantries. It is now
6:42 Debugging Adam's Cluster (Part 1)
6:44 time for clusters. So we are gonna start with Adam's cluster. William, you and I are up first. We've got a slightly different setup today. William likes to keep notes as he debugs some mushroom problems. You will see his notepad at the top right. We have teleport available here. I'm gonna connect to the control plane node. William, if you could connect to that and give me an echo hello to let me know that you're there and we will get started trying to fix Adam's cluster. You can do it. There we go. Yeah. But one of the
6:45 Cluster by Ádám Szücs-Mátyás
7:24 problems already. That's not there we go. Perfect. Alright. So this is the Kubernetes cluster. I would suggest that you export the cube config and pick any command you like and see if we have a control plane. It's usually a pretty good first step. You've already got your commands there ready to go. Perfect. Oh, yeah. Oh. Oh. Well, that's already a bad sign. Scheduling's disabled. But we do have a control claim. Like, not many people are that lucky. So I'll let's take it. That's good. Alright. So we have Postgres. We have the the pod there. Get
7:30 Initial Checks: Nodes Unschedulable
8:09 services. Oh, don't have my shortcuts here. Alright. And we got our service. We got 3,000. So it should be running in its current form. I come from a kind of a development background, so I like to know things are working. Would you like me to browse to the application and check? Yeah. Would you mind? Happy to. So we have it exposed via Teleport secure access. We click launch, and this should create up a nice secure tunnel. Our application is not working. I think Adam has been cheeky. Alright. Now it's we got we got first okay.
8:55 Incorrect Application Image
8:58 So let's take a look at the cube system. Everything's running. So we have control plane, accordion f, Selium, API server, controller, scheduler, the okay. So I that's all the things I think I'm used to seeing as of yester day or the day before. Okay. So Yeah. Things like, even the way too easy. Are, like, this looks way too easy. But you know what? This is one of the most deceptive things in cluster. Right? There's so there's so much attack surface to break things that you just never ever know. Famous last words. Exactly. So there's no logs on the cluster. Your go
9:52 to. I'm really sorry. I'm a terrible developer. There is no log output whatsoever on my crappy application. Really? Okay. Let's see. I know. I Hey. There's there's our output. It was in it's not my image. Well, that might explain a lot. Okay. So I get pods and we'll do Let's see. Can I tab complete? Yes. Alright. What do we got here? Oh, wait. Now supposedly scroll works on this. It does. Alright. Are you using the web client or the CLI? I'm using the CLI. Nice. Okay. Cool. Where's our image? Image ID clustered b. So that doesn't seem
11:03 right to me. Yeah. The Because Go for it. That's somebody else's. Yeah. That is not the Rawkode official stamped cluster released application. Yeah. I'm like, that doesn't look right. The chat is right there with you. But what what I have up is the Git repo, which does have the right image. Yeah. It looks but, of course, our scheduler is disabled, so I'm probably gonna break this. Well, sometimes you gotta, you know, break it before it gets better. See, this is what? Clustered. Can I do this? Well, I would move the size of deployment rather than the
11:42 Fixing Unschedulable Nodes
12:02 pods. So the pod spec will be immutable, so I don't think that will work. Oh, that that that's okay. Oh, looks like somebody already applied it. Okay. Now this one, you'll be following me. Mhmm. Okay. So that's I wanna set image pull policy to always, but I shouldn't do that. Not yet. Oh, you're scanning for more changes. There's a lot of stuff here that Chris, we'll just start with that. Yeah. I think that's a good idea. And because we don't have a scheduler or the scheduler is broken at least, I don't think it'll get beyond pending.
13:17 What makes you think the scheduler's broken? Well, in the node, it says scheduling disabled for worker one and worker two there. So I don't know if that's a scheduler broken or if it's just tainted. I've never I've never done anything with taint, so this would be fun. It's either tainted or cordoned. Oh, good point. Yeah. Node worker. I probably only need one of these. There's well, there's a taint, cluster d app, no schedule. Mhmm. So It's also unschedulable true, which would suggest it's potentially cordoned as well. Is there an idea of which how do I remove a taint? How do
14:19 I remove a taint? That's the good question. But this is what documentation is for. So you can just do a cube control edit node and then the node name to remove the taint. Oh, sweet. Let's see. Worker. Start with one. That way we have second one. Well, just a bunch of stuff here. Now this unschedulable is not I can change that. Right? You can, but I'd probably run cube control on cordon rather than modifying this back. Okay. Yeah. So we're just gonna drop these taints then. Let's try that. Yeah. Go for it. Scheduling is stable,
15:14 but that's a flag. Okay. So kubectl uncordon? Mhmm. And then just a note name. Says, I wish I knew Kubernetes to understand all of this. Well, watch a couple of them. You'll pick it up really fast. I promise you. Okay. That's ready, and we're running. So how are we running? Are we running alright. If you go to the web page, we can see something different. You have connectivity issues. We are unable to reach the post grads database. Progress. We're making progress. Okay. I'm gonna do that to the other one, to the other worker, and alright.
16:08 Debugging Postgres
16:29 Let's see. Oh, yeah. When I okay. Let's try that. Oops. That's not the one I wanted. Notes. Let's go. I haven't had to use my notepad much. This is strange. Okay. So Postgresk, let's see if we have any output from post Postgresk. I love Russell's comments. Step two of x completed. Because you just never know how much is broken on this thing until you get to the It's true. What do you mean pod not found? Oh, I'm looking at services. Wait. Was there a service in your definition? I guess there probably was. Okay. Yeah. The Postgres service
17:33 is there. Get pods. That's what I'm man, I cannot type. Alias key. That's the greatest structure in the world. There we go. Like that? Yeah. And then just key everything. These are ones I use a lot. And you only need to start that in court. This may trip me up because I'm so used to all my other shortcuts, but I'm giving myself at least one. Okay. So we wanna do k log dash post grad school. Now does ali aliases break tab complete, don't they? They shouldn't. Well, it sure didn't wanna do it. Okay. Database was shut down. Database is ready to
18:38 accept. That looks good. Let's look at our definition because this is a deployment, not a stateful set. Are you sure? Because the name of the pod is dash zero, which Oh, you're right. Just stick with it. So we might have pull up your thing. Edit. Or k. Can I use that? No. I gotta do STS. STS? Yep. Oh, you're I'm gonna have to remember that one. Describe. There's not one for mutating admission controllers or validating web controllers or configuration. Those are dire need to I'm so glad that we have not run into that yet so far. Yeah.
19:40 Never too late. Never is too late. You know what? I'd rather describe this thing. Let's edit it. You're gonna edit the staple set? Can I ask you why? Mostly because I wanna see the YAML. I could do git I could do git o YAML. Type that in the last Are you not happy that our stateful set is working and our pod is working? Yeah. I mean, kinda. Because it's a liveliness probe. I mean, there's something wrong with it. Of course, I could hack into PostgresQL, but then I'd have to install the Postgreschool client or find the Docker container for that.
20:28 And that sounds that sounds annoying. This should have there's no there's no storage, though. Right? No. So I have this configured to the it's actually the data is bootstrapped through in a container. That's what I'm thinking is that the and it's missing. But if you think back to the error message that we got on our browser here, it's unable to connect and time out connecting to the database. Someone finally messed with IP tables? I hope not. There. I'm gonna put this in less so you can see my scrolling. Let's see here. Drop all anywhere. Cube firewall.
20:50 Checking Network Policies
21:37 I have a working cluster that I set up. I'm gonna compare the output. Is that cheating? No. You're definitely allowed to compare it. You you can do everyone. That's it. There's almost no rules on cluster. Just don't break teleport. That's pretty much it. And you t f eight. Right? Oh, really good. I would suggest maybe describing your service. Okay. I don't think we did that yet. Right? No. Because we do have endpoints, which means the service should be able to receive traffic on the DNS name, on the IP address, on the port, etcetera. So what could block that?
22:33 K. So our selector is good. Single stack, 54325432. K. I agree, Russell. Let's see here. Yeah. The chat with us. No UTF, no Unicode shenanigans, no STD shenanigans, all the stuff that makes me cry. See, I don't know if I can do this from the control plane, but can I ping these IPs? Probably You could you should be able to ping the pod IP. Yes. No. That didn't work. You'll need to do it. So you could either add a you'd probably wanna execute inside the cluster container to do it. Yeah. Other than setting up your networking rules.
23:10 Network Debugging Tools & Failed Ping
23:30 Throw up Oh, no. After your e b p f cat cat on my machine last week, that is also banned. I'm gonna have to publish the ban list, I think. Is your yours is a rust application, isn't it? It is. There there is a shell available and the things that you should be able to do a cube control exact dash into the cluster pod and then you should be able to it's Alpine so APK update and APK install whatever you need, but ping will be available by default. Okay. So oh, it's it is Alpine. Okay.
24:12 Exec. Pretty quiet there. You having fun? Yeah. I'm I'm looking at how you try to debug my my breaks. So if you need any help, then I can give some hints. Alright. So you have a shell inside. You should be able to run ping postgres. Oh, no. That'll be the service. So ping the pod IP. Although I believe and I'm sure Noel and Russell can help us out in the chat, That the ability to ping a service IP is coming and Kubernetes one twenty three or one twenty four, which is really exciting. Oh. Okay. Well, that ping is not returning anything.
24:57 Okay. So, yeah, you've got some sort of policy blocking traffic to that. Where'd you wanna look first? I mean, Selium is the I think that's a great idea. Alright. One of my former I don't if one of one of my colleagues, former guests on this show, gave me into trouble for leaving Hubble on all these clusters because he said it's it's too good a network debugging. People still have access to it. Let's just do this. Path equals path colon home. Why are you putting in this Cilium NRA? Well, that's our CNI. Right? Yeah. And so that's I don't know. It's the
26:00 first thing I go to is what's the tool to use for the job. I get rid of that. No. I don't need but but, of course, that's happy from there because he operator, it's the versions. Alright. We are halfway through. We got twenty minutes left. I'm gonna throw some suggestions out. I think the chat said this as well, but you should probably look for network policies on the Kubernetes API. Yeah. That's right. Is that gonna be We keep the dash e and maybe just check on them. Capital a starting. Okay. So there's no basic standard Kubernetes network
27:10 policies. However, this is a silly inquisitor. So silly has its own types of network policies that we could check out. I don't remember them all off the top of my head, but you could run. CMP and CCMP, serial network policy sense cluster, serial cluster network policy. Yeah. All of those. So you could run Kube control API resources and grep for Selium, and that'll give you a list of all the resources available. But Adam also just listed. Hey, Welcome. Yeah. CMP and CCMP. It's easy. CCNP.net. Policy And, like, network policy. Yeah. CCNP. Correct. Gotcha. K. And then there's a Cilium cluster wide network
27:59 policy. This is the one. There is a c c m p. That's the Selim network policy. That's the name space code. Yeah. Let's try and skip control API resources. Let's just get a list of them on and no get. That's right. And that's gonna dump us a bunch of stuff, so let's talk about Selium. Yeah. Yeah. I'd start looking into these things. I guess the CCNP is customer rate of our policy. Thank you, Adam. And there's also CMP. CMP? CMP? Yeah. I was really hoping it was a network policy, Adam, what you did here. Mhmm. That's obviously.
29:06 Have you if you need some hints so I can help you out. Okay. So let's talk about the symptoms here, William. We've got what we believe is a working clustered pod. We have a working post for this pod. We have a service with endpoints. However, we cannot reach the database from the application. We've ruled out network policies, so they have network policies, and they're still in cluster wide network policies. And I and we can get in to the the the net the the other service. Right? Because that's how I load the web page. So Teleport creates a proxy and uses the
29:49 node port service. So you could do a curl local host 30,000 and you should see the same thing. And that delay there means we're gonna get the time out. Yeah. Okay. So that's a node port. The other ones are cluster IPs. Correct. Maybe? I don't know. So why don't? I don't know. I'm trying to think here. What to do next? Oh my god. Sixteen minutes left here. Loads of time. Loads of time. Let's edit the cluster deployment again. Hunch right now is are we actually that's that's it's hard coded, isn't it? I think I hard coded DNS lookup. I don't
30:48 think it's passed in through configuration because I'm a terrible developer. Alright. What am I looking at? I'm going I have no idea. Basic check of DNS. Is the service named properly in the deployment for clustered? Okay. So I think what we need, we need some more network debugging tools. I would expect back into the cluster pod, and we probably wanna start playing with the deck command. Oh, okay. You want me to get back into Yeah. I I I think so. Where'd you go? There it is. Yep. You said Bash was available. Right? It is. Yeah. Sweet. Okay. Let's do a if I
31:06 DNS Resolution Failure
31:33 just maybe a Debian, we'll do an apt update, see if it works. Yeah. I know that. Oh, hey. It is Debian based. Can't remember the name of the package, so let me I'll just quickly guess and type. But I think I can never remember what it's actually called. I really should, like, take note. There we go. The initiative. That will give you the tech command. So now you can start to query core DNS and see if we're getting any responses. Let's check resolve.com. Yeah. It's a good idea. I can't see that on my local one.
32:17 Here, let me clear this. Yeah. That that was try. There we go. It it some of the the teleport bits weird stuff on my end. Okay. So that looks good. Big postgres. Yep. I'm not a Qualified, just to be sure. Maybe need Yeah. That's a problem. Let's go look at our DNS. Can I I'm gonna pop out of this? Yep. Q. I do appreciate you keeping everything in the default namespace. It makes life easier. We won't do a dash f. CoreDNS. CoreDNS is also notoriously quiet. It doesn't particularly log too much, but it does have a conflict map and
33:24 Fixing CoreDNS Config
33:31 a keep system namespace that we probably wanna explore. So I would get the conflict maps in cube system. And there's a core DNS one there. I see a honk. A honk? Yeah. I see a break. Got it? Mhmm. Oh, there you go. Yep. But if we just edit this, it'll pick it up. Right? Might have to go off the pod. This wonderful thing that nobody in Kubernetes land likes. But when you modify the conflict map, you have to rotate the pod. Yeah. So annoying. Don't know why we don't have a controller for this shit. Or we do, if someone
34:17 Restarting CoreDNS & App Works
34:33 can share it with us. There is one that was presented at KubeCon a while back. Someone had one that that embedded the hashes. Oh, yeah. I do see that a lot on the Yeah. Yeah. But there's someone who actually wrote a controller for it way back when. Or you can just use file system notification. It we use it in couple of our applications, and it's working great. So except that it's having a couple of false triggering when it's syncing from HCD, so you have to ignore a couple of file system events for reapplying the same permissions from HCD. So
35:09 but that's a known issue for a couple of years, so nothing to be unexpected. You can always judge a person based on how to restart the pod. I am definitely a cube control delete all kinda guy, but I see that William You you are. Very a very sensible approach to doing a rollout restart. I like it. Yeah. Because I have to remember things like deploy slash or DNS. There you go. Okay. Yep. 50% of the time, DNS or always DNS. Oh. Alright. Ten minutes, and we've got v one working. If you refresh. Mhmm. Not sure why my video is playing weird,
36:08 but we do have the watch point. Let's try and get this upgraded. Okay. Alright. I I I feel like I should check the images on the on things, but let's just trust it and go. So we'll get pods. Edit. Oh, wait. Deploy. Cool. And I'm I'm guessing we just need to edit this v two v one to v two? Correct. So just change that up for a two and apply and free. There you go. Alright. And we have the dance. How's that, William? Nine minutes to play. Alright. Great job about that. Adam, did did we get everything? Was there
37:17 Adam Reveals His Breaks
37:19 anything that doesn't trigger properly, or did we did we get Yeah. It should not work. What the yeah. It should not work. Can we get the cube CTR gas ports all wide? I love the CDL. Don't work. Gas ports all wide. Yeah. This is interesting. It should not work. Maybe I forgot something. Alright. Well, we look forward to the breakout notes and the repository, and we can see what was there. But good to go, William. I I can I can even tell it right now? Yeah. Go for it. What would Yeah. So, basically, I set up a harbor
38:07 instance somewhere. Oh. And I changed the container, the config to trusted certificate, and changed the at c at EC host files on the worker nodes so it would resolve to my IP instead of GitHub. So I like it. That's my kinda break. So yeah. So you you could pull the v two image, but it was the same image pushed. I'm not sure why it didn't work. Maybe I I forgot to change something. That's sneaky. So bad. Oh, I love it. Yeah. But I guess you you would struggle to find that because it would actually pull
38:44 the image, and it would look fine and everything. Did you update the container deconfig on worker one and two to do that redirect? I I think I did. I I I think maybe I forgot to change in the worker to the at t c host. I'm not sure. Well, never mind. Alright. Yeah. Let's switch clusters then. So we're gonna kill this session. Alright. So this and I'm gonna switch over to one patcher. Come on. Not now. Alright. I'll just do that over here. I don't trust my password into my live screen anymore. I've flashed it too many times.
39:15 Cluster by William Lightning
39:33 I'm gonna start my victory drink. What have you got? It's a just a root beer. Just a simple mug root beer. Alright. I have one password. Okay. So this is William's cluster. Adam, you're now pairing with me on this. I am opening a connection to the control plane node. If you could do me a favor, join our active session, and then we will check for a control plane. I guess I I see my weird screen flash, which means I know where we're in. Perfect. Okay. We are in. So let me install k nine s. I've seen that a lot lately. I think
40:40 that's the second time we've had someone install k nine k nine s. I'm not sure what the pronunciation. I accidentally catch it while it was released, like, two days after it. And since that time, a fan. So Nice. It's a very cool tool. Okay. So let's check whether if export. Queue config. Let me see. Okay. Let's check. Kube CTI version. Okay. It's not accessible. So, obviously, something is going on. And I think the even the IP address is not correct. So let's check. Okay. I could probably check-in my cluster, but I assume that it should be local host
41:10 Control Plane is Down
41:28 anyways. So let's check. No. I think the IP is okay. That's the BGP advertised IP address. Ah, okay. But that should be okay. I think we just don't have a a running control plan. Okay. We can check that for sure. Seriously, my keyboard is messing with me. So let's check that. Cricityl dash r, and let me copy that. So, bots, I think. Qubepi server is running, but the scheduler was messed with. So that's something to look for for sure. So I think where are the static manifests? I think in that, you see Kubernetes. Right? Yep.
41:50 Investigating Static Manifests
42:31 You need a city. Yeah. Yeah. You always forget you're not your local setup, and you expect some of that to City manifest us. And let's check what's going on with Kube controller manager. The image looks right. Yep. Looks okay to me. Yeah. That's so why the API server is not running? Let's look into that. QA server. Okay. Bind underscore. Fishy. At least seems so. So something is going on why we don't have access. So really, I've looked into EP tables, so I guess it might has to do with that. I think it's e IP tables.l
43:48 or something like that. So all drop anywhere. I think that should not be there. Yeah. I think that should not be there. So I think it's how you edit. I always forget how to edit this stuff. So Yeah. I'm not entirely sure how to modify IP tables with myself. Typically, I would just flush them all. So anything that's not processed either, we just delete, and then you we will lose our Kubernetes and Solium stuff. We and we pod restart will fix that. Yeah. Sure. We can do that. But then I am now infamous for breaking
43:55 Flushing IP Tables
44:40 customers more than I am fixing them. Yeah. I'm just trying to figure out how to fix IP tables because I'm not that good with it. So I'm trying to figure out how to You want to flush them all? Yeah. Yeah. We can just do IP tables slash f. Okay. And then when you run l again, you'll see we have a clean setup. So Okay. Let's check again. Yeah. Let me fix our API server or may not. I don't know. I can see while William's sniggling. Yeah. Let me try to give CTL version. It still doesn't work.
45:19 So it's still refusing. What else it could be? Have I played a game, Russell? Okay. So, yeah, that didn't work. So we took a look at the API server manifest. Right? Yeah. Do mind if I just pop it open again? Sure. Ah, it was cut off. Okay. That was weird that it's so I don't see anything that concerns me yet. Maybe we should take a look at the best ever logs. Yeah. Sure. So we can check with guess it's less war logs. Not Kubernetes. Container. Container. You can just pause, but the redirect server tracking containers is a lot easier.
46:01 API Server Error: Cannot Reach etcd
46:21 No. It's no. Warlock. That was the issue. Cube system, cube API server, and error by dialing. Okay. So the API server cannot keep the HCD. Yeah. So maybe then something is going on with HCD. Let's check that. What would grow free audience? Don't fuck with that CD. Right? I think that's what it was. Yeah. That's evil. Advertise client URL. Maybe that's correct. Was that running when you did the control command? I don't I don't I don't remember. Yeah. We can check that for sure. Yeah. It was running. You're very calm, Adam. What? You're very calm.
46:58 etcd Permissions Issue
47:25 I'm why why I would stress more than I need to. So I I really I really like it. I mean, I already, little bit pissed that my my break didn't work as intended because that would be a fun to watch that to figure out why you don't find the v two part. But, anyway, it's Next thing. What the fuck? Let's check why it's Cube system. X c d. Okay. Oh, it's LS. Stupid. Okay. So the recommended permission, unprevised access to data. So I guess it has been messed with that CD. So the permission is 755,
48:09 etcd Disk Full Issue
48:20 but the recommended permission is 7OOO. So Yeah. Okay. Let's check the h c d folder, I guess, which should be okay. Where h c d stores the data. Let's check the manifest again. So less h c d Yellow. And this is where is the data? Am I stupid, or I just don't see where it's storing the data? Data. Okay. So Alice. Okay. This is looks fill me up. Oh, this doesn't look right. So let's remove that. This r f Let's see the fill me up. Okay. There's a great a large file there, William. What was that? I just had to get
49:37 there. 50 gigabytes. Yeah. It's it's pretty nice. So maybe now we have a running control plane. Let's check that. Permissions are still Oh, yeah. Sorry. So under guess it either could be the keys or the content, but this folder looks okay. So let's check under what is under member. It still looks okay. This does not. So we can do a CH mode dash r, and I think it should be 700. What is the something is not Then I guess it's the wrong order of the command. Yeah. But you're I have to change the permissions on the files.
50:07 Fixing etcd File Permissions
50:34 So 700 probably be the directory and then 600, I think, for the files. Because you wouldn't need the executable on the Oh, yeah. Yeah. So I can change it to 500. Right? No. 600. 6. Yeah. Six. So let's check if it works. Okay. That should be fine. And then there was another folder. What was it? Right. Had load, but I guess that wasn't messed with. So okay. Let's check again the and what is what's going on with the p k five. So it's it's at city p k at okay. So as at DC. Kubernetes at city.
51:27 No. What was that? LPKI. That's CD. Okay. These certificates having two broad access as well, so let's change them as well. So ETC. Uber. That is speaking. Let's see the and yeah. That should be fine. Okay. That's okay as well. So let's check maybe now. It self healed itself or not, we can still refuse. So something is still not okay. So c r I c t o. That's the shop. How we can restart the container. Maybe stop. Okay. So Yeah. You could just also remove it and let the kubelet set up a new one. Okay.
52:42 So remove RMP. So c r I c t l. Okay. So the pod name is let's see. Okay. So copy. And c r I c t l. What was it? RMP. Okay. So it it didn't remove it. Or let's check the logs. So less var log pods. No. I think you sent in the pod name. I don't know if that works or if you need the pod ID. Oh, yeah. Okay. So let's try with the ID. That looks better. Yeah. It's gone. It has has a new I There we go. Yeah. It has a new ID.
54:09 So let's check the logs again. So at CD 217. It's running. Uh-huh. It's no space left on device still. So what the heck is going on? The f is So there's this weird thing. It's so annoying. So With that CD and that when an alarm goes, you have to manually clear it. So you're gonna need to get a CT control installed and clear the alarm. Okay. But it's still full. So this is something we are going on with this mount. It's 200 megabyte, and it's full. Or this is yeah. So there's something is sneaky is going
54:18 Disk Space & Loopback Mount Issue
54:56 on in this list. So I guess, let's check again. For. Let's see. Maybe there is a So these files the DB is very big. Oh, it's six megabytes. It's okay, I guess. Let's check the write ahead logs. The write ahead logs are are big, so I guess we don't really need them anymore. So let's just Well, they may not be compacted yet. I wouldn't delete them just yet. Okay. If you say so. But there was something interesting there, and I'm not sure, but FireLab etcd was a look back mount. Like, we could just unmount it, I think,
55:53 to remove the 100% usage. Like, we'd have to stop etcd, on mount it, and hopefully, that would expose the directory from the Rawkode. And that's Okay. Making some very large assumptions because I have no idea what William's done. But it should I I don't believe it should be on the back mount in there. Okay. We can start stop. Yeah. It's CD. I guess it's not restarting. Oh, it's re no. It's not restarting. Okay. That's good. Yes. So I guess it's CU mount. Right? Yeah. And then war switch back to Hungarian keyboard again. I'm using a US English keyboard. So Oh,
56:52 alright. That this is the this is the command? Yeah. Yeah. Okay. So let's check the FH again. Okay. So nothing is full anymore. Mhmm. So we can we can start to run at CT again. So You wanna check we've got we still got at CT database under the location? Yeah. There was a d b file. It was six megabytes or something like that. But we can check again. So what was it? And then the number no? Yeah. I was worried that was gonna happen. Yeah. So should oh, shit, I guess. So how we restore the database now?
57:51 I think we just go home and say goodbye. So there's gonna be a backup somewhere, and that look back mode is gonna be on the device somewhere. So we could go old field and maybe run, like, a thing from the root and look for another entity directory. William, have the idea? I think I think that's a good track. I think you'll find something at least. Mean, most people's text stuff in temp or slash root. So that I I would say we look in there. People aren't very creative when it comes to hiding stuff. Let's okay.
57:56 Searching for etcd Backup
58:32 So nothing here. What about slash load? Downloads. I cannot type today. No. Okay. That's not alright. Not here. I mean, mean, I know question, I had suggested that we, you know, debug the loopback mode before removing it. But, still, we can find it. So Yeah. I mean, if there's no objections, I'm just gonna suggest that we do. Okay. There is one local. Yeah. So let's check that. Okay. Loop back file image. So we're gonna have to note that again to a new location and copy the files over. Okay. So I need some cheat sheets for
59:46 that. I think we can just do I don't know. Will this work? I I don't I don't think I've done this before. I'm trying it should be mount. There we go. Okay. So now we have to check that whether we have at CD running. Oh, well, we'll need to you. Yeah. We have to restart it. Move member first. And we've now got a fresh STD started. Okay. So we're we're gonna have to stop the Kiplit stop STD, delete barlib etcd, and then restore that member directory to the barlib etcd. Okay. So yeah. Guess we Yeah. He is absolutely
59:50 Restoring etcd Data & Control Plane Fixed
1:00:52 stop. Keyblad service. And then we have to stop call the etcd processes or pods. Yeah. So HCD is running. There is two of them, actually. So it's very good. We have two of them. So RMP and then one more. Okay. So now we can delete FireLab etcd. I know. And then or just the member director at least and copy over. Thanks, William. We appreciate that. But instead of the chat, he created other backups just in case we blew away the lit back image. And this was worries. And the a directory. Yeah. So I meant it to look back to the a directory.
1:01:51 Okay. Actually, you'll need to move a member to bar levy, and that should yeah. That's that should hopefully be a working entity that we should be able to now restart the kubelet and hope hope for the best. We're halfway through this, Adam. Thank thank you for helping me out. So let's start kubelet. Maybe we will have an actual API server now. So I hope so. Let's check what is running. So at CD starting up. Let's check. Let's be hopeful. So version. Hey. We have one. So we can start with k nine s. So let's figure out what's
1:02:41 Nodes Not Ready (CNI)
1:02:41 going on in this cluster. Okay. So all names. Nothing? Okay. This is weird. We don't have any ports, so let's check note notes. Okay. Not ready. Not ready. That's always nice to hear. I guess it was intentional. So keep control get pods, all namespaces. That's for plug in not ready. See, and I plug in not in iOS. Okay. So I guess he might have removed let's let's check on DaemonSet. Okay. There is no, but we also should have it in. So is there, and it's running or at least seems to be. So let's check with the control plane. So
1:03:40 now we have some logs. Let's check again the notes. Maybe they are self healing. Okay. Now we got something. Good. Okay. So cube scheduler. I guess something's fishy is going on it, so I checked the config map. Maybe I think that's fair. Okay. Let's go back to ports and check why it's gonna crash to back off. Cannot find config to my end to okay. So I guess it has been messed with. So I'm actually, don't have a cluster with me, but it's very weird that it's expecting a queue config. I think it should not do that.
1:03:53 Scheduler Crashing (Volume Mount Issue)
1:04:25 Yeah. We wanna go we'll probably take a look at the static manifest. I would suspect something Yeah. Changed. Yeah. So let's quit and go back to our FTC Kubernetes and where is it now? Manifest. And then let's check. Last queue scheduler. What's going on? So it's mounting up the scheduler config. Nothing else. And I guess it's expecting the queue config. So I actually should check on my cluster because I did not break it that what's going on. Yeah. You can also peek into that /etc/kubernetes/scheduler.com just to see anything there has been tweaked or changed. Yeah.
1:05:24 It's it's it's it's config, so debugging it may be a bit of a pain. Okay. So it cannot find this, or this is not the right one. So there could be this doesn't it looks okay. So something is going on for sure. Maybe just wanna check that IP address. Yeah. We can check that. So No. No. It's just the bank address. So maybe that's okay. So this is the queue config. I guess we can. I mean, we don't need the working scheduler, to be fair. Yeah. Well, obviously, we can we can work around that by
1:06:17 assigning manually. But it would be nicer if you would have one working. I'm thinking what to do now. So the scheduler doesn't work for some reason. And that's e t c two r to schedule. Maybe the five permission was messed with, so we can check that. Wouldn't be the first one. So this is seems to be okay if it's running as a root container. Let's check that if it's running as a root container. So Oh, there's an error message added clutch. So it says cannot find the volume. Did you see that? Oh, I won't. The first line cannot find
1:07:18 volume cube config to mount out of the container. So the static manifest must have some sort of volume mode mode configuration that isn't actually being declared. Yeah. Interesting. So let's do let's use the. It's making more sense sometimes. So the mount is so file or create. That's okay, I guess. Read only should be true. And the volume wait. Volumes host pass name, queue config, and volume month. It's it's in the container. The name is matching as well. So interesting why it cannot mount the volume. So it's it's always this. What the heck is going on with this?
1:08:33 No logs. So only the so the hash is okay. The mirror port is okay, and it should mount. Oh, I may be and it's running on control plane one, so that should be fine as well. Yeah. We've got twelve minutes. Why don't we ignore the scheduler and focus on the work group and then look back to the scheduler? Yeah. Just like you said, we we can just we can we can skip this. So there is a hello. I guess this is a Chrome job, so let's delete that. We don't care about this. Sorry, William. Jobs. Okay.
1:08:54 Manually Scheduling Pods & App Works
1:09:18 That's okay. Now it's cleared up. So let's check. Not ready. Not ready. So it's not scheduled, it's completed. So completed is not the port status. So I guess there should be a job or something like that. A pod can complete if the process exits with a zero. Okay. Let's check the the money first. So so this is v one for sure coming from. Okay. Let's check deployment. I cannot type. Okay. So describe. No volume. Now let's check the replica sets. There is only a single replica set. So it has been either cleaned up or something
1:10:18 else is going on. And based on the age, something else should be going on. Why it's not ready? So the other nodes were fine. Yeah. They're And the serial port started fine. So that should be okay. Okay. What the I'm not sure what's we can why it's quitting? So do we have no replica sets at all? No. We have we have our replica sets. So Okay. So let's try just updating the image to v two. Sure. We can try that. You don't feel like that. You don't feel confident that's gonna work. Yeah. I'm I'm mildly hopeful.
1:11:19 We won't. Okay. So let's check Cluster pending because we don't have. So should we directly edit the pods? Yeah. That's it. So where is it? You could just add node name to the spec. Yeah. Here. Right? Yep. It's like the the last case and h. I cannot edit this. Yeah. We'll need to oh, yeah. Oh, no. We should be able to change the I must have the field name. Oh, no. Not not name is there. Yeah. It's cannot be changed. That's what I hate about bots. Let's just add it to the deployment on the template spec.
1:12:38 Sure. So deploy. We're not fetching your scheduler, William. There you go. So let's find where is the port. So it's here. So port's back. That's under container. So port's back. This is here. I hate that them does that. So I Ah, always forget. Okay. Let's see what's happening with the bot. Here we go. Okay. Container creating. I could check the logs, but it's not my NGINX container that always give back an internal server error. So Hopefully, it's just pulling an image. There we go. Okay. So we have okay. So we have to fix the stateful set as well.
1:13:46 So let's put it on the same note. It should be fine. So I think we may have to delete that part because it says it's a stateful set. Yes. It's one to roll out automatically. Sure. I'm curious how you got the completed state on that volume. You need to you need to share that with this one. We're done. That that is better now. Shall I try and merge the application? Yeah. We can, I guess, we can check it? Maybe it will work. Maybe it's not. Oh, we got that. Oh. So I could not fix the scheduler,
1:14:50 William & Adam Reveal Breaks
1:14:52 but at least we got there with some with some almost dead cluster. Yeah. That was I was it was interesting. Like, I must add are there any breaks there that we we navigate around their volume, or was the entity the big one? The the XCD was the big one. Yeah. I I figured that would be it, and I just did a one line change on the cube scheduler. I'm pretty sure you guys broke it while messing with XCD, though. Yeah. Unfortunately, I had a one one more idea to to mess with your cluster, but it did not work either because
1:15:31 of one dot 22. So I I I had some more ideas, but I ran out of time, and that was probably for the best given given how long that took. This was really great. Yeah. May may maybe a heads up for future. Cilium currently doesn't support, but there is a nice service property in one dot 22 that you can create a service with local, internal traffic policy. And if you that's why there was a change there because I tried to do that, but fig I figured out that CDM still doesn't support. And because David installed the cluster without Qproxy,
1:16:09 it didn't work because supposedly with Qproxy, if I would restore that, it would respect the the rules. So Damn me. And my Qproxy free clusters. Sorry. I don't Yeah. That that was a that was a sad thing to to happen. Alright. Well, tell me, how did you get the completed status on the cluster of postcards pods? I have no idea. That is not something I broke. And people say Kubernetes is hard. The the the etcd and and compressing that into a small volume and then trying to get it to fill the disk, that was really
1:16:45 hard. Etcd really didn't want to die. And then I just I just configured the scheduler to write its config out and die. There's one command line option in there that just so that's why it was dying. So all the volume stuff, I think, was when you guys were messing with the permissions because I didn't mess with the permissions anywhere. STD license is great. I guess that's what happened there. Alright. Well, thank you both, Adam and William. Those were two great clusters, and you both did really well working through that. In fact, I I had a pretty
1:17:10 Conclusion
1:17:16 much hands off experience. You both absolutely smashed that. So great job. Was fun. Thank you, William. It was a lot of fun. Joining me. Thank you to Teleport, our sponsor. Remember to go check out Rawkode.live/Teleport. You've seen us using it today as an awesome product. I encourage you all to install it in your clusters. And thank you to everyone who's watching and your comments along the way. We'll be back next week with a couple of members from the Discord community. We have Carlos and Eric. It's gonna be another amazing episode. So once again, thank you, Adam.
1:17:46 Thank you, William. Have a great day. Bye. Thank you. You too.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments