Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Trace Cilium CNI failures by comparing control-plane manifests, daemonset logs, and iptables state when pods will not schedule.
  2. Diagnose Kubernetes control plane issues by auditing swap settings, kubelet configuration flags, and service status across all nodes.
  3. Restore API server connectivity by fixing etcd endpoints, validating node firewall rules, and correcting cluster CIDR parameters.

Rawkode and Walid Shaari debug two broken Kubernetes clusters deployed via Cluster API on Equinix Metal. They track down a Cilium iptables misconfig, swap-enabled kubelets, UFW interference, and a bad etcd endpoint.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 0:30 Introductions
  3. 0:38 Introduction & Guest Welcome
  4. 1:40 Setting the Scene: The Broken Clusters
  5. 2:35 Initial Debugging Plan
  6. 3:45 Starting with Cluster 1 (Lee's Cluster)
  7. 4:00 Kluster 001 by @briggsl
  8. 4:40 Initial Health Checks (Cluster 1)
  9. 5:40 Testing Pod Scheduling
  10. 6:47 Scheduler Appears to be Working
  11. 7:09 Investigating Component Status
  12. 8:40 Port Discrepancy in Component Status
  13. 9:54 Examining Control Plane Manifests on Node
  14. 29:50 Kluster 002 by @thebsdbox
  15. 1:11:55 Hypothesis: Missing Port in Manifest
  16. 1:16:36 Verifying Listening Ports with Netstat
  17. 1:17:30 Identifying the CNI Problem (Cilium)
  18. 1:18:39 Debugging Cilium DaemonSet & Logs
  19. 1:20:51 Fixing Cilium Configuration (iptables rule)
  20. 1:24:16 Verifying Cluster 1 Health & Workload
  21. 1:26:15 Cluster 1 Component Status Red Herring?
  22. 1:27:03 Switching to Cluster 2 (Dan's Cluster)
  23. 1:27:17 Initial Check (Cluster 2): API Server Down
  24. 1:30:36 Attempting Node Access
  25. 1:35:47 Gaining Node Access via Port 2222
  26. 1:37:18 Investigating Kubelet Status
  27. 1:39:34 Identifying the Swap Problem
  28. 1:40:35 Fixing Swap on One Node
  29. 1:42:42 Realizing Swap Problem on All Nodes
  30. 1:48:30 Disabling Swap on All Control Plane Nodes
  31. 1:49:09 API Server Status After Swap Fix
  32. 1:50:03 Firewall Interference (UFW)
  33. 1:50:32 API Server Disappears Again
  34. 1:51:42 Investigating API Server Logs & etcd Connection
  35. 1:57:31 Fixing API Server etcd Endpoint Configuration
  36. 2:00:08 API Server & Cluster 2 Restored
  37. 2:01:32 Addressing CoreDNS and CIDR Problem
  38. 2:03:41 Fixing Cluster CIDR in Kubeadm Config
  39. 2:05:13 Final Verification (Cluster 2)
  40. 2:06:05 Discussion, Realism of Problems, and Conclusion
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:38 Introduction & Guest Welcome

0:38 Hello, and welcome to this episode of Rawkode live. I am your host, Rawkode. Today, we're doing something a little different. We're gonna do some live debugging of production ish Kubernetes clusters that have been broken by some Kubernetes community members. Now I'm not gonna do this alone. I am joined by Waleed Shari, a very active Kubernetes community member regarding all around awesome dude. Hey, how is it going? Thank you. Hi David, how are you? I'm very well, thank you. I'm very excited for today's, I'm gonna call it a mission. I think this is gonna be a very interesting

1:14 episode and I hope we can you know through the process of what we're gonna do today, share some of our knowledge and the way that we work with everybody and everyone will be a little bit more familiar with Kubernetes. I'm hopeful I'm hoping to do so also. Mean, it's I don't know what what I have put myself into, but yes. Yeah, I mean, I've definitely got some concerns. Let's set the scene right, we've got two Kubernetes clusters. They're both deployed with the cluster API project on Equinix metal. They're highly available control planes with the worker nodes each. I've thrown on

1:40 Setting the Scene: The Broken Clusters

1:51 some sample workloads onto them, a WordPress blog and MySQL database and I'm just was silly enough to say hey who wants to keep configs for this and this happened to break them for them and we had two people offer, well we had more than two people, but we're gonna do two today, so the first two clusters were broken courtesy of Dan Fennerin, who is also a colleague of mine at Ecnic Metal and Lee Briggs, a developer and developer advocate for Pulumi. Now I don't really know anything, which is I think what has us both slightly worried. We don't know

2:25 if the API server is gonna respond, we don't know if the mess with the workloads, we don't know if the mess we've had, we don't know anything. Let's just We don't know anything about these customers. So I guess what we need to do is just get our screen shared and then run some basic health checks and and see what happens. Yes. So we need to, like, day one on the job, discover what the clusters are, what's working, what's not working, get to know the clusters, and see if there is any apparent errors. Just assess the situation first.

2:35 Initial Debugging Plan

3:01 And then I'm going to use my favorite strategy from a bridge documentary, a bridge IT documentary. It's called the IT Crowd. So they have a principle. Their first principle is turn it off and turn it on again. Did you turn it off? So that's the principle that I usually do in troubleshooting. Well, we'll certainly give it a shot. And just to I don't know if it makes matters worse or better, but we have a comment from Dan, who's watching along with us today. So, yeah, if we need Yes, please. If we need tests, we will definitely

3:39 I'm not watching the YouTube, but yes. Okay, so let's see. I have my screen shared and ready. We've got the Kubernetes documentation. Walid and I certainly do not know everything nor do we remember everything we learned at one point in time, so we will be leveraging the documentation where we need. We also have the IP addresses of all the nodes, we do have SSH access and hopefully we have control access, but we will see that in a moment. For anyone not familiar, Waleed has an amazing set of resources on his GitHub profile so that is Waleed Shari.

4:00 Kluster 001 by @briggsl

4:16 The username, really helpful for anyone looking to take their CKA, CTD or CKS exams. Exams and I have an SCD cheat sheet because I can never remember how to work at SCD. So step one, we have kube config for cluster one here. This one is the cluster broken by Lee. So first things first, should we run a get notes? Is that Yes please. But can you run cluster info first? Let's see the API. Okay, anything. Yeah, let's see if the API is responding. Yeah, that's good at least. That is good. That's good. The network is working. Okay, we

4:40 Initial Health Checks (Cluster 1)

4:59 can tell that the CNI is working because all the nodes are ready. And we can tell the API is responding because it responded to our Q control command. So we can assume the controller is okay. Can we do get bots minus capital A to see? Oh, we have I've ran out get component statuses and we can see the scheduler and the control manager. The scheduler. Okay. Control manager. Okay. But I'm really happy that HCD is happy. Okay. So what we can do, we can check the manifest for the scheduler and the controller manager on the master node.

5:35 In that case, we have three master nodes. Do you want to see the outputs of get pods all namespaces first? Most likely we'll see bending. Do we see bending? Can we grip for bending? Yeah. Sure. Strange. I mean, the scheduler is so basically these were run. So nothing Okay. Can you schedule Can you run an NGINX image? Can you run a cube control run bot? Run NGINX image NGINX. Yeah. Dash dash. Okay, let's see. Container, oh. So basically maybe one, okay, maybe not all the schedulers are down because we just need one scheduler to be

5:40 Testing Pod Scheduling

6:35 up. Well, I would expect the image to pull it a little faster than that. Let's give it a few more seconds and see if we actually get a running container, but it does. If you do, can you do cube control describe on the NGINX because we will see that it was scheduled. So successfully assigned, you see. So it was assigned to the node. That means that the scheduler was working. So maybe not all the schedulers are down. We can go one I mean so it's helpful if you have a better SH command like BDSH or Ansible or whatever.

7:09 Investigating Component Status

7:15 But let's go the traditional way, one node at a time. Okay, so let's say talk about what's your hypothesis? Do you have anything right now or are we still just kinda looking around? What are you thinking here? Okay. Let's do if you do cube control minus and the namespace cube system and get bots, yes. And let's grab for a schedule. So I see Yeah. We have three schedulers here. Do you wanna get the logs? So they are running, correct? It looks like it, yeah. Let me check. So everything is running. Scheduler cluster, yeah. Let's check it along with

8:01 one of Okay, good. Successfully required leads, yeah. There was a leader election done. Okay. Interesting. K. Interesting. Yes. Can you do again the cube control cluster info? So control is running. Kubectl is fine. Okay. Now do the component again. Get component status. So get. It's doing get. So the unhealthy connect, connect refuse. So if you look at the address, yeah? Mhmm. It's a local loopback address. Yeah, so this is the probes on the scheduler, should we take a look at Yes. So if we just do get pods, describe pods, pop open a scheduler. I can't scroll in t mux. Let's use

8:40 Port Discrepancy in Component Status

9:12 a pager. Yes. Yes. Okay, I want to check if there is a reason or anything. State started, yeah, ready through, that's fine. This one it looks The ports are different. The ports on this is one zero two five nine, one zero two five nine whereas our error message was one zero Let me run that get CS again. 10252 1 0 2 5 1. I wonder if the probes are changed on one of the schedulers. Maybe we should check them Let's be open minded and let's check the API server. What's the configuration for the controller manager at

9:54 Examining Control Plane Manifests on Node

10:00 what board it's listening to? Sorry, can you say that again? So our source of truth for board's numbers would be the ABI, the cube ABI. Correct? So if we check the manifest for the cube ABI to see what ports configured for the controller manager. Okay. What do you think? Yeah. Let's let's so you want to describe I want I want to one of nodes. Right? Oh, I want to go to one of the master nodes. Yes. Alright. Alright. Let's grab this IP address then. So we can do a SSH. And these should be deployed through a static

10:51 manifest. Is that right? Yes. Correct. So let's minus L. Yeah, can we or yeah, the Kube API server? Okay, you would need You want the API server? No, no, it's okay. In fact, this is the future, right? You could type in my machine. Yes. There you go. I'll just I'll just chill back a little bit. You've got this. Okay. I don't see anything related to the controller. Secure port. Okay. Actually, it's the other way around. So it is 10257. This is the liveness probe. Okay. So I've I've got a question. Right? Like, we we run and get pods on the

12:07 cube system namespace and what we see is three schedulers appear to be healthy. One controller manager that appears to be healthy. However, component status is the same as unhealthy. Now how are we getting this disconnect here? Do we know what could cause that? It could be the health checks are the wrong ones. Yeah. So basically it is listening on a different port, but the health checks are doing a different port. One zero two. Sorry. Two. So okay. So now I saw a roll of course there. Oh, okay. Sorry. No. No. It's all had. Cube control.

13:07 Minus n cube system. You'll need to yeah. We're not gonna have access to that on the noise. You'll need to use the admin cube config. The admin.com is a cube config. Yeah. I've just dropped into my machine. Okay. Grab that. Okay. And the current controller and we need the cube system namespace. Yeah. It's already there. It's at the start. You're good. Oh, okay. Okay. Let's describe one of them. Yeah. Okay. Excess, you're all good. K. It just auto expands. Oh, okay. Good. And I'll copy and paste that controller manager for you if you want. I'm not used to this.

14:06 Yeah. I'll I'll here. I I could do it. Right? We wanna do a describe. Describe. Yes. And not the cube color scheduler. No. Cube control. No manager. Okay. And what was it? Control name and then m g, let's say. Okay. We have to do less. What happened? Hold on. Let me fix it. There you go. Are we typing together? Okay. Let's see. Can we see the boards? Can you go less? Okay. So You see port and host port, there is none. I don't this is normal. Let me do you want to go to the other

15:04 cluster and use it as a check? Okay. To see what's the baseline. I don't think this is normal to have the board as none. Which container is this? Yeah, this is the controller manager. It should be listening on 10 to 50, is it? Okay, why is it saying none? Okay. Okay. So let let's I think I understand what you're suggesting here. So let's make sure we make this clear for the people that are So we are missing maybe configuration, yes. I'm thinking, are we missing a configuration? Yes, so and when we do a get pods and everything looks healthy, this is because

15:43 the liveness probes on the controller manager and the scheduler are using local IP addresses 127001 and a port which will work, right? However, get component statuses was reporting that our controller manager and our scheduler are unhealthy. Now the component statuses checks aren't gonna use the 127001 address, they're gonna use the cluster address for it. What Wally does notice is that when we take a look at the description of this is that there is no port or host port defined on this Which means that that cluster address would not work and components would think that things are unwell.

16:20 So I think that's a good lead. But the controller manager, okay, because we did the bot, maybe we should try a replica set or a deployment. But before we're doing this, let's try with the traditional tools. Let's do netstat minus three year LBN and check. You'll want to be on the machine, right? We are not anymore, yeah. Yeah, we can just do it. Racing conditions we had. There you go. Let's say not stat minus t u l b n to listen to TCB, UDB, and port numbers as numbers and grab control. Okay. It's listening on 10 to 57,

17:13 which is correct as in the health check. Yep. So I think you're right. Why don't we let's assume that our second cluster doesn't have this problem. We wanna be able to correlate what we think our hypothesis is to confirm it. So But are we actually are we causing us problems? This looks healthy. We managed to run a bot. We didn't manage to run a bot. Did we? No. I thought they did container creating. Okay. So container creating. Okay. Can we run events again against it? So basically, the scheduler did work. And okay. Failed create what sandbox. Cubelet combined from similar

18:07 failed to create what sandbox. We have to set up network for RK for sandbox, enable to connect to Selium daemon. Okay. So it looks like can we do again? So it's here. It's saying that our CNI is Cilium. Can we do How did we miss that? I don't know. I didn't miss it. Because we did a cat A. Yes. So can we look at the daemon set that is responsible for that? Can can Yes. I just even set minus a. Cilium namespace. Is it Cilium or CubeSystem? Oh yeah, it is Cilium. Okay. So none none of them is ready.

19:08 Okay. Okay. None of them is ready. No. I think we're going have to get the logs of one of these ciliums to see if we can see any other messages. Yes. Let's describe first. Usually when there is a crash loop back off, I do the logs because it's usually the application. But to describe, there is a reason and sometimes this reason is quite good enough. Can you repeat again with less or grab for capital reason? Capital R, easy. Yes. So just saying crash to back off. Okay. Let's do the logs then. Let's do the logs with the previous log so to

19:48 see what the error was. Minus minus p for previous. Yeah. Detected, mounted, valid table, prefix. Rawkode requires IP tables. Is the so Selium can use Selium uses BBF. Correct? It can do. Yes. So is the config for Selium cluster? Okay. This is Let's see. You're going fast. So I'm just looking for anything, looks like an error message. Okay. So that is the first error message. So we got a What is it? We have a warp number of and then we have a l seven proxy requires IP table rules, install IP table rules. We just messing one parameter?

20:55 Proxy requires IP table rules. So it's telling us it requires a certain IB table rules. Okay. Let's use the bridge documentary for this one. Let's do the IB tables minus capital l. But the problem with IB tables, we'll see lots of stuff. Let's let's try. What is the problem here? I I don't know if it's IB tables on the host or selling them is telling us that it doesn't have access to the conflict stuff and now it's trying to get that Is it using but it doesn't have the flag on the template, right? Is that what it's saying? Yes. So

21:34 if we do an edit DS Celeb, unless you jump down to all these parameters that we pass into the pod. Man, that's too fast. Yeah, sorry. Just get to the bit we need to do. Where's args? It was most likely it is in a config map. Of course. Or secret. But most likely it's a config map. Okay. In in cvm space, can you do get all? Yeah. Haven't created cvm before. Okay. I think yes. Okay. Let's see what we have. Always use our. So let's try that Centimeters again. Yeah. There we go. It helps if you spell it right.

22:38 Oh, you need minus n? You did minus n, correct? Oh, but I didn't. No, but but yes. Centimeters, sorry. Yes, I think we could just add notes here. It's not ib table rules. Oh, wait. What happens if I don't want to use ib, because it says enable ib version four on top. But what happens if I want to use e b b f, install ib table rule false? Yeah. Try this one also. Why is it both? Ah, so it's a duplicate. I think there's a typo, and then there's a false But which one is correct?

23:10 Because you have rule and rules. Now time to hit the manuals. Time to hit the main pages or the documents. Well, the logs told us, so we go to the bottom of this log, rules, plural. So Okay. Let's restart our let's put our selling pods and see if they get happy. Yeah. I'm glad that you are here. I'm glad that you're here. We'll deal with this together. So I've just restart What did that? Sorry to interrupt. But that's a realistic that's one of the realistic problems in Linux and Unix and the Kubernetes in general is

23:46 typos. Most of the problems is us doing the type on a configuration map or somewhere. Yeah. The difficulty here is, I don't know if Lee has broken or BPF configuration and fixing it by enabling IP tables is still a valid way to fix it, but we may not be fixing the real issue, but if our customers happy that's a win on our books, Selvium namespace, so. I must specify the resource cookie. Yeah. Sorry. I I would type. No. Go ahead. Go ahead. K and Celeum get pods. Okay. Let's do a get c s and see

24:31 if that fixes. Healthy. If you do the the thing is when we did k get nodes, everything was ready. So there was communication between the nodes. That pod they're scheduled, so maybe a component status is just a little behind, we're gonna have to wait for the check to catch up, but we may have fixed this. We got out, we were able to schedule our pod. Let's see if our application is still running. We maybe should have done that from the start, but So port forward, s v c, WordPress, 80. So the fastest way to check,

25:14 I mean, that your cluster is working is basically you run a workload. And running a bot is the simplest workload to have. I would say deployment will exercise the replica set controller. So let's create a deployment. We we have one here. Let's work with this as our sample workload, which is is running just fine. We never changed the area. But the thing, if it was running before, nothing would happen. Okay. Unless you delete the bots, okay, if you want to try. So you don't need to create a new deployment. Just delete the bots for the WordPress.

25:56 Okay. Alright. Let me close that. So get pods. So the controller manager is several controllers. One of them is for nodes, nodes controller. One is for the replica set among other daemons and stuff like this. And they are registered. So I don't think we have a problem with the controller manager anymore. I am so worried about this get component. But the bot was, it was not 10252, was it? The bot was, I think I thought the bot had seven when we did netstat. Yeah, this might be a red herring. Let's just say that this was deprecated in

26:38 01/19, this cluster is one nineteen, but maybe this just doesn't work. Maybe just doesn't report healthy here. We could check, let's why don't we I'm happy that our workload is running, we fixed the Selium problem, Selium started working, I don't think we have any pods that are not running. Yes. So let's do okay. Let's move on to cluster two and then we'll see if the get component status is fails there. And if it does, we'll consider that just a one nineteen thing. And let's see if we can tackle the next problem. And then if we need to,

27:14 we'll come back and take a look at the other problem on this. Can you do just for my peace of mind, can you do cube control get events and let's and grab for okay. Let's see. Just get events. It's like okay. And grip minus v normal. V only. Oh, sorry. I just added that. It's case insensitive. Error. Volume. Volume is already used. Okay. So fifth setup network sandbox. This was four minutes. Work sandbox. I think So problem. It's it's fine. But it's running, isn't it? Yeah. Mhmm. Okay. Okay. So I mean, if I do get nodes and if I

28:13 do get bots minus a, and if I do cluster info We have a working cluster as far I can And if you run and if you run, like, a deployment and you expose the service, let's expose the server. There's no NGOs configured on this cluster? Just as a service, as a cluster IP. Expose NGINX. Oh, I can't remember the full thing. The server does not have resource. Expose SVC slash NGINX. But I have to tell it what to No, no, actually, we don't have the service yet. Yes. By default, I think it's cluster ID. It's

29:04 clever enough to figure out the board, isn't it? Okay. Good. Get service. Can you do get endpoints for this service? Get EB endpoints. I just described as service. Okay. That's better even. Yeah. Yeah. And we have endpoints. So networking seems to be fine, at least at the antenna cluster level. And creating a bot and the scheduler is running, the API is running. I think who broke this cluster? Lee Briggs, who is in Seattle, so will not be available just now to confirm anything for us. However, he will send a pull request to the cluster repository to tell us

29:50 Kluster 002 by @thebsdbox

29:50 what he broke. So hopefully we'll fix Okay. If not, I think we did pretty well there. So let's move Do you want the good news or the bad news? So Dan's left us a comment. We're having fun, there is no bad news. I fear I've opted for a less nuanced approach and gone full sledgehammer on the cluster. Well, I always knew you were the wild card then, so that's alright. So Daniel is the one who broke this cluster? He is. Broke Okay, thank you Daniel. I appreciate. So let's I just want you to be realistic,

30:28 so that we can so maybe we can actually learn something out of it. Okay, let's try. Time man, time Can you use minus V? But it looks like okay. Can you do kubectl config? I want to view. Okay. Can we curl curl the address? Yeah. 6443 slash A B I or Yeah. Minus LB. I love minus LB. So basically, it looks like we don't have an ABI server. Alright. Let's get on this note then. Okay. Clustered to I have a lot of projects here. Cluster oh, right. Only issues I had with API servers in production was from the load balancer.

31:32 That the the load balancer was set one time as the layer seven load balancer, not a network load balancer. And another time, basically, during the installation, this was OpenShift. So OpenShift starts as a cluster ABI. You have a node that is a management node, and And they forgot to include this management node as part of the back ends. We have no SSH. We don't have SSH? No, we're going to have to go in the back door. Fortunately, could expand But yeah. Has an outbound console, so that's gonna take us over the private link. On a non standard on a standard board,

32:21 he said, okay. I guess we could port scan it or I could just go in the back door. Does he mean non standard or he means standard board? This machine's also been up for more than twenty four hours, so I don't have root password to actually log in here. So we've got two options. How do I quit? Turn it off. Thought control square bracket was my get out of here. Yes, yes, in general, yes. Let's just open up. Control square bracket dot, control square bracket dot usually. Yeah. How do I open a new tab on

33:08 TMux? Control B C. Alright, okay, let's just use this. Should we port scan this machine and see if we get anything? Okay, go ahead. And Matt, do you think, let's just the normal range. Or do you think he's creative? I have no idea. No idea what he's done. It's blocking ICN. You can try nmap minus s u, s capital u. So we do it here. You are requested to scan, which requires ah, you need sudo for that. Also the first one probably. I don't like it today, Dan. So this is scanning UDP. There is another one,

34:05 SSP, b I think that basically doesn't send the the first the first sync packet. It starts with sending the second. I've got an idea. He he may have done this in all of the control plane nodes, I bet he hasn't done it on the worker nodes. I don't know if he's that evil, so why don't we just try sneaking in through. You have a trust between them? No I'll have to forward my SSH if I can remember how to do that but I wanna see if I can even get on one of these nodes. No, he has done it on all of

34:37 them. Of course he did. Okay. Nonstandard. Okay. He's saying nonstandard. Let me just a second. I mean, I would pick 600666. I would pick 20202121. But we can scan scan using Okay. I'm gonna take one of the non important nodes, reboot them into a rescue operating system. So that we get on it and poke around the disk. Okay with Just a second. Let's try. Anyway, this I'll just this is just a worker node. So I'm gonna let that do that. Oh, you are in the worker node? Are you in the worker node? No, no, no. I'm rebooting a worker node into a

35:38 rescue of a ram based operating system, so we can get on and poke around. So while that does that, we can try a few other things this way. So do you wanna try that end map again? He says we are so close with those port numbers. Ah, 20. Not not as you. SS. To to I don't know if he's been nice to us here. 20 one, 20 one usually. Or 20 two, 20 two. Yeah. What happened to 22? There is an n map scanner. Let's see if this is in its rescue or less yet. No. It'll take a couple of minutes.

36:46 Okay. So he has just sophisticated the port 20202222. Doesn't seem to work, but he seems to suggest You mentioned the try triple six. I might have used the wrong IP. Alright. We're on a node. Okay. Which node is that? Control Control So what is it? 2222, correct? Yes. Okay. Can we do, what is it, Docker or Cryo, or what kind of Container D. Can you do container d? Okay. Can you do cryo control bots? Can we do we do system control and see that container is running? System d status container d. You did the whole stuff.

37:48 Container d. Yeah. Okay. Active, running, which is good. And okay. Can we see the cubelet? No. Can we check the again, with the sys with yeah. Failed. Loaded. Okay. Enabled. And activating auto restart. So basically, it's trying to restart and it cannot restart. Certificates. Of course it is. Oh, man. I told you there would be certificates. Okay. Can we check the configuration directory, varlib? What does it say? Certificate? Yep. We're here. Certificate. Kubelet client currently loading. Okay. I see it looks like the search were okay. Right? I do actually see an error message in the journal v log there.

38:51 Can you highlight it? Can you highlight the error message that you saw? No, there isn't one. I I see the search and got worried. Run journal control minus x. Run journal control minus u cubelet. Do you wanna type? Sorry, messed up. Yes. Oh yeah, that's You'll need to add a note pager. Fun. We reached there. It was up for twenty four hours. Okay. Keyblade service failed with exit code. Oh, I need What was the exit no major. Did we? Code dot main process. I wanna see if our API server is okay. It will not be okay because the cubelet

40:28 is the one that starts it. Basically, if you the cubelet No, it's not running. It's restarting. If you look at the you look at system control status. So basically, there is an error in the configuration file, most likely. If you look at the activating auto restart, this is your key. It means, in six seconds, It keeps restarting it, but it hits the same problem. And this is one good thing about systemd, that it will go forever. It will until the problem is fixed, hopefully. There you go. This is the everybody. The one good thing of

41:10 It's the IT crowd principle in action. Restart it. Restart it until it's over. So we take a look at the Kubelet static manifest? Yeah. Yeah. And not not it's in the var lib. Yeah. You see it? A var lib kubelet. No. A var lib kubelet. LS config dot yaml. Okay. Now, cluster DNS, cluster local. One thing we can do, we can grab the config from the other one and try here. Did we actually see an error message here? There is an error message, but I'm not sure. Let's see if we can. I couldn't find the code. What code was

41:57 it? It says failed. Okay, another way we can look at the Wow. That's a lot of Yes. There's a log file also. I mean, other than join and control where we can do grab against has been duplicated. Ah, here. We need to remove this flag. I don't think that would stop the cubelet starting, would it? Let's see. What does it say? The parameter should be set by the config config flag. Let's see these flags. Are there or not? He says it's right in front of us. So we're looking right at it. Show the result. Exit code. Main process.

42:53 Yeah. It's always right in front of us. It's mains. Main process exited here. Okay. So let's let's see if that flag is here. Eviction. Eviction. Eviction. Okay. Can you comment this out? Or make a copy of the file? That's not the same one though, right? Grip. Effection. Yeah. That's not the same one. Effection hard has been deprecated. This okay. Maybe okay. Let's find where is the cubelet grip grip config dot yaml on this one. What does this message mean? CPU Can you do where? Sorry. We've got a c groups per cause enabled, but secret group is not specified

43:54 using slash. Can you highlight it? Yeah. And we've got the swap header too. Swap error. If this is a cube admin cluster? Yeah. We should just turn swap off. Right? Yes. If it's cube admin, it should be off. What do we have as well? Minus a, I guess. Let's see if that restarts our kubelet. But I want to see the system control. I just want to see if the configuration file is in a different place than normal. So you see here. The kubel is running. Oh, okay. So just swap. That's really silly, no? We complicated it.

44:39 I think we we were good down a rabbit hole there. Okay. So let's So the second principle is keep it simple, I think. And don't jump to conclusions right away. Alright. Let's see if we got an API server at least. So what does he say? Ah, okay. Please disable swap. Who's that? Oh, I mean. Alright. So we're I mean, our API server is still broken, so bring them back on the node. No. No. Can you go and check if the bots start running now? Check if what's running? Can we see on the the master node,

45:16 can we see the cubelet has started the bots? Okay. Can we do docker b s? There is no docker, so basically, try control b s. Oh, there is no okay. Can we restart the container container d? System. Good. Good. Good. Can we check the status of the container d? It should show that. Looks alright. Okay. Yep. Okay. So it was a container. Okay. API server's been running. Can you try try control b s? So basically, Cry Control BS will show us the container images. I mean, the bot is running. Process so can you do cube control status,

46:30 the container d? Let's see again. Yeah. Status, container d. Good. Okay. So active running and basically looks good. Can we do Jordan control minus u? Jordan control minus u and container d. No major, and maybe follow minus f also. So you want to try from the other tab on TMax to check the cube controller? It should work now. It doesn't. But It doesn't? He did joke about IP table rules at the start. Yeah. Yeah. The problem with IP rules is that there are too much. Queue firewall anywhere anywhere. So we got Yeah. We've got this.

47:48 Which one? Nothing. Nothing yet. Start looking. That's oh. You did the main part is startup. I hate IP tables. Let's Okay. Let's Okay. Is it I b system control? And let's check if firewall d is on. What what what kind of system is it? Status? It went to USWR. Okay. And we've all the rules We could see the allow rule for Port 2222. Let's just go through the last. Let's see. Okay. Let's try to find the Should we just add a rule to just to load that? No. I want to check 6443. Can I do a slash 6443?

49:25 Okay. I shouldn't actually. I don't think we'll have a rule to allow that. I'm assuming we've got some very heavy handed dropping going on here. So why don't we just add a rule to allow that port and then see. But why okay. Can we do can we do a trace against a trace? Where did our API server go? That was running a minute ago. So somebody is asking, Daniel, he's saying disable UFW. Is UFW like on top of IB tables? I'm not sure about that. Yes. We're looking at the rules for it right now and I could disable it, but

50:11 I thought that would be a bit heavy handed too. Figured we could just allow the port, but sure. Why not? Since we are, I mean, I guess we don't need to fix this gracefully. We can we just need to fix it. I still don't see an API server at our PS output, which was there a few Can you check the status of the cubelet maybe? Yeah. Yeah. It's happy. It is happy. Cry control bots still bad. Yeah. Something wrong with that. Cry control b s. Can we do a strace against it? See where it's stuck.

50:58 Oh, you don't have strace. We can fix that. Something I know how to do. There we go. Okay. So it's trying to do View text open. Such file directory. Use that. Failed to connect. Failed to connect. So still, we have some network issue. Contact the okay. You mentioned etcd. Correct? Yeah. The container would use etcd. Right? Why not? Because it doesn't use etcd. I mean, if we were the Kubernetes class for container data. Yeah. Yeah. But here, the this error message is usually you see it on field connect context deadline exceeded. I see it on HCD when the certificates

51:55 are wrong, but I might be wrong in this case. Okay. Let's take a look at Kubernetes. So the system fail is in lib system b system then container d. That's useless. Okay. I would look at the logs first. Okay. Returns looks like the containers d is doing its job. It's creating containers. Yep. Do you have Docker client here? No. If we can install it though. No. No. You don't need to install. I mean, try control should be doing the job for It's not a WW. Okay. So we know that cry control cannot talk and usually talks to socket.

53:23 Unless so This is a hard one, Dan. Thanks a lot. I said this is a hard one, Dan. Thanks a lot. It's bugging me that when I grant for API server previously, it was showing. No, it's not that I'm not getting past that, hold on. Let's try let's restart our kubelet system. Something's not right here. I mean, many things are probably not right here, but something specifically here is not right. Okay. Why are these manifests not started? Do we have etcd? Right. Okay. We do. Do we have controller? We don't have a controller And so we have a schedule. So there's something

54:34 wrong with API server and controller manager manifests here. Okay. And so basically, it is the Kubernetes manifest. Okay, let's, so any of these started? He suggested we look at the kubelet again. So the kubelet has a configuration where it looks where the manifests are. And it's under under Etsy. If you go see the me go. Yeah. Go for it. Etsy system, the system kubelet, less, and this one. Okay. Now here so environment, kubelet, kubelet axe, bootstrap, it's Kubernetes bootstrap, kubelet.com, kubelet config Yaml. Is it not configured to have static manifests? Should we get the docs up on this

55:46 one? Yes. Let me check another cluster. Let me check another cluster. It has a pod manifest fast. It's not not being passed via kubla args. You can check the other cluster, by the way. Let's just add it to this. Kubernetes manifest. No. It's not manifest. It's Kubernetes manifest. The environment file basically. No. Am I wrong? What did you do? I didn't see. So in this file, I added a flag to our kubelet to see pod manifest path is slash etcetera slash Kubernetes slash manifests. And it's Make sure the kubelet's actually got that. Okay. That's a pain in it. It has to

57:00 read that. No. You have Kube API server up. Oh, I do know. As of ten seconds ago. Okay. Cool. No. Can you but it died. Okay. We can look at the logs, by the way. So the logs let's let's do this a second. Can you give me control? Yeah. Take it. Okay. Cat Cubelet. You Cube API server. So there's API server struggling to speak. Oh, the 10. It's 18. Emissions controller. Yeah. Can't reach NCT. Endpoint. Var scheme endpoint. Var scheme endpoint. CBC resolve rubber. 2379. Yeah. This looks like an HCD thing. Yeah. The context exceeded.

58:17 Important is wrong. That's why 2379 versus 2381. So Okay. Let's We can fix that, right, by going to The manifest again. Yes. Yeah. Okay. Find endpoint. Yeah. That's it. No. You didn't fix it. Did you fix it? Yeah. You are very quick, man. No. It's still 2381. It should be It was 2379, and then I changed it to 2381 because when I did a I grabbed on etcd, the port was 2381. It's here. Oh, no. But when you did the grip on HCD just a second. This is somebody is running. Where is it? Yeah. So

59:07 is on port 2381. Part of which process? No? Oh, yeah. Okay. Dennis. So maybe you should fix it from here. Well, which which is the valid the real NTD port? It is to here. Initial cluster, 2379 and 2380. Here it's 2379. You just went on a port random thing and started messing up ports. So this is for the beers initial cluster. 23792379238 here. This one. This is metrics. This metrics. Listen beer, 2380. Lifeless probe. Okay. So let me fix the API server again. Sorry. Because I I said okay. So you're saying I've changed that the wrong way around, which

1:00:15 makes sense. So What was it before you changed it? To three seven nine. So let's put it back to that. Did you make a change in the NADC one? Did you change anything here? No. I didn't. Alright. Let's check this kubelet thing again. So q config. Q config. What manifest? It is easy to manipulate the manifest. Correct? Config. Well, the config. Runtime. Run container d, container sock, eviction, hard memory. Yeah. That's alright. Our API server. Check this one. It has to be the API server. It has to be the API server. Can we check the socket?

1:01:19 Yeah. It is there. So why control b s is not working? Hold on. Can you look at the minute two gig of memory available, we're gonna start infecting stuff. Oh, yeah. That doesn't seem right. Can we check the try Check the kubelet Where was that kubelet thing again? Do you wanna jump back into that? I can't remember where that was. The kubelet. Okay. It's under it's in your system. System cube. Let and there is two, by the way. There is the unit, which is in a different place. Yeah. And this is the override for the

1:02:33 unit. So do we have to have no. There's no eviction here, is there? No. So maybe we need to go to the so see that in this flags fail or it might be in the unit fail. The general control would show us that. Okay. Can you go again? The beam can description, documentation on software. You're really fast, man. I'm sorry. Know I really so it's not in the unit file. It wasn't in the QBDM flag. It wasn't in the config that you found. So where else is it going to Bootstrap config. It's the can we check the Bootstrap the this one?

1:03:16 This file? Yeah. No. There is no file bootstrap. So do we need this file? So he's passing a pack he's passing a flag to the kubelet. But do we need this file to bootstrap? No. I don't think we do. I think Where was the can you go back to that other QPDM config you had, please? This one? No. Which one? We are in the current directory. I could have just done cd dash probably that would have worked. Do cd dash. Where are we? Yes. Strange. Why I don't see it. Yeah. Hold on. Can I type?

1:04:44 Yeah. Okay. Yes. Yeah. This default should put it. Vixen. That's it. You you found it. Remove the call. Comment the call. Sorry. I'm just reading that and laughing. Alright. Let's do daemon reload. Restart. No, that's so the API server was running, it was disappearing. I'm glad I didn't make that up when I was doing that PS. So now we wanna test the API server again. Okay. Back on the note. No. He didn't. We we definitely have an API server now. So I'm gonna test it with the admin conf. Okay. So let's confirm that this is the API servers

1:06:08 running on 6443. How many many problems did you put in this thing then? Really? 443. 4 0. Correct. And we've got is that IP And you can check using let's let's say also. Yeah. Go for it. Oops. The b n server is not It's not Is it down again? Well Did you did you restart it? Yeah. Was running. Yeah. Search for API. Okay. API server. Okay. The ties address. Is this the correct address? Let's check. There's not really anything out. 75180. And we've got an elastic IP on the bond of 8045. What was the netstat output?

1:07:35 Sorry, the PS output. Yeah. Oh, it's down again. It's gone. Let's check the locks. Log containers. Kubernetes. Maybe I k. We need the last log. So we need the last log. You want me to copy and paste all that? Oh, happened? The the the other section. Okay. It's just too weird. I'm still my That's q. Yeah. Cluster002. Tap tap. Okay. V18. 30 2 log. Yeah? Yeah? 32. Oh, 33. Let's do 33. Context deadline exceeded again. Keyword. Edit. Loaded. Plug in. Loaded. Bar schema. Endpoint. CC server. Can we check 2379? Yeah. Endpoint. Contacts So I was trying to speak to entity

1:09:36 on our local Yes. The thing is I my problem is Yeah. We're talking My problem I need I I I want to know why this is not working. I want to know why CryControl is not working. Should I So we have two big problems right now. One, the API server keeps rebooting because it can't speak to etcd and we have no way to enter out with the container d process. Yes. So if CryControl is working, I will actually use etcdControl from the etcdControl to check etcdHealth because this looks like etcd is not responding on time. We could

1:10:33 try and enter and jump into the entity namespace. What process number is it? 7553. What? Is that rewritten as well? Oh. Let's see if we can catch it before it goes away. So we want let's see if it's still there. Yeah. It's a key enterprise client, 2379. Server cert client certification directory. Let me grab my entity cheat sheet. Where is it? No. No. It it looks fine. We I went through the manifest and it was a server key. It's in client. I think he said earlier, entity wasn't touched. Yeah. The problem is an SCD. So

1:11:51 okay. Can I do we have SCD controls here, type? Just want to check. No. We definitely install it. Yeah. That's why I was gonna try and do it in a center and to its namespaces so that we could just pretend we're in it even though we're not. I want to check why control is not working. Because Dan hates us. Okay. I'm trying to check which file it opens. Failed to connect. It's Dan's given us a hint, of course. The entity cluster won't be healthy because of the firewall probably in all the control plane nodes, so we'd probably have to disable it

1:11:55 Hypothesis: Missing Port in Manifest

1:12:48 on them all. Oh, yes. Because we did only in one node. Yes. Yes. Yes. Yes. So you have to log in to each node then. That's why it's useful to have a valid SH like PDSH. Yeah. I wonder if he added default things to the mall. Of course he did. And this one. I'm not sure what failure scenario Dan had in his head here. I'm assuming our conflict management was so messed up that we rolled out this change to everything. Okay. And stop. Get f w and just because I wanna know. Maybe I need to just

1:13:57 tackle things a little bit here. Just encourage it and we'll do that for the previous node as well. I'm not sure if it actually reloads the default params or if they get cached, so Okay. Alright. Let's just jump back onto node one and see if we can see what's happening. Okay. Where are we? API server. No. So SCD might just need a minute, I guess. Just a second, system. I want to check if he played with other network stuff. Minus a. Grip. Forward. Assist controls. What? I p v four dot I p underscore forward. That's

1:15:24 no. It's okay. So I didn't play with that one. Disable the firewall and the other nodes. I just did the control plane nodes. I don't think I would need to do the worker nodes. Feel free to correct me. You don't need to do the con con worker nodes because basically what we wanted is the quorum for HCD. So HCD is like in every now and then, they will check who who's the leader. Also, the scheduler and the basically, they will do, a leadership election. Let me just confirm. Good tap there from a different Yeah. I didn't know that.

1:16:31 So I bet you I need to start it. You can do ibtables flush and then restart the cubelet. So ibtables minus capital f. Yep. Already done. Let's get the other one done. So we disable flush restart. And back onto our first control plane node. Can you do cry control b s? Because I think this is what was stopping my cry control from okay. I'm with minus flash. Good. You start. Okay. Type impressions. I love the duplicate. Yes. No. Do we need to restart also the container d? No. Here, I'll restart the whole thing if I need to.

1:17:30 Identifying the CNI Problem (Cilium)

1:17:48 Like you said, just turn it off and on again. Yeah. I don't know why is that. Alright. We have an API server. I'm gonna try that didn't reject it. Alright. Host. Do we still have an API, sir? No. K. Okay. Let's go back one step. Alright. Well, we have ten minutes left. So let's oh, Dan, I really hate you. He turned on swap and all the notes, of course he did. Right. So you you disabled swap and all the notes. Correct? No. I I just disabled one note. I'm a cowboy. So that means swap off

1:18:39 Debugging Cilium DaemonSet & Logs

1:19:01 Smiley isn't it? Restart kubelet. Dart mode. Okay. Swap off. That's a good thing for me to remember for future episodes of this since that people are thrilled and will break all the nodes and not just one of them. I think Dan is cruel. Okay. Let's jump back onto the first node. We'll try a couple more things and then we'll just ask Dan to maybe we'll just get this working before we wrap up for today. So let's see what happens with yeah. Let's try get notes. No, we're still broken. Do we have an API server? No.

1:20:04 So maybe SCD has just not got a core yet, which would mean that the server wouldn't be able to start. Okay. Okay. Should we be able to check the server logs to see for a different error, I guess? The so the SCD logs are in the same place. Varlib Varlib containers, HCD there. So if I cannot get to the container, at least I can get its logs. Okay. The Varlog container. Right? Yes. HCD. No. It would be just the and it It's a scheduler. And let's grab the HDD. Yeah. I don't know why it's not all

1:20:44 complete. Yeah. So let's get one of these. Let's, for example, this one this one. Did we get the right one? I hope so. Oh, sorry. You type. Okay. Not such five. Yeah. I think it's only pasted half of it. Well, thanks for that vote of confidence, Dan. He thinks we've fixed it. And the other Daniel, Daniel Louen Lou Ellen has said that we may have just overwritten that Rawkode file. We might have what? Overwritten the that log. Maybe. But there are two of them. Okay. Why don't we just do cat etcd star? Okay. Let's see if that works. There.

1:20:51 Fixing Cilium Configuration (iptables rule)

1:22:22 Helpful k? So etcd is not an issue. There was indicating a symlink. Yeah. That's only healthy for thirty seconds. So why don't we test, get notes? Okay. So we just had to wait for LCD to be happy. Settle. Okay. Can we try our bots? Okay. Can we do that? Okay. That's good. Cluster oh, crash through back up. The control manager. Alright. No Definity. Don't tell me he did the No Definity also. You want to check the RawDefinity or do you want to check the Control Manager? Control Manager, I think is the best. Yeah. Okay. So we have one Right. Not a problem.

1:23:34 I never used minus f for logs. What is my oh, minus f follow. Okay. Yeah. I would do the for a crash loop, I would usually do the previous because the previous will show will terminate in an error if there is an error. But, also, I would I would love to describe first just to check the reason. And if you grab for reason Reason crash loop back off. Very helpful indeed. Okay. Okay. So what do we got? I think it might just need to restart it. Quote quota monitor. Quote monitor. Restart the resource quota. Ah, there is a

1:24:16 Verifying Cluster 1 Health & Workload

1:24:40 resource quota. Can you check OC okay. OC. Cube control get quota. Good. On the cube system. There is no no resource. Okay. How about network policies? Net one? Okay. It's not that evil. Yeah. Hold on. Let's find a real error here. But it says I saw some quota messages. Error start to node IPAM. Failed to mark CIDR 1 9 2 1 6 8 0 0 Slash 20 4 IDX occupied. 1323. Dan thinks these are red headings. He says it's not him. Okay. They read they read they read the bots and let's see. Yeah. Oh, it looks

1:25:35 like a call DNS controller manager. Is one of them running? One of them is no. But it's not ready. It's running, but it's not ready. And we have bending. Why is it bending? Can we check why it's bending? Two seconds. Cube control. My second. Describe. Nope. Okay. You're a terrible person, Dan. Terrible person. Container image. Okay. Already present a machine. Change the and recreate it. You control. Get rep controller. So let's jump on to the first node. The first node. Update another location. Warning. Invalid disk capacity. This was six minutes. Can you do queue control describe node this

1:26:15 Cluster 1 Component Status Red Herring?

1:26:54 node? Cluster. Yeah. And allocated resources. I don't okay. So do you remember can you go up? Let's do less. Okay. Conditions. Network unavailable. Memory is good. Disc pressure falls. Disc VID pressure. Ready. So what's the capacity CPU? 48. So why did I get this this case with there? Okay. You have a metric server? Why does it say this capacity is wrong? Okay. Let's run the a cube control events and create for warming. Oh, yeah. Keep get the events. Yep. Okay. Why does it say invalid disk capacity? This was fifteen minutes ago, fifteen minutes ago, and it's all in cluster two. Maybe they

1:27:17 Initial Check (Cluster 2): API Server Down

1:28:31 forgot cluster two. You didn't remove the eviction and stuff. I thought You got it to press on image file system. On image file system. Yeah. Or two over He said he said the side arrange to local host. Testing. Where did the controller manager go? Try again. Oh, if you did you need to restart the kubelet if you change the network because you need to restart all bots. So basically, restarting the kubelet will be actually, restarting the node will be the easiest. But restarting the kubelet might Yeah. We've got a control manager. He said he said he did it on the first control

1:30:13 plane. So the problem with network settings, all bots will have the wrong settings, so you have to basically restart them all. When we ah, basically, restarting the container runtime will will get you the same effect. Actually, restarting the node would be better. He left again. Oh, you're funny. I think our cluster's healthy now. Right? I'm not going to use it. I'll stay away from this kind of clusters. So we fixed I mean I bet. You can help it. Oh, you have self response. He's probably did that same change on the other nodes. Right? Oh, no. I got the cluster CIDR wrong.

1:30:36 Attempting Node Access

1:31:39 Oh, 192. What was it? Well, I thought our cluster CIDR was 19216800Slash16. Let's see. But maybe it's actually 17200016. Because our pod CIDR yeah. Our pod CIDR would be Yes. Yes. Yes. How are you doing for time while we while we wrap this up? I guess so. Thanks, Dan. It was fun, at least. So what do you think? How realistic was it? So UFW, I think it could be realistic. I saw it I saw some basically, someone is going to install something on a node. And this is the problem if you have fat nodes, Buntu, Red Hat, or something like this, not

1:32:37 like container operating systems like FlatCar or Fidora CoreOS or something like this. So they do an update or they want to install an extra software to test or something like this, and they will add another rule. And they forget that this is, like, a different node and it's specific, especially in test cluster. This will never happen in production. No? I don't think this will happen in production. Hopefully not, Daniel. Well, that yeah. That was the only problem yeah. We had the firewall, we then had Yeah. The weird eviction rules, idea how those got there. Yes. We got

1:33:14 the four or 15 terabyte files in the root partition and the root file system. There was the cluster cider tweaks across the controller. The swap could happen. The swap, I saw this happen. The swap, yes. For the cube admin cluster, that that's definitely is happening. I mean, for managed cluster for and for on premise turnkey clusters like OpenShift or Rancher, I don't need to play with the manifest. I don't need to play with the configurations of the API and the scheduler. Scheduler. The CIDR, I only needed in the beginning of the installations. Well, I think we we had some fun.

1:33:57 That was that was really interesting. I think, you know, even just the the way that hopefully the way that we work with the tools, looking at the different components, the logs, system d, where configuration lives, I hope a lot of people find this useful and it helps them debug on Kubernetes clusters in the future. Thank you to Lee and to Dan for their tire fire clusters. Hopefully we fixed Lee's. I'll speak to him afterwards. And I'll be back next week with two more broken clusters. Thank you, Dan. Dan will not be back. Sorry, mate. You're you're out.

1:34:30 Well, thank you for taking some time out of your day joining me today, sharing your knowledge. It was great just to see your thought process and how we work with Kubernetes. Thank you very much. Thank you. Have great day. Thanks for watching everyone. Yes, that's fine. Bye. Bye.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Cluster API

More about Cluster API

View all 7 videos
Cilium

More about Cilium

View all 36 videos
etcd

More about etcd

View all 24 videos
CoreDNS

More about CoreDNS

View all 21 videos