Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Identify an etcd version mismatch and fix it by comparing manifests and binary compatibility.
  2. Track kubelet startup failures caused by invalid flags, then repair node boot by updating config and binaries.
  3. Resolve cluster networking and control-plane blockers by checking NetworkPolicy, PodSecurityPolicy, and missing CoreDNS/control-plane pods.

Dan Finneran joins Rawkode to debug two broken Kubernetes clusters, unpicking an etcd version mismatch, a bad kubelet flag, an etcd "no space" alarm and quota, a block-all NetworkPolicy, and a PodSecurityPolicy affecting static pods.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 1:00 Introductions
  3. 1:11 Introduction and Guest Welcome (Dan Finnerin)
  4. 2:56 Recap of Last Week's Clustered Challenges
  5. 6:45 Debugging the Wrong / Working Kluster (Oops)
  6. 7:40 Starting with Jason's Cluster (Cluster 3)
  7. 10:08 Initial Cluster 3 Checks (Workload, Pods)
  8. 16:47 Realization: Cluster 3 is Healthy
  9. 17:52 Switching to Waleed's Broken Cluster (Cluster 5)
  10. 17:55 Kluster 004 - Broken by Jason DeTiberus (@detiber)
  11. 18:25 Cluster 5 Initial State: Unready Nodes
  12. 19:01 Debugging Node Issues: Checking Kubelet Status and Logs
  13. 22:17 ETCD and API Server Issues on Node 1
  14. 25:39 Examining ETCD Manifest (Node 1)
  15. 30:41 Identified & Fixed: ETCD Version Mismatch (Node 1)
  16. 33:19 Checking API Server After ETCD Fix Attempt
  17. 36:05 Debugging Worker Node Kubelet Issue
  18. 38:20 Identified: Kubelet Unknown Flag (`bootstrap-kubeconfig`)
  19. 39:52 Examining Kubelet Configuration Files
  20. 40:35 Fixing Kubelet Configuration (Removing Flag)
  21. 50:27 Hint: Check Kubelet Version
  22. 54:50 Identified & Fixed: Old Kubelet Binary Version
  23. 56:23 Cluster 5 Nodes Becoming Ready (After Kubelet Fix)
  24. 57:20 Kluster 005 - Broken by Walid Shaari (@walidshaari)
  25. 58:30 Reading Cluster 5 Readme Hint
  26. 59:55 Cluster 5 Status After Initial Node Fixes
  27. 1:02:05 `kubectl get nodes` Times Out (Investigating ETCD)
  28. 1:03:03 Examining ETCD Logs (Node 1)
  29. 1:03:35 Identified: ETCD "No Space" Alarm
  30. 1:04:09 Examining ETCD Manifest for Volume Configuration
  31. 1:10:13 Using ETCD CTL to Query Status
  32. 1:11:55 ETCD CTL Status Shows Healthy, But Alarm Exists
  33. 1:13:05 Getting ETCD DB Size & Attempting Compaction
  34. 1:22:47 Hint & Examining ETCD Manifest for Quota
  35. 1:24:51 Fixing ETCD Quota Setting
  36. 1:31:11 ETCD Logs Still Show "No Space" Alarm
  37. 1:34:57 Hint: Check ETCD Alarms
  38. 1:36:07 Disarming ETCD Alarms
  39. 1:36:36 ETCD Status Healthy After Alarm Fix
  40. 1:36:50 Cluster 5 Overall Status Check
  41. 1:38:59 Testing Internal Pod Connectivity (Failure)
  42. 1:39:41 Identified: Missing CoreDNS and Control Plane Pods
  43. 1:43:24 Troubleshooting Missing Control Plane Pods (Logs)
  44. 1:51:03 Hint: Check Network Policies
  45. 1:51:35 Identified & Deleted: "Block All" Network Policy
  46. 1:52:31 Hint: Check Pod Security Policies
  47. 1:53:15 Identified & Deleted: PSP Affecting Static Pods
  48. 1:54:38 Final Hint: Check Kernel Settings (Sysctl)
  49. 1:56:01 Conceding Failure on Final Issue
  50. 1:56:18 Conclusion and Wrap-up
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:11 Introduction and Guest Welcome (Dan Finnerin)

1:11 Hello and welcome to today's episode of Rawkode live. This is the special clustered edition part two. We are going to be attempting to fix some broken Kubernetes clusters. These Kubernetes clusters are broken by friends, colleagues and members of the community. Community. We have two clusters today and to help me fix them, I am going to be joined by my friend and colleague Dan Fennerin. Hey Dan, how's it going? It's going well, thank you very much. I am very excited and slightly nervous for today but it should be fun. It will definitely be fun whether it will be successful we'll have yet

1:54 to see. To the people watching us and joining us at home please remember to subscribe and click the bell on YouTube it helps other people find these videos and we have a discord chat which will be visible throughout our session. Feel free to jump in and join us there. Now, why don't we just start with a quick introduction then Dan? Why don't you tell us a little bit about yourself and then we'll take a look at last week's clusters. Excellent. Yes. So my name is Dan Finnerin. I am part of Equinix Metal, part of the

2:24 developer relations engineering team. Spend a lot of my time in Kubernetes clusters or in the container space. I come to Equinix Messle from Hetero VMware, and prior to that, I was at Docker. So I'm hoping through that I've I've accrued some knowledge that may help me today. I feel a little bit as though this is punishment for being mean last week with breaking my clusters. But, yeah, I'm excited. Well why don't we quickly talk about that then. So we actually have our first comment from none other than Waleed himself who says, hey Dan. Hey Waleed, thank you for joining us again.

2:56 Recap of Last Week's Clustered Challenges

3:06 I'm gonna get you. Well let's cover what happened week then. I've got the repository up so like we want this to be a valuable learning resource as well as entertaining for the people that are watching and following along. So what we're gonna try and do week after week as we air new episodes of clustered, is just publish the manifests that we use to create the clusters using cluster API on Equinix metal but also we're going to try and document what went wrong, what the symptoms were and then we've also going to have each of the breakers

3:37 just contribute and tell us what actually happened. So Wally joined me last week, Wally and I were on team fixed, our clusters were broken by Lee Briggs of Pulumi and Dan who is with me now. You'll see there's a readme in each of these, we can see that cluster one from Lee had a small typo and the CNI slash Cilium configuration which broke networking on the cluster. We discovered the symptom was that our Cilium agents were in clash with back off, we actually missed that, I'm not sure how I'm gonna blame TMux, so we're not using it

4:07 today and scrolling on TMUX is just not something I was good at. Anyway we we managed to fix it so that was good, it took us twenty five minutes, it was good fun, thank you very much Lee. And then there's the novel that is Dan's cluster. So discovered symptoms, we worked out well first and foremost SSH wasn't running on the standard port. Mean that was just a major fuck you right from the start Dan, come on. I spent ten minutes trying to work out how to go on the thing. Unfortunately you were joining us and left us a comment

4:37 to help us out. So we did get on the port I should have guessed but again you get this blinker days when you've got people watching over your We had an unresponsive API server, it was in crash loop back off. We discovered a few contributing factors which we documented and then Dan very kindly provided all of the fibr he applied to that cluster. So lots of interesting stuff for there to debug and fix. I went a little bit crazy, I think. A lot of it is kind of issues that I tend to do when I'm kind

5:06 of quickly throwing things up though, like building a machine that by default has swap enabled, turn swap off, install Kubernetes, and then rebooting. And then the cubelet is basically like, well, I'm not starting because swap's enabled. Yeah. So I kind of reverted it back to the sort of issues that I created myself. What I I think what I loved and hated at the same time about everything that you broke as well is you didn't just do it on the control plane or do you actually replicated it to every node in the cluster which made fixing it all a

5:34 more painful. Yeah. But it was good. Yeah. I one of the issues I think I was interesting, I guess I hadn't really considered it, was that once HCD got to three three nodes as part of its configuration, breaking that meant that HCD would never actually reach quorum. So fixing the first node still left you at a point where nothing was actually working. Yep. Alright. Sorry. But but that was that was last week. Waleed thanks for joining us today. Thanks for joining me last week. It was a whole lot of fun. In fact Vale just said no, don't worry

6:09 Dan. It was fun. Which just means leads at me. I'm curious to see what will lead. Will lead is one of our breakers today. So a couple more highs. Hi Benziden. Nice to see you. Hey Mark, how's it going? Good to see you. Of course we've got Nuno thanks for joining us. Yeah awesome, oh sorry Frank I missed you, there we go. There's all the hellos done. Feel free to jump into discord or use YouTube comments as we go, we will definitely appreciate the help. This is a little difficult at times. Alright, so we're not using TMux today.

6:44 I have been slightly more prepared and I want us to be able to still have that TMux flexibility or teammate function functionality where we can actually share a terminal type together, but with actual decent scrolling. So today we're using teleport, which means we have an SSH connection open on our first node with RAM PS to make sure it works. Dan has the same on his side. Dan type hello. You've lost the window. Should have warned you. There we go. So we're gonna try and take it a little bit slower and methodological, I'm Scottish I think a Methodically.

6:45 Debugging the Wrong / Working Kluster (Oops)

7:29 Methodically. Thank you. Try and fix this together, taken, try to follow each other's hypothesis and really explain what we're thinking as we go. So I guess really what we wanna do first and I'm gonna jump to my local terminal first just because I do have a cube config there and I'm gonna run just to get nodes. I guess, I mean that's the best way to get started, isn't it? We're on cluster three, this was broken by one of our colleagues Jason Does that strike fear into you a little bit? It does. I I I'm I'm intrigued.

7:40 Starting with Jason's Cluster (Cluster 3)

8:09 I know he is well into the internals of Kubernetes So I I don't know what he may have may have may have done, but I guess it's time to try and find out. As you mentioned, I guess, this methodically, so we'll kind of test from outside in by poking it from the outside, seeing what works, what doesn't works and then starting to go layer by layer until we hopefully find out what he's done. Nice. Alright. Let's just tackle a few more and we'll get started. Well he'd asked if we have it CD backups. That's terrifying.

8:48 Alex is joining us, Anise you're definitely not too late for hello's, thanks for joining us and Sandy Blaber, hey, alright let's fix some stuff. So one thing I got stuck on last week as I didn't realize that get component statuses was actually deprecated so this is well I'm on the wrong cube con pic, gave myself a heart attack already. Well, it's not responding anyway. Just a little bit slower. I can't remember where I stuck this cluster actually. I think maybe in Chicago, so there might be a little bit of latency but teleport will fix that. Anyway, component

9:25 status is supposed to red heading last week. This has been deprecated for a few releases I've now found out, so we can ignore those errors and I'm really happy that we even got a response anyway, we can even run a get nodes. That's a healthy This is worrying though, this is where you immediately start tripping yourself up and thinking we're on an easy street. Yeah, yeah, definitely got that. So, so that we can type together, I guess the best thing for us to do is just export a cube config here use the admin token. Mhmm. Make sure we can still get notes.

10:04 Okay. Okay. So I'm gonna let you I'll let you take for a little bit. Why don't you do a little bit of explanation and see if we can find out for So these have something running on them, don't they? There is a WordPress workload that runs on all clusters clusters, which we should be able to, I mean I can port forward to it locally and browse to it if you want or you can have curl whatever you feel like, but you should be able to have that IP on port 80 or I guess you can use the node port there.

10:08 Initial Cluster 3 Checks (Workload, Pods)

10:32 If you wanna do local host and 30314 or do that. Oh, I forgot I configured load balancing. Yeah. So So that's Our workload works. Okay. Well, I put forward to it just for the and see if we can browse to actually in a real visual browser. I want I'm just gonna have a quick look at some other well, let's have a look at actually what's running on there first, I think, and then we'll if that's okay. Yeah. Of course. Go for it. Before we before we start looking into applications, let's let's actually see what's what's kinda running on

11:16 here. I have often wondered if a customer's only automatically fix itself in some way, but I hope that never happens. So we have a Cilium namespace with all of the the various components that are required for Cilium. We have default namespace where we have our application running. And then in cube system, API server, various proxies, core DNS, the packet or Equinix metal cloud controller manager. And looking here for things that have rebooted a number of times or have been restarted. There are a few things, but normally, we would see things restart, namely during the convergence of all these components

12:01 actually coming up and all of the ceph components for our persistent storage. Storage. So, I mean, from the outside, everything is kind of looking okay. Did he forget to break our cluster? I'm sure not. I I did see him saying that everything should be should be working okay. Everything should be I'm sorry. I saw a message from him saying that he had finished breaking everything last night. So I guess the kind of thing is that we're gonna have to determine what's broken here. So how how do you want me to go about that? I mean, do you do you want to

12:43 deploy a new workload on there and see what happens? Or Yeah. Why don't we try and modify the WordPress? Let's try and pretend we're doing an upgrade of our WordPress component. We'll change the tag on that WordPress image and we'll see if anything breaks through there. So is this a deployment or is it just a single? It's a deployment code word pressure. Might come out outside. Oh, there we go. I'm not so bad. It's just the elicit now before we Yeah. Before we forget. And so let's just change that tag, right? Like just take the Apache off and do

13:46 a 4.8. We'll just force that to pull a new image and see what happens. Yes, one second. Yeah, do it though. I'll do that repeatedly as well. So that could be pulling an image, we could describe it and see the event log just to confirm or we can just be patient which is definitely not one of my strong suits. No patience has already expired. I see what's going on. I forgot to play. I apologize. What? It's running. This is confusing. Okay. I wonder what he's done. I wonder what's actually broken here then. So let's confirm our WordPress is working. I

14:47 am gonna do the port forward now and browse to it. Make sure it's hitting my scale. It's getting post. It's happy. I'll let you drive them. Yeah. I'll just do that from, I guess I could do the put forward from Sorry. You know, I'll do it for my machine now. Okay. Get pods for forward. Eighty, eighty, 80. And then I'll jump over to my browser. Well, I haven't installed WordPress yet, but it is working. I'm sure I did that. There's no three. Well, sure. And if I remove this, it's speaking to the database. There's nothing

15:53 obvious. Yes. I shouldn't type here. I wonder what we're missing here then. I mean, we have a working API server, our workload is intact, we have no, there's no pods aren't happy. Yeah. That's I'm gonna I will try and deploy. I'm gonna quickly deploy NGINX and expose it. Yeah. Go for it. See see what happens. So this is from the case examples. So we have NGINX deployed, and I am now gonna expose this. You know what? I think we are in the wrong cluster. I bet you Jason's clustered four and I'm just making an absolute

16:47 Realization: Cluster 3 is Healthy

16:57 arse of this. Well, I'd like to imagine we've done a very good job of debugging our working person profile. Although this hasn't exposed us. Oh, there we go. That might take a little while anyway as it gets to my p address. Okay. I don't know if you want to check whether O4O4 is the one that we should film them. Yep. What I'm gonna do is we are gonna have to jump to teammate for this one then. So I'll send you the link. I just moved to off screen. We're gonna have to do this one the

17:31 hard way but that's alright. Well, least we're warmed up now. Well, yeah. We've shown everyone how to check a working cluster. Jason's probably going, I had no idea what they're doing. I shouldn't bark. So we'll use teleport for the WALI cluster. So let me export the cube config. You've got the command. Let me know when you're in. We can test it. We type it. Let's run get pods. Are we looking with something that's a bit more broken? We see a broken thing. One second. Okay, I'm in. Alright, we've got a control player node which is very much unhappy.

18:25 Cluster 5 Initial State: Unready Nodes

18:29 A worker node which is very much unhappy. Our pods at least look okay and the workload namespace. I'll show the SSH command anyway. Oops. I trust y'all. That's a stupid idea but and now I need to remember how to do scrolling and t max. There we go. Alright, so I think we just need to get some control plane notes online. So how do you wanna start debugging this then? Here's our first symptom, two notes offline and do you wanna SSH onto one of them? Do you wanna do some cube commands? What do wanna do first?

19:01 Debugging Node Issues: Checking Kubelet Status and Logs

19:15 I mean, we could do a describe nodes or see if there's any events on the node that is not that is not ready. I think we probably can hop straight onto the SSH onto that node and see its current state. Alright. Let's take a look at JSC. Alright. I'll let you drive from now on since you've got you've got a plan. Alright. Okay. So what's normally What would be the first thing you'd wanna look at when you see that a node is offline or not ready? What so, I mean, normally, a node that is no longer

19:59 marked as ready has either been powered off, rebooted, or the cubelet has basically stopped responding. So my first step really would be to check that the cubelet is actually up and running, which it is. Oh, pot thickens. Okay. That's fine. But why is it not ready? I guess we maybe wanna take a look at the kiblet logs. Yes. So Connection refused. I see a few of those scrolling past. Yeah. Okay. So we basically set that to follow so we can start to see everything that's going on. There are a lot of errors here. Well, the cubelet is

21:10 trying to so what's the IP address of this box? So it's trying to loop back to itself to speak to the local API server component. So the kubelet the kubelet's main role is to start a bunch of static pods from static manifests of which the API server is one of them and also to be given instruction by those API servers in order to to start things. So let's see what is actually up and running from a cube perspective, I suppose. So we have the scheduler. We have the controller. We have the cubelet. We have cube proxy.

22:17 ETCD and API Server Issues on Node 1

22:18 The API server isn't starting, though. We could see that in the cubelet logs. And I I guess the other component that we need to check is etcd, which also doesn't appear to be running. Now if etcd is not running, it looks to me like it's probably issue. We kinda saw this before in that etcd needs to bind to all of the other nodes in order for it to come up successfully, and the API server needs to be able to speak to the XCD server in order for that to work as well. Yep. So So just my point of view, right, is

23:07 like Yeah. We can see the Qplet running. We can see the connection refused. I think if we can get past that first to try and identify the networking issue there, do you think that's likely to maybe assuming there's some sort of root misconfiguration or IP table blocking it, we can maybe that would bring the other components online or is that not a fair assessment? Yeah. That's that's fair to me. I I think the problem here is that XCD can't speak to the other XCD the other XCD members. But that wouldn't cause a connection refused on the Qiplet, would it?

23:42 It will so the connection you're seeing there is it trying to speak to the API server. The API server can't speak to the NCD services cascading failures. Let's I guess, should we get some ETCD logs? Yes. So so there's ETCD. Oh, these are siblings, aren't they? Uh-oh. And if I'm reading that correctly, has he explicitly downgraded one of the members of the cluster? I think what I'm seeing here is he's actually possibly removed from this member. It doesn't appear to have any idea of other members to join. But why Sorry. On your go. There's also the new raft

24:54 line, which is the second error mess second log second to last log entry. Okay. So is that line that says cluster cannot be downgraded current version is lower than the term clustered version, a red herring? I think so because it's not joining. It doesn't appear to be attempting to join any other there's there's no mention of it trying to join the rest of cluster. Well, there's a restarting member in cluster with two unique IDs. Okay. So can we take a look at the entity configuration and see if it is configured to speak to the other two control play notes?

25:39 Examining ETCD Manifest (Node 1)

25:49 Yes. I mean, we can I'm already concerned. I'm going I've I've talked to myself into a corner here about something that's possibly wrong. But Well, I don't see any conflict there. So why don't we take a look at the system d unit file for it to do? They Or the manifest? Yes. Yeah. I keep forgetting the static manifest for XCD. So red herring here, the date has changed on the XCD manifest compared to all the others. These are all created at the same time. Yeah. So what what have you done? Alright. I'm gonna need you to explain every

26:46 argument past the STD here. Everybody hold on to your app. So effectively, this config is generated by kubeadm and it will create these manifests and and give it all the information it needs in order for this to come up. Given that we appear to be having some sort of networking or some sort of binding issue of things coming up. I presume that something in here is not the way it's meant to be. So what have you done? People are enjoying watching us suffer here. I've gotta tell you that. I I think what we could do is possibly look at one

27:40 of the other one of the other manifests or one of the other control plane nodes and see what that looks like and compare the two. Yeah. Do a state by state comparison. Definitely. Yep. Alright. Tell you what, why don't we Wrong button. Aljet is on to the other node. I'll just grab the oh, that closed that. Okay. Four. Okay. Four. Get nodes white. I'll just grab them all. But no. We just want control plane. And we'll just grab one of those working ones. So How how are doing for time? Plenty. Yeah. Okay. That's fine then.

28:42 No no rush. Talk amongst yourselves. Alright. Well, mean, if I knew how to do a vertical split that, you know, may have a horizontal split that might have helped do a side by side comparison but if you could want to just scroll the top one and we'll try and do it roughly at the same time like synchronize shelling. Oh no we can't do that because it's not two cursors, one. Alright okay. So let's see what do we got. Well the advertised client URL should be different right? That's each each unique node just doing this thing. Server search

29:17 okay, it's off true. Direct The list of members looks one second. The initial advertised peer URLs, would that be the it's on URL as well? I guess so. Yes. That's I believe that's when it first starts up the first time. The initial cluster, the reason why those those are different, I believe, is due to what nodes it is aware of when you actually do the Kube ADM join. Okay. So the node on the bottom half will have only been aware of itself and whatever was itself when it was being added. The one at the top will have been

30:02 the last node to join, so it will have been aware of the the first two and it being the third. That's why there's three there on the initial cluster flag. Keep going down. All ports are correct. I will loop back a little bit further, please. Okay. Three dot three snapshot count, and the image is different. Yep. I told you it was the image. I'm I'm just gonna sit here quietly I think. Okay. Let's let's at least lay them up. Right? So I what I saw in the log with that downgraded version thing made me think that

30:41 Identified & Fixed: ETCD Version Mismatch (Node 1)

30:58 because there's obviously some sort of data format change on the disk, couldn't load the data as it couldn't join the cluster, it's kinda where let's let's try it. Right? Reload, in fact it's not reload. You shouldn't need to do anything. Oh, yeah. The pods have just come up. Right? Yeah. The cubelet will check the shards or the m d fives or whatever and realize that the static pod manifest has changed at which point it will restart it. Okay. Let's make our lives a little bit simpler here. So get pods cube system, grab that CD. Oh, well, I'm on the node that's not

31:42 gonna work in that CD so that was smart. Maybe on its way back up again at this point. Yeah. Looks like it. Let's check the logs and find out if there's anything else wrong with that CD, I suppose. Alright. Those were in Barlab containers. Right? Barlab containers? Oh, that's Yeah. You're right. And then we catered the FCD file. So it's established, blah blah blah, with all of its peers. I see a lot of infos. I It looks happier. Yeah. It does look a little bit happier. Yep. Health is okay from the endpoint. It looks like our API server still hasn't

32:31 come up though. Cube API. I'll make sure I've got messed up that. Yeah. Okay. So now we need to work out the API server problem. So in here, it should be the the API server logs. Yep. That is failing to connect to local hosts and when did that fail last? Yeah. That was four minutes ago. It's four four minutes ago. So that's 12:29. We're 12:33 at the moment. Okay. Why is that restarted? Okay. Let's see if we can do a countless containers. Well, that didn't work last week either. Maybe that's just that command. I think that's yeah.

33:19 Checking API Server After ETCD Fix Attempt

33:42 So why is the API server not restarted? Let's make sure it's in the manifests. Yep. There it is. And in theory, that doesn't really look like it's been modified. I mean, he could have been really sneaky and changed the date on that, but you're right. Probably unlikely. Well, yeah. I mean, it could have been very sneaky. That would be very good. That would be very sneaky to do that. Now we're giving people hints for future episodes. But Yes. So if I drive for a few minutes Please please do. Yeah. Go for it. Around. I mean, I don't feel like there's gonna

34:28 be any major issues with this, But let's check that that CD is actually listening to where we expect it to be listening. Well, it's listening on 2379, and something is looping back to it as well. Now the API server has come up What we've been Well, so we hold on. Maybe we were just a bit quick then. Like, how long would it take for a member in the NCD cluster to really join the quorum and get healthy again? Were we just impatient? Etcd can take a little while but there's also cubelet which may I I don't know off the top

35:23 of my head but maybe that has some level of crash back before it actually comes up. I think we've there's no doubt it's ready. I ran get nodes. Looks like we just wanna jump on to worker or TW seven. Okay. Let's see if we can get this one healthy then. So yeah. We confident this machine's okay? Yeah. Well, the FCD has rejoined the quorum. The API server is looping back to the FCD instance and reporting itself as ready. Alright. So from my perspective, this node looks okay. We've yet to try and run workload on it. So I mean, we could be we

36:04 could be back here for long king through logs at any moment. We'll see in a minute. Let's get this one. Okay. So this node is unready. Now, I guess very much like the first node, what we wanna do is we wanna check that the core components are running. So we're gonna look for the API server. No. We're look for to be No. No. It's not API server. No. We have a kubelet. Alright. Okay. So oh, no. We don't have a kubelet. We have a CSI node driver registrar. Okay. So this one has nothing. So kubelet normally runs as a system d unit

36:05 Debugging Worker Node Kubelet Issue

36:45 or a system d instance. So first thing really would be to check whether that actually exists. Okay. Oh. It's it's a worker. There'll be no admin.com from here. Of course, silly me. Alright. So you wanted to check if there was a system d kubelet. So there we go. We have some errors. Let's So there's something very wrong with the config for this cubelet by the looks of things. There's about a terabyte of logs fly past. It looks like something has broken the flags on this so that it is it's basically instead of it working, it's error

37:43 printing out all of the flags as in you've missed done one of the flags here, all of the flags work out what you've done wrong. So I think we could do to be able to scroll around in these logs to work out where Why don't we just try stopping the service and running the complete command manually try and Well, so at the top of all of this output should be the line that it's finding wrong as in we passed the wrong flag. Okay. Let's take a look. Yeah. You're gonna need that no page or Here we go.

38:20 Identified: Kubelet Unknown Flag (`bootstrap-kubeconfig`)

38:39 Unknown flag bootstrap config. So this is the issue. Oh, this appears to be the issue. We are passing a flag that it doesn't understand. Right. Well, I guess we should go take a look at that unit fail. So this is either going to be in the kubelet config. You can pass additional flags in some of the files that the kubelet config imports, or it will be in the the system d unit files as you mentioned. So we will start with it's in system. It is. So this is the config that system d uses to start it with to begin with. Hold

39:30 on a minute. I I I'm not I can't let that slide by. More? Who uses more? I do because I want I want more. And yeah. No. I should use less. But Sorry. I didn't mean to ruin your train of thought there. No. That's that's fine. So we can see here there is the dash dash bootstrap dash cube config. The line on the top oops. My crazy mouse just gone a bit crazy there. It appears to be this line that's causing it problem. The cube, like, cube config RX? Yes. Where we are passing that dash dash

39:52 Examining Kubelet Configuration Files

40:09 bootstrap dash cube config. Now I don't even think there's anything in the bootstrap kubelet conf anyway. There normally isn't. Am I gonna now have to live with the fact that I use more? I been mocks for it. I haven't used more since, you know, 02/2003. So I mean, I'm all for just basically removing the dash dash bootstrap dash q config because there's nothing even in that file to read. However Let's try it. We I mean, we do have a q config parameter after it. So the bootstrap config, I'm not really sure. You know? Yeah. I I I think you're right. Let's

40:35 Fixing Kubelet Configuration (Removing Flag)

40:52 let's remove that and then see where we are after that. I mean, again, I we could, you know, kinda go away and look at what is on the other nodes. Just whilst we're in here as well. Okay. I mean, the bootstrap command was the one that I had issue with. And we can oops. I mean if that worked, system d would have attempted to restart it by now. Okay. Do we need to do anything for system d to pick those up or should it just? No. We didn't modify the unit files so we don't need to do a reload.

41:49 Okay. We can just It would appear still a little unhappy. Yep. We're not getting a kubelet. Let's jump back into the pager. Invalid argument. My my tmux foo is particularly poor. So it seems I don't see an error. Pairs that describe resources as for non Kubernetes components. That's that's just output from minus h. It don't appear to be getting much. It might have been silly. We need to get to the top of because it the problem is when kubelet fails with a wrong argument, it prints out like a mega random additional flags that it could be.

43:07 See, thought that yellow line meant that's the the most recent restart and I thought that would just show me what I wanted. There we go. Okay. So now that shift key took away. I was a little worried there. There we go. Oh, it skips out the bits we need to see if that's helpful. Yeah. We we we I'm gonna stop it. Right? So I mean I hate fighting with journal d so what I'm gonna suggest is we do this, we cat that, we do a kubelet, kube config equals etcetera. Kubernetes, kubelet dot conf, config equals bar.

44:05 I can work with. I see. Okay. So it couldn't parse that triplet config. Let me I don't know what happened there either. And it cannot connect to the Docker daemon, which we aren't running Docker, so that makes sense. Yeah. That's where the error is. So if we look to the left, we can see there's a w one I info e for error is basically the reason why it's not not running. I wonder whether someone has changed this conflict to use docker instead of container d. Alright. I guess we should take a look at that then.

44:49 So is that in the this file here, kubelet comp? No. That's the that's the for the cert. So it's the config here? Config that yaml. Yeah. Normally in here you would find the what was it in here? I'm assuming the lack of CRI configuration maybe defaults to Docker on this version of Kubernetes. So maybe we need to actually provide the configuration for that. Should we pull up another working work worker node? Yeah. Start with the quickest and easiest way. I don't think there's gonna be much in here that's actually node specific. Alright. Grab one of these.

45:41 One four five. Put one to the broken one, David. And we just really wanna compare this to Yep. This config. I'll just commit config gamble. Yeah. Don't know what happened there. So I don't see anything that says container d on it socket or Yeah. I'm not seeing any socket anywhere. Maybe not this fail. In fact, we can use that we trick that you've now taught everybody who's like, has this fail even changed? No. No. Alright. So let's so the cube that can be modified with additional configuration by passing parameters to the unit failed. Did we maybe miss something in there?

46:42 Yeah. I think we'll open up the unit file one more time and see what else it imports for config paths. Where was that for system d? Alright. Not that one, so it's this one. There we go. So it could be the QBDM flags or the default file. That's all we've got. Yep. Okay. Farnib. He's definitely making his run around. Is that right? Is that right? Remote and is that socket path correct? Well, the easiest way is to quickly compare it on the other node. So if we pop through the pane and look at the same one.

47:55 They are the same. They are the same. So One other thing then to check is to whether he's actually changed the path of container d as well. So whilst that's open, if we should do an l s on the path to slash run slash container d slash container d dot sock. No. It's run. It's run container d. Container d container d. Container dot sock. Okay. That's it. So there was that one other fail, wasn't there, that could provide overwrites? Slash etsy default. Doesn't exist. Okay. That's interesting then. Yes. So as far as we are concerned,

48:55 our kubelet is configured to correctly speak to container d but for some weird reason as trying to speak to Docker. I'm gonna close this other one for now. Does the kubelet fall back to alternative run times if the configured run time fails? Like in my head I'm thinking, is it are we messing somewhere the config has been overwritten to tell to use Docker or is it doing that as a fallback because it can't speak to container d? Do we know? I'm not sure. I'm okay driving for a second. Should not have a quick one more look

49:29 at the cubelet logs. Please do. I'm ready to leave. Everyone, I'm using less. So Okay. So bootstrap config, we know that doesn't work. Let's set the timer. What's the How do I get to the end here so I can go up? We've had a few comments. Next suggests our kubelet version may have changed. Alex is saying could it be permissions on the socket? Both things we should probably confirm. Yeah, definitely. One sec. Let me try to get to a recent log. There's so many logs in here. It's really hard to find them. Yeah. Why don't just run it manually the

50:27 Hint: Check Kubelet Version

50:59 way we did earlier? Just control r dot kubect command and then just get the output from that. That'll be a bit easier. One second. So I was looking for about the bootstrap. When did we change that? That was quite a while ago, isn't it? Well, yeah. The failed that exist so we removed the flag. So let's go to five minutes ago. That isn't. So you could point from what we're just saying couldn't we just rejoin the node. We could just use QD, QDM to rejoin, but, you know, would be nice We could. Because it would be nice to dig in and

51:32 fix it. So still complaining. Let's move to what's the time now? 12:51. So what are you looking for? I am just flipping forward through the logs to a more recent oh, so this has not been restarted since 12:44. So Oh, I stopped the kubelet. Stop. To run it manually. Sorry. Cool. No. That's fine. No. It wants a reload. How do we need to with the new config? Demon. Oh, there we go. Silly me. And start. Yeah. I wonder if we fixed it and then just forgot to start it there. I forgot to start it as I look shifty.

52:45 No. We still have no so we can grab the logs one more time. There is a last flag isn't there? I keep forgetting. Scroll. Let's see if we can find the start this time. This is 2AM. This is these these aren't the right logs. I think it's last. Shift g, let it jump down. There we go. Okay. What are we completing that? Config has been deprecated. Unknown cloud provider. Let's put on our no page. If anyone in the chat knows how to just get the last 50 lanes. I have a bit of a brain blank at

53:48 the moment and I just can't remember. Oh, dear. I'm gonna Google it. Journal CTL last yeah. There we go. Last 100 lines I'll do. Dash n, okay. Not super responsive to my control c, but hopefully that's just a standard out flush. Alright. Okay. Dash n 100 please. There we go. Use pod manifest happens that I know the provider external worries me more than the config being deprecated. Let's check the version. Oh, for one four seven. This is a 1.2 Yes. Zero cluster. I mean, in the real world, I'm not entirely sure how many people would find a

54:50 Identified & Fixed: Old Kubelet Binary Version

55:08 version of Cubelet that's, like, three years old. You happy with that? I'm very happy with We actually had a one nineteen four available. Maybe just over overkill the binary. I'll put it back to the version that thinks installed. So I think he's been out of his way to find a really old kubelet binary and just dropped in the machine. Yeah. Mean, so the the error that you were getting there is I guess cloud providers weren't even a thing back then. But, yeah, effectively, that was mean. I like that. I mean, congratulations. That that was that was mean.

56:00 That's just my alias that I got wrong. I don't actually know. I don't work on how to You won't be able to there's no admin.com for anything. I need to go to one of the other ones. Alrighty. So we pop back to the control plane or your your local terminal. Ta da. Okay. That was a good one. I like that. That was good. Well done. That was that was interesting. I mean, yeah, I it does it can happen. It's just I I I after watching last week, was I was in my head, I was like, don't get dragged into looking in

56:23 Cluster 5 Nodes Becoming Ready (After Kubelet Fix)

56:45 the wrong place. Don't get dragged down the rabbit holes. And that's exactly what's occurring right here. So actually, missed Nick's comment, but he called it earlier. I reckon it's an old Kooplit binary. Well done mate. Alright. Well I mean maybe I should just sit and read the YouTube comments and do this much quicker. Yeah. I I you know, never thought to check the version. You obviously never thought to check the version. I just that was a I suppose why why would I though? You know, cluster APIs install all this. It's all installed the same version. Why?

57:20 Kluster 005 - Broken by Walid Shaari (@walidshaari)

57:22 Okay. Nice work, Jason. Let's jump over to Wally's cluster. Was that Jason's cluster, was it? Yes. Okay. Okay. We got there. We did. Doesn't matter. I feel like, yeah, maybe I should have checked the version correctly but I think we were we were we did go slow and we we talked and we we thought about it. I just I guess it just never occurred to us and there was the STD issue as well which we fixed. Couple of really smart issues there. Alright, well at least cluster, let's try to teleport this time so we don't

58:04 have to t mux and I don't have to remember all those weird emaxi style shortcuts. Okay. Well let's just jump on to our control plane note and you should be able to join the session. I will zoom in on the terminal. Yeah, I think that looks alright. Now Waleed tell me upfront there's potentially a readme available. He doesn't tell me where. I know you're in the chat Waleed if you wanna save me a few clicks feel free to tell us where the read me is. What I'll do first is my of the alias. Have you joined the session yet? Okay.

58:30 Reading Cluster 5 Readme Hint

58:45 I am I just need to get my my authenticator access up and running. I'm now in the well then, what have we got? Oh, I got a broken cluster. Alright. Like, he says it's not on this control plane notes. I'm gonna pop open. There we go. We do have a read me. Hi, David. Thank you for giving me and the auditor a chance to check cluster 005. So other say things are not always as they seem, the first appearances deceive many. Also a user passed by and reported pods cannot access the API server, she was testing

59:29 with from inside a pod to verify with this command. Best of luck, I love that you set a scene, thanks. Like that, set the mood. Yeah, let's get our alias on this machine as well And we did, I think the get nodes failed on that other machine didn't it? So probably gonna fail from here too. So I guess it's the same, I mean our starting steps are just gonna be the same, right? We wanna check for the core components running on this machine see if we can identify some really obvious things upfront. So kiblet. You didn't grab.

59:55 Cluster 5 Status After Initial Node Fixes

1:00:13 We do have a kiblet. We do have an API server and I thought we really need to get started. We have an entity so things look alright up front. But so the error that we had was that we well, we didn't get an error, but we kubectl should speak to you used the admin.com from this one, didn't you? I did. And that will have its local address and port in there, right at the top. Oh yeah. So that's over our I mean, could change that to 127001. Would that work? Yeah. That would be fine. And I believe it should be signed for

1:01:06 local host. If not, then you'll get a TLS error. But, I mean, we can just look at net stats to see if it's actually up and running on that port. Alright. So first of all, I'm gonna so that IP address should be the IP address of this machine. Right? It should. Yeah. So we've got two IP no. One IPV four 80 seven 30 eight. Yep. LN. What would you use here? A n and grep for six four four three. Yeah. It's there. We have an established connection to that, so I'm gonna assume it's alright. The get nodes didn't work.

1:02:05 `kubectl get nodes` Times Out (Investigating ETCD)

1:02:07 Is that alright? Maybe I'll let it actually time out. Give me an error this time. Timed out. SCD timed out, not the API server. Oh, okay. That's strange. I hate SCD problems. Is the STD running on this this node Well, I mean it looks like it of course. It certainly does. So is there is there a way for us to check the health? I mean, if entity is running to me, that normally means that it's healthy because it wouldn't be able to join the cluster otherwise. Yeah. I mean, the first thing probably would be check the FCD logs, find out what

1:02:55 it's saying and how healthy it thinks it is. K. In fact, we can just go to top Follow-up containment. Oh, no. It's your bad data in our etcd. Failed to apply. There is a c t m space. So it looks like a a value to be oh, no desk? No desk space. I'm sorry. I managed to log on if that was the case. No. This looks alright. And although it's running on a container, let's take a look at our manifest for this and see if there's any weird configuration. Oh, it's running read only or something like

1:03:35 Identified: ETCD "No Space" Alarm

1:03:59 that. Possibly before we start. We do have a volume but I don't believe those are no. That is the data directory. Varlib etcd. And that volume is a host path here. No, that's not the same one, is it? No. Etcd data. Where's Etcd data? Oh yeah. Varlib Etcd. Not a special mode path or anything, is it? No. Yeah. So it's not the horse which leads me to must be something to do with the I don't know. I'm not an Nets CD expert if I can kind of I wanna go back to to this. Let me go through a little bit slower.

1:04:09 Examining ETCD Manifest for Volume Configuration

1:05:20 So I'm not worried about the commands. I don't think there's anything there. You can't tell STD not to rate to this location. We've got varlib etcd as the data there. So let's confirm that's where we meant it. Yeah. So varlib that looks alright. I'm not fussed about the probes resource. There's no constraints or limits. I mean, there's there's nothing there that makes me think that can't run. And that CD is running, but it's failing. Yes. And does that CD have a fixed size database? Is there too much in it? Or I wouldn't where's my CD does have a yeah. Does

1:06:26 have a command line thing, isn't it? It does, but I can never remember how to do it and I always fall back on one of these things so Well there'll be a lesson to you Google is the answer. Always. Alright. We don't actually have the etcd command line tools. And if I could remember how to do a NS enter into that mount fail system, I would. In fact, we could just do that. Right? It's pet three nine eight zero seven nine. LSNS. NS enter. Let's see if I can remember the syntax on this. N s. We want the mount namespace.

1:07:26 Don't know how you do that? Oh no. There's a number and the namespace mount and then I'll remember. I'll get there. Yeah. NS enter four zero two. This is the last time I'm trying. I know there's a way to do this. Screw you NS enter. NS enter m. If I much need to install it, see download it from I know it should it should. 398079. I'm gonna try one more thing. Okay. Or pet ah, okay. Maybe I just need that. No, right, pain, forget it. I'm gonna Google that for the next pain because that's twice I felt that Anna Center

1:08:25 would have been really useful. We just need that CD client, don't we? Well, that keeps failing. We may just have to grab the x c d binary then from Okay. Tripped up by the silly things. I'm gonna grab the lot by those things. AMD sixty four. Curl l. Extract. That's more than I want to. Oh my goodness. NCT. There we go. Okay. So NCT, CTL. Oh, what? Did you grab Mac OS one? Why am I my host? Oh, you're in the prop file system trying to download it there. I'm shooting myself in foot far too much

1:10:11 here anyway. SCD CTL. Alright. Where's that cheat sheet? I've exported all those things which should get us started. Let's test this command works. Alright. We have the ability to query etcd. Wow. That's wonderful. Yeah. I think it's just binary keys inside of that to be fair. Looking at the effects of the logs for not the logs, the Kubernetes documentation for NCD, there are a number of things that we can actually kind of start to look at to determine kind of the current state of it. Things like the snapshot size and things like that. I need this pointer to just tell link

1:10:13 Using ETCD CTL to Query Status

1:11:03 in the docs, s c d op gate maintenance. I'm assuming that's gonna have some commands on it. Space quota. Yeah. That sounds so table endpoint status, I guess. Do you wanna tape if you know what you're doing? I do. I do not know what I'm doing, but oh, I'm gonna kick out again. One second. I'm just gonna type health and see if anything does anything. No. I should know more etcd stuff. Well, mean if you're lucky, you never really have to touch etcd. Which node are you on? M5. M 5 R Zed G. Yep. Okay.

1:11:55 ETCD CTL Status Shows Healthy, But Alarm Exists

1:12:12 I'm not sure what's happened. Yeah my screen seems to have since you joined it just went up at ski west but I'll just keep scrolling down manually. It says it's healthy. Yeah. Should we run that compaction? I just wanna get the one second. There was a size command which should. Yeah. I saw the size command for setting it. I never saw a way to query it. Let's see. Quarter back end size. Woah. Do we need to because scrolling's a bit weird sometimes. Yeah. I've had a bit of issues there with it. Watch version blah blah blah. Snapshot status.

1:13:05 Getting ETCD DB Size & Attempting Compaction

1:13:40 Yeah. If we do that, write out table endpoint status. Did you run that? There's also an entity control alarm list which I think will list out some problems. No space. So I think we're getting close to working out where we're looking here. Okay. There was also the fact let me just drop this command and then see if it works. It seems to list something. There we go. So we've got an endpoint state is there looking at the table and we can see the size, the DB size is 18 meg. So I guess we're gonna have to increase

1:14:19 that or do a compaction to free up some space. Yes. So compact will compact the event history as we can see here. I'm guessing our our events we have that many events that there's no space left. So Well, we just set well, we just make it really big. Yeah. That's entirely too really. Alright. Well, I've got that command from the maintenance gauge. Let's set this to we don't want 16 meg. I'm gonna go for a gig. Oh no, you start at CD with a quarter back inside. Okay. So we're gonna have to modify the static manifest to change the quota back

1:15:02 end sites. Do we have one? No. Okay. So one zero two four. You happy? Yeah. No. I'm not sure if it's static manifest, but that automatically restart. Do we need to Yes. It will do. Any change to static manifest will be detected by Qubelet. It keeps track of all those files. So it should restart at CD. Alright. Well, Jason has now joined us. Hey, man. We spent a solid hour on your cluster. Thanks for that. Alright. Let's see. So we should be able to do that right endpoint status again. Etcd's been up. It's not been restarting the system. That's just

1:16:14 a cube API server, isn't it? So I've now broken it. Tail etcd. Alright. We need more. I've made it worse. Invalid value, flag if I missed that. Okay. It doesn't like the the math where the you've added in. Alright. I guess I can just type a big number. Yeah. Of course, that shell expansion is not gonna work in a static manifest. There we go. So we had a rake. I loved it control r just worked in that and that's a browser window. That was awesome. And I changed the retriever. Where did I It's in your home directory, isn't it? Yeah.

1:17:26 Thank you. Yeah. There we go. Okay. So right at table. It's still 80 meg. What's what does the compact actually do then? Will that not help us yet? So there seems to be a defrag and a compact. Will we try both? I don't really know what defrag does but I remember what it used to do in my old hard drives back in the day. Yeah. Compact? Well, it needs one argument. I should look at the docs but. The revision number, I think, for compact. Do I pick a random number? It's version three, isn't it? That's CD. I'm look I'm looking at the

1:18:28 XCD logs there. If it's just the three at once? No. I think it's getting the revision from that write out table. Oh, that's the wrong terminal. Need to stop doing that. Write out. Is there a revision here? Is it grabbing it from that version? 313. Alright. I'll just try three then. That would be a nope. Out of range, required revision has been compacted. What? Yeah. There is some sort of revision and there are write out JSON. So sorry. I keep on doing the wrong thing. That write out command, I guess the dash o JSON is gonna give us

1:19:24 more. I'll write out equals JSON. This is why you should read the documentation instead of. To have j q I don't see the oh, there's the revision there. Oh, there. 730154. 7 3 0 1 5 4. Compacted. Okay. Let's go back to the table output and see if it's still got an alarm and it does. Although the db size is now only five meg. So I have no idea what we're doing. Yeah. I I how's it still complaining now that we've reduced the it appears to be no longer the leader either. Should we go to the leader, I guess?

1:20:28 Like I don't know whether we need to replicate this across all of the members. Let's try it. We already have another machine here. We'll copy the etcd cheat sheet. Where's my cheat sheet? To set up etcd. And I don't have the binary again. Hello. Car. Let's see the what was that? Write out command. Endpoint status. Mhmm. Dash dash right out table. See, one of the good things about this is obviously we have no idea what we're doing with entity but I feel like I'm learning how to work with entity. I do feel as though I am I'm learning

1:21:35 a little bit. I mean, I have written a bunch of code around Raft before and found that very weird. But a learning experience, but finding out how, I guess this database sits on top of Raft is is kinda interesting. Yeah. I'm gonna run that defrag and then this is the leader. So I'm gonna do that right out JSON again, get the revision ID and then we're gonna run compact on 730154. It did we get the defog already? Did that revision ID change? No. 11301. That worked on the other machine, didn't it? It did. Yeah. It's

1:22:21 it's Rawkode version. Oh, required revision has been compacted. So Oh, okay. But it's still no space, still have the same error. Yeah. The db size is now only two meg. Well, we're saving disk space one way or another. Alright. Let's see. Wally just dropped us a hint. He's feeling sorry for us now. Check the etcd manifest in here. So I guess something different on this note perhaps. Yeah. I'll let you scroll through this. Looks like my screen has an update with this particular one. Over on the oh, I don't remember. BJLW4Node. Okay. True LCD manifest in here. Okay.

1:22:47 Hint & Examining ETCD Manifest for Quota

1:23:21 Quote to back end bytes kind of stands out to me there. Oh, yeah. It has been changed here. Before we actually change that, let's oh. The dates on a few things are different here, but definitely, that CD has been modified. Yep. I'm just gonna pop to the other node and see how that one looks. So I'm just in the other window at the moment. I'm on M 5 r. Yeah. I've got it. Okay. I'll use less so people oh, I can't. Cool. You want me to tape a tape? Oh, you got it? Okay. So that line isn't even oh, That's

1:24:16 the current back end bytes that you put in, wasn't it? Yeah. I made up that number. Okay. We could probably just take that out I suppose. Or maybe we should copy it from the other node so they're the same consistently across all the members of the cluster. Let me check the control plane node and just tell you if there's anything different. So Kubernetes manifest has anything here been changed. Right. No. That's actually No. This is this is fresh. Does it even have the thing in there? It does not. So I think we just remove the quarter back end base from all

1:24:51 Fixing ETCD Quota Setting

1:24:55 of them and let it restart. So I've removed it from the one control play node that begins with b. If you wanna just remove it from that node that you're on there. Oh, this is just a less, isn't it? Oh, yep. Let me try and then my way through it. So done. Alrighty. Did you you'd removed it from the one I set. I have. I have. Yeah. Okay. So there is no maximum sizes on things now. What was the endpoint status? Oh, status endpoint. Yeah. Right. Oh, it's such a terrible name for that flag. It yeah.

1:25:49 I'm still getting the no alarm the no space alarm. I don't know if this will take some time to Yeah. If our SSD has even restarted yet, guess. 13 20 5. Yeah. Potentially. Let's give it a minute. I mean we can see where the cube CTL is actually giving us a working cluster at this point. Yeah. Why not? Every time I click on that I really should have used real service. Let's go to this is five export. Okay. Get notes. Hey. We're back in action. Our pods all running, grab dash v. HCD is coming up on one of them.

1:26:46 Mhmm. System. Just create pod. Ah, okay. We've got one more thing to fix it looks like. What is that? So which which node is this? I really should have checked. So contain white. That was a stupid thing to grip on wasn't it? No. Because it's not an inverse. There we go. We wanna jump onto m five r z g then. Okay. Which we are. Now what was that error again? Sorry. Wait a minute. Scatterbrained. Yeah. It can't open host. Let's see hosts. So is it maybe using host networking or something? It is, but it should. Right? Does that

1:27:54 city use host network? Yes. Yeah. Because it binds to local host. This looks like a file issue though. I'm trying to access that Etsy hosts. Let's get proper logs from it to either it. Just cattle. Older and new. Older and new. I'm still getting them. Yeah. It's still the size thing. What's the time currently? So these are yeah. These are from a minute ago. Yeah. Well, it says it's just three more simple things. Yes. Thanks, Ovid. Oh. I I deeply regret upsetting you last week, but I think I think you've definitely taken your revenge here.

1:29:07 Alright. Let's what's going on? What have you done? Blah blah blah. That looks fine. That'd be it. Nothing. Is there anything else in the config in the in the flags that are being passed? So advertising, fine. Data, blah blah blah. All the peers are all there. What are we missing here? I mean, I I'm not actually seeing this all looks fine to me. It does. Wonder if that log is just well wrong. Let's just check. Editor opening of our lab kubelip pods Etsy hosts. I'm gonna just copy this file path actually. Why would think why would that fail?

1:30:42 Is this an old error? Are we are we doing a really long loop back five minute wait before restarting it or something? Know, you're it's it's definitely possible of course. Bar log containers tail etcd. No. It's still it's still morning. But I feel like we fixed this error. The no space one. Yes. No. Absolutely. Let's run. We have access to etcd. We were running endpoint health. Good. We run endpoint status. We're still getting that alarm. Yes. There's leader, there's learner, blah blah blah. The db size is tiny compared to where it needs to be. We actually fix all those manifests

1:31:11 ETCD Logs Still Show "No Space" Alarm

1:31:55 quota. Yes. There's none in the third one. So that is R8M5 and this is also R8. It's going to be I'm sure I've got multiple windows in each thing though but What have you done now? There is no quarters on any of this. I think SCD is still restarting. It is. You're right. Why are you restarting? So, I mean, yeah, it restarted two minutes ago. Looks like I said he's popular. Jason also tried to do some STD carnage on his cluster, but then do some STD carnage. STD on his cluster, but then break it in the way heats up.

1:33:44 Starting as a member. Yep. Added as members as expected. Started streaming with peers. And then we got the same error. No space. Should we run a compaction on them all again? Just Founded valid file directory a b c. That's one that you created, isn't it? So ignore that. You just made sure that you could actually write to the I did. Yeah. Mhmm. So we're still just getting effects of the error that we're still getting is no space. Even though we removed the quota bytes, we've restarted the etcd. Alright. Well, it's we're approaching the end of

1:34:37 our time, so if you wanna throw us a hint to get us past this, that would be appreciated. I joined out JSON and to see if there's anything. I wanna see the revision ID, I wanna know if it's changed. I wanna kinda I feel like I should check it across the nodes. Yeah. The revisions are same. So '70 o one The version is same. Yeah. Okay. Well, it has given us a command which I will type in. Endpoint and paint. That's wishful thinking. Let's do this right. At this point, definitely. Egrep. Wish I could copy this.

1:34:57 Hint: Check ETCD Alarms

1:35:20 Oh, nine star. The original is the same one that we used. Yeah. Alright. I'm gonna should it be a compaction we're running? I'll do a defrag and a compact again, I guess. 730154. There's a yeah, it's already been compacted. Jason, did we clear the alarms? They probably don't clear themselves. Clear alarms. No. Alarms clear. Let's see what we've got. Alarm less than disarm. Alarm less. Alarm disarm. Disarm. It's gone. Alright. Let's do we need to do that in all the nodes? I hope not. Let's go back into our container logs and see if it stops restarting now.

1:36:36 ETCD Status Healthy After Alarm Fix

1:36:36 Oh, hello. It's healthy. I think it's healthy. Clear alarms people, apparently. That's my tip of the day. Get pods. Great. Are they all running? There we go. Almost. So I think we may have to clear the alarm on the other two notes. It doesn't look like that is automatic. I know Jason does say that it's cluster wide. Maybe I'm just being very impatient, but we can check. We've got them all open. Right? So does I not download it on this machine? I'm sure I've got it somewhere on all machines. I'm going to the wrong directory. That's why.

1:36:50 Cluster 5 Overall Status Check

1:37:38 There we go. Okay. So Are you in CD? I know. It's time being particularly wonderful today and I won't have all those exports will I? Cheat sheet. This is now my favorite is page in the world. What happened there? They're all running now. Oh, are they? Just to make it. Yes. Alright. This took some time. Oh, my goodness. That was fun. I well, your definition of fun is mine. I mean, databases. Well, we need to check that the application's working. You mean the WordPress app? Yes. I actually don't think that ever broke. I think it was quite happy just working

1:38:49 away in the background. Maybe we never tested it. Oh, I mean let's restart it. There's his read me which said that they were trying to do curls internally. I wonder if he's broken internal DNS or something. Alright. Let's see if this pod restart gets it online and running. We can jump and say that we could test quickly pod networking quickly. Can tell you what's missing here from doing a quick get pods. It's always DNS. Call DNS isn't running. Do a get pods minus capital a and grep for DNS? Okay. So WordPress to put forward work. It just couldn't yeah. Okay. So

1:39:41 Identified: Missing CoreDNS and Control Plane Pods

1:40:03 there's no good service discovery. Sorry. I'll use this. I shall shared. So namespace get pods. Windows. I need to contain myself next time. Cooper. I can't even spell Cooper. We have no cube DNS. Now how harsh has he been? Do we have They also the API server and things like that are missing. Alright. Like there's there's a bit of a problem there Two unready pods. Let's describe our deployment. I don't see a scheduling adder, which is what I was kind of expecting. Is it We're probably replica failure, replica create, minimum replicas blah blah blah. Why have we not got a scheduling error

1:41:32 here? Mean, that's because all events are missing. They would have been in the database. Oh, we don't have an API server or scheduler. We don't we don't have a control plane. Yes. I was wondering about that. And I wonder if you've done something funny with the manifest so that they're in a namespace that doesn't exist. Alright. I'll take a look at cube API server. We've got a pod. Namespace cube system. Yeah. They're all that's all fine. We have an API server. Maybe it's just not come online yet. Should we just reboot them? I mean, taking a CD off is normally

1:42:19 quite catastrophic. This could just be a side effect. Side effect. Yeah. I'll just kill dash name. I'm not gonna get my old complete. Copy the ped. Give it a second. Alright. Okay. It's back. Does that now show up? No. So our API server logs are what we need. Right? So why is not coming up? Because there's no scheduler. Let's cat this. Do you see an error? I am. I know I'm scrolling fast. Yeah. If you pop out here look at the scheduler logs. Scheduler. I can't reach API server. So the scheduler can't reach the what have you done now? Honestly, this is

1:43:24 Troubleshooting Missing Control Plane Pods (Logs)

1:45:07 this is a good one. It's it's getting connection refused on port 663 on 14540. Let's check the IP address. 14568143. Alright. So has the port changed? That looks alright to me. Secure port 683. I'm pretty sure we should remove that in secure port. That's not even a thing anymore. Right? Has the style changed? Yes. Right. Yeah. I don't think insecure that's not been a thing for a while. I'm just gonna look and see if I see anything. The line of this probe be going over the IP or should it be used in the pod IP?

1:46:26 Like 127001. I mean, we haven't checked to see if this is restarting or if it's running but not working. Time out some immediate seconds like okay. Yeah. They don't worry me. This is stressful. Look. I can't see any gray hair underneath this packet hat, so I'm good to go. Alright. Let's API server's not there anymore. If you edited the file then it'll be restarted. Ah. Yeah. You're right. Okay. So this is PED7417. Is that gonna change? Mean, I can kill local host 6443 is fine so I can loop back to the API server. And secure. Yeah.

1:47:42 That's good. This So there's nothing This must be networking. But what they what they So how did API servers elect the leader? Because To do it through the to do it through the. Sailorium's okay. Are you okay for time? Because I know we've went a bit over. Yeah. I'm okay. Okay. Jason is telling us to use the Hubble UI. We do have it available. I guess it wouldn't hurt to take a look with it. Okay. It's a good idea. Hubble is a really cool product as part of the sub m c n a. That will show us if there's any

1:48:52 network policies or traffic being blocked, etcetera. I don't know what port it runs on. And now I feel like I'm I need to change my command from here. Not configured. Getting myself in a pickle. Describe pod UI. Give me a port 80. I really should have just took a shot on that. Alright. So if we browse to this IP address of 80 80. Nothing's gonna happen. Alright. Can't be bothered fighting with that. Okay. Well, it says think simple Kubernetes resources. Let's just stick to what we know. Are there any network policies? Yes. Scrape NetPol MVP one.

1:50:26 That just seems to be in the live everything, doesn't it? Yeah. Hello, created yesterday at 6PM. So I'm gonna delete it. I'm gonna suppose to allow any egress traffic not with that block any external. Actually, this was a block all. Yep. Well, you've just given a sense. It's still feeling pretty. What hates pods? Right? Why would something not show up on get pods? Let's take a look at the Yeah. I'm starting to draw a blank here. Right. I only have a few minutes and then I I really need to leave. So we'll we'll go for three more minutes.

1:51:35 Identified & Deleted: "Block All" Network Policy

1:51:48 If we don't get it, we don't get it. That's fine. We will get the post mortem into the read me while these will add all of his notes to tell us what he did. We can all learn from that. I can imagine his notes are gonna be a book at this. I must have spent a week doing this. Better run a git all, see if I can see anything. I don't wanna give up. Alright. He said check pod security policies. So the speaker will be met They'll be the Rooker's Rook. That no privilege. I'm gonna describe

1:52:31 Hint: Check Pod Security Policies

1:52:56 What's this thing? Allow privilege escalation true. Run is only run is only running. I don't think that's am I misreading that? It's called null privilege but it doesn't appear to be blocking anything. Let's just delete it. Well, it's there you go. He's dropping some knowledge on us now. No privilege hates static pods. I had no idea. I mean, I've hidden static pods by putting if you put a different namespace that doesn't exist in the static pod, cubes kubelet will start it, but the API server doesn't know what to do it do with it. Okay. So with that gone, we should see

1:53:15 Identified & Deleted: PSP Affecting Static Pods

1:53:56 things like the API server. Yeah. I'm wondering if it's only the wee cheeky push, though. I shouldn't think so. I've given up just assuming I know anything anymore so. Alright. What machine are we on? That is m five r eight and I'll give this last one a push because we're wrapping up. See if we even have a working API server yet. It's so weird I'm creating the API server and I can't see it on a pod list. It is quite confusing, isn't it? Yep. Still cast there. He says this one last issue something to do with the kernel sentence for forward and

1:54:38 Final Hint: Check Kernel Settings (Sysctl)

1:54:44 I mean, I definitely have my depth there. So that's ETC CTL. Alright. What have we got? What's the date on these? There's some Selium configurations here. So he's disabled something, I guess, in /etcsysCTL.com. Always done it manually, I guess. Yeah. I think it's been done manually. Those fails look fine to me. I'm gonna need to lie down after this. Man alive. I've never been so stressed in my life. This is insane. I'm gonna concede. I'm sorry. I don't think we're gonna get it, Ari. We're It doesn't it certainly doesn't look like it. The the API server is still hidden

1:56:01 Conceding Failure on Final Issue

1:56:09 inside the API which is kind of amusing. Alright. So let's let's let's call it. Like, we we could fix valid cluster. I'm looking forward to reading what you actually done and learning. I love that trick with the no privilege. I mean even looking at that podcast security policy, I was like it's not doing anything. Definitely need to research that more and understand why that hates a static pod benefit, have static pods like I just right now I don't understand that at all. Alright, thank you Jason, thank you for lead, those were both extremely tough. Thank you. That

1:56:18 Conclusion and Wrap-up

1:56:53 was was brutal. I I will take at least we got one cluster up and we got kind of partially towards this one. Yeah. But that was that was that was pretty brutal. Thank you. Yeah. Alright. Dan, thank you for working through that with me. It is stressful. It is fun. I'm learning stuff. I hope others are learning stuff too. Thank you for everyone that joined us for that massive two hour session. Alright. I'm gonna get for a long time. Yeah. Let's take a nice break, well deserved. Alright. I'll speak to you soon. Everyone else, thanks a lot. I'll see you soon. Bye,

1:57:30 everyone.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

Documentation

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
etcd

More about etcd

View all 24 videos
Cilium

More about Cilium

View all 36 videos