About this video
What You'll Learn
- Diagnose Kubernetes API server TLS handshake failures by checking static pod config, ports, and control-plane logs.
- Remove broken Cilium host network policies and verify node connectivity when nodes stay NotReady.
- Fix Cilium tunnel-mode and CIDR settings that can break cluster networking and kubelet connectivity.
Noel Georgi joins David to debug two broken Kubernetes clusters: a TLS 1.3 handshake mismatch blocking the API server, and rogue Cilium host network policies plus a misconfigured native-routing-cidr breaking node readiness.
Jump to a chapter
- 0:00 Viewers Comments
- 1:30 Introductions
- 1:37 Introduction to Klustered Part VIII
- 1:59 Housekeeping & Community Shout-outs
- 2:35 Sponsor Thank You (Equinix Metal)
- 2:56 Introducing Guest Noel Georgie
- 3:42 Meeting the Breakers (Tim Hawken & Matt Anderson)
- 4:30 Kluster 015
- 4:35 Starting Debugging: Cluster 15 (Tim Hawken)
- 5:36 Initial Cluster Check (API Server Unreachable)
- 5:58 Checking Kubeconfig and API Server Address
- 6:25 Verifying API Server Process is Running
- 6:50 API Server Port Check (Netcat/OpenSSL)
- 7:47 Examining API Server Manifest
- 8:58 Installing etcdctl for Debugging
- 9:32 Attempting etcd Health Check
- 11:10 Investigating Suspicious API Server Flags
- 12:10 Removing Suspicious API Server Flags & Restart
- 13:15 API Server Still Unreachable
- 13:43 Adding Verbose Logging for Debugging
- 15:29 Checking API Server Logs
- 20:46 Finding Authentication Handshake Failed Error
- 25:47 Checking API Server Certificates
- 27:00 Examining etcd Certificates
- 27:38 Checking etcd Logs
- 28:10 Finding Errors in etcd Logs
- 31:21 Re-testing API Server Port Connectivity
- 34:12 Checking Certificate Expiry Dates
- 35:43 Checking Running Processes (ps)
- 36:26 Checking Firewall Status (UFW)
- 36:57 Reviewing iptables Rules
- 41:09 Re-examining API Server Static Pod Manifest
- 42:55 Testing API Server with Curl
- 43:10 Identifying TLS 1.3 Handshake
- 44:08 Discussing Potential TLS Version Mismatch
- 49:30 Kluster 016
- 49:34 Moving to Cluster 16 (Matt Anderson)
- 49:45 Connecting to Cluster 16
- 50:10 Initial Cluster Check (Nodes Not Ready)
- 50:50 Attempting Teleport Connection to Workers
- 51:59 Manual SSH Success, Teleport Issue
- 52:05 Restarting Teleport Agents
- 53:08 Suspecting Network Issues (Teleport Affected)
- 54:27 Reviewing iptables on Worker Node
- 54:50 Working from Control Plane Node (Cluster 16)
- 54:58 Reviewing iptables on Control Plane
- 57:30 Network Issues Confirmed (Private IP Ping Fail)
- 59:03 Discovering Cilium Host Network Policies (CHNP)
- 1:00:36 Examining Cilium Network Policies
- 1:02:09 Deleting Suspicious Cilium Policies
- 1:03:08 Teleport Agent Returns
- 1:03:36 Nodes Starting to Become Ready
- 1:04:03 Checking Cilium Pod Status
- 1:04:28 Identifying Unready Node
- 1:04:43 Restarting Teleport & Kubelet on Unready Node
- 1:06:00 Verifying Node IP Addresses
- 1:06:43 Checking Kubelet Logs on Unready Node
- 1:07:00 Kubelet Connection Timeout to API Server
- 1:07:59 Focusing on Kubernetes Resources (per Breaker Hint)
- 1:08:09 Re-checking Cilium Network Policies
- 1:08:50 Deleting Node-Specific Cilium Policies
- 1:11:06 Nodes Getting Ready (One Still Unready)
- 1:12:10 Deleting All Remaining Cilium Policies
- 1:12:27 Restarting Kubelet (Attempt 3)
- 1:13:21 Kubelet Logs Still Show Timeout
- 1:14:32 Checking Admission Controllers
- 1:15:03 Checking Static Admission Controllers
- 1:16:31 Checking the Node Resource
- 1:16:49 Verifying API Server IP in Kubelet Logs
- 1:17:31 Hint: IPv4 Address Issue
- 1:17:50 Checking Cilium Endpoint for Node
- 1:18:12 Suspecting Pod CIDR in Cilium Config
- 1:18:51 Dumping Node Configs for Diff
- 1:20:16 Diffing Node Configurations
- 1:20:32 Identifying Cilium Host IP Discrepancy
- 1:21:36 Checking Cilium ConfigMap
- 1:21:53 Finding tunnel-mode: veth
- 1:22:15 Fixing tunnel-mode
- 1:22:17 Confirming Pod CIDR Hint
- 1:23:40 Reviewing Cilium ConfigMap CIDRs
- 1:24:19 Checking Local Cilium ConfigMap
- 1:25:40 Fixing native-routing-cidr in Cilium ConfigMap
- 1:26:01 Fixing cluster-pool-ipv4-cidr
- 1:26:31 Rolling Out Cilium Agent Restart
- 1:26:57 All Cilium Pods Running
- 1:27:00 Checking Node Status (Again)
- 1:27:10 Restarting Kubelet (Attempt 4)
- 1:28:04 Node Still Unready, Pods Terminating
- 1:30:06 Monitoring Kubelet Logs
- 1:30:54 Hint: Check Kubernetes Service Endpoint
- 1:31:08 Checking Kubernetes Service Endpoints
- 1:31:32 Identifying Incorrect Endpoint Target IP
- 1:32:28 Re-testing API Server from Unready Node
- 1:33:40 Ping to Private IP Fails
- 1:35:12 Confirming Network Issue to Specific Node
- 1:35:40 Focusing on Cluster Health Despite Ping Issue
- 1:35:58 Re-checking API Server & Pods
- 1:36:08 Postgres Pod Terminating on Bad Node
- 1:36:11 Force Deleting Postgres Pod
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:37 Introduction to Klustered Part VIII
1:37 Hello, and welcome to today's episode of Rawkode Live. This is part eight of the clustered series. A series in which myself and a guest will attempt to fix broken Kubernetes clusters. The catch is these Kubernetes clusters are broken by members of the Kubernetes, we're cloud native, and Rawkode communities. Now before we begin today, we're gonna do a little bit of housekeeping. First and foremost, if you haven't clicked subscribe and click that little bell on YouTube, please do it now. It means you will get notifications for all future episodes of clustered and all the other cloud native content that
1:59 Housekeeping & Community Shout-outs
2:12 I'm doing. It also helps other people find the content, which we can all agree is a good thing. Right? Also, there is an active Discord server. Feel free to pop in there. There's currently a discussed stage happened just now. Leo and Kevin are there. Hey, and Kevin. Feel free to go chat and laugh at us as we attempt to fix these clusters. It's always funny, at least for you. Lastly, I love to thank my employer Equinix Metal. They provide the time, compute powers, and resources for me to put this show together and us all to have a bit of fun.
2:35 Sponsor Thank You (Equinix Metal)
2:45 So thank you Equinix Metal. You can check out Equinix Metal with Rawkode live as a code. This will get you $50 in credit, which is around one hundred hours of compute. Alright. Now today's guest, Noel Georgie is joining us today, a active Kubernetes community member. Hey there, Noel. How are you? Good. Excited and terrified too. Excited and terrified is exactly where we need to be so I'm very happy with that. Why don't we start with a quick introduction? Do you want to give us a little bit about you and then we'll talk about today's episode?
2:56 Introducing Guest Noel Georgie
3:23 Nothing. Thank you. What we yep. I work for an employer in a company called Railsoft. It's a startup and we do kind of like mostly for some of the government contracts, mostly in the defense department. Alright. Awesome. Thank you very much. We have our first question today, which I think is is very apt. Is asking who are the breakers? I haven't. You guys want to know. Yeah. I intentionally really not intentionally. I didn't really post a lot of information about it but we do have some really good breakers today. We've got a cluster broken by Tim
3:42 Meeting the Breakers (Tim Hawken & Matt Anderson)
4:01 Hawken, one of the original or at least first couple of contributors to the Kubernetes project and Kubernetes developer at Google Cloud. The second cluster is broken by a colleague of mine, Matt Anderson, who works at Equinix Medal on our delivery and infrastructure team spending all day with Kubernetes. And as he assures me, he can break Kubernetes with his eyes shut. So I'm curious to see what evil things they have done to our clusters. Yep. I guess it's that time where we just start taking a look. We will get my screen shared now. Alright. We're both on screen. We have teleports. I
4:35 Starting Debugging: Cluster 15 (Tim Hawken)
4:43 believe you have access to it. I am going to start our first session. I will zoom in a little bit. If you just want to jump into active sessions, click join and type echo hello so we know that we're both sharing the same terminal. Yep. Got it. Awesome. That's a hard one though. And in fact, there we go. We've got a wave there from Matt who is watching us. Thanks, Matt. I'm looking forward to this. Alright. So we're starting with cluster 15. We did them in numerical order. This is Tim Hawkins cluster. We will first
5:24 I'm gonna set as an alias so that we don't have to keep messing that up. I will export our cube config. We're using the admin token and I'll give you the honors. You wanna type, get nodes, get pods, whatever you want. Let's see if we have an API server. Why does nobody ever leave us an API server? Yeah. That's Okay. Let's check that file. Alright. Lead the way, and I'll Just want to see which API server it's okay. It's I guess it's IP of this machine. Yep. It's this one. Right? 180 something. Yeah. I think that's alright. It looks to
5:58 Checking Kubeconfig and API Server Address
6:18 be what I would expect. I'm assuming our API server potentially isn't running. Happy. Come on. It is running. I hate it when that happens to it. So let's check. Okay. It says the secure port is 643. Shouldn't okay. The insecure port should be zero. This is fine, but it's not responding. Okay. So, yeah, we definitely have Do we have Netcat here? Yep. And I just want to see. Okay. So it's listening. At least I can help reconnect to that. I I'm curious if if we run the get pods again and let it time out, if it's gonna tell us HTCD connection timed
6:50 API Server Port Check (Netcat/OpenSSL)
7:09 out or if it's gonna tell us something else. I suspect I mean, who would be harsh enough to mess with HTCD? I made it very clear my thoughts on people messing with HTCD. I don't like people messing with Yeah. Alright. What's the timeout gonna be in thirty seconds? Gonna be the longest thirty seconds of our lives. Yeah. And if I'd while he jot down with a comment there saying, wait. Wait. Wait. It might time out with a message. Yes. I'm I suspect we're gonna see entity. Let's see. Just go through it, the arguments to see if
7:47 Examining API Server Manifest
7:51 are these user name handles normal? I don't remember seeing them. What do you not remember? Oh, TLS handshake. No STD. I guess that's good. Alright. Should we do some basic control plane checks? Do you wanna see if we have ETD running on the PS table? Yeah. Okay. Yep. It's running. And So Let's try that safety cheat first and see if it actually works. Yes. Hold on. I really need to book oh, Oh. I've found a new shortcut in my streaming software that shows comments. That's useful but not what I want right now. Okay. I really need to start bookmarking this
8:44 because I use it all the time. Great. I'm I'm just trying. Yeah. I don't I don't remember these at at city things. Oh, happened? We need apt install. Let's see. Let's search. Let's see. It's a decline. Okay. Of course. And we can just run. Yeah. That looks good. Yeah. That's good. Yeah. So TLS handshake time out. That's gonna be one of these weird error message that just phone itself. Oh, now you're getting scary. You know these commands off the top of your head? Yeah. I just I kind of use them frequently. And connect. Lock the host.
9:32 Attempting etcd Health Check
9:50 Six o two three. Oh, right. Oh, I guess it's really not listening. So Good thing. It's lessening on ports six four four three, but it's not actually running TLS. Maybe that's what's happening here. Should we go take a look at the API server configuration? Yep. Before that, let me check. Okay. It's QA API server. I just want to know if it's, any other process that's matched with it. Yeah. That would be sneaky. Alright. Let's take a look at the config. Kubernetes. Ah, so I don't get a full screen on direct pro what what do it is?
10:44 Oh, yeah. We can just scroll up. There we go. Okay. I don't know. That's interesting. We opened them and it started just halfway down the file. It's as if maybe somebody was sneakily in this file and didn't close it properly. So maybe that's a good indicator. But let let's see what flags do we have here in the API server. Do you see the request extra headers? That's something I had never seen them or I might have missed it previously, but You you would think I'd know most of these arguments by now, but I must say
11:10 Investigating Suspicious API Server Flags
11:23 I just scroll over them looking for the typos and swear words. Header, it means, like, I think if you are to authenticate, do you want to, like, pass in those headers? So you think these are suspect? What is it? Like, friend, client? I guess so. Let's try committing it out. I had never seen them. I don't remember seeing them previous cluster. So Let's see if there's anything else here. Look at that. Your media's a proper. It's stroke view from spine. Yeah. Those look alright. Probes, I'm gonna just skip over. Volume, spine. You should take the time stamp. Yeah. I
12:09 I think I think you're right. I'm not familiar with these request header parameters. So let's why don't we just get this saved and wait for the API server to restart and we'll see what happens. Let's see. Sixteen o eight. Two minutes ago? No. That's not restarted yet. Right? Yeah. When did I sixteen ten. Yeah. They've got it's not oh, there we go. Okay. Restarting now. Okay. Took its time. You know, always the telltale tale, but we can actually see that all three of these files have been modified in the last oh, no. Wait. No. It's it's it's the APIs ever have.
12:10 Removing Suspicious API Server Flags & Restart
13:10 Okay. Alright. So you wanna try? Yeah. Pods get notes. No. So we don't have a comment from Leo. Okay. This okay. Yeah. Let's look at the comments. So now we're probably Oh, so the word was locked. Right? Sorry. Can you say that again? Oh, is it the comment about the verb verbose logs? Yeah. Yeah. Leo suggested that we pass in a couple of dash fees into this and get some verbose log in. So why don't we we do that with the get nodes and we'll see if we get any extra output. I suspect now that it
13:43 Adding Verbose Logging for Debugging
13:51 affects the weird request header thing that we're gonna see an STD timeout. Oops. Give me minute. I just like it got locked out. Alright. I'll take over typing. Feel free just to jump back in. Yep. Thank you. I just pressed control w. Oh, control w should work. Oh, it just closed my tab, so maybe it's on Linux. Oh, oh, yeah. That that will be it. Yep. I'm just so used to that shortcut that I always accidentally Yeah. I've I've done that a few times, especially when I was trying to work from a Chromebook for a while.
14:29 That's painful. Alright. We're still getting the TLS handshake timeout on our Do you remember what's the username? No. I one. No one. Okay. Okay. Guess I'm gonna just have to work at what we do next year. So enter loading discovery information. PLS handshake time out. Skip cache discovery. At this point, I was kinda hoping to see an SCD problem. Oh, you've just joined the session because it did that weird reload thing. I'm back. There we go. Okay. So you're because everything taking too long to respond? Yeah. But your open SSL command confirmed that it was now listening on
15:25 a t l s o. Right? Listening on 643. But, normally, I would see, like, it returning certificates, but I had never actually get a Kubernetes API server. So I could be like, it could be just that Kubernetes API server is differently. I never hit a Kubernetes API server with it. I mean, like, I should return me the certificate and everything so I can, like, go and figure out stuff anyway. What did we get, please? We got nothing. Leo said we should try v equals nine because Waleed suggested that earlier, but I can't find that format. So
15:29 Checking API Server Logs
16:02 so what's happening? Oh, we got a thirty two second time out twice. Okay. So by default, it only has the user agent headers. I have never seen this ever before in my life. So while he suggested that maybe there's something in between us and the API server, I think that's a good call but we we did run a netstat and we did see that the API server process was responsible for that port. Okay. So So it could be something network wise on the machine. Is that what you're checking now? Is there any oops. I just closed it
17:07 again. Give me that. Okay. I'm back. I will learn, but not to use that. I'm wondering if there is any, like, IP tables. Someone messed up with IP tables. Yeah. Let's take a look. Dash l, I guess? Yeah. Just doing l. Let's get everything. Maybe we could grab it for $6.04 $4.03. Or maybe it might be named or something or alias. Yeah. Not really good with, like, people. Oh, we're screwed then, aren't we? Alright. Okay. So selling stuff I'm gonna mostly ignore. Don't. I don't think there's an effort on toward there. Yep. That is my rather naive IP tables knowledge
18:13 shining through. But, yeah, I think that's okay. Let's take a minute. Let's take a this that error message is bugging me. It says it's looking up discovery information as if it's maybe trying to verify the CA somehow, but I'm not sure what that means. Let's see if they are left out as any hidden files or something. Hence Do you wanna jump into the p t I folder, and we'll see if any of the time stamps look a little bit? Sure. The Kubernetes. Yeah. Let's check that entity one. Yep. That seems like a completely different time stamp.
19:11 Oops. What did I do? Pure cert. That looks different. Right? Is it like the XCP pairing is not working? And, like, it's not Well, I think the XCP pairing must be working. We've been able to well, we never checked the health, but we did created a the API. Yeah. Let's try the x c d client help thing. Yeah. Let's try it. We could probably just do a control r, pull up that last x c d command. Health. Do we do you remember what this is? Nope. Alright. Let's Okay. And what 10 health. Yeah. That seems healthy.
20:19 Yeah. I'm okay with that. And I I like that requesting something from the Kubernetes in a space seems to return okay. I I think I'd see to yourself. I think we'd we'd see error messages. Why don't we check the API server logs? Maybe we should have done that five minutes ago. Less cube. Oh, come on. It is over dash. Let's go to the end. Don't see that end. I'm just gonna do a quick scan, but. I think, like, that points in the end, but I don't know if it's actually an issue. Which one? Let me search through.
20:46 Finding Authentication Handshake Failed Error
21:47 Endpoint. Yeah. So something yeah. It's over here. Loaded. I know. Let's see. Like, crap. Nothing. Yep. This is something I was unable to register. I think this is fine. Revolve that points from Kubernetes service. Do you see that? Yeah. So do we I think that's about, like, the node registration. Right? I think it's about node registration. Can you ignore it for now? Okay. I'll trust you. Yep. From the previous I don't I don't think this is anything useful here. Right? So annoying. Yeah. What could we try? Like, we should change the logs while hitting the AP server.
23:15 Alright. We'll tell you what. Why don't you tell the logs and I will pop open another session. Oh, there's an error message I can't error now. Yeah. Okay. So what have we got here? We've got Yep. Unable to perform initial I o. Yeah. There's an error. Yep. Probably we looked at the logs of the old APA server. It keeps scrolling to oh, so annoying. What's that? Is that just one error? I'm gonna open I'm gonna open this and then because I can't. Okay. Yeah. Okay. Where's the start of Yeah. That's that's beefy. Just keeps coming.
24:55 Okay. Okay. Loaded. Loaded. Loaded. Normal. Normal. Normal. Normal. Oh, just so these are all standard or, like, see exactly STRs. So it's probably is our results. It says about endpoint.go. Oh, there's an HCD connection time out. Just after that. Some Okay. There we go. Okay. Ah, authentication handshake failed. Yeah. We gotta contact that thing. It's speaking to HCT. Okay. So we can speak to HCT, but the API server cannot speak to HCT. Fine. So it's probably the certificate that I mentioned in the manifest might be wrong or messed up. Okay. So we check the API server configuration
25:47 Checking API Server Certificates
26:01 to make sure it's using the right certificates. Oh, we got a etcdca fail. But the server, that's fine. I got the cert and the key fail. That looks right. That's this thing that I show and I was screwed through your head. It does look okay. And we It does look okay? Keep it preferred service address, internal, external host name. Yeah. I think so. I mostly run k three, so I'm really don't want to, like, look into all this stuff. Let's look into that folder and see. It is TKI. I need to add a new rule. Nobody's
27:00 Examining etcd Certificates
27:07 allowed to touch it, CD. The one that was modified was peer.crd. Let's open up and see if it's Yeah. But the peer search should really only be used with other SEDs to Yep. But that was only one that seems like has a different time stamp. Oh, it might be that the service no. But then we should see those errors in the CD logs. Yeah. Let's let's take a look at the CD logs and make sure just even though we run an endpoint's health, it looked okay. I I guess we should have run an endpoints list to make sure. In fact, there's
27:38 Checking etcd Logs
27:46 only one entity. This is a single control plane node. Yep. Let's maybe I should operate in there. Hey, Rich. We do have some errors there. Right? Yep. Lease is not fine. Okay. So it became leader. Oh, read only range request key. Okay. It's an error. It seems like a read only. But someone did actually see the only. And if it's read only, actually, this should return data. Okay. It seems like end of file server name. Yeah. We're getting a rejected connection from adder EOF server name blank. Right. Oh, let's check that. Let's take that admin.com
28:10 Finding Errors in etcd Logs
29:10 if it was modified. I mean, it was not. No. What else I could add any ex like, add or remove any extra information. Right? Authenticates to no. That should only be the API server. Maybe we should like Maybe we should just set this cluster on fire. I don't know. Hey. That's a good tip from Walid there in the chat. Wally says, why don't we get the request headers back because those have been confirmed as unrelated. Yep. So I think that's probably a good start. I'm really annoyed that we can create the entity. I'm trying to think what could be
30:00 blocking the API server and a t d. Yeah. Let's just double check these search a little bit in here. Like We got Discord chats in the bottom of my screen is hidden. Hit that. I have fixed it. Okay. Okay. So we're advertising on our own IP address, the 1454. That looks fine to me. Authorization mode, node and RBAC. Yeah. Sure. Yeah. That's fine. Client CA, we're using the PKI directory. I'm gonna assume that's fine. They hadn't touched anything there. Yeah. We've got the entities, CFL and search key fail. Yeah. Let's look at the Maybe we need to look up this kubelet
31:13 preferred address types. Could that be but then it's not the kubelet. Right? I don't know. I hate this cluster at all really. Silly. Can we restart? Not yet. No. Few minutes behind. Yeah. It's taking forever. Yeah. I don't know why. We can always carry it, I guess, but we'll give another moment. Usually, it's fast. Okay. It's dead. Okay. It's back. Okay. So can you try that open SSL command again and see if we get a response? Oh, it's connected. Okay. So we were chasing our tails on that the first time. That was yeah. That definitely was unrelated.
31:21 Re-testing API Server Port Connectivity
32:17 And then you said you would normally see a certificate coming back. Is that right? Yep. I don't know if I find disconnect. Yeah. That's it. And now since we hit it, let's go check the logs. L. Which one is latest? A l t. Right. Copy. Code. You see anything? It seems like c setting the address to CC. Right? It's just it's an STDR, so that's what kind of follows me. And are these actual studios? Alright. Who in the chat's got an idea? So there was some ideas that potentially the sound check timeout could be expired certificates.
34:12 Checking Certificate Expiry Dates
34:19 Yep. We can check that. I mean, we I don't think the tech certificates will be expired and the timestamps haven't changed on any of them. Alright. So that we are not set. I can check the expiry of that set. Yeah. That's it. I know. I'm glad you know how to check it. I'd be on Google looking up the open SSL command again. No. It's a little x c d. That was the peer dot cert. Yeah. Dot cert dash text. That's fine. And I can actually type dash and take. Right. Yeah. Oh, I don't need text to that.
35:04 Okay. It's April 6. Yeah. That looks good. And it's created when I expected it. That's, you know, that's when I spun up the cluster. That tells me that I mean, it has been modified, obviously, but not the expiry date. Yeah. It's not expired. Yeah. I actually should have not expired. Would be just not just bring it out after a water. Yeah. That's fine. Wow. A hard one indeed. Let's oops. Sorry. Let's run a PSC. I wanna see if there's anything running on this that I don't expect to be running. Yep. All normal. NTP SSH. NCT.
35:43 Checking Running Processes (ps)
36:07 Speakers from metal l b, filter. Alright. You see anything suspicious? Nope. I don't even see a firewall running. Let's just double check. Status firewall d. And that won't that won't be fire. Yeah. USW. Yeah. That's the one that's out there. It all looks fine. We went through the IP tables. We're not the ways to prevent a service from responding is, like, they call c freeze the c groups. But, like, since we restarted, it should not bother it. So this must be something else. Yeah. I hate this. I hate that I'm relying now on modified dates. Just The bold actually, the folder time
36:57 Reviewing iptables Rules
37:25 stamp is shady. Sorry. Say that again. It's easy. Actually folders timestamp. Right? It says, like, ten sixteen twenty six. That's the only thing that's come kind of different. Make sure what we're looking at is legit as well. I'm just gonna make sure there's no crazy amount of business going on. Yeah. I don't think they're hiding anything from us here. Yeah. Okay. So Sorry. Were you gonna type? Off you go. I'm gonna read there again. I think what we'll do is we'll give this one another four minutes and then we'll move over to the other cluster and then if we have
38:16 time we'll look back but I think Tim may have actually clustered us. Yep. I I I just don't under it's gotta be something so silly. Right? Because we can run etcd. Yep. And the edit might be something so silly. Endpoint. Let's hit endpoint status too. Let's do this. I SCD is is fine. Like Fine. We exported a whole bunch of things. In fact, let's just look at the cheat sheet. Right? So the certs that we're using are the CA oh, that's just the health check ones. I wonder if we point it to the other search if it works or fails. Yep.
39:04 That's a good point. Okay. So let's the CA doesn't change, but these two will. So that is underscore set and underscore key. Let's see what keys we've got. So let's try this one and go with which one's the API server you're using? Server one? Server dot server. Right? Yep. Let me open up a new session. Can I join? You can create a new session if you want yet. Fire away. Alright. Let's try endpoint status. Yeah. It's still alright. That's fine. Let's check if it's actually using this file. We didn't really check which file it was
39:50 using. Oh, sorry. I need to go. I'm gonna type. So Leo's asking, is the breaker around for hand? No. I think Tim is in meetings and cannot join us live. Well, I'm sure he will take great pleasure in laughing at us later. Okay. It was the wrong side we used. It's it's it's API API server, actually, decline. Yeah. Okay. Let's let's hook it up to that. Give it one more spin before we move on to Matt Henderson's cluster. API. API server at c d client dot cert. And let's see if it's a key, I guess.
40:41 Yep. K. K. So let's see client dot key. Run a status as well. Just okay. Yep. It's working fine. So nothing Let's easy there. Let's take us through it. Everything we're doing manually is working. What is standing between our API server and etcd? The API server has started as a static manifest running in a container. We need to look at that API manifest again. Do you agree? Sure. Yep. Oops. Sorry. Alright. That's always the fun part. Okay. So let's check the image. I'm gonna I wonder if he's brought in a sneaky image. Like I guess he could have replaced the
41:09 Re-examining API Server Static Pod Manifest
41:53 image with with anything with it at present. I don't even know. Is there a way to know if that image pulled at all? I don't know. Four minutes ago, so it's not restarted yet. Let's give that a minute. If that doesn't help Yep. We'll jump onto cluster two and then if we have time, we'll jump back. But I'm at a loss. So impatient. I'm gonna restart. I thought there must be something in between these two, some component that we are not aware of or, like, something how it communicates with HCD, or it must be some flag that's causing
42:38 this. Okay. And let me I think the cache discovery is fine. Okay. But it says DLS handshake timeout. Let's take that off. That curl in verbose mode. Wait. Oh, but it's really was responding and So success Oh, do you see that? It's saying TL is 1.3. Is that bad? Probably. Let's check that city manifest and, like, see if it's actually only listening on p l s 1.3. Okay. I we could have tested it. But if I said, yeah, whatever. No. The NCD manifest. Oh, the NCD. And probably the NCD CTL might just work fine with TLS 1.3.
43:10 Identifying TLS 1.3 Handshake
43:42 I know. It might be that just curl used it by default. Like, it should have been, like, tried using the fallback mechanisms, whatever. I don't believe it should appear. I think that's okay. So why would TLS 1.3 be a problem? As the API server doesn't like that, I can just quickly verify that. I might be sometimes we might be just oops. Not this one. You think it's maybe restricted to 1.2? Yep. Okay. I think it's just PLS 1 2. Oh, come on. Is that sorry. Day on beach. It's after connect. I don't know, but I know, like, we can pause and maybe,
44:08 Discussing Potential TLS Version Mismatch
44:54 like I don't know. That's what I usually do. I don't have it. Yeah. I don't think we have it. Oh, it's DashTLS1Underscore2. Of course, it is. Why? It's connecting, but I don't know why it doesn't return the certificates. Like, it's viewed. Alright. You Google and, like right. Like, it usually returns. Did I type it wrong? See, it just returns all this bunch of data, which makes me, like, like, verify that actually kind of thing. But this is, like right. This is what's kind of bugs me off. Like, it doesn't show anything. And if I do TLS one three,
46:03 let's see if it does something just to verify. Ah, nothing. This is what kind of confusing. It should return those. Okay. Let's I don't wanna spend too much time on it when we're not sure, but let's try the IP address instead of local host just to rule out a couple of things. So Oh, yeah. Take off that protocol. Yeah. It's connecting, but it's not responding. I think it respect to So what would stop it? Certificate. What could stop it responding? I let's take the manifest again. It might be that f certificates are not mentioned for
46:53 the API server. We could always bust out TCP dump. Wait. Where is the certificates for the Kube API server itself? Do you think we're missing something from here? I guess so. Oh, there is TLS cert file. So at cKubernetesAPIAPIserver.crp. Let's take the things that expired. Mean, the timestamp on them hadn't changed. Yeah. But I think it's Yeah. Maybe maybe on the, like, maybe unlikely. I didn't pass the fly. Let me find the what were oops. Add intake. Yep. It's eight or six. That's fine. Beautiful. It's fine. It's not even touching. Permissions looks fine. Why? Something funny that locking it on. Like the
48:45 API server is obviously getting a request. It's getting accepted and connected but it's unable to respond to anything that we send to it. Got my c dash view. It's not responding. That's a problem. So it's Well, it might be responding. It it's just not getting to us. That's the thing. Right? It's like Yeah. Probably. But, like, if you see, right, it sends a hello, but the server now responded with the what, like, I don't know what to write. It does something with TCP. Yeah. Like so it's just like client is sending the hello, but, like, the server is not responding.
49:30 Kluster 016
49:32 Alright. Let's call it. Like, I don't we we can't spend the full episode on this. We'll come back to it. Yeah. I I'm at a lot right now. Let's let's see what we can do over 16. So this cluster, I've just joined and created a session. Please just feel free to join that type echo hello. Login. This cluster is broken by Matt Anderson. He leads our Kubernetes team at Equinix Medal. I will set up our alias. I will set up our cube config. And I will run again pods. We actually have an API server. Fake.
50:10 Initial Cluster Check (Nodes Not Ready)
50:18 The grep be running. Rawkode and Seth is Lots of things are unhappy. Maybe get get pods was the wrong idea. What what about nodes? Okay. We have three not ready Oh, it's not ready. Unhappy nodes. So I guess our work in the control plane is done. We're gonna need to jump over to the workers. So let's jump onto the first one. Yeah. That's might be easier. Are sessions created? I can't connect to the workers. Wow. If I can't connect to the workers, we can't debug. Okay. Not matching UUID like target. What's that? I will attempt to SSH manually,
50:50 Attempting Teleport Connection to Workers
51:28 which is not good because we can't share screen. At least we'll know if anything's running on those machines. Okay. So I was able to SSH on the worst curve, but that means teleport is maybe just broken. So I'm gonna just try and restart the agents. We'll see if those come online. Let's try restarting it. Yep. Not looking good. There are very few rules on custard, but breaking teleport is one of them. Yep. You should not. Alright. That's not coming back even though we have teleport running, which tells me there's potentially network issues. But we can't share a terminal. Let's
53:08 Suspecting Network Issues (Teleport Affected)
53:21 Yep. And I get it angry. I get it flustered. Let's do that. Alright. We're gonna have to try and fix the networking rules then. Let's pull up the teleport network ports. That's not gonna work. Oh, wait. There was the admin thing there. Not very useful. Let's see. Just need some port numbers. Alright. There we go. So we've got pretty much the three zero two range as used by teleport. Let's take a look at our IP table rules on this worker. Those look fine. Okay. Let's just use the control plane node right now and see at least if the workers are okay, we
54:27 Reviewing iptables on Worker Node
54:45 SSH down on my local machine. I restarted teleport. It actually seems to be okay. So let's just try and work from the control plane node and see if we can work this out. So IP tables dash l. Let's see. Oh, I can join. Right? Yeah. Yeah. You can join the the control plane node on cluster 16. Let me know. I've set I'll keep an eye on it, but you should see when you pop in here. Okay. Must be on the wrong session. Okay. I see it. Yep. I just reload. It's got a weird bug when you join that it just, you
54:58 Reviewing iptables on Control Plane
55:28 know, freezes my screen. Okay. So let's run that IP tables again. I'm gonna ignore all the Cyllium stuff because if Teleport can't reach over the back end network, then I think our IP tables are being messed with. We can ignore WP running. I'm gonna check the firewall. Yep. I think it's just UFW. Right? I think so. Yep. It's just like UFW enabled. I don't know if it's a system to service. Whatever. Yeah. It's not running. You see anything? We do have Matt watching, I think. He did leave us a comment earlier. So at least if we Yeah. I have
56:34 a coworker who is, like, very expert in networking. So whenever I hit a networking issue, I just, like, ask him. He sends you some commands. I just run it. Send him the output. Tells me what to do. These look alright. I'm not accept all. Alright. So let's just let's try let's try in them. We're just the noise. Actually, dash l doesn't list out everything. There's something else that actually shows you everything else. I don't know what does that like. One like, I don't go running that once. So this is a drop policy, but it says it's for marked packets.
57:17 This is only doing local host that's invalid. Okay. So whatever is affecting teleport is probably affecting is probably the reason the nodes can't speak to the API server. So let's run a test. K. One four seven. Nope. 14. 1 4 5. Okay. So 145. Yeah. This is the IP address. So while it is asking from the Yeah. Why we're not checking the the network policies. The reason I'm not going straight to that is because Teleport is effective and Teleport runs directly in the operating system. So I'm not fussed about checking the Kubernetes policies yet. Maybe that's naive of me. What do
57:30 Network Issues Confirmed (Private IP Ping Fail)
58:14 you think? But I think if we could if we fix Telethorpe yeah. I don't think that Kubernetes policies affect nodes. It can't write. It's only affecting the overlay that, okay, Kubernetes manager. So Yeah. I don't know. I'm gonna try it. But my overlay yeah. Carling six four four three. Let's take for. Yeah. There there's something in between us and that machine. Let's make sure I can even hang it. Let's use the private IP address too. Yeah. Oh, they do. The post network policy. Yep. Let's see if those things exist yet. CRDs. Alright. There's something we learned today. Thanks, Matt.
59:03 Discovering Cilium Host Network Policies (CHNP)
59:17 Post network policies. Do get API resources? From API. Yeah. Yeah. API resource. What did I do? It's not get. I think it's just keep control API resource. Oh, yes. Just, yeah, API resource. Did you mean yeah. Just do it. Both. You may wanna do a green dash I maybe. Yes. Try a Selium cluster wide. Oh, okay. It's probably Selium cluster wide network policies. It says cluster wide, and then the other is, like, CDM, like, policy. So let's try get CCNP. Wait. It's very hard for me to. Ah, see? Wow. I didn't know about this. This is
1:00:36 Examining Cilium Network Policies
1:00:38 cool. Yeah. New to me as well. Let's let's describe them, take a look at them before we go sledgehammer deleting them. Oh. Yep. I have no idea that Kubernetes could modify the That's cool. Let's do it egress. And this is something I learned from Duffy. Alright. Let's see what we have here. Oh, it doesn't really. We probably need to get the YAML. Yeah. We can take a look at do get old YAML too on it. Why not? But it looks like we've got some sort of egress to the world. I'm assuming world is special for everything.
1:01:22 Spec kind. Okay. Two entries, a word. What does it say? It's, blocking. It shouldn't be a stack. It shouldn't be. Egress, two entities, world, and what is the kind? Is it, like, cluster or white network policies? And this one was egress. Is it, a default thing? Well, yeah. Potentially, why don't we do a kubectl explain CCNP spec? What's the is it dot spec? Yeah. That's not They don't really have a description. I say we delete them. Valid. Yes. Yes. Yes. Thank you, Valid. Yeah. World not equal to cluster. Wow. World does not even cluster. Come on.
1:02:09 Deleting Suspicious Cilium Policies
1:02:25 Okay. Let's do CCNP. Are we happy just to delete these? There's not nothing nothing good of these ever came. Right? Anyway, it's not clustered, so it can't be like the default one. Right? Well, I will delete the egress, and then I'm gonna describe CCNP. I just wanna see what that pretty much is. This one, ingress from the world. I'm assuming these are just blocking all ingress and ingress traffic. Yep. Clust oh, custard ingress. See, I love it when we get a break like that where we learn something. That's that's the best thing. So do we think
1:03:06 that That's tech telephone. Look at that. We got one back there. Yep. I'm assuming the other two are gonna come back on momentarily depending on the retry pause. Bullshit. Come back soon. Yep. It's getting ready. I hate it when I get taunted. Matt is saying it's not gonna be that easy. So let's see. Did the yeah. The the notes are coming back. So let's check if Celeum. Do you wanna do a namespace Celeum get pods and see if our networking is slowly starting to come back to life? Ah, so one of the nodes is still
1:03:36 Nodes Starting to Become Ready
1:03:47 not ready. I assume it would eventually be ready. Yeah. Let's give that a few minutes, but let let's check out the Celeum namespace. What is pending? Getting there. Yes. We've got two agents online, which means two of the worker Yep. I think that's the one that's scheduled on, but, oh, that's not ready. Yeah. We could just do Dash 0 Pike. And oh, come on. It's a little bit hot read. DP4. Yeah. It's the one that's not ready. So it's like maybe that's a breaker. Yeah. I think we may have to get Yeah. That's I'm gonna jump on that note.
1:04:28 Identifying Unready Node
1:04:32 Let's take a look. So t t j. And then we don't have teleport access. I'll just need to grab the IP address and jump to my terminal for a second. We'll do some Okay. Teleport is still on. Is it back? No. Okay. Let's see if we can just encourage it to pop back so we can use our shared terminal. So BSSH. And I will just do a restart teleport and while I'm here, I'll try a kiblet. Why not? Like, who knows? It could just be that simple. Is teleport happy? Let's try hitting the API server using code and see if it pops
1:04:43 Restarting Teleport & Kubelet on Unready Node
1:05:16 up. Alright. Yep. I'll do that. It teleports back on that node. How's the get the cubelets start back yet? My flakes still got go. So Matt told us to note the IP address and the node list. So let's do a get nodes or wide and see if there was something issue there. What do we got? The one that is not ready has, like, 25145. Is that the one you're on? That's okay. Yep. I'm on that machine. Right? So yeah. The teleport came back. IP address. Excuse me. Yeah. 145. Let's check the kubelet logs. So journal
1:06:43 Checking Kubelet Logs on Unready Node
1:06:47 kubelet. Oh, okay. We got errors. Dial. Oh, it's just failing to connect to API server. Right? Yep. Connection timed out. Okay. Why is it to do the connection timing or or everything today? Yeah. No. Everyone just seems to be targeting our networking knowledge. Alright. Let's see what we got in IP table. Although my scanning of this previously did not do me any good whatsoever. So Maybe there's like something hidden again. I should get better at this networking nonsense. Yeah. Ephelia. But maybe it's yeah. Let's assume that the control plane or at least something and cut
1:07:00 Kubelet Connection Timeout to API Server
1:07:54 let's assume Matt's doing something with Celeium and other cluster resources. Maybe we should just do a cube control. What do you see? See there are, like, a lot of Celeum network policies. Oh, yeah. And it's blocking in the namespace cube system, so I guess it might be blocking it for that node. Thanks for the confirmation there, Matt. He has confirmed that everything was done using Kubernetes. Awesome. Means we can stick to teleport. Oh, thank you. So you've pulled up the cellular network policies. We've definitely got those custard egres and ingress things again. So you wanna
1:08:09 Re-checking Cilium Network Policies
1:08:39 yeah. Good call. Right. Node specific stuff Yep. It has some selected. Let's delete it. I love this sledgehammer approach. Just just delete anything. In fact, I should just write a script that looks for any changes to the system within the last three days and just delete them all. Although, maybe that will get me in trouble. Like some of the monitoring tools should do like once you deploy cluster, it should monitor the state and if anything changes, it should just go back. I think it's actually I should run something on the cluster that audits every API server
1:08:50 Deleting Node-Specific Cilium Policies
1:09:18 request so that I can replay it later. Let's let's for now let's go with the sledgehammer, delete anything we don't know and see if we can get this cluster working again. So that was one of those really cool CNI plug ins and technologies that there's just so much of it that I've not really even started to play with enough of the cool features in network policies. Although we do have Hubble on our cluster. Wonder if we should port forward to it and show Hubble. Because that would probably that would actually show us I think all the policies right in
1:09:46 front of us. Hold on. Let's let's try that. DLB cluster. This is 16. Export. Okay. Should I delete things or wait for Yeah. Just give me two seconds. Let's Yeah. Let's yeah. Let's check the Huddl. I had always checked that. I want wanted to deep dive in my cluster, but, like, never got around it. Oh, Hubble was here. Oh, what's the debug pod? Where's Hubble? Let me check if I just failed. Did I not enable Hubble? Hubble enabled. Okay. I think Hubble was taken away from us. I see. Either that or it's not. It could be on that node or no.
1:10:50 Could we see it as pending? Alright. Let's just delete. Let's go. Alright. And get notes. I can I can gently encourage that kubelet to restart too? So t t j or was it another one? No. T Alright. Restart. Yep. It's Yep. Come on. K. Keep it as restarted. I I guess we should delete all of them because if you look at each, right, it seems like hundred minutes. So probably maybe some time back. Right? Sorry. What was that? So if you look at each of all these policies, right, it's just created, like, a hundred minutes back, but, like, the node is,
1:11:06 Nodes Getting Ready (One Still Unready)
1:11:57 like, two days old. So probably all these probably passing stuff. Yeah. They're all two hours old. Yep. Should I go and, like, delete everything? I mean, it may get us in more trouble, but I fully support your approach. Let's do it. Yep. And this is my favorite thing. Nice. Alright. I will restart the cue ball just in case it needs a little push. Yep. I'm gonna try to familiar with, you would know, like, which one are the default. That's, like, kind of a drainage. Damn you, Where are you? Alright. Let's get the logs for it again.
1:12:27 Restarting Kubelet (Attempt 3)
1:12:53 Yeah. Let's check the logs or, like, try calling log. It's telling me we're still getting a connection time out on it. Okay. So those are okay. Those are, like, root log. We could ignore. I love Matt's comments. I was procrastinating, and I only broke it five minutes before the stream. And then five minutes, she's still got us chasing her tail. So thanks a lot. Right. Let's see. Is there anything else with Serium? Yeah. I would do a Oh, Serium. Resources and grep to start. They've got all complete in our mind. Yeah. Like, this is, like, why it's, like, autocomplete.
1:13:21 Kubelet Logs Still Show Timeout
1:13:36 Okay. So let's Oh, okay. There are some endpoints which are named clustered. Oh, no. It must be from the ports. These are from the ports and the default namespace that I was given an IP address too. So that looks good. Alright. We checked network policies and customer by network policies endpoints. Let's choose the same nodes. That looks yeah. I think that's standard. Just try this. Matt says, if we burn the policies, we're on to our next fun adventure. So we got something else to do. Yep. And I assume it's probably based on CDM. Let's check for the old classics. Right? Let's
1:14:32 Checking Admission Controllers
1:14:34 check for a mutating webpack controller, any validating webpack controllers. Let's see if there's anything interfering with node registration. Nothing. I don't see this card. No. You're dating that for configuration. Damn it. Okay. Let's check the API server configuration and see if there's any static admission controllers enabled. Oh, yeah. Oh, wait. Oh, I thought you did a control w there, but you said alright. Yeah. Those haven't changed. Yeah. I was going to, but I don't know. But I saves all the files none of the files that I did. Alright. So Matt has definitely used only cube
1:15:03 Checking Static Admission Controllers
1:15:30 control, which means we're dealing with resources. Let's Yeah. Let's let's go down this rabbit hole. Alright. Let's let's do a get pods all and just make let's see what it let's grab for dash v running and see what we've got still pending. Wow. Alright. Rawkode is pending, which I'm not really worried about. Oh, Cognitive is terminating. Okay. So But it shouldn't affect the node registration. Must be missing something. If you type that Oh, we haven't actually checked the node resource. I wonder if he's just modified it. Could that stop it registering? No. Like, if you look at the logs,
1:16:31 Checking the Node Resource
1:16:42 right, it's still connected issue. Can we check the latest logs? From the Kubernetes? From the Kubernetes? Yeah. Yep. I've got them here. We're just seeing the connection time out. I use the connection time out. Kubit node's not synced. Failed to watch the node. Is that IP the right one? 90 three one one eight. I think so. Yeah. The IP address is fine. Yep. 518. Alright. Matt's given us all the hints here, so that's good. That's something to do with IP v four address. So let let's go back into that. I've got the machine here. Right? What are
1:17:31 Hint: IPv4 Address Issue
1:17:46 we expecting? It's useless. 14Dot5. 14 Dot 5. I feel like Yep. The CDR? Wait. This is wrong. The Sillyum host. Right? This IP address is just flat out wrong because that's the pod network. I wonder if we drop in the 10 dot address here. That alright? Think I got But let's let's be fine. Before I go change it, I was No. That's like a working note. Yeah. The amount of times I've made this worse for myself by modifying something that I shouldn't touch. Oh, no. That's got the same. Okay. Podsider, Podsider. I've got a status here with the address.
1:18:12 Suspecting Pod CIDR in Cilium Config
1:18:46 So let's confirm that. Maybe we should just print these to file and def them actually. Let's do that. Okay. So this is our broken. Yep. Yep. That's what I usually do. Yeah. I've definitely made I just like always save the YAML before usually I always save the YAML before deleting in my stuff just to be safe. I can't type. I'm gonna reload. Hold on. Okay. Oh, just type in reset. Like, whenever this happens, like, if you type in reset, it, like, fix the prompt. Oh, nice. Yep. Just type in reset. Like, sometimes some applications messes up with output. Just type in
1:18:51 Dumping Node Configs for Diff
1:19:36 reset. Like, you won't see that, like, you're typing it in, but it should fix it. Okay. So let's get notes. K. Get notes. Drop in a working one. Dash o o g m o. Working. And then our broken one. I miss the plug in. Like, I have a lockout plug in, like, called qcutl neat. So I always pipe output of minus o YAML to qcutl neat, which should remove all the unwanted stuff. So it's easier to test. It's like a nice tool. Nice. I'll check that one out. Okay. Let's see what we've got. Besides a really crappy diff, but
1:20:16 Diffing Node Configurations
1:20:21 not worried about this. No. Let's not worry about all of these. This is not I think you're really want. Those are, like, I just go go to the top. All the data. Okay. The pod CD. Oh, what is the pod CD? Was there, like, a definite? 192. Yeah. It's more that each each node has this on pod side. Yeah. I would expect that. Yep. That's right. Alright. And the only difference was with the annotations. Right? Those are still annotations. Sorry? Can you diff it again? Oh, yeah. Sure. Alright. So it says CLMHost921683653171. Oh, wait. That all seems like a little bit weird.
1:20:32 Identifying Cilium Host IP Discrepancy
1:21:08 Like, doesn't match up. Right? Like, one is, like, dot three and another is dot seven. That's like it goes like I'm not sure if it's like a c n thing. Yeah. I think it's allocating its own network to each node and each pod gets its own address space. I think that's alright. Okay. Let's take that CCM by second. Yeah. Celium. Celium by Fic. I'm curious, Matt. Did you kill Hubble? It's really annoying me that we don't have Hubble. Oh, look at that. Wow. The classic table approach. Yep. See if there's anything else in here. It should like once the series is over,
1:21:53 Finding tunnel-mode: veth
1:22:07 like, you should probably write a book, like, 100 ways to break Kubernetes clusters. Alright. Let's fix both of those. I think 10 I think it should be 10 slash 16. Not a hundred. No. Probably, like, try restarting the cube like yeah. That started. Oh, no. It's not. Oh, wait. Is it? Yeah. No. Direct it? Yeah. That's not starting. Damn it. Disabled. It's yeah. That's. Matt says our CIDRs are wrong. I think we just fixed that. Right? Or is it, like, a wrong one? This I assume this is a cluster c a d r. Yes. Thanks, Pop. I hope you're enjoying our
1:22:17 Confirming Pod CIDR Hint
1:23:37 suffering here. No. I don't want Pop mentioned he needs to go for me. Right. Okay. Let's go through this. There's gonna be something here. So the were insiders were 1000Slash16. We think that was wrong. We set it back to 1000016. If we could verify from the working cluster oh, I know what that was. We don't have a working cluster. Tim Hawkins cluster is permanently destroyed. In fact, I have the Cilium config map. Do I? Yeah. Let's take the local one. Yeah. I might be able to. Yeah. Probably, like it'd be, a different one. Anyway I know we use the
1:24:19 Checking Local Cilium ConfigMap
1:24:33 let's see. Oh, should we restart the CD ports after Oh, yeah. Yeah. If we haven't reset yeah. Of course. We have to restart the CuteProttle, rollout, restart. Just being nice. You have people with your rollout restarts. I generally just go for the delete approach and let Kubernetes do it. I used to do that, but after watching, like, Jason doing it, I couldn't, like this seems like more Wait. Okay. It must be like a stateful set. Right? Honestly, k dash n, solium, get paused, delete all, like. I don't know. It must be. Celeum is a demon set. Okay. Yeah. I should be.
1:25:23 The deployment is oh, yeah. So the operators are deployed, but we need to restart the agent. So restart the demon set Celeum. Yes. Celeum. Yep. Matt has thrown out some more tips. He's taken absolute pity on us and saying if I look at my helm values, my pod cider starts with one name two. So let me pop that open again. So has that been changed in the config map? Native. Oh, of course. Yeah. Native written cider. I've set it to 1000, which is stupid. 192168. But we need to modify our our config map again. Thanks, Matt. So our cluster pool IPV four
1:26:01 Fixing cluster-pool-ipv4-cidr
1:26:05 should be 192168 as well. It'll need to be a yes. Dot zero dot zero slash sixteen. Okay. You're probably yeah. Other one should be where are you? And then we're ready to roll out. Roll out. Yep. And there was one more roll out. That's three. It's just the So I The Equidix metal private IP foresider is actually ten dot. If we if if we tell Cillium to work on that space, I'm surprised more bad things didn't happen. Look at that. We've got them all. Do you wanna run get notes? Yep. Hold on. I'm gonna restart that kubelet that
1:27:00 Checking Node Status (Again)
1:27:09 kubelets. What is the debug board doing there? Let's delete that. I don't know. People like to leave me little presents on cluster, I've noticed. Delete board. Alright. I restarted that kubelet. And it deleted the Deepak board. But if it was an actual cluster, like, if I see something like that, I would always, like, find like, describe, see the YAML, find the image, get the old file system, save it somewhere, and, like, inspect it later on. It's fun to do such things. Right here, just the update. Okay. Must be yep. Let me see if it's a deep pipeline.
1:27:10 Restarting Kubelet (Attempt 4)
1:28:00 Nope. I think my thing is that weird. Let me refresh. Did we have all all the nodes? Then we will go, oh, we don't have all that. Not yet. That's it. We will have lost the game. And meantime, I just want to see why those ports are not terminating. It's kind of stuck. Serial f w. So we're still getting, oh, we're getting a lease adder. No. It's still a timeout. Right? Oh, yeah. I missed that. Okay. I was too busy reading this. Okay. So we're still getting a timeout here. It says terminating, but it's not terminated.
1:28:04 Node Still Unready, Pods Terminating
1:28:52 And why is its priority, like, too high? Where? I did. Priority. No. That's fine, I guess. Like, let's just get the pods. But it's, like, still terminating for, like, some Yeah. I'm gonna gently encourage that. Yeah. Chris, period. Period. I can't. Although, it's actually this is the one that's on the Kubelet node that can't connect. So that it's not even gonna terminate anyway. Oh, yeah. K. Got it. Just. Yeah. So that's pending. Yep. Right? And because we don't have a kuplet, it's stuck there. Okay. Let's describe the node again. Wow. That is cool. I'm just gonna keep looking at the cubelet logs
1:30:06 Monitoring Kubelet Logs
1:30:07 and hope that something pops up. In fact you know I'm gonna stop it through system D and I'm gonna maybe I'll start it. Matt says there there we go, but I don't know. Oh wait. The Wally doesn't matter to us to check the Kubernetes service. But the service wouldn't explain why our Kubelet can't connect. Right? Or is there Kubelet connect that something's broken even though it's speaking to the server? Hold on. Let me see if I can get an actual error. Has an endpoint said it's oh, wait. Its endpoint is wrong. Right? The endpoint should be
1:31:08 Checking Kubernetes Service Endpoints
1:31:17 93118. Yep. And it's 127. Well, that's the control plane's other IP address. That should also work. Oh, okay. Okay. So we have a Oh, the target one is wrong. Right? It says 123. Oh, yeah. But I still don't think that explains why our Kubelet can't speak to the API server. But what? Yeah. I don't understand, like, why the service should affect it. Right? Like, it just radically hits the and the target bot should be 6443. Matt, same. We're looking right at it. Yeah. Matt, you'll find out soon enough that when you're on the stream, you're looking at
1:31:32 Identifying Incorrect Endpoint Target IP
1:32:07 a lot of things and it's it's it's difficult. I guess it's 6443. Right? Yeah. I Yeah. It is. Alright. It says 6443. Alright. Like, we could like to queue port forward and, like, see. And there's the node. I hate this queue. No. We're still getting a tame out. And I don't think that's really easy to the surface. This is something else. Yeah. But it must be something different. Yeah. Wait. I'm gonna restart it one more time because, you know, I feel like I'm gonna another circle. So starting speaking to container d. What was the very first error message?
1:32:28 Re-testing API Server from Unready Node
1:33:18 Yeah. It's still a connection timeout on the right IP address. I'm just gonna try this with curl again. Yeah. I I just I can't get to that note. All the same stuff we're going. But I don't I don't yeah. It can't be the service. There must be I'm gonna try and just make sure. Let's see. Just I get it from my local one. Point 40 five. Oh, alright. No. On you go. Oh, it's not even pinging. I was typing to yeah. It's I can't even ping the private address. I don't know if this is part of
1:33:40 Ping to Private IP Fails
1:34:24 Matt's break or just completely unrelated. Can you ping that IP address? Do you wanna try a ping 1025141? So 1025. 40 1. Right? 14. No. Wait. That's that's not our computer. Let me read out from here. Try pinging ten twenty five fourteen dot five. Nope. Alright. And just for clarity, let's try and ping one of the other machines. So let's try pinging 4 143. Right. Yeah. It does. I think this is unrelated. The I think, like, traffic both of these is blocked. Right? Yeah. But Matt said he only used kubectl. And if I can't even ping the private
1:35:12 Confirming Network Issue to Specific Node
1:35:32 I p v four address, that's a problem and not the one I think Matt's given us. Oh, maybe we should just ignore that for now. Have we gotten the cluster healthy besides that that machine? Oh, yeah. We should. If did, let me just Yep. It does connect. Yeah. That's right. So oh, it's terminating. It's prop okay. The other one is running. So and the Postgres, it probably like, if it delete and it gets rescheduled on the Yeah. If we force delete Postgres, I reckon that'll restart on a node that works. I think we just got unlucky
1:36:11 Force Deleting Postgres Pod
1:36:16 and had some other random networking problem. And remember I said when we set the native pod CIDR to 10 dot, I was surprised that more things didn't break. Well, I think maybe that's what's happened. I just created my password from my clipboard. Oops. I have done that before. Don't worry. It's fine. It's just for this episode, so it's, like, a temporary thing. Oh, but it okay. I had to forcefully. Yeah. Force grace, you know, grace period, you know.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments