Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Repair cluster node readiness by fixing containerd socket path issues and validating kubelet and API-server connectivity.
  2. Diagnose and remove a rogue eBPF port blocker, then confirm firewall, iptables, and network policy behavior.
  3. Investigate unschedulable nodes, disk pressure, kubeconfig misconfigurations, and blocked Postgres egress with in-pod DNS checks.

Marek Counts and Abdel Sghiouar join David to debug two broken Kubernetes clusters, fixing containerd socket paths, kubelet misconfigurations, a rogue eBPF port blocker, a kubeconfig issue, and a Cilium NetworkPolicy blocking Postgres egress.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 1:00 Introductions
  3. 2:04 Guest Introductions (Marek & Abdel)
  4. 3:30 Cluster 33 by Marek Counts (@TheNullChannel)
  5. 3:41 Debugging Cluster 33 (Marek's) - Initial Cluster Check
  6. 5:35 Cluster 33 Problem Description (Reading the README)
  7. 6:50 Debugging Node 58ghnk (Easy)
  8. 7:44 Kubelet & Containerd Issues on 58ghnk
  9. 9:25 Locating & Fixing Containerd Socket Path
  10. 13:05 Verifying Node Status (58ghnk Ready)
  11. 14:21 Debugging Node vx4 (Medium)
  12. 14:56 Fixing Containerd Config Again on vx4
  13. 17:21 Debugging vx4 Node Status (Still Not Ready)
  14. 26:34 Identifying Kubelet Config Problems (Memory, RunOnce)
  15. 30:34 Fixing Kubelet Config on vx4
  16. 31:27 Verifying Node Status (vx4 Ready)
  17. 33:32 Identifying Pod Scheduling & Admission Errors on vx4
  18. 36:21 Debugging Node zedpr (Hard)
  19. 37:48 Checking Kubelet Logs & Connectivity to API Server
  20. 40:07 Checking Firewalls & Network Policies on zedpr
  21. 44:40 Hint about Cilium & eBPF
  22. 45:00 Finding Rogue eBPF Process on zedpr
  23. 48:40 Identifying & Fixing eBPF Port Blocker
  24. 50:51 Cluster 33 Recap & Intro Cluster 34 (Abdel's)
  25. 51:00 Cluster 34 by Abdel Sghiouar (@boredabdel)
  26. 52:47 Debugging Node 4wglm (Easy)
  27. 54:20 Identifying & Fixing Disk Pressure Issue
  28. 56:55 Verifying Node Status (4wglm Ready)
  29. 57:56 Debugging Node ghp7k (Medium)
  30. 1:00:52 Identifying & Fixing Unschedulable Taint
  31. 1:01:11 Verifying Node Status (ghp7k Ready)
  32. 1:02:44 Debugging Node zxr6q (Hard)
  33. 1:04:16 Checking Kubelet Logs & Network Unavailable Condition
  34. 1:06:57 Checking IP Tables & Kubelet Config Check
  35. 1:11:10 Finding Misconfigured Kubelet Kubeconfig
  36. 1:21:17 Checking Application Connectivity Issue (Clustered to Postgres)
  37. 1:23:50 Debugging Database Connection
  38. 1:27:31 Checking In-Pod DNS Resolution
  39. 1:30:30 Finding & Fixing NetworkPolicy Blocking Egress
  40. 1:34:30 NetworkPolicy Explanation
  41. 1:35:32 Wrap Up & Conclusion
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:00 Introductions

1:00 Hello and welcome to Rawkode live. Today is Clustered. I am going to, with the help of some guests, fix some broken Kubernetes clusters today. First, we've got a little bit of housekeeping. If you are not subscribed to the YouTube channel, now would be a really good time to do that. Click subscribe, click the bell, and you will get notifications for all new episodes of Rawkode Live. I will do my best to bring you the best cloud native learning materials on the Internet. All you have to do is watch and enjoy, hopefully. Also, if you wanna chat, there is a

1:32 Discord server. There's over 400 people in there now talking all things cloud native Kubernetes and everything in between. So you can check that out at Rawkode.chat. And also starting today, this is the July 1. We are taking on a new sponsor for clustered and that is Teleport. This was the easiest decision I had to make because I love Teleport. We have been using Teleport and Clustered since the very beginning as a great tool and I'm excited to share it with you all today. So if you wanna check that out and support the show, go to Rawkode.liveteleport.

2:02 Now, let's fix some broken clusters. Today, I am joined by Marek and Abdel. Hi there both. How are you doing today? Good. I always say that and then throw it out and forget I've got two people and then they've got to say who's gonna talk first. But, Marek, can you please do the honors, introduce yourself and tell us a little bit about you? Yeah. So my name is Marek Counts. I've been doing cloud native Kubernetes things for a couple years now. I do a little bit of YouTube in myself, not nearly as good as Rawkode, but

2:04 Guest Introductions (Marek & Abdel)

2:36 all about learning about computer science and and Kubernetes. So I'm really excited to learn more on this show. So thanks for having me. Awesome. Thank you for sharing. And, Abdel, please introduce yourself and tell us a little bit about you. Yeah. Sure. Hi, everyone. Abdel. I go by board Abdel on Twitter. I work for Google. I specialize on everything, Kubernetes and serverless and all these fancy, fancy buzzy words. I am also core leading cloud native community back in Morocco. I'm actually based in Stockholm, Sweden, and I have a podcast. It's Iran. You could just follow me on Twitter to

3:12 find all of that. And when David invited me to actually come fix some clusters, I was like, this is exciting. Because in Google, we don't fix clusters. We just delete them and create them again. So so that's gonna be interesting. Yes. It's gonna be a whole lot of fun, and I will make sure to include links in the show notes to both your YouTube channel and your podcast, both of you. So I really appreciate you being here, and I'm excited to see what nasty things you have got in store for us. Okay. Let's get started.

3:41 Debugging Cluster 33 (Marek's) - Initial Cluster Check

3:42 We're starting on cluster 33. I can't believe there's 33 of these already and this is Maddox cluster. So I am going to join the control plane node. This should start a session. Abdul, you should be able to join through the teleport interface. If you can just give me an echo hello to let me know that you're here, I'll type echo hi first. Let me know when you're in and then we'll get set up. Oh, you're already in. That is super awesome. Yep. Alright. So let's set up our kube config. This is a kube a d m cluster.

4:18 We're gonna use the admin conf and of course we always need our alias. Alright, Abdul. The tradition on this show is you get to run any command you want to see if we have an API server. Fire away. Alright. Let's do that. Let's check pods and nodes at the same time. Okay. So we have some pending NGINX stuff. Postgres is pretty much dead. Your app is pretty much dead as well, and we have three nodes not ready. Lovely. Alright. Let's have a look, I guess. Yeah. What what do you want to start fixing? Well, I'm curious about the NGINX nodes. I don't

5:03 think I left them there. So but I'm gonna say they're not important for now. We've got stuff stuck in terminated. Yeah. I don't know. I guess it's because I mean, I guess the postgres is terminating because the nodes are not Yeah. We don't have any worker nodes. Right? Everything there is not ready. Correct. Yeah. So the Postgres is I in but it's not running. I did leave a read me on there with just, like, a little bit of info, but feel free to debug it as you like. Yeah. We'll take a look at the readme. I'm curious why you've added control plane

5:35 Cluster 33 Problem Description (Reading the README)

5:38 rules to my worker nodes, Mavic. Okay. I I had to have some fun. Okay. Most most control plane components are rising. API server is running. Kube proxies are good. DNS is down. Yeah. That's gone. Okay. And then the packet stuff is also down. Okay. Let's start with Postgres. What do you think? Well, do you wanna will we take a look at this read me? I think that we shouldn't look at the list of things you did, Marek. Sure. Go for it. Go for it. My mistake. Was that supposed to be there? My mistake. Go look at it.

6:25 Okay. Let's let's look at it. List of things I did, the read okay. Do you want to read the read me? Yeah. Let's start with the read me. Just Alright. My broken cluster. I hope you enjoy. Blah blah blah. Okay. So we don't have any nodes which are allow us to schedule workloads. I think we kinda work that one out. And we need to get that working on the default namespace. You have broken each node in a unique way. Well, thanks for that. That's always fun. Alright. We even got little difficulty levels. So Yeah. Will we start with worker easy and then

6:50 Debugging Node 58ghnk (Easy)

7:03 see if we can get work it work it up? Yeah. Let's do that. Sure. Alright. Okay. So I am gonna jump out of this effect. So node 58 g and k? Yeah. I'll jump off of that and I'm gonna do t s h s h root at this node or even at symbol. And you should be able to join us on that machine. Alright. I've done echo hi. Let me know when you're there, and we'll see what we can do. Yeah. You're so you're so quick at this. Alright. Should we check for a working kubelet?

7:44 Kubelet & Containerd Issues on 58ghnk

7:45 Yep. Let's do that. Do you want to do it? Yeah. Sure. Alright. It does appear to be running, but we definitely have some errors here. Yeah. Let's let's check the logs. I'm sorry. Don't know. Let's check the log, I said. Yep. Journal CTR. I should have added just the tail. Alright. This is not very helpful. Stop it for now. Connection refused. Ah, it's not We've no container. The potential there. Okay. Let's check that. System CTL status. Container d. Container d looks like it's good. That looks suspicious. The docker d dot sock. Where where is it?

9:07 The third log line. It says address slash run slash docker slash docker d dot sock. Uh-huh. Yeah. Okay. That sounds good. Yeah. I wouldn't expect to see that. So perhaps it's being misconfigured. Yep. Where is the config file for container d? That's a a good question. So, I mean, we can definitely look here, but I don't think we'll see anything and etcetera, which means we're probably gonna have to do a cat container d and start working our way through any drop ins that we see here, of which I don't see any. So let's do a just a find. I

9:25 Locating & Fixing Containerd Socket Path

9:53 mean so I think that the the it's at etsycontainerd.com. I I need to wait. I'm looking for a config file. Okay. Come on. There's also a very cool container d config dump command as well, which I've used in the past. Uh-huh. Container d. Really? That's cool. D dump? Yeah. Pretty pretty verbose. People keep doing bad things with container d and alias and images, and then making my day really bad. I think this is our problem. It's here. We have to fix this. Yeah. That's not that's not right. I think if we fix that path, that should be our

10:45 easy one fix, and hopefully, our node gets back online. Alright. So I'm gonna go into the file and just look for run, and then we should be able to fix this. Right? What? What's going on in the session? I'm not sure. I'm gonna suggest that we just escape and quit out of this. Mhmm. Oh. Hello? That sounds good. I didn't do that. Just so everyone knows. That was not This is this is either Teleport or the streaming software. Right? It'll be maybe the command line interface for Teleport. So for modifying this file, I am going to grab

11:42 a web browser and see if it helps just in case. Or we can just edit it and we show people what we changed. I guess you were seeing the same as me though. Right? No. On my side, I will see it properly. Ah. Yeah. Okay. If you just if you just fix it, and then we can cat the fail. Yeah. Yeah. Yeah. So That's funny. It is. Yeah. Sorry, people. It's coming. Just give me one second. I am not good with Vim, so I have to and it's a slash run container v slash container v dot.

12:27 Slash run slash container d slash container d dot sock. That sock. Right. Yeah. Okay. Cool. So You didn't have to tell us you were bad at them. We wouldn't have known. Okay. So this is what we changed, the address of container d. If you so when I scroll, it doesn't scroll on your side? No. No. You can't scroll on my terminal. But I'll I'll try and keep up. Don't worry. That one was set to slash run slash docker slash docker d dot sock. So I just fixed it. So you should be able to if you can restart container d.

13:05 Verifying Node Status (58ghnk Ready)

13:11 And then we'll need to restart the kubelet. True. Maybe. We'll see. And then let's pull up your journal command from earlier. That looks good. That looks good. Yes. It does. Let's look yeah. So that was the cubelet the logs of cubelet. Right? Yeah. Yeah. That's coming back on later. Okay. Yeah. And it's it's able to talk to containers, so it should be fine. Do you want to exit oh, no. You are on the control plane node. Maybe you want to do get get nodes and see if it was fixed. No. Sorry. It's a different node.

13:51 Yeah. I dropped off, but let's jump back on the control plane because I think we should see we should see that healthy. Yeah. I'm gonna have to start adding this to the profile when I spin up the clusters. Alright. Let's see. Easy mode when. There we go. Alright. That's fixed. Okay. Do you want to check the README file again? Let's see what's the next in the difficulty level. It is v x four. Let's go for it. Let's go. Alright. Alright. So t s h s s h we're at go. I'm gonna apologize here because this one also has the

14:21 Debugging Node vx4 (Medium)

14:37 the cubelet problem. It just has stuff on top of that. We've already seen the cubelet problem. You guys fixed it or the container d problem. You did that on every note? Come on. Not on every node. Just on this node as well. And that was my bad because that's not fun. So that's why I'm telling you. So you can just fix it real quickly. Alright. So it's the same problem with as the previous one. Right? Yeah. I'll let you if you wanna jump in and fix that and There's more. It's just that is one of It is

14:56 Fixing Containerd Config Again on vx4

15:06 the same problem and then other stuff. Yes. And this was my apology. I shouldn't have done the same thing. It was late when I was doing this. Was there with a config dump too? Yeah. That is a great command. We love that. Oh. People are enjoying listening to Marek laugh. And just install Nano. Come on, Khalid. We're not gonna install Nano. Somebody was suggesting to use c. If you can give us the c command, it will make our life easier for the next node because he did the same he broke it the same way No. No. The third node

15:43 does not have any of this. Just these two. I accidentally did it. I thought that I hadn't done it, so I did it again, but I was doing it to a different node, and that was my problem. Okay. Okay. That could look good. So The Kubernetes is especially not verbose. No? Or maybe just me? Yeah. There was no logs there, I guess. Let's check the logs then. I am a big fan of journal CTL. So I just love seeing the different parameters that everyone uses with journal CTL. Alright. Okay. Uh-huh. Container. Uh-huh. Don't see anything there that gives me cause

16:32 for concern yet. Okay. Then if you think it's fine, I'll just do a quick f to see what are the latest. I don't know. Jono, I forgot with Jono, how do you do the tail? How do you get the tail? Dash dash last and then a number, I think. Uh-huh. Okay. Well, it looks good. I usually just do f u. Oh, yeah. You do f u l. Inverse. Yeah. Yeah. Okay. Cool. That looks good. Yeah. But it couldn't be this simple. Right? Perhaps. I don't know. Try scheduling something on it. Well, we don't have to schedule something on

17:18 it. We just need to get our workload scheduled on the cluster. The existing pods that are in pending states. Do you want to quickly jump on the control plane node and check if it's good? I bet it's not. Yeah. So TSS, HSR. I'm just gonna leave this one open because we're gonna need access to it. And I'm fed up type and export already. So that's definitely gonna add you to my global profile on every future cluster. I apologize for making you switch back and forth so often. No. It's part of the fun. So if I was using the web interface, I have

17:21 Debugging vx4 Node Status (Still Not Ready)

18:00 multiple tabs. I was just trying to avoid doing that on the browser. Maybe I'll just start running both. Alright. So that node is not ready yet. So if there's kubelet, there's definitely something wrong. What could this be if it's not the kubelet? Can you rerun that command with the dash o y? Let's check the network configuration. Maybe he changed the internal IP or something? I'm just speculating here, by the way. Well, we would see a message from the kubelet that it couldn't reach the API server. Okay. Yeah. That makes sense. Yeah. Let's let's double check these logs.

18:49 Think about Okay. What needs to be running when for it to be healthy. There's some things that should be running, generally, I think. We need the queue proxy as well. Yeah. It looks can you check the status of the queue proxy on the because I think it was unscheduled, or was it was it running? This is v x four. Okay. Let's see. One of one. We have Cilium, we have Cilium Operator, we have Hubble. Postgres is actually oh, no. That's terminating. Oh, no. We did these are all terminating. Cilium's okay. How about I don't care about. Postgres, yeah,

19:42 we'll fix it eventually. Kube proxy is running and healthy. Okay. The Kube proxy is good. Then and and the and the Cilium components are good? Yeah. Only Hubble appears to be broken which we can ignore. But yes, we have the Cilium and the Cilium operator running okay on this machine. So I'm gonna suggest that we try describe a node and see if we have any events or warning. Sure. Okay. So it looks like it can't speak to container d. I was trying to pull images or do image garbage collection. You wanna double check our container d configuration?

20:30 Yep. I can. Can I extract an iTerm tab split to tab? No? Alright. Slash run slash container d slash container d dot sock. That looks good to me. But is that binary file where it needs to be? Yeah. It is there. It's suspicious. What's that suspicious color? Okay. It's a symbolic link. This isn't this is not a binary. Right? This is a symbolic link. What am I saying bullshit here? It doesn't look like a symbolic link. No. It's not. But it's not We get the same color on the control plane node, so I think it's okay.

21:43 And when I try to run it, it says permission denied. So let's let's try running a CTR image pool nginxdocker.i0/engine. Yeah. Container? Uh-oh. I was gonna say container d is okay. Did I get that wrong? Or the top level repository. Right? Library. Yeah. I know. That looks good. Yeah. Okay. Yeah. Container d seems okay. So it's not Container d. So if we look at the describe again Oh, come on. Export alias describe. I guess this is maybe just an old error then. So let's look for something else. Let's see if there is any memory pressure

22:47 or disk pressure on the node. You have those under the conditions. Well, it's tainted with no schedule and no execute, but that might be because it's unhealthy. Yeah. I think so. Network unavailable. Yeah. But it's false. It says is running on this node. It's the conditions. Right? Cube so the Kubelet RPC error, that was eleven minutes ago. And then the newer one is Kubelet started Yeah. Kubelet started Kubelet started Kubelet. Okay. Okay. I was like, a Kubelet started the Kubelet? Like, what's what's funny? Okay. What else could it be? Normally, a node would not be ready if we had no

23:52 networking. But since the Kubernetes is able to reach to the API server, it means it's working. Right? What are some of the other ways that a node goes unhealthy? So the component Kubernetes is not reachable. Maybe the CNI is broken, but Cilium seems to be okay. Will we try restarting Cilium and see if it actually reschedules? Maybe That's good. I know, but it's not Yeah. We'll need to do it on the control plane. So I can do pods all white. Correct. Was it b x four? And then delete pod. Dash n. Cilium. Bye bye. Oh, almost.

25:13 And I should have forced it. We don't need to wait on that. Let's see if it still shows up here. Oh, it's stuck in I am gonna force it. Okay. Grace period zero force. It's not starting was unexpected. Yeah. It's not coming back, though. Yeah. Because the node is stuck. Right? I I jumped with you into the session. Yeah. Just give me a I I just want to check something. So the node that we are talking about is this one. Because your output of describe looks translated on my side, so I wanted to see it

26:23 on the terminal directly. Actually, we do have a problem. What do you see? So I don't know why, but here. See? Memory pressure, disk pressure, cubelets stopped posting node status node status unknown. Right? So if these conditions are not met, then the node well, the the EPS server will not flag the node as ready. And then it won't schedule anything on it. Yeah. Correct. So the the Kubelet is not posting something to the node, which is making me think that you have screwed up the Kubelet configuration, which we can check-in the ETC Kubernetes no. Slash slash bar lib

26:34 Identifying Kubelet Config Problems (Memory, RunOnce)

27:29 kubelet config. Did you if you modified in your file, did you keep any backup at least? Nobody ever leave. I did, but not of this one. Not of this one. Okay. And we can fix this one. I can help you with this one. Hold on. Hold on. You just can't done, mate. Hold on. Hold on. Why do you have a memory available 500 gigabots here? Is this port correct? One zero two four eight? That doesn't ring a bell for me. Hold on. So that 500 gigabytes will say that it hard evicts any pod if you

28:19 don't have 500 gigabytes. You're on the right track with that one. Okay. So that's not good. Yeah. You can delete that one. Oh, thanks, man. Actually, I I cheated a little bit because while you guys were talking, I went to the cubelet config on the control plane node to see what's going on there. That's a cheated that that's good debugging. We like that. Okay. Let's see. I restarted the cubelet now on this node, and it should be a general CTO. It looks like it's good. Okay. But let's go back and check this node status. Yeah. It's still not ready.

29:16 There might have been something else in that file. Yeah. That's that's what's going to be my last my next one. So let's check this file, what's going on there. I'm sorry. We cannot do from history. I'm just comparing to the file, but I don't want to compare. I want to see by myself if I can spot something. The health check port is good. No suggestion that might just be iTerm that is struggling there, so I'm gonna find out. I'm literally now just comparing the two files, just to be very honest. No. Good. Keep diffing them.

30:16 CPU blah blah blah. That's good. Okay. Yeah. Make this better term. Hold on. Hold on. What is that run once? That's not good. Sorry about that. Yeah. Run once is a not very known configuration that makes the cubelet just run one time and then ends. That's doesn't look like a very useful configuration. Where's the run once? Where did you see that? It's you can see it here. Run once through. Do you see it there in your screen? Oh, got it. Yeah. Like, I commented out and restarted the cubelet. Alright. So And now I believe we should be good.

31:27 Verifying Node Status (vx4 Ready)

31:27 Okay. It's ready. Good. So who don't? What's what's the run once flag for? One runs allows you to run the cubelet once. It will do its thing and then shut down, basically. So it goes on not ready anymore because it won't accept anything more. That's what it does. What's that for? Sounds like or not Like, four year or no. Not four years. Maybe two years ago, there was actually a person that had gotten this in their configuration, and it caused me no amount of grief before I found it. So you guys are awesome because I would not have found it

32:12 that fast. So look at if you open the if you open the terminal where we have the shared session on the worker node, do you still have that one? Mhmm. So we can show the people on the livestream what's happening. So run one's true. I commented that out. Right? So this is pretty much what we were talking about where Mark explained. And then there was these two lines a little bit up that says memory available 500 gigabytes. It's literally the third line from the top as we see it on the livestream. We never actually got to see this in

32:45 action because the run once was working. But with the eviction of hard, it's going to actually kick out pods it doesn't have at least 500 free gigs of memory. Usually, this is a very small number. You want a little overhead, a little room, little wiggle room, like 200 megs. I see. So you wanted us to find the run ones, fix it, and then stumble across the second one. I thought the run well, there's actually a third one that you guys haven't found, but the cubelet will be working. But we can come back to that when you

33:18 find that things aren't being scheduled on it. Uh-huh. There's more surprises. Yes. But for the most part, you fix this node. So what's happening? What's the what's the node name? V x four. Yep. So we have a bunch of stuff that are supposed to be running on that node that are We've got an admission controller, potentially. Unexpected admission error. Okay. Okay. Sneaky. So what do you want? Do you want to continue working on this one, or do we move to the next one? Well, let's do a quick check for that admission error. So let's try validating

33:32 Identifying Pod Scheduling & Admission Errors on vx4

34:21 webhook configurations. Validating. Okay. I'm mutating. Failing that, it could be in a kubect config again as a static. Yeah. Can you can you maybe describe one of the pods that says eviction error? Like, that Cilium operator pod maybe? Mhmm. Executed error while a computer recover from admission fee. Preemption. Not very helpful. So we we can take a look at the Kubernetes manifest and the Kube API server. My Vim is not gonna work, is it? Oh, no, it seems okay. Oh, no. I'll do it in key. I don't know how to list this. Okay. I'll just join a new one because I

35:39 don't know how to join from the command line yet. So what do you want to check? I wanna see if the admission controllers on the Kube API server are set and the file here. So we can do manifests then cube. Wouldn't those affect other nodes though? We're looking at something that's just affecting this node. Yeah. That's what I'm thinking. If if the admission controllers were misconfigured, unless if they were, like, a custom programmed one, which they are not, I didn't do that much, they would treat all the nodes equally. Okay. Let's jump onto the hard node then

36:21 Debugging Node zedpr (Hard)

36:23 and see what we can work out before we Alright. We move on. Okay. So LS. I'm gonna do SSH onto root at and this is zed p r. Yep. It is kubelet. What? I don't know. There's no telling me Ketty isn't a good terminal, but I'm just maybe I'll I'll I think I'll just use the web interface. You know, fun experiment, no using the command line. Okay. So what was that error again? No. Even from the web interface. Okay. Thought I thought it was my my terminal, but, okay, it's not. Right. Cool. Did you mess up the terminal? No. It's my

37:29 cutie. Alright. I'll let me pull up the web interface. Alright. Okay. You wanna join this new session, please? I'm waiting for it. Okay. So if we run the status on the Kubelet, we yeah. We've got we've got messages here. So Request cancel while I is this IP address of the of the of the, sorry, of the EPS service. Correct? Let's find out. 1477549. 6 7? Yeah. Yep. Failed to ensure, and it's getting a timeout. So maybe maybe it's some sort of, like, something that blocks this node from existing out to the API server because it's getting a timeout issue.

37:48 Checking Kubelet Logs & Connectivity to API Server

38:53 Yeah. So maybe a simple thing would be just to ping that IP and see if it will respond. Oh, sorry. I'll let you type. Oh, oh, oh, just make it sure. It should be one four seven dot I know that I'm copying it. Well, I should just copy past it. Just bear with me. 67. Yeah. You're right. 6443. Okay. I'm expecting to see an okay here. This is the Halsey port of the the API server or maybe API. No. So the node is not this node is not able to reach out to the API server on that IP address. So maybe

39:43 Can we ping it? No. And, no, we can't. Can we ping the Internet? But you SSH over public IP. So, yeah, the Internet actually should work. So I guess for IP IP or something? Exactly. Yeah. I'm really bad at reading active rules, by the way. So I'm just looking for, like, a very quick hint. Yeah. Well I should have put four or five stars on this one. Just I apologize. For dropping mark block in terminal local net. Okay. No. There are nothing. Right? I don't see anything. So there's a this is a Cilium cluster. I wonder if we'll get any Cilium host

40:07 Checking Firewalls & Network Policies on zedpr

40:43 networking policies. You think so? I mean, we can definitely check. So Cilium has has the ability to use eBPF to modify the networking, but we can do a separate Kubernetes. Let me just set this up again. K k control. And then we can do k get host net poll. Is that right? Nope. I don't know how the Cilium host network policies are. Yeah. So we wanna get Cilium. I kinda wanna just Class provide network policies, but that would make sense. None. And then there's Cilium network. Oh, okay. So it's not that. You can you guys are

41:42 for educational purposes, you guys are, like, right on the right path. This is why I said four or five stars because I kinda went off the beaten path of what the right this is where you should look. If it's something that was happening. At Cilium. At your container run at your container networking solution, in our case, Cilium, hundred percent, this is probably where it would be coming from or DNS. Uh-huh. Well, DNS, no, because I tried to do a ping from the worker node on www.google.com, and it worked. Right? You already proved DNS was working. Yep.

42:22 So it's not that. So I'm looking at the serial configuration and control plane node to see if you messed up with that. Let's see. Dash dash o YAML. Okay. Yeah. But if the Serum configuration was broken on the control plane node, it would affect everything. Right? That's probably not it. People are suggesting it's always DNS. I don't think it is DNS. It's it's not DNS this time, but you're most people would be right to to question DNS. I don't know how often that breaks. So I did a trace route from the worker node toward the IP address of the

43:22 control plane node, and I could see that the node actually is able to get out. So it's able to get to the gateway, and I I believe the problem is on the receiving end, so it will be on control plane side. Right? Sorry. I'm I'm I have a background in networking, so I'm reasoning. I do not. So I I'm loving this. I'll let you do this. Okay. So Well, I don't know what to do. It's gonna be obvious, to be very honest. I I just flex it a little bit in my note my notebook knowledge. Okay.

43:53 No IP table. I'm gonna just make sure we have no firewall. I mean, I run a status. I'm assuming it's not running though. Yeah. No. I thought that would be too easy. You guys are better than that. I like easy. Just Easy's good. That was the other two nodes. Guys, blew through those. So what else could this be? Is uh-oh. I'm trying to think of a good hint for you guys. So you were really, really, really, really hot when you were talking about Cilium and eBPF. Like, you were on fire. I hope there's not an eBPF probe running

44:40 Hint about Cilium & eBPF

44:53 on this machine. Hold on. So if we look at all the pods and we do a dash o white here, And then we grep on the node that we have. So the z p blah blah is Cydium running here. The Citium operator is running on both nodes, so that's good. Let's check. I want to check the logs of this thing. Post the this. And Which machine are you on? Sorry. I'm on the control plane node. So this one is not outputting any oh, come there on the logs. But so the two Cilium things that run

45:00 Finding Rogue eBPF Process on zedpr

45:59 on does not have any logs? So I I see a a binary running. I kinda highlighted that, but I don't know if you've seen. But there's this BPF filter running, which to me looks rather suspicious. What is BP filter? You want I mean, I'm gonna suggest we just kill it and see what happens. Go ahead. Oh, three six. Followed by taping. Do you wanna try that ping again? Yep. Sorry. Give me a second. Still not pinging. It's six seventy six. Right? What what what is that IP address again? Just to make sure that I'm not pinging the wrong one.

47:13 Having killing the program may have made things worse to be fair if that was the malicious. No. In fact, it's running again. Okay. Let's find this rogue system d service or Kubernetes pod. Let's see what we've got running in the cluster. So in in light of educational, there is a thing called a BPF tool, which you can use to, like, examine BPF apps if that's what you're digging into. I'm not sure that it's installed on this machine. That was just a greater for everybody out there watching if you're digging into these things. But this was running on the control plane.

48:09 Right? It was. I think there's something on the control plane That's got an blocking traffic from that node. Am I right, Marek? You're close. It's actually on the worker node. It's on the worker node. Yeah. So I I think we're close to time here. Yeah. So what I did is There's Python port blocker. Yes, sir. Actually, just actually, just deleting that doesn't fix the problem because it actually loads the BPF module that Python does, which is that filter c. So you could actually go and check that out if you like. It's just an eBPF program. All it

48:40 Identifying & Fixing eBPF Port Blocker

49:14 does is it goes in and it reads each line there. Yep. That was it. So it takes every single packet. And if it's from the API server, if the source was from the API server, it rejects it and throws it out. It drops the packet. Which is why when we did the trace route, we got out, but we then never got anything back. Okay. Not get back in. You never get the request because the request is dropped by or the the response is then dropped by this. This is never going to happen in production and was just a fun eBPF

49:52 demo. Okay. Jabbarred. See you later. Okay. Sorry, guys. In the interest of time, and I wanna do wanna do cluster. Did we grab all like, did we get all the issues there, Mark? Are you happy? You didn't fix the deployments, but that's fine. We didn't have time. Yeah. Just I I I wanna make sure we have enough time for both clusters. So I I deleted the scheduler, and so there is no scheduler in the static pod manifest. So how kubeadm stands this up is all those are static pod manifests, and they're those manifest exists in a static

50:33 folder. And so I just deleted the scheduler. Now I did back it up, so it was there. If you guys found it, you could have just restored it. But, anyways Alright. Awesome. Alright. So no pod would get scheduled right now because there's no scheduler. Alright. That was the last thing. And a nice assortment of problems there for us to tackle. Fun, kinda. It was so fun. You guys were you guys were fantastic. Alright. Thank you very much for that. Let's see if Abdele has been nicer to us. I'm not very sure. Alright. I'm gonna open a session on the control

51:00 Cluster 34 by Abdel Sghiouar (@boredabdel)

51:23 player node. Please join me and echo something when you are ready. I will export a kube config alias kube control. And I will give you the honors of running whatever command you wish to check that we have an API server. Yeah. I agree with the comment there from Navin in the chat. Abdel, you were fantastic debugging for that. Very well done indeed. Thank you. Yeah. I I a % you guys rocked it. Although I am adding no eBPF to the rules in future. So that's no UTF characters and no eBPF. I do have a tendency of people making

52:15 new rules when I join challenges. So this is alright. Not entirely. DevOps direction did a challenge, then he made a new rule after I I did that. So I apologize, guys. I do like to have fun, and, hopefully, that was fine. I didn't mean for it to be horrible. Alright. Let's see. We have a control plane. This is looking good so far. Yeah. We have one working worker. One was scheduling disabled, and something broken. Oh, so the theme of my challenge that even if things seems to be working, they sometimes don't. Those are the worst. So let's

52:47 Debugging Node 4wglm (Easy)

52:54 yeah. Yeah. Let's actually just describe this work. Actually, let's look at the pods. Let's see. Yeah. So we have some InvictD. What ones are those running on? So I'm okay. Let's see. So there are a few things. There are some nodes that are broken. Oh, actually, all of them are broken. They have to be fixed. If I could rate them from the easiest to the more difficult, it would be four w g l m followed by well, maybe g h p seven k is first one, four w g l m second one, and then the last one

53:38 would be the more difficult. And but, like, I think you will fix the nodes very quickly. It's what I what I would like you to do is actually try to get the workloads up and running. Okay. Okay. So you're more interested in that. Let's Okay. Let's fix four w g l then, and then we'll take it from there. I've created a session. Feel free to please join, and then let's start the let's start some debugging. Oh, okay. Sorry. I was still on the control plane. Let me pull up the new session. Alright. I'm there. Alright. So if we're looking here,

54:20 Identifying & Fixing Disk Pressure Issue

54:24 is up. Memory pressure falls. So we're seeing disk pressure on this node. Okay. So you did that describe on a node. Yep. And we're seeing some disk pressure. So something's causing disk pressure. The terminal went all funky on me. Yeah. So the failed disk something is causing failed disk. So let's try oh, okay. So the limits are zero. It maybe it's not necessarily failed disk, and they just configured the cubelet funkily in a weird way. It's not really a word. Well, looking at the disks on our worker, our root partition is at 99% saturation. Although, there's still 2.5 gig left,

55:17 but even just looking at this So how would you figure out Okay. How do you do the iNode thing? So you can do if you want to see which folders are full, you can do a d u dash h dash dash max depth one max dash depth one or like that. Yeah. And it should be able to tell you where do you have a lot of stuff. Alright. And our directory. We got some chunky looking files in here. Yeah. Those hundred gig files. You wanna just well, we just delete? I think that I I don't think that

56:20 those are useful for anything. Not that I know of. Yeah. If we delete those, that should hopefully cure up that. I'm looking over the describe here to see if I see if anything else that was yeah. I think that should clear it up, and then I'll just run a describe again and see if that Yeah. That's looking much better. Hopefully, that removes the desk pressure warning then. Yeah. It looks like that made it You you guys pretty much got it. You fixed it. So the goal actually of this exercise was to because I get a lot of

56:55 Verifying Node Status (4wglm Ready)

57:04 times, we get people asking us, can we just install random stuff on the clusters? Like, can we actually use the nodes of Kubernetes to do other shit? Right? And, obviously, the problem there is that if you install things on a Kubernetes node that is out of the control of Kubernetes itself, it would know about it. Right? So the kubelet has a configuration to clear up the logs, to rotate the images, etcetera, etcetera on a periodic schedule, but it does this only in few parts of the file system. So if you would deploy something in slash root or in

57:33 slash OPT, it wouldn't know about it. Right? It's like if you're running a VM on a node, which I thought Pretty much. Yeah. Yeah. A VM outside of Kubernetes. Cool. Is there anything else that Kubernetes is controlled? Anything that's having to be running through Kubernetes itself. Alright. So we have this next one that has scheduling disabled. Check out why the Yeah. Go for it. One second. My sessions got all little bit weird. It just resized the window a little bit and it it tweaks it. Okay. Just drag the corner away and we put it back in.

57:56 Debugging Node ghp7k (Medium)

58:21 That that does fix it. Perfect. Thank you. Alright. So we want to describe this. I think describe is a generally good way of Sorry. Yeah. Sid is asking where should I install my crypto miners if not on the control plane? That's a good one. No crypto miners here. Thank you. Okay. So you've described do we see a reason for being unscheduled about? It could just be tainted that way. One second. What's let's see. Yeah. There's definitely got a tint. I mean, we can try removing it and see if it gets added back or we can

59:17 Oh, one second. We have AppArmor enabled. Who enabled AppArmor? Figure properly. What? Nobody's allowed to enable AppArmor. Come on. AppArmor is enabled. I did not enable AppArmor. It is there. Alright. Let's see what we'll choose. Let's check out its labels. This looks fine. I guess we could also check the cubelet logs here. Which machine are you on? Oh, the actual node. Right? I'm on no. I'm on the control plane right now. I might jump to the actual node and check or if you wanna I'm just looking through the describe right now. CPU looks good. Memory

1:00:18 pod limit. Version. The version is right. Same as the rest of the nodes. No, Kevin. Never gonna happen. We'll never be using an episode with Red Hat Linux and SE Linux. Okay. Alright. Did did you get the logs of the cubelet? Nope. Okay. Let me I can jump over there and get them. That's alright. What machine was it? I don't know. G g h p. Yep. Alright. I have opened a session on g h p. Come say hello. We'll do a journal fluke. I mean, I guess maybe we should just untaint it or remove the unschedulable and then go from

1:01:11 Verifying Node Status (ghp7k Ready)

1:01:32 there. Yeah. I'd say we remove it and see if it comes back. I think that's a pretty good shot. Yeah. You can just click here at that node. Really should copy the name. I know. Tense. Remove the unschedulable as well. Fixed. Yeah. That that I was just so that was the last node they wanted to to screw up. I didn't know what to do. It was just cordon then. So you could I mean, you did it you you did what should happen. You could just do a kubectl and and cordon, and it would just go

1:02:29 back. Cordon. Okay. Yeah. I didn't think it was a cordon because they never seen anything in the events to suggest that it was cordoned. I thought an event would show up there, but I guess not. Alright. So let's tackle this last one. I'm going to do a let's do a cube describe on the control plane there. Yeah. Go for it. Which ones? One second. I gotta go reconnect to the control plane, I think. Remember to join so we can see it on the Yeah. Screen. I'm not nearly as fast as a bill at at rejoining

1:02:44 Debugging Node zxr6q (Hard)

1:03:14 these. Alright. So we have our notes here. Let's go ahead and describe this one. Alright. So let's see. It's not ready. We have unknown. Networking is not available on the node. This is a big red flag. It does say Cilium is up, though. I so that what pods are on this node. That output is extremely confusing, by the way, because when it says network unavailable falls, it actually means it's true. Oh, that's right. Because it's a double negative. Unavailable. Yeah. I get it. Yeah. Okay. Whoever does the double negatives, that's that's you're right. Okay. But we're unknown on all of

1:04:06 our memory. So this could be a bad configuration of the cubelet like I had done, or it could be the health endpoint could also be busted. Let me look through here. It looks like it's reporting how much CPU and memory it has. Non terminated pods, it has Cilium up, default cluster, cube system. I've opened a session on that node if you wanna Okay. Poke around. Yeah. Which node z x r six q. So I guess let's go ahead and check the cubelet logs so we can do Nope. There's another. Alright. So Kubelet nodes not sync.

1:04:16 Checking Kubelet Logs & Network Unavailable Condition

1:05:27 Unable to write. So that almost looks like you when you see these unable to write, it could also be like an etcd error when you try to write a resource. I don't think this is because etcd usually is a cluster wide event. But sometimes when you get unable to write things, it's etcd. But I don't think it is in this case because it's this node specifically. Yeah. I think you might be I think you might be right. Yeah. We also got another tell. Error message just above above it, which says eviction manager failed to get memory. It got failed to get

1:06:02 summary stats, failed to get node info, nodes have not been ready. So, yeah, I don't think our kubelet is actually speaking to our API server potentially. In fact, at the very end Yeah. That big wall of text, we have a timer as well. So So let's just check the firewall. It's what is it? The it's Ubuntu has the super easy firewall that I can never remember. It's not firewall d. It's u f w. W f? Yeah. U f w. I don't know. Yeah. Don't think there's Oh, are root. So And you can try firewall d as a

1:06:48 usually work. That's usually there as an alias. Yeah. So I don't think there's a firewall. Okay. So if there's not a firewall, they could have messed with IP tables. That's I'm not very good with. Let's see. Dash capital l will list all the rules, then we can look for drops. That's about as much IP knowledge as I've got. But Abdel may have given himself away earlier by telling us of his networking background, so I think we might be able to It's more simple than that. You're on the right path, but it's more simple than that. Alright. So we have rejects here.

1:06:57 Checking IP Tables & Kubelet Config Check

1:07:33 Yeah. We're rejecting TTP anywhere. So, I mean, if we ping google.com, this should fail. Right? No. It works. I mean, those IP addresses you are looking at are 17226. Those are internal IP addresses. So RFC ninety eighteen. Oh, okay. Right. They're only rejecting to these internal internal ones. Alright. So we just need to remove these, which if I knew the command for that, I would do that right now. We can try a flush and see if they go away. I think if you do that And then restarts kube proxy. If you do it, the queue proxy will just

1:08:15 recreate those, by the way. It's a controller. Right? So it does look reconciliation loop. Well, or maybe after you. Okay. So let's so did he put something? I I just did an all let me look through the namespaces here, make sure he didn't hide anything in the namespace. Okay. So that's the we have the Cilium not connecting. Outlook. Middle will be So he's using Cillium or a controller to do it. Don't see any Cillium network policies. I also don't see anything that's running here that would do that. I'm gonna restart kubelet and container d and then

1:09:33 check the logs of the kubelet. Let's see what happens. I'm curious to see if we're still gonna see that time out. Yeah. It looks maybe. Alright. One second. Let's check. One second. Let me join the session over here. I am enjoying this very much. What did you say? I apologize. You broke up a little. I said as networking nerd, am enjoying this. Oh, that's good. That's good. Alright. So we have the idea that our IP tables are getting screwed up. And let's check our services actually. Yeah. We got this drop all anywhere anywhere on this.

1:10:44 I mean, that looks bad to me. So something is rating. It's not it's it's not of epidables. I'll just help you. It's not epidables. It's more simple than that. It's more simple than that. Yeah. More simple. I don't want you to go into IP rules, firewall rules, loophole. No. Wait. I mean, is it just the IP address is wrong? Let's let's see. Seems fine. Wait. We could check the check the the Someone's left as a backup. Oh. Very nice. Nice backup. I didn't want you to leave you hanging there. It just have this configured endpoint

1:11:10 Finding Misconfigured Kubelet Kubeconfig

1:11:48 for the API server. So if you edit the and remove the one in one sixty eight, it should work. Well, you were right. It was simpler than that. You want the honors of fixing our IP address? Sure. I want you to get to the next one. Alright. An interesting note here, if you ever type reboot, so system CTL reboot, it will actually just reboot your machine, not the service that you type after that because it takes it as a bad argument, which would then say now. Anyways, fun story. I've done that too many times. Alright. So now you want us to get

1:12:45 on to the next one. Yep. Well, he just said I did this before and he's right. I actually just removed the s from HTTPS and nobody noticed it. Oh, do I need to add that back in then? No. No. No. No. No. No. No. Was a different episode. It's just a flashback. Oh, okay. Maybe I think want to check if your notes are up. I was gonna say, I don't think we're scheduling stuff though still. So Alright. Let me check. Confirm that. Yeah. So we still have evicted pods here. Let's just go ahead and describe one of

1:13:26 these pods. But that's just an old replica set, and it's a Vexit. I think it's just not been cleaned up. Like, it does look like clustered is running and Postgres is running. Alright. Let me do this. Let's see. Yeah. Scheduled. So here is the challenge. So here is the challenge for you guys. So when you deploy an app into Kubernetes, if you want to expose it to be able to reach it, you have to create a service. Right? Service. Alright. So you wanna create a service to expose memory? Yeah. Yep. Well, we could always check that our current

1:14:26 services have endpoints. Okay. So we we should have a postgres and a clustered service. Do wanna see if they are looking good? So we have none. So let's go ahead and look at our Postgres here. That's good. Alright. So we have app selectors, Postgres. Let's see. Here's the IP. We don't have a type on here, do we? It's cluster IP. But I think the important thing here is the endpoints has a value. So I'm curious to see do you wanna do the same on clustered? Sure. Or yeah. Do it do the same on clustered and try to scale it up and

1:15:29 see if that's actually what's broken. So you you kinda got into what could be broken. So Right. So that IP corresponds to one to the pod which already exists of clustered. Right? Right. And you already are putting a port on there. Yeah. So scale clustered up, Let's say two replicas, for example. Let well, let's try, like, let's try upgrading the image. It's rather than scaling it. That could also work. Of course. Yeah. Because, like, you know, one of the things we wanna do here is modify our deployment to get v two. So If it's just hitting okay. Sweet. You have

1:16:11 a different I was gonna say if it's just latest, it you can't actually do that. But this is gonna still point to the same endpoint. It should be a new hardcoder. But there there is a new pod. You go ahead and do get Alright. I'll I'll do as I feel said. Let's scale up. Because that that looks okay to me. Right? But Can you But this is the problem here. One second. Can you rerun the What were saying, Abdul? White. Yeah. And see So this is the IP addresses. Yeah. Yeah. This is what the problem is.

1:17:06 Been updated. His endpoint's not being updated, and so it's not going to do this. So we need to go in and edit the service. Or fix the controller manager, maybe. Do you think it could be disabled? I've actually never hard coded endpoints into a service. Usually, I use a selector label. There there should be a selector label, though. Right? But there should be a selector label. So let me look for that. I don't I don't remember if it was there. Yeah. My I'm assuming if their service hasn't been updated, then Abdul has been a bit of a

1:17:54 mini and disabled one of our controllers. And I'm looking at him to see if he's got guess. Got a tail. Like No idea. Alright. So we do have an app selector. The endpoint's not being updated. This this is an old update. This isn't any either of the new ones. So So my I'm assuming if we come in here. And look at the manifests. I wonder if he's been kind enough to leave it back up. No. Not this time. But look at the cube controller manager here has a list of controllers. You see that? Yeah. The endpoints.

1:18:47 Sneaky. So if you do a minus on the controller, it means remove one of the controllers. Right? Yes. So the star means all the controllers except the nondefault ones. The nondefault ones are the token cleaner token cleaner and the bootstrap validator or something like that. And then if you do a minus on one of the controllers, it just basically removes that controller. It doesn't run it. So I remove the endpoint controller, which is what You can do the same thing. Mhmm. Yeah. No. You can do the same thing with labels with the minus and just

1:19:22 removing the label. Yes. So the reason I didn't leave a backup there, it's because I struggled with it because I didn't know. If you just get that file copied to something called dot old dot backup, the cubelet will still pick it up. It wouldn't pick up the files and start with a dot, though. So you could have Yeah. Exactly. Yeah. Dot back. Oh, yeah. I could I could have done that. Yes. Alright. So now we need to restart. Yeah. You have to roll out your do a roll out on the We have to restart the cubelet, though, first. Right?

1:20:01 No. The cubelet will detect that failed change to roll out the new controller managers. I think we'll just need to modify our deployment, and then the controller rollout rollout. Yeah. Okay. So we just need to restart this. Yeah. You can use whatever whatever method you want. You're too nice. I normally do cube delete pods dash dash all. Which I know is a horrible, horrible way of doing it, but it's fast. Oh, wait. Did I Yeah. You just missed that l and clustered. Yeah. Sweet. Now we should be able to describe the service, and we have all the

1:21:04 endpoints listing. Very nice. That was a good one. And, yeah, I I actually really like that one because of the the educational with the dash minus there. Let's see if there's anything else here that's broken. Yeah. I'm going There is one more thing. I am going to just on my local terminal dot port forward and see if we can hit our application. That's fine. Let's grab this one. Looks like we're gonna get a timeout. So I suspect our clustered application cannot speak to Postgres. Well, we didn't scale Postgres or anything, so anything that had a bad service,

1:21:17 Checking Application Connectivity Issue (Clustered to Postgres)

1:22:01 right, would need to be updated anyways. Right? Failed to connect the database and we've got angry face. Alright. Go for it. How do we fix this? Alright. So I was just looking at everything that was up and running, which is quite a bit. Let's actually get the logs from the database, see if if there's anything wrong with the database. I'm guessing that it's the database service. So Postgres here, it says it's running. I always use git with logs. Yeah. We all do it. They they might as well make git and logs like just like an alias.

1:23:02 Yeah. Do you wanna check the So it says surface and make sure the IP addresses match? I wonder if we need to do that, you know, what we did with the clustered one. Okay. Yeah. K. Get pods dash a, not dash o y. So its IP address is this one right here, and then we're going to do 4156. It's just Postgres, I think. Alright. Why did it re oh, I I misspelled it. There we go. There we go. So external IP, cluster IP address. Yeah. This is let's describe this service. Oh, if nothing else this is correct.

1:23:50 Debugging Database Connection

1:24:18 192 yeah. What what's the IP address that he could've messed with the IP address? No. It's always DNS. I was hoping you don't see it. It's always DNS. Okay. Yeah. Probably. That's because earlier when we were when me and David were stuck at one of your problems, everybody in chat was saying it's always DNS. So you screwed with DNS. You wanna check what's running in the cube system namespace? I'm now starting to doubt whether that service was called Postgres or PostgresQL. Like, I wonder if he's just renamed it. Oh, no. Or we just had no core

1:25:16 DNS. Look at that. Or or no core DNS. Yeah. That would be a problem. Wait. Wait. Did you leave us a backup? That's a lesson for you folks, how to easily screw up DNS in Kubernetes. I'm assuming by a desired zero, this it may just be scaled down, you think? We could try scaling it. Let's let's do that. I'm good with that. Okay. Let's let's do three since we have a three node. And you'll need the namespace. Yeah. Let's see. Well, it's they're not running yet, but it says three of three ready. Hit the refresh. See if your database works.

1:26:38 I don't think so. Alright. Well, we do have core DNS running. We have three pods running. Let's check some of the logs from core DNS. Did you leave the queue proxy command running while you were scaling up and down the queue port forward command while you were you were scaling up and down clustered? That might be your problem. It definitely still seems to be timing out. Do you wanna exec into one of our pods and try cluster DNS and see if it's working? Yeah. I was just looking at the logs. I don't see Yeah. Cloud DNS never logs anything.

1:27:31 Checking In-Pod DNS Resolution

1:27:47 Yeah. Yeah. Well, it didn't fail. But yeah. Okay. Sure. We can exec into I guess, we could exec into the clustered. Does it have fashion in I don't know what's installed in these. It's shit. What is it? It may be Alpine or maybe Scratch. We're about to find out. We can just run we could just run a You ran NGINX earlier, right, I think? Alright. I did. Let's let's Yeah. There we go. Let's jump into there. I'm pretty sure that's got bash available. Go for it. Don't know what complete. Curl. Oh, let's try normal DNS.

1:28:38 Yep. And Kubernetes. It looks to be working. Do we have NC? No. Can you crawl Postgres? We're getting into the end of this, and there is only one last thing that you have to figure out. Yeah. The name of the DNS package across different operating systems. That's the one thing I need to figure out because it changes every time. And tools, DNS tool. It changes per version. It'll take come on. Maybe around that app to get update. Did you do an update? I did. I just never remember the name of the package across Ubuntu, Debian, Dms utils. Dms dms dash utils.

1:29:35 I I did that. I thought I tried that. Yeah. Maybe it's not dash. Yeah. Alright. Let's take Postgres. No response. Have to do defaults. No. Dot dot defaults. Because otherwise, it would be It's in the default namespace. Oh, there we default, it should find it. Yeah. Okay. That resolved. DNS is working. Oh, no. It didn't. Because this is oh, no. It did. It did. No. It did. That's your IP address of the DNS or the Kubernetes API. So he said there's one last thing for us to find out. We gotta figure out something else that's broken.

1:30:27 Yeah. Okay. So I I would okay. Just to help you out, it's something that shouldn't be there. Something that shouldn't be there. Yes. Is it this file right here? When oh, wait. Where am I? We're still in the container. Yeah. Oh, okay. Yeah. Thank you. I was like, what? Nice tip from Waleed using plus short with big Oh, that's the that's just the backup of the cube control manager. So something that shouldn't be there. There's metal l b cube system. Let's see. Packet. Wait. What's this packet packet cloud, man? Is this I I don't know what's this supposed

1:30:30 Finding & Fixing NetworkPolicy Blocking Egress

1:31:19 to be in this system. I could check Yeah. That's just the the Equinix metal CCM, but it's not been renamed. So it it's it's okay. We trust it. It's hard for me to know something that's not supposed to be there when I'm not necessarily sure what's What's confusing me is that our application we can we can browse to this is still a DNS error, which we know is not to be true. So either that pod isn't getting DNS or it's just some Probably probably put differently. So your clustered application is not able to reach your database, but your NGINX application was

1:32:06 able to resolve DNS. Right? So Okay. So we need to take a look at our deployment for Clustard because he may have modified it. The deployment or the service could have the wrong ports exposed. But no. Clustard can't even resolve the DNS, but our Nginx pod did. Even if the DNS Are you sure that it cut okay. Okay. Let's okay. Put it a different way. Let's say that you are a security cautious customer. Nope. You're trying to block the hell out of everything that you don't need, and then you fucked up when you were trying to do that.

1:32:57 So are we talking about, like did you go in and modify the capabilities or the the permissions for this pod? Oh, depends how No cluster role binding for it? No. I'm saying, like, did you go and, like, give it a a role with no permissions? No. Something is there, but it's just misconfigured. Okay. So something Good catch. Good catch. Cheeky. I like I like David. They just delete stuff. It doesn't care. Very nice. You can't delete. You deserve that one. Yeah. I put the top policy to block traffic for the clustered app on egress. So even if you can

1:34:16 talk to it, it's not able to talk out, so it could not reach Postgres. Very nice. I checked the Cilium network policies, and I didn't check the standard cluster policies. So for the sake of knowledge sharing, there are two things you have to keep in mind. Kubernetes itself has a network policy, an upstream API part of the Kubernetes code. Right? That network policy allows you to implement network policies on a cluster, but it doesn't come with a controller. So we basically when you install Kubernetes, you have to install something that will make those network policies

1:34:30 NetworkPolicy Explanation

1:34:52 happen. So Calico or Serium or anything like that. And most CNIs that exist on the markets, they have some sort of integration, and they allow you to either use their own CRDs like the case of has has network policy, which is an object. You can either use that to define the network policy, or you can use the upstream network policy API, and then Serum should be able to read that and enforce it or or whatever c and I run the cluster. Right? And that's done for portability. If you have a cluster that runs on one cloud provider with network policies, you should

1:35:24 be able to take it out and put it in a different cloud provider. As long as they have network policy plug in, it should just work. There you go. Alright. Thank you both for ruining my day. Some evil evil breakages of both clusters there, and thank you for also spending a little bit of your day fixing these clusters with me. It's been it's always super knowledgeable seeing what people do to these clusters and the way that they think and the way that they debug and it's just an absolute pleasure for you both to join me today. So thank

1:35:32 Wrap Up & Conclusion

1:35:56 you very much for that. Thank you for having us. I hate both those clusters. I gotta say that those are just cruel. Network policies, DNS. You're just bringing out the worst in everything here. Hey. Know that's what I'm doing with DNS. Alright. Thanks to everyone that watched as well. Thank you to the sponsors Teleport. We use Teleport today to debug this. Make sure you check it out using the link in the video. Do either of you have any last words before we finish up for today? It's Thank you, dude, for that cluster. That dude, that that was a fantastic cluster.

1:36:36 I really appreciate it. We should do this again sometime. Yes. We should come up with more interesting ways to break things. Well, it's funny that you both mentioned that because next week is the start of cluster teams, and I would love for you both to bring a team for the team edition if you wish and get involved that way as well. My team versus your team? It's a rematch. I can't promise you that's how the schedule will go, but if you both wanna submit teams, we'll catch up afterwards anyway. I know we're a little bit overtime now,

1:37:09 so I'm gonna let you both go. But again, seriously, absolute pleasure. Thank you for joining me and I hope hope you both have a wonderful day. Thanks. Thank you, folks.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

Additional Resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Cilium

More about Cilium

View all 36 videos
containerd

More about containerd

View all 23 videos

More about eBPF

View all 9 videos
PostgreSQL

More about PostgreSQL

View all 22 videos
CoreDNS

More about CoreDNS

View all 21 videos