Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Diagnose kubectl permission failures by exporting kubeconfig, verifying privileges, and rejoining a broken cluster control plane.
  2. Track a disappearing deployment to a rogue process, kill the offender, and confirm stable re-creation behavior.
  3. Patch CoreDNS config, then restore kubelet and scheduler components to resolve image swaps, timeouts, and node readiness.

Russell and Marek return to break two clusters. Russell hides a recompiled kubectl, a rogue process killing deployments, a CoreDNS ConfigMap edit, and a kubelet that swaps images. Marek drops the kubelet port with UFW and removes the scheduler manifest.

Chapters

Jump to a chapter

  1. 0:00 <Untitled Chapter 1>
  2. 1:46 Start of Klustered 20
  3. 1:55 Welcome and Introduction
  4. 2:23 Sponsor Thanks (Teleport, Equinix Metal)
  5. 3:11 Guest Introductions (Russell & Marek)
  6. 3:33 Introduction
  7. 4:33 Starting the First Cluster (Russell's)
  8. 4:52 Initial Cluster Access and KubeConfig Setup
  9. 5:53 Debugging Kubectl Permission Denied
  10. 6:30 Export Our Cube Config
  11. 14:41 Russell Confirms Recompiled Kubectl
  12. 15:30 Investigating Missing Application Deployment
  13. 16:49 Deployment Disappears After Creation
  14. 19:14 Searching for Rogue Process
  15. 22:12 Checking Worker Nodes for Rogue Process
  16. 26:16 Identifying and Killing "Rawk" Process
  17. 28:41 First Cluster Deployment Fixed
  18. 29:02 Spotting CoreDNS Issue
  19. 30:03 Debugging CoreDNS Logs and Config
  20. 31:07 Fixing CoreDNS ConfigMap
  21. 37:23 Testing the Application (v1 Working)
  22. 37:50 Attempting Application Upgrade (v2)
  23. 38:23 Edit Deployment
  24. 39:10 Debugging V2 Application DNS Issue
  25. 40:33 Postgres Service
  26. 41:13 Discussion on Image Injection and Kubelet
  27. 49:47 Russell Reveals Recompiled Kubelet
  28. 51:32 First Cluster Resolved Recap
  29. 53:34 Starting the Second Cluster (Marek's)
  30. 55:22 Debugging Not Ready Worker Node 1
  31. 57:25 Kubelet Status and API Server Timeout
  32. 1:03:31 Unintentional Kubevip/Kubelet Communication Issue
  33. 1:08:00 Debugging Down Scheduler
  34. 1:11:43 Finding and Restoring Scheduler Manifest
  35. 1:12:22 The Static Manifest
  36. 1:15:09 Waiting for Scheduler and Nodes to Become Ready
  37. 1:16:10 Worker Node 1 Still Not Ready - Debugging Kubelet Again
  38. 1:19:57 Debugging Worker Node 1 CNI Issue (Cilium)
  39. 1:30:00 Checking IP Tables on Worker Node 1
  40. 1:32:01 Rotating Cilium Pods
  41. 1:34:25 Marek Reveals Worker 1 Break (UFW)
  42. 1:39:13 Walkthrough: Marek's Breaks (UFW, Kubelet Rate Limit, Postgres Capabilities)
  43. 1:40:47 Reaction and Wrap Up
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:46 Start of Klustered 20

1:46 No audio. Thank you, Marcos. After a flying start. Alright. Let me start that again. So welcome to the Rawkode Academy. I'm David Flanagan, blah blah blah, also known as Rawkode. This is the first cluster that we've had in quite a while. So I'm really excited to be back. We've got lots of clusters planned over the next twelve weeks. And today is a good special one to kick us back off because we've got two people who have previously participated in clustered, and it's gonna be a whole lot of fun. So lots of hellos in the chat,

1:55 Welcome and Introduction

2:20 I'll get to them in just a minute. But before we do that, I wanna just say thank you to our sponsors that help make Clustered happen. So first and foremost, Teleport. Teleport is a product that we have been using on Clustered since the very first episode, and they've been supporting this channel for the last few months, and I can't thank them enough. Teleport is a great product. We use it every single week to fix these broken clusters, and I recommend that you all go check it out at the link below. There. I've never done that before.

2:23 Sponsor Thanks (Teleport, Equinix Metal)

2:49 I also wanna thank Equinix Metal. They were my previous employer and, you know, it was great having access to bare metal, but Equinix have very generously decided to continue supporting cluster by providing all the bare metal clusters, all the bare metal machines for me to install a cluster onto. So thank you very much Equinix Medal, always a pleasure. Alrighty. Let's say hello to our wonderful guests for today. Hello, Russell. Hello, Marek. How are you both? Hey. How are you doing? Good. Just just got my break in on time. Well, I've already broken the show by muting myself at

3:11 Guest Introductions (Russell & Marek)

3:25 the start. So, you know, at least the first thing that went wrong was me, as always. Can we start with a little bit of an introduction? We'll start with you, Russell. Hi. I'm Russell. I'm in Rawkode's Discord quite a lot. I I don't work in tech anymore, so I'm not gonna promote anything or try to recruit anybody. Just just have fun. Alright. Thank you very much. Mark? My name is Mark. I've worked with Kubernetes for a little while. Did a lot with actually, like, the control plane stuff. And, I've been trying to share my knowledge, with some educational

3:33 Introduction

4:10 content here on YouTube. And, yeah, I'm excited to excited to get into this cluster and see how it's broken. I'm sure they're not broken too badly. Right? You've both been very kind to each other as those sparks fell up your face. You can always tell how devious someone's been by that smirk, that smile that comes on their faces, we're about to get started. All right, lots of lows. Hello everybody. Let's get onto our first cluster. So before we get started, Marek, you offered to be the first fixer. So I'm gonna pop up my screen share.

4:33 Starting the First Cluster (Russell's)

4:47 I've set the timer to forty six minutes, so that's one minute for us to get things going. So I'm gonna open a session on the Russell control plan. And if you can join a session, please remember not to open a new one and start exporting your KubeConfig, and then we'll take it from there. So best of luck. Okay. One second. How do I join that session? It's not letting me. It's not letting you? So I don't I was gonna flash my IP address, but I guess it doesn't really matter, does it? If you go to

4:52 Initial Cluster Access and KubeConfig Setup

5:30 activity and then click on active sessions, there should be a join button next to my IP address and stuff. Yep. Yep. Sorry. My bad. I forgot. Alright. Hey. You're in. You're in. Let's let's get started with setting up our kubectl here. Let's see. So oh, we have a story. I have heard that the stories are legendary, so we're gonna check that out. Let's see. We have oh, chapters. My goodness. Are these hints or are these the story we should be following along, Russell? What what do want us to do? I'm surprised. Welcome to the Shire still. Think that

5:53 Debugging Kubectl Permission Denied

6:15 might be empty, but yeah. So do do what you want, but as soon as you find a problem, maybe go into chapter one and have a look. Alright. So let let's let's check for our control plan first, Mark. Right? Yeah. Okay. So the let's get the let's export our cube config, which I think our kube config is stored at admin.com. Okay. So we'll just do this. This is where, as I was reminded, kubeadm likes to deposit its kube config. So Alright. And what command are you gonna use to check for this control plan then? What's your choice?

6:30 Export Our Cube Config

7:21 I just kinda start with nodes. One second. KubeCTL. I do love a good permission denied editor. I always Yeah. Right off the bat. Okay. Was that I don't have permission to KubeCTL or to the git nodes. That is the binary Linux editor. Right? Yeah. Which would just Kevin is watching with snacks. Hopefully, you've got a beer there too, Kevin. If not Where? Maybe next time. Yeah. W H I C H. Oh. So I have root here. One second. Let me just do oh. I'll go to that. Oh, yeah. That was actually really that's not what I meant.

8:51 That is that is not what I meant. And if you control c that, we're probably gonna be stuck with a rather interesting terminal, I think. Right? Oh. Oh, was I not supposed to control c? No. If you control c oh, no. You're okay. You're okay. Normally, you get stuck in, like, some sort of weird Unicode binary blurb and you have to reset the session, but I think you're alright. How do I stop it? It has stopped. You know, c l s? I just typed it. No. I'm gonna reconnect. I just refreshed the page. It'll be fine.

9:29 Alright. Don't do that. That was dumb. Moving on. Alright. So why was I getting permission denied? I was going to actually check its permission, its file permission. That's what I wanted to do. Not Yeah. Cat may not be the best command for that. Maybe try something else. Yep. I haven't rejoined yet. So Alright. I'll check the permissions while you're rejoining. Make myself useful. Those look alright, actually. So was it binary? Was it a Linux final permission? Like, it seems weird, especially being root that I mean, unless it was owned by somebody else, which it's owned by root,

10:26 and it has rewrite and execute. Actually, could you try running kubectl with, like, the full path and not like, did he do some weird alias on kubectl, which would hurt my feelings? Nope. You're definitely getting permission denied. But what could be causing that? Any thoughts? Keen observation from James in the the chat there, I've gotta say. That time stamp looks a little suspicious. It does, doesn't it? I've actually never run into this. Alright. Let's let's talk about this. So there I think there are few ways that this could be achieved, and then maybe we'll read chapter one in

11:21 a minute if I'm wrong. But we've checked out the base permissions. Right? The alter permissions, and those look fine. We have the executable bit. So the other thing that it could be is potentially an attribute on the file that doesn't allow it to be executed. Although, I'm not sure what attribute that would be. No. That does not appear to be the case. So what else could cause a permission denied would be potentially that Russell has just compiled his own cube control, which I think would be particularly devious. But I don't put it past them, especially

12:04 considering our home directory here has a good directory, doesn't it? I'm I think my Internet's having trouble. One second. Let me let me check. You're coming through just fine. I think that okay. I don't know. I'm, I'm breaking up. You're okay on this side? I can't hear anything that you're saying if you guys are saying something. Russell, can you hear me okay? Yeah. I can hear you fine. So can you hear me, Marek? Okay. Yeah. So I heard you, Russell. But you didn't hear me? No? I I cannot 01:00, two o'clock, three o'clock, rock. No? You're not Oh, I got 03:00.

13:04 Yeah. I'm not sure what that could be. This is a part of my break. Alright. Well, Merrick works. Don't we entertain ourselves with the story? So we go to chapter one. Alright. So the I I couldn't hear you guys, but I don't know if you can hear me. But he could have I think what you guys were saying is his own KubeCTL. So I I think that we could look at at the history on that KubeCTL and if it's the actual right one, or we could read his story. Well, if I mean, you've got a hypothesis.

13:52 I I'm assuming you can hear me, but I don't know. We've got a hypothesis. Right? We think potentially he swapped out the Kube control paneling. I'm kind of inclined to believe that he has. So I would suggest that why don't we just reinstall the Kube control binary with that? Although Russell has said in the chat that he only attacked the cluster. That was about the dossing somebody. Not about. Don't trip me up, Russell. This is hard on him. Right. Mark, you there? Nope. Alright. Oh, he's away. Hopefully, hopefully he comes back. Otherwise, I'm I'm fixing that cluster.

14:41 Russell Confirms Recompiled Kubectl

14:44 Let's see if we got a control plan. So go on. Tell us, Russell, did you recompile your own cube control? I did indeed. I compile I compiled it. I took a look at it, did the modifications myself, looked at the error messages. I was like, I could just, yeah, I could just echo that out. And then I've to to make it look more realistic, what what I didn't do is change the data. I should change the data. To make it look more realistic, I I embedded the actual cube control in it as a as a

15:11 resource, as in the binary sort of a binary. However, if you look at the the hints, I kind of show you where it is. So I actually copied it before I put my owner. However, you you install the new ones up. That's how easy it is. Alright. So I'm not gonna read this out loud for everybody. I'll just leave it there for a moment. Yeah. I have to apologize to anybody that likes Lord of the Rings. Yeah. I might have I might have desecrated it. Sorry. Alright. Okay. So we got a plot. We got the hint, which I'm assuming is just

15:30 Investigating Missing Application Deployment

15:52 gonna say. Yeah. Okay. Cool. So when we run kget pods, we have a Postgres, but we don't have our clustered application. So I'm gonna check for my deployment. Oh, that's mean. Now I have to get the YAML. Alright. So all the automation for this looks very clustered. That's what I forgot to do. I forgot to copy that one for you. That's it. Regular Kubernetes deployment. Raw. Copy. Copy. Copy. Oh, that's pretty much same thing. Oh, wrong one. Kip control apply dash f. Oh, sneaky, sneaky, sneaky. Alright. Let's check for the obvious things here. Mutating webhook configurations.

16:49 Deployment Disappears After Creation

17:12 Alright. Yeah. Marek has just reached the it says his internet is having problems. So I'm just gonna keep debugging the cluster for now and hopefully Marek can join us again. Okay. So cool effect. Like, I applied the deployment YAML. It appears to work. Oh, that is a very restricted view of my cluster. Maybe not actually because oh, yeah. Okay. Because this is the rail cluster. Yeah. That's deployment, isn't it? So, yeah, that's all it should have. Yeah. I was just having a moment. Okay. So where's my deployment? So it gets accepted. Okay. Deployment. Deployments. Let's say I pop over one more of

18:31 these. I'm gonna have to get a URL again or copy it from here. I'm not gonna have my export. This is what I get for trying to watch it. Okay. Wanna see if it flashes in for a second. Or if it's just not there at all. Let's go back to our watch. There we go. Okay. So something it's getting created, but it's getting deleted pretty quickly. So I suspect that this could be the story of the rogue process. I'll do a c m f in here. That has Russell's signature on it. You could be hiding something

19:14 Searching for Rogue Process

19:49 on another node as well. Looks alright. That's that w w. Okay. No weird triplet params. Don't think so. Cilium Teleport. I assume that's okay. I'll put it in the suspect list. Control plan, etcd, and then kernel processes. Alright. Do you mind if I ask you one question before I go down some rabbit hole here? Fire away. No. Is there is there a process running on one of the worker nodes that I should go be looking for, or am I just is that we can done the wrong path? Yes. Yes. Yes. Yes. There is a process.

21:15 Alright. Guys, hear me? We can't hear you. Welcome back. I'm sorry. I think we're having pretty bad snowstorms in my area of the world, so I don't know. Hopefully, that doesn't happen again. Alright. I'm logging back in. You guys can continue what you are, and I'll I'll catch up. So Okay. So I'll tell you what I I did is I ran a watch on the deployment and recreated it because our deployment was missing and it got deleted pretty quickly. So I'm assuming that there was a road process running on a control plane, but I couldn't

21:49 see anything. And I asked Russell if I should be checking the other nodes, and he said, yes. So I Is there like a road container running? Well, if I can I'm gonna just see if I can find something in PSE UX at the moment. So you can join the session on the worker one if you want and start looking at that process list, or you could jump on to worker two and see if you could see anything. And then I'll jump on to worker two. And then I'll hop over once we find that. It would be really speed it up.

22:12 Checking Worker Nodes for Rogue Process

22:22 You've already guessed it, jump speed it up. You probably wanna jump on a worker two. Alright. Well, I'm gonna open a new session because I don't wanna flash your IP address to the world, Marek. So if you Alright. Wanna join this one. I'll join I'll join your session. Alright. Alright. What looks suspicious? There's a Chrome process running. I wonder if we run Chrome tab dash e. No. I should really put this to a pager so we can both kinda follow along. Right? So I'm gonna assume he's not been sneaky enough to name the binary starting with a a

23:38 left square bracket. Although, I think we'd still see a forward slash at the start. So I'm gonna trust the all left square brackets are actual kernel processes. I'm pretty sure I've tried that before. So we got a comment there from Kevin. There's an SSH process. There is it. Is there? That correct? Yeah. Is that not what a normal SSHD command line looks like? I mean, when is that? I I don't think that's suspicious. What what do you think? No? Absolutely not. I don't know. I've never grabbed one, so I wanna know. Yeah. I I I think

24:54 it's okay. I'm not I'm not terribly worried about that. We do have the cron demon and the at demon, so I'm not ruling those out. I'm not sure if I'd normally yeah. You would see those normally, would you? I've just got my suspicious head on that. Oh, wait. Wait. Wait. If it's if question. When you were looking at it, was it the control plane that was killing the pod, or was the pod just dying and then the control plane restarting it over and over and over again? Control plane. So if it was the control plane, he'd

25:29 have to have an active connection to the control plane. Right? I I did consider changing the bank address on the control plane, but then, you know, we still wanna deploy workloads to the worker. I mean, we I mean, we could we could just get rid of worker two and have a one worker set up. That's a good idea if that's what the path you're thinking. But Well, I was just trying to limit try to understand how it might be working. Because if it was, it would have to be connecting through via some some some connection. So then we could do, like,

26:00 a port scan. Yeah. But I don't wanna let this sneaky guy away with whatever this is. It's your call, Marek. What do you wanna do? I'm just looking for a second. Apparently, the chat are finding something called Rawk. Did I met oh, there we go. I did I don't see that. Where is it? Where is it? I don't see it either. I'm looking. Can we just grep for Rawk? Yeah. Alright. Do you wanna take what's happening? Suspicious. Yeah. Alright. So let's see. Alright. Good catch, audience. Thank you. Can I just add? I the one part of my break that didn't

26:16 Identifying and Killing "Rawk" Process

26:54 work because moving between kernels was I was recompiling PS. So if that if I if they got that working, you'd never have seen it. So be thankful you're on Ubuntu. It's like work in PS clothing. Alright. So we found this. What do we wanna do? Do we just wanna kill it? And then let's see. How did he I'm not sure if I'm connected. I'm I'm trying to type. It's not typing. Oh, there you go. Oh, there we go. Sorry. You think he's created a unit file for it? I don't know. I I thought I'd check.

27:56 Oh, I guess not. Yeah. I say we just go We could just kill it. Nine sucker that. Yeah. It not typing again? Let me do it. I was trying to do just kill minus nine on the ID. It is not coming back, so that's a good sign. That's why I was checking the like, if you made a system file for it to make it just come back. Alright. Let's jump back onto our control plane. I've got a curl command in history that applies my deployment, and it does appear to persist this team. So I'll let you take over from here.

28:41 First Cluster Deployment Fixed

28:55 Alright. We have nodes that look like they're running. We have let's check our deployment. So we can check the this is just the actually, let's check all of it just in case there's other things going on in so we have an ambassador failure. I think that's actually normal, though. That is normal because they haven't fixed it yet. So yeah. Just ignore just ignore We do have a core DNS issue. This also, I think, is normal, this VIP, but the core DNS is not. So that's probably going to be an issue. Let's see. Is there anything else at this

29:02 Spotting CoreDNS Issue

29:45 point? Everything else looks let's see. So let's check out core DNS. Yeah. Let's running the cube system. Alright. And we want we'll just get the logs of this one. Created seventy minutes ago. Yeah. That's that's suspicious. He probably changed the image on it. So k. Logs. Let's see. I don't need the pod in there. I Yeah. You could need that. Invalid duration five. Alright. Let's go ahead and check out the core DNS deployment, see what if if there's anything that looks weird in that. We're just going to call maybe a dis describe. Actually, let's do get

31:07 Fixing CoreDNS ConfigMap

31:22 the alignment. I mean, I would just jump straight in there with kubectl get edit, but then I'm a bit of a wildcard. Alright. Let's do it. Let's edit. It's not like we can make it any more broken. Right? Oh, no. Don't don't say that because I can. Absolutely. I can't type. Type it on a livestream is always the the funniest thing ever. I also I've just started like, in the last month, I switched to the moon lander keyboard, and that's been an experience. Yeah. You'll need to remove deployment or deployment dot apps. Yeah. I gave away my Midlander. I gave

32:17 up trying to type with it. Oh, did you? Yeah. Alright. Let's see. What does this deployment look like? We have app, QBNS, name, namespace. Where is the image file I'm interested in this image? Oh, wait. Doesn't there have to be an image file? Does it Yes. Go ahead. Normal? Yeah. You need to scroll down a wee bit further to the spec template spec. Oh, okay. Sorry. I thought I it cuts off halfway on my screen. I Just resize the window, like, a millimeter and it fixes the scroll. It's a teleport bug. Oh. Oh, okay. There we go.

33:02 It was all just okay. I got it. Thanks. Alright. So let's see. Where is this image? So the image looks okay. Yeah. It does. What's failure threshold five? I need to pull up maybe an a standard one. I'm not sure if I see anything that looks suspicious. So I'll give you a better context here. So this file might be tampered with. I think the age of it definitely means that it could be. However, the there's a conflict map in the cube system there. This one? Oh, you wanna just check out the config map? Although, it does also look like he's passing

33:44 in his own core file. Good catch. I never saw that. So where's that main main point? Do wanna search for that? It should be Okay. Towards the bottom of this file. So let's see. The mount point is right here at Etsy core DNS. Okay. So if we go So Then my teleport is doing that thing. Okay. Oh, when you resize, it does it to me. Alright. Okay. I'll just leave it. I just will realize that it's okay. So yeah. Oh, sorry. I won't resize it again. You resize for the stream, and then I won't touch it.

34:26 Alright. How do test your script right now? Is that broken? It's broken, but I can deal with it if I just understand that I can scroll down more. So Right. Okay. So we've got a volume amount of the etcetera core DNS. It's coming from the column fake volume. So if we search for a config volume And that's the core file. This is coming from a config map called core DNS. So let's do I'm just gonna type since you've kinda got half broken screen, but just No. No. That's good. Just edit the yeah. This looks like your error message. Right? Yeah.

35:04 The cancel five. Maybe we just delete that line. I think it should be a duration for five s. Oh. Okay. Should that fix it? We could we can check. I don't know if you'll need to restart it. I think it was was it crash loop back off, so it should restart anyways? So we shouldn't have to delete it. I think it's still called kubectin s. Right? Yeah. Damn it. Well, we can get the logs and see if that changed it. I don't know the core DNS configuration that well. Those those things are just all broken,

35:57 Russell. He just rolled his face on the keyboard in the So if he's added these just to make me think, oh, Adnan s will be okay, but really it does nothing. I was gonna say, I think we just delete them. I can pull up a standard core DNS file. Let's go log again. What's the complain about now? I mean, should this be strength? I don't know. I don't configure core DNS ever. I I don't either. So, like It doesn't wanna put the half limb duck one. Right? So although it says string to I. Right? So

36:49 s t o I is string to integer. So that probably should be a 30 in that case. Right. A time to live. Makes sense. I'm gonna delete it one more time, and if I if I see crash loop back off or or failing, I'm gonna I'm gonna be annoyed. Oh, there we go. Okay. So we found a we found a logline in code DNS. Someone messed with the config. We have that fixed. So just get all in the default system here and see what it looks like. Oh, let's see. Everything looks to be running. Can we

37:23 Testing the Application (v1 Working)

37:28 access this now? We can. So we can use the application tab on Teleport where I have exposed clustered. And let's see. Alright. We have the watch tab which is version one. So now we have to try and upgrade the application to v two. Alright. So let's just run an update on the let's just edit the deployment. I don't know. Good call. Fine. I've had comments on clusters where people call me careless for the way that I work with the cluster. So I mean, let's be clear. I probably wouldn't do this if this was a production cluster.

37:50 Attempting Application Upgrade (v2)

38:18 But, yeah, that's a little bit different. So we're gonna edit deployment of clustered. A moment of truth for Russell. Is this break gonna work? Sorry. I'm working on a little tiny screen here. Okay. Let me do the version for you. Oh, sorry. I'm I'm just making it worse now. I really need to think of our way to search for that. There we go. Yeah. That's yeah. Thank you. Alright. Yeah. It looks like it's good. Moment of truth processing. Try accessing it again. Let's let's access it. We have a DNS issue. Success. Alright. What do you wanna look at?

39:10 Debugging V2 Application DNS Issue

39:31 So we have a DNS issue with possibly our database. Upgrading it to version two is what caused that. One second. Let's see if we have anything that looks wrong here. So you know what? DNS is not my greatest strong suit. So let's see. We could create a sidecar pod and try to start pinging around, but that seems like a lot of effort. Yeah. Why don't we try to scrape why don't you do get service and describe it and see if we have endpoint as a starting point, and then maybe we'll get Sounds good. So we have the postgres service there.

40:33 Postgres Service

40:38 So you'll want to describe it and see if that is even working. Although, I suspect it is from the first image. Well, I was gonna say it worked the first time which is the weird thing is I would expect the upgrade to the version number to not have changed the Postgres back end. Right? Because it's its own deployment as a stateful set. Sure. Shall we exec end to the cluster pod and do some debugging? If you got I don't know of anything else to like, I don't know of another way. So I mean, he could've

41:13 Discussion on Image Injection and Kubelet

41:13 fart the cluster repository, rebuilt his own cluster damage, and injected that into the cluster. But He went through I I don't think Russell would do that. That's amazing if you did that. Hats off. So, yeah, there's a few things I'm curious about. I I think we should get inside that pod. Which one? The the clustered? Yeah. Yeah. I I wanna run some DNS commands and and see if we can actually Yeah. If we can get an IP address. Go for it. Alright. Okay. It's not typing for me again. I'm trying to just do the exact.

41:53 I think sometimes when I highlight it, it gets weird. Alright. So we got did I not leave myself? That's I I I can't leave a shell on these images. Maybe these aren't my images after all. Okay. But he would've had to upload it to your like, he would've it it's pulling from a public repository. Right? He can't have screwed with that image. Wait. There are few ways to screw with that image. Describe the the the deployment again. It's coming from your repository unless we hacked your repository. Look at the image ID. Oh, okay. So what we need to do

42:47 is just nuke the local image. Right? So what's the code policy on this? Just set it to always. Yeah. Let's do edit deployment clustered. Pro policy. That is on always. Okay. He's he's injected the image another way. I can only think of one way. I don't wanna jump to conclusions just yet. We've got seven Okay. Let's see. How would you inject your image? I mean, like, you can hijack your DNS so that it's not going there. That would be one way. I would hope it failed the TLS negotiation. Yeah. It could've So that's fucked with container

43:39 d. To pull from, like, a local repository first? Yeah. Container d has, like, this alias support where you can tell it to you know, instead of going to Docker Hub, go to the public GCR or something like that. I've never updated that. Let's check that out. I don't know where that is. So So what worker is this on? This is Worker two. Alright. Yeah. There we go. Worker two. So I'm pretty sure there's a dump config on container d. Oh, I just ran another container d. It's not respecting my signal. Alright. Let's get another one.

44:37 Alright. New session on worker two? Yeah. Alright. So let's run PS. Grap container d. So that's my rogue process. So I guess I can kill that. So if let's see. It's not showing me. I'm sure I told it to show me everything. Oh, wait. I put one too many w's. That doesn't help. Okay. Container d help. Oh, now I've got a dodgy session. Config help dump. Okay. So there is he's not doing this to the container d. I don't think. There was no admission controllers, at least dynamic ones. We then checked the control plane for static

46:06 admission controllers, but then I'm not even sure if there is one that substitutes the images. Oh, yeah. Just it would have to also be pretty sophisticated because it had to pick up that it was version two that it needed to substitute for because version one worked. I guess maybe he just left version one working, and we were on a bad container anyways. Never mind. That falls apart. Maybe he can pound container d? Wait. Wait. Let's go back to his story. Yeah. Chapter should we be looking at? Chapter four. Oh, this is this is the the final

46:43 break. Take the And so there's that trust issue, which is so I think we are. Oh. Maybe he has done some DNS trickery, but he's put it on an unsecure mode of some sort. Right? Wait. What about what about network policy? To redirect traffic? Yeah. Maybe. I I I I've not heard of it used in that way, but I'm not gonna write it off. I I don't know. I was just thinking possibly on I I I'm not Okay. Seeing anything. I don't think that I mean, could this be just be as hacky as post? No.

47:56 Alright. We've got two and a half minutes. Do we wanna look at the hand? Yeah. Let's look at the hand. I don't I think that's just telling us that we can't trust the image, which we know. Yeah. So there's a good comment there from our loop in the chat saying that we should check for the DNS policy on our deployment, which is a really good idea. Yeah. Let's do that. Edit deployment. There's too many ways to break a Kubernetes. I will say this is incredibly mean what I've done. Alright. This is a clustered first policy, which

48:55 I think is all right. Now if he was gonna do it with the cluster DNS, we would be able to see it in here. And there's nothing there. Oh, I hate you so much, Russell. Yeah. That's the only thing I wanted for this. Because it is so mean. Do you wanna just talk through how it could be done instead of you going having a look at things and wasting time? I don't waste the time. I'm fixing a cluster. Okay. Are are we close? Are we close? You've you've obviously found the fact that I'm changing the image. It's just how how does

49:40 an image get onto a pod? Not pod. A node, I guess. Oh, can we it it's the can we check the kube config or the kubelet config? Or am I completely off in checking that? Okay. Superbots. Superbots. Superbots. Super warm? Okay. Because the cubelet is the one that's going to be responsible for telling the OCI to I think it tells container d to pull the image. He's not he's not modified to cluster DNS, which I thought So if you were there, I would have said hot. And I but at a time and and just what I've done is, like, really, really

49:47 Russell Reveals Recompiled Kubelet

50:30 mean. You know, it's that way I'd go on Uh-huh. We we noticed go. Yeah. Yeah. It wasn't just cube CTL that I built. Did did you actually compare that Kubler? Yes. Specifically targeting clustered v two. I you you've you've found the the trick soup like, really, really quickly. So if you look at your actual page, it'll give you a hint in a meta tag. I was hoping you could go have a look at the oh, it's broke DNS again somehow. So if you view the source, that's not actually obviously, you know it's not actually your container.

51:12 You compiled half of Kubernetes for this. You realize? Pretty much. And on on the unbundling system because I could do it on my my laptop, copied it over, and it didn't work. This is like I'm on a different kernel version. Yeah. And, again, I I I did a backup, but, like, it's as I said, it was incredibly mean what I did. Close all tabs. Russell is gone. We yep. Bye bye. Russell. Bye. Russell, I don't even want you I don't even want you to touch my cluster because, like, I didn't do anything half as good. That was fantastic. No. That

51:32 First Cluster Resolved Recap

51:54 was me. No. No. I think yeah. I I I agree with David. That was me. Yeah. That that was that was me. More for David than for you. That was good. I will never try you know what? Honestly, it's probably a good moral of the story is don't trust your binaries. Well, that's one of those things. You know, I was talking on the Discord about how I'd really love, like, a a cluster check command. And one of the things we would have to do is just do those checks on all of the compiled binaries for the

52:24 first. Because you all have after the first binary, we should have just replaced all the binaries. Just been like, no. We're done. Well, yeah, we had to hand. We've seen the Go folder. Like, we should have we should have known. I thought he would've stopped the cube control. Alright. Yeah. Good break, Russell. That was that was good fun. And I could Hopefully, yeah. I I learned that the how sort of, like, parts get deployed on. I was like, I kind of think I know what is going on, but, it goes by the API server into the Kubelet. Kubelet tells container d, and

52:59 I was like, Kubelet, I can attack it down and change it before it gets to container d. It's like, I learned a little bit about how the overall ecosystem works from it. Yeah. I knew you wouldn't go down a container d route in my head because I'm like, we've seen that a couple of times now in clusters. I didn't think you would go down that because you'd know how it would look. But to compare it on Kubelet was particularly harsh. Good job. It's easier than you think. You can get the source code from GitHub for the actual release.

53:27 And it's all the make files set in everything good for you. So you can get it done within, like, a few minutes. Well, now the boot is on the other foot. Yep. Now I'm gonna warn you, Russell. One of these breaks is unintentional. I accidentally broke a teleport, and one of the nodes just never came back. And I never touched that node. So we're gonna have to debug that one together. So fair warning on that one. Alright. I have opened the session on control player one for the cluster. Russell, if you could please join start exporting

53:34 Starting the Second Cluster (Marek's)

54:08 your KubeConfig. Your forty five minutes start in forty five minutes. Hang on a minute. So where's the sessions at? I can't see it on the screen. And then active session. Activity. Ah, sorry. Oh, nervous now. Oh, you're in. Awesome. Yep. Zero trust. Yep. That's that's the only way this is. Right. So oh, yeah. Just Russell, just do me a favor and resize your screen a bunch to yeah. I've heard that he likes that. Apparently, it's just because Oh, it's okay. Xterm GS doesn't like it when the two terminals are different sizes, and it tries to accommodate it.

55:12 I'm gonna have to, like, set out a size of my window and then ask everybody to resize their window to be exactly the same number of pixels. Or we could always use the terminal CLI, I guess. If I do this again, I think I'll try to set up the terminal CLI. I think that would just be a better experience. Yeah. The CLI does work well. Alright. We have a working control plan to not ready worker node. So one of them is unintentionally broken, and one of them is actually broken. Is that what we're saying? Yes. Yes.

55:22 Debugging Not Ready Worker Node 1

55:50 And he's not telling you which. Nope. That's fair enough. That's fair enough. Actually, I made a I I didn't tell you this. There's a setup file that would have alias that out for you. So scheduler is a real break. Postgres is down. Ambassador, I'm assuming, is just Yep. Some Collateral damage. Yep. Okay. Cool. Same with keep that. Right. Okay. Right. So why would the notes notes be offline. What is it called? K twenty dash clavin dash worker dash one. I can copy and paste it if you want. Alright. Nope. It's not on it. Whoops. Right there, brother.

57:07 Did you do it? I did do it. As always, there's a road key, I think, from us typing over each other. So just knock that off the end. CPU. Qubel stop posting status. Oh, where'd you say that? If you scroll up to the the event tags. Ah, yeah. Okay. That makes right. My screen's only partially full. So if I re no. Even the browser screen's just, like, squished. It's, like, weird. Yeah. That's that. Sorry. What are those weird teleport bugs? I didn't understand what you're talking about before when you're saying, like, resize the screen and all that.

57:25 Kubelet Status and API Server Timeout

57:56 What size is your monitor? Four k. Oh, yeah. Okay. Oh, I think we're we're gonna be too far apart to get it close. No. It's alright. I can deal with this. I'll just resize my window. Oh, no. I can't. I'll just leave it as it is. You can resize your window, and then I'll resize mine, and maybe it'll magically work. There's a there is, like, a certain number of times you can do it, and it will start working. Right. You just fix yours now. I'll leave mine as it is. Are you good? Yeah. I think so. Right. Alright. Don't touch

58:29 anything except maybe type in kubectro command. Alright. So do you wanna jump on to the workers and check the kubelet out? I'm still trying to see the message that says the kubelet is down. I can't see that in my It's underneath lease, and then there's conditions. I don't know if you got it on your screen. Uh-huh. These ones. Ah. Oh, yeah. Okay. I'm blind. Right. Okay. So we need to go on to Yes. Go work one first. Alright. Session started. Give my Internet a minute. Alright. Alright. There's definitely some errors there. Ever get a node,

59:50 worker one not found. Yet the key, but it's up and running. What's the command to find out the is it? So it's not What do you see? No. There's I think I remember from one of the one of the one of your shows. If you change the name of the host and then restart Cuba and then change it back, it messes up with Cuba. I wonder if he's done that. So you want to restart the Cuba again? No. I'll look first. Oh. Yeah. You've have you resized again? No. I've lost a lot of text. I've now

1:01:07 down to, like, a quarter of what I was before. Welcome to my pain. And then, like, it will stop letting me type for a little while because I think it puts the cursor somewhere. I don't know. I was having troubles. I'm using the terminal next time I do this. So it is timing out, waiting for something. So that is that the IP of the API or machine to the API server? That 91. Let's have a look. Is it 91? Yeah. That is the BGP address of our control plane. So it looks like he's blocking. So I'm gonna ask Barack a question because

1:02:12 I don't want us to go down a rabbit hole here. Is this intentional, Marek, or is this not intentional? You're muted. You're still muted. The guest has muted himself. Nope. He's just hit the hang up button. Well, I deserve that for the the Kubelet, so fair enough. Alright. So See, because if this is not intentional, I would get us out of it very quickly because it could just be that KubeBlep is acting up and not advertising our address anymore. But if it is intentional, then it's gonna be some networking thing that we have to start debugging. So you could try curling HTTPS

1:03:02 on the IP address and see what you get back. I'd say that's a good or a ping. Yeah. The space at the start. It's a double top. Double top. Yeah. There's no ICMP ping here, so maybe it isn't functional. Can I control c? Will just break everything. No. You control c. Hey. Welcome back, Mark. I heard I'm gonna ask Mark a question, and then everything died. Okay. Is this intentional break, or should I quickly get around this BGP issue? Wait. What's the issue? Worker one can't reach the control plane on this BGP IP. Yeah. This is the unintentional break. Okay. So

1:03:31 Unintentional Kubevip/Kubelet Communication Issue

1:03:56 that's alright. I think that's because kubect is broken. So I am gonna quickly fix this. We have the other IP address here. That's why I warned you that it was unintentional. Yeah. That's alright. F m kiplet. What IP address is this using? Yeah. Okay. Now let's just swap this over. And this might actually be affecting the other node as well. Yeah. Was other breaks to it though. So yeah. Yeah. I think could fit this had a little bit of a moment itself. Sorry about that. I I think I helped it along in its issue. Yeah.

1:04:46 I gave it a little kick in the pants. So if you guys wanna know the lesson, the moral of the story here is do not restart your clustered control plane an hour before the cluster goes live because Annoyingly, I'm gonna have to fake the cube clip because we don't have the IP address at our control plane. So it's I'm super sorry, Russell. That's alright. Alright. Let's get this fixed. After what I did with that cube, but yeah. Just just don't apologize to me. Alright. Come here, cube that. What is your problem? After what you did with a cubelet, I

1:05:45 have a ton of respect for you. So that was awesome. I'm not sure what David thinks that. Thank you. I mean, I think he made his feelings very clear. Am I disconnected? Are you I'm still here. He's he's trying not he's trying not to swear. I'm not trying not to swear. I'm pretty sure this is a bug that KubePap introduced recently. Yeah. But the bank address. And I did look for 221 in here and then find it. So interestingly enough, this bug actually was always here. It just caused it not to come back okay. The cube bit was always complaining about this

1:06:39 bind address because I actually wasted about an hour and a half of my time digging into this. Yeah. There's a file you can pass. Yeah. I've never used CubeBit before, though. There we go. What we need is this. And then as we modify our daemon stack, and I'll just add this flag very, very quickly. So why does this not happen on the other cluster? Did you restart the control plane? No. So I bet you his KubeVit is in a bad state. The cluster just isn't broken until you restart it. Right. Alright. That should fix it.

1:07:44 We should see pending. Oh, yeah. Okay. So you're gonna have to fix the schedule and I'll wrestle. So can I go back onto the notes? Or I'll I'll at the schedule first, but then are the notes gonna be working now? Not until we get oh, yes. Okay. So I I can just cheat and schedule the cube map daemon set if you want. Although, that shouldn't actually go through a scheduler. Right? I thought you're asking me. I don't know. I don't use keep it. Well, wait. How is it being run? Is it run as a deployment, or is it

1:08:00 Debugging Down Scheduler

1:08:29 run as a It's a daemon set. If it's a daemon set, then yeah. But it has no node. And we need a That's interesting. Alright. Let's cheat. Oh, we can't cheat because he was there. Alright. Let's fix the schedule, Russell. Okay. Back to me. Alright. So I can't use my shot because it's gonna Namespace. Thank you. See, I'm not that good at Kubernetes, really. I'm just good at breaking things. I'm not good at anything. If you're also good at breaking things. Oh, okay. But we failed container. Yeah. I think you're gonna have to get the logs with that pod.

1:09:54 Okay. Is my greatest letdown. This break was supposed to rival Russell's, but didn't. Namespace. Yeah. Have you just tried the tool QB? It's really cool for switching context to namespace. Oh, look. Have a scheduler. I have a scheduler. I have a scheduler. I don't think that's what it normally puts out, so I'm assuming that the somebody has gone and changed the static manifest. So I have a trick if he's done this on the if it works. Let me try and find it. Well, it might have ripped. So We are always ripped on custard. You're gonna use the VIM backup?

1:11:07 Yeah. I I I left the backup. I'm never that mean. If I change a if I change a file, it's back in here. Save that. It's all good. Thank you. So he's changed the scheduler. He has edited some things, which I assume I'm not sure if we can get back to that. Right. So Yeah. Three and four look pretty ominous. Yeah. So the real one is in manifest. The one in just Kubernetes is I'm assuming is the backup. So we're gonna have a look at the manifest first. That's its authentication to the cluster. That is KubeConfig.

1:11:43 Finding and Restoring Scheduler Manifest

1:12:09 Yep. Shetler. Oh, Shetler.conf. Yep. Not YAML. Yeah. Remember, you're only seeing what's in BIM. Alright. Let's go look at the static manifest, and we'll take it from there. Alright. So, hopefully, it's going straight to that. Also, I believe if you use g and semicolon or g and comma, you can go to all of the edits. So let's have a look at that. Well, there's something there that's interesting. Yeah. Yep. So command and the image have been changed. So I am not sure. So I don't know what the actual real ones are gonna be. I think the

1:12:22 The Static Manifest

1:13:20 image is GCH. Is it Google? Yeah. I think I can take that off the top of my head. No. Fair enough. Thank you. I can't spell scheduler, but scheduler one point So there's also a backup file in dot backup if you just want that one. You guys found both things. So Alright. Okay. Right. Okay. So the actual real shame of this, Russell, is that I wrote my own scheduler, and it was supposed to run. Why is everybody being so harsh? It actually runs on my my kind cluster, but I didn't get its credentials correct. And

1:14:11 then I kinda broke my cluster, so I didn't have enough time to debug it. But it all it did was print out one does not simply run a pod Nice. Whenever you tried to schedule something. So Well, you've not been that nice with your backup. Because you forgot to no. That is right. Oh, if I got that right, cool. Okay. Cool. I was gonna say I'm kind with my backups. Anytime I change a manifest file, I make a backup so that Well, it's because the deaf didn't show the image, but it's because I forgot I fixed the image and

1:14:45 then did the the alright. Even got the version right? That's that's impressive. Hey. These are hand rolled well, fully automated clusters, but I put a lot of stuff into these clusters. And then QVAP comes along and makes my list of all the visitors. Can't see the scheduler. That just means the kubelet's restarted. It takes ten, twenty seconds. Okay. Probably saying that maybe pulling an image right now because the pull policy was always, I believe. Fair enough. So, yeah, sounds like you've run into the same problems I did, Mark, because I was developing my evil on a kind cluster as well. And when

1:15:09 Waiting for Scheduler and Nodes to Become Ready

1:15:29 I moved it over, nothing worked properly. Yeah. I just made the grave mistake of restarting my control plane, and then nothing worked. And so I didn't have time to debug it. So sorry about that. It was supposed to be a little bit more devious, just a little bit. Come on, scheduler. You know what? I don't know why, but I found the scheduler doesn't come back until you restart the cubelet. I finally cut it along a wee bit. There we go. Cool. It should. Anytime you update the static pod manifest, it should. But for some reason,

1:16:10 Worker Node 1 Still Not Ready - Debugging Kubelet Again

1:16:15 it wasn't. Okay. It's now rescheduling everything that needs to reschedule. So that's good. Although my cue put fix doesn't work. So we just give up on. That's fine. Oh, you're going you're gonna try and fix it? There's also an environment variable we can pass them to change that port. Okay. There. No. That's the flag. I have try to flag again. I'm sure there's an environment variable though. I'll make one more change to it and we'll see if we can get by with that. Okay. What's wrong with that? Just because that's what it shows as the

1:17:39 default minus the port change that we need to make. I'll call on two. Alright. We'll see if that helps. Off you go. Right. So you've sorry. Fixed the worker nodes. So if I why is it not typing? Okay. Yeah. I had the same problem. Why isn't it typing? Sometimes it doesn't. Oh, both of the work are not gonna ask this question. Uh-huh. Are both of the working nodes now intentionally broken, or is one of them still unintentionally broken? I mean, there are minor breaks on the nodes. Yes. I don't think I'm gonna see anything from

1:19:01 oh, let's just do it anyway. I described the pod. That's not how you do that. Copy and paste isn't very friendly for I was just using the mouse. The mouse worked pretty well, which, I mean, is a context switch from your keyboard. But So I think what happened? Oh. I resized. Know I know we're gonna get used to that. I gotta open another session quickly and fix. I'll pop that over here so it doesn't but I'll get cube bit working while you debug. Okay. Really sorry about the cube bit. That's not your fault. That is Dan Federan's

1:19:57 Debugging Worker Node 1 CNI Issue (Cilium)

1:20:02 fault. Dan. I I do believe you warned us that restarting the nodes would take a long time, and you meant they take a long time. They do. Yeah. So those tolerations, are they they look a bit dodgy, but I'm not sure if that's always there or not. Not ready, no execute, and unreachable, no execute. Is that just the norm? Okay. Sorry. I'm back with you now. So we're looking we're looking at Can you see the highlight? No. I don't see it. That would be a cool feature though, but sadly not though. Yeah. Obviously, it's yeah. Okay. So we got

1:20:53 zero nodes are available. One node has a tint. So just under the tolerations, are they normal? It's pretty normal for a node to have a toleration that the node has to be ready to run on it. Right. So that so what that's saying is it's not ready. If it's not ready or unreachable, yeah, don't try execute don't try schedule to it. Okay. Right. And it's no execute on a not ready or unreachable mode. Mhmm. CubeBit is happy again. I would expect our notes to be happy again very, very soon. I I would expect them to be happier.

1:21:33 But the BGP issue is not a thing anymore. So if you can So I can go on to worker one, can I? We have a session open on worker one. Yep. You can pop over there. K. Go. Hello. There we go. Oh, that's because you were typing as well. Oh, okay. What was I doing last? Yeah. Apparently, spelled Prometheus wrong in the flag. Thanks to James for pointing that out. I got it fixed. So if you restart that, I could I think that'll be okay. Okay. Assuming I set the IP address back in their config, which I'm pretty confident I did.

1:22:25 Yeah. That's a good kubelet. That's a bad kubelet. Did I not change the IP address back? Well, I'm sorry. That just makes me a terrible, terrible person. Should be 8191. Is it using 8191? No. Alright. I thought I was confident I changed that back. Oops. Sorry. I was just gonna run a journal. Alright. I'll step back and let you. I just wanna make sure that I'm not tripping you up anymore than I already am. Okay. So this isn't the BGP. This isn't TLS. This is now just a broken node from, I believe. So Right. Felt like a node node in four.

1:23:47 Keep it container runtime network not ready. Not ready. Network plug in not ready. Network plug in returns error. CNI plug in not initialized. So what has it done to the network? Do you know how the CNI works? No. I've never touched that to try and break it for you yet. Alright. And so what happens I I was just checking on a control plan there to see if we had our Cilium pods on all the nodes, which appears that we do. They have an a container that actually downloads binaries to each node, which I suspect could have been tampered with

1:24:37 and may maybe the binaries removed, which would cause that error. Say that again, sorry. Missed that. So we have an error in our kubelet logs. Container runtime network not ready, network ready false, network plugins returned an error, c and I plugin not initialized. I can't remember where they download to. I thought it was maybe slash c and I. But if we jump to the control plane node for a moment and if we edit the daemon set. Right. I'm on that. Yeah. Alright. So Liam, There's Emmet containers. Not what I expected to see, but doesn't mean it's wrong.

1:25:34 Is that they download the binaries to the machine, OpsCNI, okay. So if we take a look in here, those all look okay. So Okay. That idea is dumped. So can I have a look, which one am I on? Why can I see inside what's to do VIM in my Alright? In the control. I'm on the control plan. I can I'm inside VIM. I've kinda stepped out of that. Yep. Alright. So what are the And there's comment on the chat there from Cook Jan saying, what do the logs from Solium say? Yeah. I'd be curious about that too.

1:26:52 That's not the. Just pick any of the non operator sodium. Okay. Useful. Okay. Will that be the same for all of them? Yeah. I feel it will be too. Just have a look at this one. I feel like we wanna look at that demon set again, potentially. Oh, what's the So this was another of the cilium. So let's just check that I'm not going crazy. So I looked at this cilium, which had a restart, And that's what the logs are from this one. I I yeah. Shall we take a look? Chip in if you want. Well, we've got hints in the

1:28:35 in the story, do we? Are they on the root fail system? No. You'd be No. There is your setup script. Yep. But I don't wanna I don't wanna be I just want to I I'll just say there is no more image shenanigans. You can trust every other image that you find so that we don't have to go down that rabbit hole with literally every image. You don't have to think that I hijacked Cilium image and compiled it. But you have broken Cilium. Right? I have broken something that has broken Cilium. Yes. Alright. On the worker nodes.

1:29:16 Possibly. Yeah. Cilium won't be working possibly. Yes. Yeah. Okay. So I don't know. So we might have to know whether or not it would help us. Neither. What the problem is. Do I? Alright. So what would what would break So Liam? Right? So we don't get any logs from the pod. So we need to just work this out. The binaries do exist in. So I'm thinking, like, it's probably something I think, outside of Kubernetes that's network based, where it can't be taking the network interface down because of a bit of talk to it. I believe that's right, isn't

1:30:00 Checking IP Tables on Worker Node 1

1:30:16 it? Mhmm. Even so it shares the same IP. So it could either be a firewall or IP tables. Let's take a look. Alright. So are we on worker one? We are. I've ran a list on IP tables. I mean, we could just yolo this right into a flush and see it gets things kicking over again. So gone. If there was an IP table rule, it no longer exists. What else can we Will that break Cilium though or will Cilium inject its rules back in? Cilium will just add the rules back in. Okay. Now we're getting a connection timeout and our

1:31:20 IP here. No. Because it is an IP, it can't be a host file entry. You can't override I don't think you can override IPs to IPs. I really hope this isn't the BGP thing again. And that's just something just intentional, Mark. Just wanna check. Because the BGP looks okay. In fact, you can't bang out you can't ping a BGP address. I'm just being really silly. Okay. Networking thing. Has back any rules? No. Should we rotate the pods? It's yeah. Like I guess so. Like, it's it's always been yeah. Let's Delete everything and the world. Bye bye. Bye bye. Bye bye. Bye bye.

1:32:01 Rotating Cilium Pods

1:32:17 Alright. I get a bit happy when I delete things. Alright. Those are terminating. We've got one up, but not quite ready. Alright. One ready, one up. No rules yet. Makes that we win for that node. Let's check. Grip. This one's still terminating. You want, well you want a control plug, mate. Yeah. It's not I mean, we could maybe try and speed that along, but perhaps that's part of the brick. I'm not feeling it running. So we don't have currently, we don't have Celium working on our node and it's not terminating. I don't see a process.

1:33:29 So this must be a Kubernetes thing. Right? So if the node if the node was down, Kubernetes wouldn't be able reschedule anything, would it? I can't delete it because yeah. Of course. I forgot about that. So the it's never gonna terminate because it's not I can't reach the kubelet. We could try restarting the kubelet now that we flushed all those rules. Let's see if we got a happier kubelet. But I think there's an underlying problem that's gonna stop the kubelet from talking to the API server. I think you're right. I just I just don't know enough now.

1:34:15 Like, this this is where we start to reach my limit of Linux. I don't know anything about firewalls and for a lot. So Alright. Do want me to give a hint? Yes, please. Now I was kind of ex I was kind of expecting the flush of the IP tables to fix it because I just did a UFW rule rule to drop all communication from the server. No. It's UFW, but But UFW just sets up IP table rules. Right? Now I don't know if it just reset it, but it didn't look like it reupdated the IP table. So I don't know.

1:34:25 Marek Reveals Worker 1 Break (UFW)

1:35:05 Is the IP table flush should have w now. Yeah. Yeah. The the IP table flush should have fixed it because, again, UFW is just a front end for the IP tables. So So this this probably has my BGP thing again. This is probably your b b I was gonna say, you actually did fix the thing I broke because I just dropped the port that the cubelet communicates on. Okay. So that was Has that been done on worker two as well? No. I didn't repeat. I don't I don't do repeat this time. Alright. So is worker two down

1:35:46 intentionally, or is that a BGP problem as well? Probably both. Ah, okay. Right. So there is a definite break on worker two as well. The break on worker two is more subtle. I would actually expect the node to show available, just not function properly. Okay. I am gonna very quickly delete QPub control plan. Fair mail is hard. You know? It's hard. It's it's very, very big. I should save that curl command. Just so I'm gonna try restarting the kubel one more time. Okay. Why? I I don't think ship that is doing what it's supposed to be doing, which is

1:37:15 really frustrating because that's kind of ruined any chances we have of getting this to work. Yeah. I'm sorry, guys. That is my my I p g p address. Right? The 8191. Yeah. 178 and I've not messed up, have I? No. Damn. You keep that. So So with this, it's going to be really hard to actually debug the other thing. So I can either just step you through them since we're low on time. Well, with any of your breaks, I've actually stopped Kubelet from being able to advertise. No. No. I specifically with, like I I did very targeted. So, like, with my firewall,

1:38:10 I dropped the Kubelet port, not anything else. I didn't drop all communication ports because that would, like, break SSH and things like that. So they were very targeted towards the cubelet. So, no, it wouldn't have broken it. So should we try oh, we've only got a minute left. I was gonna say, should we go on to the worker worker two? Because that that hasn't had its IP tables pushed yet or This cilium stopped. Yeah. This is where it's gonna be very hard to solve the other issues. They're not hard. They're just if this isn't working, then the other issues

1:38:45 aren't seen. Okay. I'm gonna run out I'm gonna run out of time then, I guess, but I I I don't mind going on a little bit further if you want, but if you wanna explain it then. Yeah. Why don't you walk us through it, Marek? Okay. Yeah. I I right now, don't think we can curl and speak to that BGP address, which makes me think that it's not being advertised, and we're just we're dead in the war at that rate. So on node one, I just dropped all communication on the cubelet port so that Kubelet wouldn't be able to communicate

1:39:13 Walkthrough: Marek's Breaks (UFW, Kubelet Rate Limit, Postgres Capabilities)

1:39:19 anymore. That's all I did there. And then on the second node, the only thing I did was I took the kube config and turned its quality of service down so it could only make one request per second, which means that when you pull pods and stuff, it actually communicates more than that. And it makes a funny error because, like, the cubelets working and communicating, it's just doing it really, really slowly and it causes weird errors. I thought that would be fun, but we never got to that. So But unless it said in the logs rate

1:39:56 limited or something, I would never have found that. Did you use that? Can Was the traffic control you used or something else to change the quality of service? I used Go's built in. You Go has a built in rate limiter, so I use that. That that will do it. So it was anyways, I was pretty happy with that one. And then the very last thing that I did is I went into the deployment and for your your Postgres database, and I dropped all capabilities on the container, which it has to access the file system and the network and stuff.

1:40:36 So it had no capability, so the database would never run because it didn't have it didn't have the permissions too. So that was the last last thing. Well, damn, we both I mean, I know we got a recompiled kubelet and a recompiled kubectl, but messing with networking is just as evil. Well, I Russell, I I appreciate your cluster. That was good. Thank you very much. I was absolutely flummoxed by yours. Yeah. As I said, I would not have got that quality of service thing. Well, I set you up for failure with this broken BGP, looks like. So Oh, yeah. Yeah. I'm definitely

1:40:47 Reaction and Wrap Up

1:41:15 gonna blame you. I'm definitely blaming you. Alright. Well, thank you both for joining me. Those are some really good breaks on both sides there. Really cool. And I hope that we were able to share some knowledge with the audience of how to potentially debug some stuff. And I will make sure that we have some properly working clusters before our next session. All right. I wanna say thank you to Teleport and Equinix Medal again. Thank you to Russell. Thank you to Merrick. I hope you all have a wonderful evening. Clustard is not back next week, unfortunately, but we are back

1:41:47 the following week. So I will see you all then. Yes. Alright. Have a a great day. I'll speak to you all soon. Thanks. Bye. Bye.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Cilium

More about Cilium

View all 36 videos
CoreDNS

More about CoreDNS

View all 21 videos
PostgreSQL

More about PostgreSQL

View all 22 videos

More about Teleport

View all 38 videos
kube-vip

More about kube-vip

View all 4 videos