About this video
What You'll Learn
- Diagnose pod startup failures from Cilium and scheduler corruption in Rachel's cluster, then restore pod scheduling.
- Identify a 1m CPU limit and broken service definition, then fix and recreate it so apps recover.
- Recover Andy's cluster by replacing fake kubectl and API server image, then repairing Cilium masquerade and DNS breakage.
Rachel and Andy bring sabotaged Kubernetes clusters. Rachel's hides disabled schedulers, a 1m CPU limit, and a mangled Service. Andy's stacks a fake kubectl, swapped API server image, Cilium IPv4 masquerade flip, hosts file DNS hijack, and read-only worker.
Jump to a chapter
- 0:00 Holding screen
- 1:00 Introductions
- 1:08 Welcome and Housekeeping
- 2:10 Guest Introductions: Andy & Rachel
- 4:00 Rachel's Cluster
- 4:08 Starting with Rachel's Cluster
- 4:47 Initial Cluster Assessment
- 5:34 Identifying Pod Network Issues
- 6:07 Debugging Cilium Logs
- 8:13 Investigating Scheduler and Pods
- 9:44 Node Scheduling Disabled
- 10:03 Pods Creating/Pulling Images
- 11:55 Unexpected Extra Postgres Pods
- 13:10 Checking Application Access via Teleport
- 13:37 Internal Server Error
- 14:30 Identifying Resource Limit Issue
- 15:00 Fixing Resource Limits on Pods
- 16:22 Retesting Application Access
- 18:51 Debugging Service Definition
- 20:02 Editing the Service
- 21:30 Removing Cluster IPs from Service
- 23:29 Deleting and Recreating Service
- 24:37 Testing Application After Service Fix
- 26:15 Application Works (Rachel's Cluster Fixed)
- 26:44 Discussion on Rachel's Break
- 27:22 Transition to Andy's Cluster
- 27:30 Andy's Cluster
- 28:14 Initial Assessment (Andy's Cluster)
- 28:51 Discovering Broken Kubectl Binary ("Honk")
- 30:50 Investigating the Kubectl Binary
- 32:48 Finding the Real Kubectl Binary
- 34:40 Restoring Working Kubectl
- 35:37 Debugging Control Plane Issues
- 36:07 Checking Static Manifests Directory
- 37:46 Identifying Modified API Server Image
- 38:28 Correcting API Server Image in Manifest
- 39:45 Checking Kubelet Status
- 1:41:14 Kubelet Running, No API Server Logs
- 1:42:36 Restarting Kubelet
- 1:43:26 No API Server Logs After Restart
- 1:43:51 Reviewing Admin Config
- 1:44:50 Checking Containerd Status
- 1:45:22 Containerd Image Pull Errors (TLS)
- 1:46:47 Debugging Image Pull Issues
- 1:47:41 No Containerd Command/Config Dump
- 1:48:38 Pulling Image Locally for Comparison
- 1:49:20 TLS Errors Confirmed
- 1:50:18 Suspecting Certificate Issues
- 1:51:00 Discovering Networking Issues (Ping/IPtables Fail)
- 1:52:00 BPF and Cilium Discussion
- 1:52:48 Theory: Cilium Policy and Broken Control Plane
- 1:53:06 Andy Reveals Break Got Away From Him
- 1:55:53 Considering Rebooting the Node
- 1:56:37 Disabling Kubelet and Rebooting
- 1:56:54 Andy Explains Intentional Break (Read-Only FS, Etcd)
- 2:00:07 Andy's Malicious API Server Code Revealed
- 2:01:00 Realization: Networking Broken During Setup
- 2:03:14 Checking Node Status Post-Reboot
- 2:04:03 Node SSH Accessible, Teleport Not Yet
- 2:04:33 Recalling the Fake Kubectl
- 2:05:07 Teleport Relies on CubeVip BGP
- 2:05:36 Cilium BGP Enabled
- 2:06:12 Starting Kubelet to Restore Teleport Access
- 2:08:15 Teleport Access Restored
- 2:10:27 Checking Host File for DNS Hijack
- 2:11:14 Identifying Host File DNS Hijack
- 2:11:50 Fixing Host File and Restarting Services
- 2:12:24 API Server Is Running
- 2:13:00 Application Still Broken
- 2:14:26 Debugging CoreDNS Issues
- 2:15:35 Hint: Check CNI (Cilium) Configuration
- 2:16:22 Locating and Reviewing Cilium ConfigMap
- 2:17:07 Debugging Cilium Config
- 2:18:41 Identifying IPV4 Masquerade Issue in Cilium Config
- 2:19:08 Fixing IPV4 Masquerade in Cilium Config
- 2:20:18 Checking Other Cilium Configs
- 2:20:43 Cilium Status and Rollout
- 2:22:18 Andy's Kubectl Namespace Easter Egg
- 2:22:39 Cilium Working, CoreDNS Still Broken
- 2:23:10 Reviewing CoreDNS ConfigMap
- 2:25:40 Checking Kubelet DNS Configuration
- 2:31:35 Exec into Application Pod for DNS Check
- 2:32:51 Checking CoreDNS Service IP
- 2:35:04 Switching to Worker Nodes to Debug DNS
- 2:35:59 Discovering Read-Only Filesystem on Worker
- 2:36:51 Andy Confirms Read-Only Issue
- 2:37:55 Attempting FS Check on Worker Node
- 2:38:30 Fixing Kubelet DNS Config on Worker
- 2:39:06 Rebooting Worker Node After FS Check Attempt
- 2:40:10 Cordoning Off Failed Worker Node
- 2:40:36 CoreDNS Pods Coming Up
- 2:41:00 Restarting Application Pod
- 2:42:02 Application Working (Andy's Cluster Fixed)
- 2:42:12 Discussion and Conclusion
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:08 Welcome and Housekeeping
1:08 Hello, and welcome to Clustered. Today is episode 16 of Clustered, and we will be fixing some rather broken Kubernetes clusters by two fantastic guests. Before we get started, there's just a little bit of housekeeping. If you're not already a subscriber, you should do so now. Click that subscribe button, tick the bell, get notifications for all new episodes from the Rawkode Academy. My goal is to provide the best cloud native and Kubernetes learning materials for us all to learn this fast landscape together. So we'd like you to join us. Also, if you wanna chat with other cloud
1:42 native and Kubernetes people, you can join us on the Discord server available at Rawkode.chat. I also wanna thank Teleport who started sponsoring Clustered a few weeks ago. We've been using Teleport since the very first episode of Clustered. It is a fantastic tool, and I really appreciate their support. I won't give you the pitch for the product. You'll see it for yourself as we fix a couple of clusters in just a moment. Now today's clusters are broken by. Like I said, two fantastic guests, and I'm gonna bring them in. Now here we are. Hello, Andy. Hello, Rachel.
2:10 Guest Introductions: Andy & Rachel
2:13 How are you both today? Doing well. Thank you. Good. Thanks. Awesome. Well, thank you so much for joining me. Can we just start with a quick round of introductions? I'll just say, Andy, first, since you're right next to me on the screen, and then Rachel, please tell us a little bit about yourselves. Yeah. So I guess I started doing Kubernetes, bare metal on prem, and, like, Chef on one dot nine, and then just kind of progressing the automation from Chef through cluster API, change gears, and now doing it for Equinix metal on lighting up dark data centers
2:53 with a global Pixie Boot and Flat Car. So Nice. Thank you. And, Rachel? Yep. I'm Rachel Leakin. I'm a solution architect at VMware helping customers modernize their applications onto Kubernetes. I started in Kubernetes probably now three years, more on the architect side, and then started to get down into the weeds when I joined the VMware and helping customers set up their clusters. Now I'm already helping them get their applications onto clusters. A side note from that, outside, I'm a big Chelsea fan, as you can see. I had to you know, since we're light slight competition, I had to, you know, represent
3:35 my team, champions of Europe, so a little bit about me. Alright. Thank you both very I'm just gonna share something from the audience because I think it's really funny. But just before we went live, Rachel was informing us that you made two extra breaks to the cluster, which, of course, is wonderful. And I I I mentioned that a lot of breakers do that because they panicked of not broken enough or they've taken it too far, and then Andy made best face. So we've got two potentially disastrous clusters, and I think we should jump straight into
4:00 Rachel's Cluster
4:05 our first one and see what we can fix. So our first cluster today is rituals, which means, Andy, you and I are up first. Okay. I am gonna click connect. Oh, my computer decides to play. There we go. I'm gonna click connect on the control plane node of our first cluster. I'm gonna remove that teleport. Thank you. Andy, if you can do me a favor and join a session and just type echo hello or whatever you want so that I know you're here. Wow. Straight for the VIM and the batch RC. Alright. There we go.
4:08 Starting with Rachel's Cluster
4:44 And the first thing we normally do here is export our cube config and check for a control plane. I will let you do that with whichever means you wish. Yeah. Sorry. I'm used to being quiet while I sit on incident calls. Well, that's a good start. We got a server version. And the git treat is clean. Alright. What you wanna do next? I guess we should check our pods. We got some pending pods. Alright. What do you wanna do first? Let's look at the cute system. Looks like pod network is a little borked. Well, yeah, I see two terminating core DNS,
5:34 Identifying Pod Network Issues
5:51 two pending, some broken ciliums. Yeah. Quite a lot broken. Do you want to should we describe something and see if we can get some better messages? I think we should take a look at the the not ready Cilia logs. Okay. 10 it's a common port. I see it all the time. What is that? That is one of the API server ports. I'm assuming our node to node networking is working. Nice alias. I like that. Yeah. I believe I got that from mister Brent Buckley in the chat when we were at SendGrid together. Well, it looks like the scheduler was bounced
6:07 Debugging Cilium Logs
7:12 recently. Well, that's different. No. They've I think they probably have changed how describe works because I know that used to work. You spelled system wrong in your alias. Okay. At least that's an easy fix. Our colleague, our joint colleague, Zoe is laughing at your thinking face. Oh, would you look at that? Images. It looks like the official image. What were you checking for there? Well, I noticed that the scheduler was bounced fourteen hours ago, So I assume that that bounced because Rachel did something. I did something, but I didn't know that's what I was doing.
8:13 Investigating Scheduler and Pods
9:08 And then for people who don't know, the caret word caret is said in your last history entry. Okay. So queue, it seems back. Running. Yeah. Scheduling disabled on both the worker nodes. You think that's just a card in? Yeah. I mean, to put that in the status is typically what that would mean. Okay. Creating container. Ambassador's coming up. Looks like came up. I'm guessing we have to hop on to the other two nodes and enable Cubelet. Why did you think that? Sorry. These are still stuck in creating. In my iTerm having highlights is always fun.
10:03 Pods Creating/Pulling Images
10:59 Well, it's it's also in iTerm. When you highlight something, it automatically goes to your paste buffer. Ah, of course. So I mean, it looks like it's pulling the image. Yeah. Looks like an official Postgres image. Which means we do have a cubelet. Would you like me to open up a shell on one of those nodes? Oh, there they go. And you were just being impatient. I would never do such a thing. Are we supposed to have three Postgres? Nope. Although I don't think it matters, to be fair. They don't use any volumes. Sorry, Rachel? I was just trying to deflect, you
11:55 Unexpected Extra Postgres Pods
12:08 know, just, you know, because I was being I was putting on my developer hat. I was like, why do we just have one? That doesn't, you know, make sense. So And this is not a problem, though. Like I said, there there's no volumes. They load everything via an an a container into the database. Yeah. Except then the traffic's going to get split by Cilium to three different pods. And if they're not set up in a Replica master whole shindig, it's potentially not going to render correctly. I assume cluster talks to it. I actually didn't spend any time looking at your application.
12:52 Root? So clustered speaks to each Postgres. They're not clustered. Like I said, the data is also static, so you could have 50,000 Postgres. It would be alright. But, yeah, Brett, we can clean it up over here. No ingresses. There's no ingress, but we do use the teleport proxy feature. So if I jump along here and click applications, we should be able to click clustered. Give it a few seconds while it does the proxy magic, and we have an internal server error on our application. Oh, our cluster Yes. Sorry? I broke it correctly. I I will point out that
13:37 Internal Server Error
13:42 this is not my application report in this internal server error, I don't think. So we've got some work to do. I was hoping you were being nice to us, Rachel, and it was just a cubelet in the scheduler. Is that not is that not it? That was the the basis for it. And then I was like, that's not enough. I gotta, like, break some more things, slight things. So that image is correct. G h c r. So it's got a container registry, Rawkode cluster, good one. Image, container port. Looks good. That's a problem. What's the problem?
14:30 Identifying Resource Limit Issue
14:41 The limit? Yeah. It's only getting one millisecond of CPU time per second. So right. So that would cause basically constant throttling because I'm not even sure how many I don't think one is a appropriate value for CPU. It would need to be Yeah. No. It can be a string or an integer. But one would that be a core or one melee core? One core. So the suffix oh, chat's telling me I missed something in memory. No. One twenty eight, assume, would be fine. So what's the deal with if you set CPU to one without the m prefix, it
15:00 Fixing Resource Limits on Pods
15:39 just defaults to? One's one vCPU. Yeah. Right. Okay. So that's something that the cube community spent a lot of time centralizing. So if you're in vSphere, AWS, Azure, or on bare metal, it guarantees you roughly the same amount of compute time. And so the way to reach us the an integer one would be, like, one second of compute time per one second of human time. Okay. Got it. See if that redeployed. Do you wanna try to hit the service again? I'd love to. Let's see. It could be cached, but let me it is still broken.
16:22 Retesting Application Access
16:39 Okay. Yeah. So the average load looks fine, so I don't think we're getting CPU pegged. Let's see. If she did that in one place, she may have done it in another. Let me go through my bash history to see if I can find a useful j q command. So, basically, what I wanna do is just go through all the containers and look for, their resources and limits. Like, you can just do quick control edit pod, and it'll give you all of them. As a single document with them. Yes. Or I'm learning some new commands over here. Nice.
18:08 Or like that. So let's see. One one Looks good. 200 probably fine. Onset. Onset. This is core DNS because they are so particular about the amount of RAM they use. What? A 70? Yeah. They are insistent that that is enough for every pod you could possibly ever need. There you go. Okay. So yep. So we've got our pod running. But when we try to browse through it through Teleport, we get an internal server error. Now how does Teleport work? So we expose the the service as a node port service on the node, and teleport is proxying through
18:51 Debugging Service Definition
19:01 that node port service. So if you do that yep. There we go. You should see we have a Well, those don't have IPs. Yes. And it looks like our clustered service has been either modified, deleted, or something else created. In fact, it was my cluster service, doesn't mean the load balancer. I think the other hint is the age on it. External traffic policy won't apply to node ports. How do I wanna do this? If this yells at me, I'm just deleting it. You could probably set that to node port. And I think the first five lines of that
20:02 Editing the Service
20:15 spec can be deleted and inferred on the next creation. The traffic policy should only matter if, like, Matt Anderson applied more cluster network policies. Speaking of, I might have done a lot of research, and that episode was brutal. So their network policies can trip you up. That's for sure. I have no idea about the silly and post network policies. Those are dangerous. Okay. Does ambassador need to be fixed? No. Ambassador does not need to be fixed. Okay. Do you wanna try to load the application? Yeah. I still don't think it'll work because of the cluster IPs in that surface.
21:29 So do you mind if I type? Yeah. Go for it. You're, like, right there, Andy. You're, like you're you're you're right there. I think we can just remove all of these. I don't think that they're Okay. Doing anything for us. And I don't even remember if that's the correct node port. So I may have to check our teleport config. Node port complaining you? Actually, I was going to just delete it and recreate it through K Expose. It's not letting me delete those anyway. So what was I complaining about? What was that? Okay. Yeah. Let's check the port. 31644.
21:30 Removing Cluster IPs from Service
22:23 As 30,000, there's a node port we expect. You guys are so logical with this. I was just, like, really breaking it like it's a. I don't know anything about Kubernetes, so I'm just gonna break everything. I I want my app to work or not work. I really put on my developer hat. I've never touched Kubernetes and just broke it. Alright. It's still not working. Alright. Click on it. So So one useful command is the explain command. It's kinda like having built in docs. So Do you want to do it on a shared buffer? Yeah. The explain command is cool.
23:15 Yeah. If people don't know about this, you can basically give it a path in the spec of the object, and it explains it. So that being said, Rachel did create the service, so I'm highly inclined to just delete the service. Okay. You can always reapply it from source, I guess. Yeah. Where's the source at? Is it here? Rawkode Cluster. Oh, hold on. What's that? Did you make that file? No. Oh, it's is that UTC? That's probably UTC, isn't it? Yeah. Yeah. That looks like a backup. It does look like a backup. Your magic key. That looks like their secret
23:29 Deleting and Recreating Service
24:30 Easter egg in there. Was trying to hide for you. Alright. We're getting the watch, which means we can try and upgrade our application from v one to v two and see if we get the happy then. How many bogamips do these computers have? Oh, wow. That's a lot. These are our c one smalls on Equinix metal. Okay. Yeah. So I think they're maybe 32 cores, 16 gig of RAM, or 32 gig of RAM and 16 cores. Don't remember. There's eight CPUs. Why? Why? If I had to build a service, I might do it on Equinix metal
24:37 Testing Application After Service Fix
26:03 myself. It is nice having access to the metal. I've gotta see. Let's test our application. I think I still have the watch, isn't it? Yeah. Let's refresh. One browser. Pull the new one. I don't know if that updated yet. Looks like it's up. Oh, there we go. Alright. We have the dance. Alright. Good job, Andy. Thank you, Rachel. Maybe it wasn't that bad. I gave you that hint. I wrote some other things, but they they were all more distractions than actually breaking it, as you can see. Well, other things were were hiding for us if we managed to accidentally pick them up.
26:44 Discussion on Rachel's Break
27:04 Yeah. Noel got one. I, you know, changed some port protocols and just little things, port values, just in other places just to distract you guys. Alright. Awesome. Thank you very much. Cool. Cool. Alright. Let's jump over. Let's swap places. Clustered. Pull up Andy's. Maybe. Alright. Alright. I'll just tape it in. Okay. I I have teleport running with Andy's cluster. I will click connect and open the control plane node. Rachel, if you can get logged in, join a session, give me an echo, let me know that you're here. Andy, feel free to have a laugh. There's no ever on the road. I'm
27:30 Andy's Cluster
28:06 gonna be laughing at myself. So Awesome. Okay. So step one is typically to export a cube config and check if we have a control plane. Feel free to do that however you wish. Alright. One. Everything saved. Let's see. What's the I don't remember the stuff. Admin.com. Yeah. You got it. Alright. Let's see. I don't use any alias, so I'm just lazy, and I never remember anything. So oh, oh, man. Are you killing me already, Andy? Don't think that's good control. I think Andy's been cheeky. Andy, really? How could you do this to me already? I'll be honest. I don't even know. No.
28:51 Discovering Broken Kubectl Binary ("Honk")
29:11 David's right. That is cheeky. Okay. Okay. So Let me see. Let's just open it then and see what is left for us. Okay. That's just what are we we got anything on here? What's this? I just just see what this is. You're both just leaving random yeah. Oh, I left that random layout. Yeah. Was like, alright. Who do I run up after myself? What's where are we at Yeah. I think we I don't think that's gonna be a a binary. I don't know if you think the same, but maybe we should just open it and
29:57 and see. Okay. You wanna open the the which binary? The cube c c l binary. Okay. Let's say that. I mean, do you expect this to be a binary or do you think this is gonna be a bash script that just prints out a crappy sentence? I've never seen anything like this. That's so funny. I've never got that message before when I work I I think stories. So Oh. Alright. Okay. So it only does it when we have a subcommand. I know. Oh gosh. Did you break the cluster or did you break KubeCTL? Which one did you break?
30:42 I don't it's a have you guys ever had to try to find, like, what messages are embedded in a binary? Uh-huh. What command would you use for that? So we can use strings on no. No. It is strings. Right? String. Yeah. Oh, no. You might have to install it. Yeah. I think I need to install maybe elf utils or estrus. But I don't think we should should we just remove cube control and reinstall it? I'm trying to see if there's any hint online that I'm missing. What else? Oh, dependency warnings. Jeremy says honk. Need some help, guys. What's this?
30:50 Investigating the Kubectl Binary
31:58 Okay. Should we look at backup admin config file? Well, I was hoping we'd have a kib control package file liner and which I thought was in VarCache app, but does not appear to be there. So I'm probably just looking at the wrong location. So let's see. Yeah. It doesn't seem to be there. Okay. So But we could always just curl, download Kube control from GitHub release and use that. Unless Andy's trying to tell us or take us on a journey with his randomly compelled binary. Well Can we will say So sometimes the standard practice is to install stuff under slash
32:48 Finding the Real Kubectl Binary
32:51 opt as optional software for according to video. Do not look here. Okay. Yeah. Drop in that directory, and we'll take a look. Please please respect my privacy. Oh, I didn't change the path. I just did it there. Oh. Sorry. Alright. Wait. What's the wait. And what is that? Is that what is that green represent? I don't know. Green is executable. So Oh. There's a Linux command we can run. Type, which will tell you if it's a binary or script, or we could just open it in Vim. I generally just open or fail. Sorry. I would use the file. Yeah.
33:59 You want to Vim? Yeah. I would just open Vim. Yeah. I feel like that's unsafe, but I feel like Alright. Well, we know that we have a feel like a a credibility, but I hope it was safe. So So I suspect this is either the bad binary or maybe it's a backup that's decoyed as a bad binary and might be a working cube control. I don't think it's a bad binary. Yeah. That I think that's a real cube control. Just protocol. Okay. Can I do where was that command? Yeah. Can I export this instead? Is that
34:40 Restoring Working Kubectl
34:49 what I would do? Guessing. Well, we could just alias cube control to slash opt. Do not look here honk, or we can take the sledgehammer approach and move honk into user bank cube control. Let's move it. Let's Yeah. I think that that was if it's getting caught caught off under our own feet, I think. I'm gonna copy here. It's just user bank on this machine. Oh, okay. Yeah. And, like, I don't like the word honk, though. Alright. Just overwrite cube control. Like, whoever that other cube control is, I don't care. Okay. Sledgehammer. Just smash that thing away.
35:30 So we should be able to run a cube control get pods now. Oh, yeah. This unsafe directory here. Alright. Now we have a broken control plan, which I think we can start to debug. Okay. Should I look here? Let me see. Did you, did you have in these in here? Didn't break anything here. Where do we start? You say control? So, I mean, I think the good thing would be to do an LS minus l and check the timestamp. Although they mean nothing there. People are normally aren't sneaky enough to hate them. That one of them stands out to me that
36:07 Checking Static Manifests Directory
36:23 it may be modified. Okay. Go. That was you dipped? Is that the one that you're suspecting? No. I think the API server. It was modified today. Oh, I'm looking at the wrong date. They all look like '23. Okay. Okay. Let's, take a look. And where's all my comments? Everyone was helping out Andy. Now everyone's quiet. I want I want that's just myself now. I want more because as you could see, I made mine pretty straightforward. Alright. I'll talk more. I was No. No. No. I want the audience to participate. They were like, comment the lore for Andy. I
37:16 don't feel the same love, so it's okay. It's okay. I'll figure it out. I mean, one thing is, how how are static manifests deployed? How are they deployed? And then is there any difference here, David and Teleport? Or it's like So what we've got here is what looks like a standard API server so far. Although, that's sneaky. Why are you deploying your own version of the API server, Andy? You're just testing out a new version, a new feature flag? Well, I mean, have you ever like, you wanna make a change, and then it's just really
37:46 Identifying Modified API Server Image
38:05 hard to deploy that locally on Docker for desktop. So oh, sorry. Let me I think we should do a great image across all leads. And we'll see that the standard path or prefix for control plane components is Kubernetes dot g c r dot I o. So let's modify the API server not to use Quito I o slash a Holzman. Okay. I was gonna look at mine to just take that one in case anything it changed even that. I'm not sure anymore. That's the joy of clustered. It plays tricks on your minds. Yeah. I'm like, let me just go to
38:28 Correcting API Server Image in Manifest
38:50 make sure. Let me take I'm gonna just look at mine real quickly. My eyes so my eyes are not deceiving me. I'll just check if there's any other red headings on this file before I replace the image. Oh, there's not. Oh, that was very kind of you to tell us. Yeah. Oh, wait a second. What is he using 21.one? Is that what he has there? No. I thought it was 20 1 2. Was I think it's Yeah. Yeah. It is. It's right. Okay. Just making sure that he didn't switch one number on this. Well, I thought about running the 20
39:29 The 1 20 2 R c. But What? Just deprecate all the ingress and mutate and webhook namespaces there. That would be nice. Pull out the pods to get this back. Rachel, do you wanna try control playing command again, like, pods or something and see if it's a simple thing? No. Yeah. It doesn't Oh, we get it just one more time. It's Oh, no. That's That is correct. You you restore that file. So the kubelet is responsible for running this. So we should check the kubelet is running. Anywhere else in here before we I I I I don't think so. I
39:45 Checking Kubelet Status
40:21 would try system control states kubelet. Alright. Create system control. What's what's it? Kubelet? It's system control status, kubelet. I think Naveen is sorry. Naveen is trying to be helpful. Oh. Yeah. We could check the admin config. We could have modified that, but we also do have Kubelet errors here. Control play. Mhmm. K. We'll have service. Okay. So Okay. So our Kiplit is running. So we probably wanna check that the API server has been restarted. But I would just use PS with your favorite arguments if you have any. I I don't I I don't really go
41:32 into this to this level of depth, so I just gotta, like I have to look it up every single time, but okay. Sorry. Okay. So I like AUX. Just don't ask me what they mean. I just they're just ingrained now in my memory. But it doesn't look like we have an API server running even though we do have a kubelet running. So the container logs, 11 file log containers, And I suspect oh, we're not getting anything here either. That's tricky. That is interesting. So that means one of two things in my head. Either he's modified where static manifest is supposed
42:14 to live and our change was ignored. Mhmm. Be. You gonna say something? Oh, I I was Go ahead, Andy. I'll be honest. I'm surprised as you are that there's no logs there. Alright. Okay. So you haven't changed the static manifest. Alright. I'm gonna restart the Kubelet, which has always caused me problems in the past. But you know what? Doesn't matter. Let's do it. Do we wanna check the admin config file as well afterwards? Yeah. I think that's a good idea too. I'm just there we go. That seems to be doing stuff now. So not I mean, not good
43:03 stuff because we've got error messages, but we have stuff. Do we have any APIs? No. We don't. And we don't have any logs from our API server. What have you done, Andy? Remember that face I made? I if you might might have felt like mind games, but we're at we're at the same point at this at right at this juncture. So this is unexpected behavior? Yeah. Yes. Oh, wonderful. Alright. Alright. We gotta get her thinking out, Sonia, Rachel. Okay. Well, if okay. Let's let's let's go back real quick and just look at the config files as well. Get through
43:57 those out. Yeah. Yeah. That's really I don't know why I'm I'm gonna refresh my tab as well. Go. We choose to You want me to wait? There we go. No. It's okay. Go for it. Okay. So But I think that only thing I'd really be interested in here is that server URL, which I think is okay. At least the I think so too. Yeah. Yeah. Okay. Let's I'm gonna think aloud because I'm I'm not sure what our next step is. So the kubel is responsible for starting all of our containers. We're not even seeing any log output from
44:41 the API server, which makes me think the kubelet either can't speak to container d or container d just doesn't do what we want it to do. So I think we should check the status of container d using system control. That's kinda where my head's at. Okay. I'd also like to point out I already set the, runtime socket for you because you have a slight kind even though I don't like it right now. Unity responded and it running. There was an error message there though. Oh, I couldn't pull the image. Did I mess that up? It looks
45:36 like the same one that I had. Oh, look at that. It's in TLS header. I must have messed up the path. Right? I didn't. No. It looks as if it looks same of what I had. Are we really unfortunate and GCR is having problems? I don't think so. I'll check from my side. What's the a container d command CTR image pool. Let's just try Docker IO engine next first. That just c t r pool. I can't remember. Hey. Chat help us out. C r I cuddle pool. Oh, c r I cuddle. Control. Okay. Yeah. That pulls an image.
46:46 Okay. So container config file? Where is that? Maybe it's not pulling an image. I see nothing on my screen. Okay. So there's a really cool container d command called config dump. We don't have a container d command. That's really interesting. I I don't think I've ever had it do that before. Okay. So we can take a look at the the drop ins for container d. Have you used container d's plugin alias feature to move the GCI repository just before I start getting down the wrong path? No. I that that that's already been done. I tried to avoid prior art.
47:59 Okay. Then I'm really confused about this. K. So That should work. What's that? Did you there's something you broke with the teleport? No. That only broke on the worker notes, but I fixed it. Because I tried to follow the rules of engagement, which made out my own issues. I we are in this together now. I'm just being quiet. I'm gonna try and pull this from my local machine before I start debugging the wrong thing because Might I recommend not pulling the one I created? Yeah. I went to that. Alright. Here is my terminal. Oh, I don't even have Docker open.
48:59 Docker image. Oh, and give docker a few minutes to start. Yeah. And we can't pull the API server image. We may have to just use Andy's API server. Is it fixable? Oh, I wouldn't do that. It's not going here. So okay. So we're getting TLS errors. Right? What are the things that can lead to Are those expected? I thought those were unexpected errors. Sort of, but they're related to other errors I Right. Okay. I'll be on it. I have to stop my break when this broke too much. Okay. Well, as there as those two last
49:44 errors are expected, then, okay, we can debug that. I thought it was an external factor. So What about our should we hop on the the worker node and check our certs or something on there? So if we can't fill the image on this control point node, then I feel like I'm gonna get this wrong so someone can try and correct me. But there is a certificates package from Ubuntu which installs all of the root c s that we can trust. Yep. Now if those oh, sorry. We can maybe just reinstall that package and see if that fixes it and failing
50:41 that we may need to dig into a little bit deeper. So there is a at spell c a dash certificates. C a cert is that it? Okay. Yeah. I had to put it in the container I built. And we can do does that work? Oh, that's interesting. Yeah. Ah, there we go. Ah, so you have broken more things. This was not an intentional break. Yeah. Alright. We have no networking. Oh, we have limited networking because Teleport's working. But Mhmm. I don't suppose you're a networking expert, Rachel. Nope. Neither am I. Perfect. So Okay. Okay. No.
51:46 Nolan chat suggested that it could be DNS, but we couldn't ping eight dot eight dot eight dot eight. And in fact, we can't even do an IP tables dash l. But Cillium doesn't use IP tables. You better not have any BPF probe on my machine. What did you do, Andy? Oh, man. Taking a whole So now I'm just scanning the process tail to see if I can see anything that I don't expect to be on this machine that would be running the eBPF probe. If it's Cillium, we'd maybe have to kill Cilium. I couldn't tell you what that's gonna do.
52:44 Okay. So let me I think Andy has deployed some sort of silly and traffic policy to the cluster, then broken the control plane, removing our ability to remove such dodgy policy. Is that close? No? So that's actually a really good theory about what probably happened. However, I will say that this break got away from me. And when you hit the app install, we are now at the same point in trying to fix this cluster, which is otherwise to say, I'm not entirely sure this is recoverable without, like, a cube ADM reset. Well, it will be recoverable. Right? So what
53:39 can we do here? We have Russell has a good suggestion. I don't know what my from a command line is very, very pure. Is that a command or is that a process? Okay. Let's see what that does. I'm gonna go out and let them here. So I think Cillium is responsible for the BPF. I don't think you compelled John Bennery and ran it on the machine because I didn't see anything that looked particularly bad, which means I'm gonna kill the Cillium pod. Okay. We should Cilium was the part where I wanted to break. So that is a good
54:34 so I did set Cilium into a crash loop back off before I toyed with the API server. Okay. Now it's running. Okay. So Russell suggest yeah. BPF filter is a running process, and those are normal. They're kernel processes, I think. I don't wanna go destroying it because I don't think removing the probe would block the traffic. I think Celine would need to exit. One way to find out. Let's stop the cueplet first. Okay. Let's grab our Cillium. We don't have a Cilium. It was in a crash loop. And then the API server went down, so
55:33 there was nothing telling Cubelet to try to redeploy it. But I have a pod running. It doesn't mean there's any container. So okay. Alright. We actually don't have Cilium. Wow. This cluster is screwed. Okay. So how do we get so We could re reboot the machine with the kubelet disabled, which would get his networking back. That's the step I do. But as punishment for doing this, Andy would have to tell jokes and entertain the audience until the bare metal machine rebooted. So Hey. Luckily, it's a c one small. These things boot fast. Alright. Actually, I can tell the story of
56:17 my initial break break, and maybe that can help us figure out what went wrong with what I did. Alright. So I've disabled the kiplet. I'm gonna reboot. You happy with that plan? Nope. If it was me, it would've just been destroyed already, the cluster, but that's just me. I would've been like, let's just spin up a new one. Yeah. This stuff. You've got all that VMware power. It's easy to do that. Yep. Alright. I'm rebooting it. Let's go for it. So We've disabled the cubelet, which should give us an opportunity to make sure networking and
56:49 image builds work before we start it back up again. But do wanna tell us a story then? So I wanted to approach this with a, like, a a teachable moment and issues that I've come up to in production. And one of the most frustrating ones is when there's a disk error and the Linux kernel decides to remount the file system as read only. Now inducing a disk error is actually really difficult, but there's a way to send a message to the kernel to tell it make the file system read only, except it's a sledgehammer. It should only be
57:23 used in kernel development, and then I did it. And then I tried to connect the teleport, and the teleport broke. So I undid it because what I wanted to do is I wanted to, without breaking all the process, render etcd in a read only state. However, that was only a part of it. The next thing I wanted to do was when you fixed the etcd issue, there would Cilium would come back, and it would notice that I changed the pod IP spaces. So I would have reassigned the pod signer per node, and each node didn't read only.
58:03 So then it tries to recreate all the pods, they get stuck not being able to create the the sandbox, and you have to go figure out why those nodes are stuck. But that wasn't going to work because it broke Teleport, and I want to follow the rules of engagement. So it's alright if I screen share. Yeah. Go for it if you want. Noel has kindly pointed out in the chat that this is the first ever reboot on clustered, which is true. Right. Well, what happens when I click this button? Oh, I need to So he's, like, poor
58:51 cluster. What did you do to this poor cluster? What did you do to me, Andy? What did I I gave you something real I gave you some, like you gotta reboot the whole thing. Oh, no. I'm going to have to, will the link you sent me work to rejoin? Because Yes. But if you're on a Mac and there's permissions, you can just enable the permissions. You don't actually need to restart your browser. Okay. So later. Share. Is it working? Yep. Zoe, he he wait. Yes. It is. So what I did was I had an idea
59:41 for creating a read only API server that when you make a request to write something, it would, silently delete the key you just wrote and then tell you it succeeded from this transaction. Hold on. Hold on. Oh, no. I'm sorry. I just wanted to do this. And so Can you zoom in on the header? Yes. I think. There we go. Right. So what I did is I gave it some time to settle after the API server comes up, and then it would delete every key. So that way, when you try to change the deployment, it would delete the deployment.
1:00:42 And I had verified that working on local Kubernetes cluster I spun up. And so when I went to move it over, what I needed to do was retag the image and push it to a local Docker registry running on the host. However, this thing happened where I it took more than fifteen minutes to figure out how to do that with c r CRI CTL because that's not a feature it has. So then I went to go install Docker, and Docker wouldn't install because it couldn't resolve the Ubuntu things. And that's when I realized how badly I have broken this cluster.
1:01:24 And so I have this concern that Kubelet triggered a delete of the master node, and it's struggling to reregister itself for some reason. Alright. So that's kind of my working theory. Because what I was going to do and something I had verified was while working was I could delete or with Docker d, I was able to change Etsy host to kates dot g c r dot I o and pull my malicious image and make it look like the real image. And the breadcrumb there would be if you did a version, it would have told you the git tree
1:02:19 was dirty. Or if you checked version on the binary itself in the container or if you looked at the binary and you realized it was, like, over a gig. So there were a couple of breadcrumbs. Because what I wanted to highlight is that Docker d at least doesn't support HSTS, so man in the middle DNS attacks are totally possible for injecting malicious images if you want with kind of so kind of two things. One, a read only server can create a bad day. Two, DNS attacks really suck. And so I'm not sure why HTTPS isn't working
1:02:58 on container d because I don't think they support HSTS either. So they're not insisting on HTTPS. It's just not able to dial out, I think, is that error. So if networking comes back, you might be onto something. If so far, we we don't have a bare metal machine yet. Alright. Let's I'm gonna go to my Equinix console and see where we are in the transcript. You may have clustered our first cluster as well. I was asked to break it, and I almost for a new cluster because I I thought I broke it too much. Yeah. We said break it, not destroy
1:03:41 it. Get all the earth. You know? Yeah. So silly me for having a rule that the only thing you don't need to do is break teleport, which you didn't ish. K. I you're gonna look at the That's cool, Andy. I mean Okay. So I've got my screen share back. What we can see here is I I can actually SSH on. There's a starting point. Let me drag that up. I don't think we have teleport quite yet. I'm not sure if it's crawling. No. I can't I can't get in there yet. Okay. So let's save teleport.
1:04:24 What have I got running? Oh, Docker. This is why we don't need Docker running. Oh, that fake cube cuddle binary. All it did was exec the real cube cuddle binary and then randomly print honk or make it a rainbow printout. So it was just there as a little bit of mental warfare to because it it literally just execs that binary with the args and then pipes the the information back. So that was supposed to be fun, except I broke it too much, and you just saw the error message in rainbow print. Okay. So here's where we are. We have the machine.
1:05:09 Unfortunately, we can't get teleport working because it's it relies on the BGP advertisement from CubeVip. And other for CubeVip to run, I need to start the CubeVip. If I start the CubeVip, things are gonna break through. I didn't know CubeVip between BGP. Yes. I actually I I really need to make sure Teleport runs on the bare metal IP instead of the BGP. That's just a risky mistake by me. So So to induce the crash loop for Cillium, I since Cillium one ten, I enabled BGP in Cilium. Okay. So I'm not sure what BPF rules that
1:05:53 might have installed. I don't know what to do. I don't I don't either. I think, let's see. Well, I can't well, I can't get back on to Teleport. Yeah. So So, I mean, I can start the Kubelet, which will get Teleport running. But the minute Cillium comes back up, it's potentially gonna remove all networking. Is that right? Oh. No. Let's just do it because we don't have any other options right now. Yeah. Because if Celium's configured, we need that CD to run anyway to be able to fix that. Rachel, I sent you the public IP of
1:06:39 the host so you could SSH if you wanted. Ah, okay. That's sweet. Both your keys aren't on all devices. Okay. Yeah. I'm gonna avoid a QVDM reset. I mean, I guess we could walk through a QVDM reset, couldn't we? But then we'd have to bring all the other notes online, redeploy the workload. That's what I've done by now. I'm just like, let's just redeploy everything. Forget fixing it. I'm gonna do Yeah. Let's just run many cubes. Screw it. That's easier. You know, this is why we got our, you know, the point I just gotta leave the
1:07:28 ping running Minutes. And see what happens here. And it's a you get paid, like, extra to to, like, really break stuff. Like, I Oh, I Like, is this something you really see customers? So I tend to call it chaos engineering, and my manager before has told me, in order to be chaos engineering, I have to know what I'm doing. Mhmm. Okay. Okay. You have been has been up for thirty four seconds, so it should be hopefully, it began to advertise that address and get us back onto Teleport Make sure it's not crashing. Okay. Click that one.
1:08:28 According to the staff console, BGP looks up. That is the other one. Oh, yeah. It's the twenty first start log. Okay. That is the other one. Yeah. This is today's it started advertising. We should at the very oh, no. There we go. Restart teleport. Okay. Do we have teleport? Not yet. Okay. That's broken now. Technically, I wasn't the one who broke Teleport this time. I can't speak to you. Restart it. Okay. Teleport is up. Yeah. Alright. I'll have to wait for you to start the session again. Okay. We have a new session. Alright. Let me jump back in. And we
1:09:58 haven't lost networking yet. So we actually might be in the clear. K. It's actually a really good point. You're not allowed to SSH and run an info. So you might wanna check Etsy host. Hey, David. I can't I'm, I'm not in that session. It's not I don't see this active for me. Hold on. Oh, forgot. Never mind. Hold on. Need to sign in. Ah, okay. Cool. I'm, like, looking at my session. Like, you can continue since we're closing on time here. Alright. Well, our and are trying to pull down our image. It's failing to resolve it for the TLS error, but
1:11:09 we do still have networking. So do you remember when I said I wanted to man in the middle DNS? Yes. How can I do that on the local host? You could just modify. Okay. So that should hopefully fill our API server image. Oh, good. I think you just joined the session there. Right? Yep. Have one. My screen looks a little wonky. Maybe you Yeah. Just give it a wee reset. So while you were joining, we opened the host file and seen that Andy was pointing kates.gcr.i0 to our local machine. There we go. Yeah. Let me just k.
1:12:14 Restart the coupon. Oh, no. Restart. Restart everything again. I don't know. Let's Look at that. We finally have an API server. Yay. Alright. Do you wanna try and get pods, get nodes, and actually see if we get a response from this stupid cluster? There we go. Okay. Alright. Do you want to try updating our clustered deployment to v two? I was gonna just see if it's actually running on the application. Do you wanna or you could show if it check it. Oh, okay. That's a good idea. Before I change anything there. Nope. Internal server here. Okay.
1:13:10 That's where I think I saw no. That's two days ago. Let's see. Yeah. Did did you do anything else, Andy? Or can I just work in the manifest now? This Yeah. I think I think we can just work with hopefully, with Kubernetes API now. Hey. David, could you do me a favor? I put a copy of the cube cuddle back on there in the root home directory. Could you, like, a k the cube cuddle so dot slash cube cuddle, get pods dash capital a so people could see the fun I had put in there? Is this the broken one? Yes.
1:13:58 Nice. Very nice. And then you still got the honk in there? I see it. Okay. You felt like a Mary. It's hurting my eyes a little bit, but Yeah. That's that's sad. I'm like, what am I looking at? Okay. So we We don't have any core DNS. I think that's why our application isn't working. Oh, unknown. There we go. Alright. Let's see. Where do you have? Let's see. Yeah. I mean, we're just wait. Did you do all all pods and all the namespaces? Everything. Oh, wait. Get. Okay. Replacement. Our DNS our deployment, let's look at that.
1:15:10 I don't think I've ever seen status unknown before. Yeah. It's a and there's our daemon set here. Got nothing there. DNS. Pods. Okay. I'm gonna take a Go ahead. You go? No. I'm just gonna take a look here and here really quickly. See what's going on. I So a hint as we're butting up closer. I don't know if you remember, but when I was looking at Rachel's cluster, noticed pods weren't deploying, and I looked at the net the CNI first. Okay. So I guess you're telling me to look in the c and I first. Is that what you're telling me? No. Okay.
1:16:14 Okay. So I mean, we already know he enabled BGP in the Cilium config. K. Where is Cilium config, though? Where is that? I think there's a config mapping cube system. Okay. This one here? Yeah. I think that's the one. We would probably wanna edit that and take a look for that BGP thing, disable it, and then do a do a rollout of the cilium part, I think. Where will it be? I can't see. Where is it disabled? You might just wanna search for BGP. I'm not entirely sure of what what lives in this project. That's the only one.
1:17:21 Yeah. So he said we need build rate. So we gotta put it in true? I forgot what you're saying. I don't think we wanna put that. Let's see what else we've got. Okay. Post your name. Okay. I'm gonna I'm gonna have to cheat and look at my cluster because I didn't touch these things. So I just wanna take a look at it really quickly, see what's the difference. Just taking a quick look here. So it looks like you added something to Oh, did you turn off I p v six masquerade? Because you should. IP what'd you say? Sorry. Miss Andy, what'd you
1:18:54 say? The last part? That should that was true. Right? No. Oh, whoops. Should I p v four masquerade be true? Oh, yes. I'll just pretend I know what those settings are, but I have no idea. I p v I'm pretty sure I p v six shouldn't be disabled shouldn't be enabled either. I don't think I put that enabled. In this one in mine, it's oh, your screen should let me look at my screen. One of one of them I p v six is false on my untouched, version. Let's see. Alright. I've copied that. What about I p v four masquerade?
1:19:43 Is that true or false? That is true. And then I p v six masquerade is true online. And let's see. Enable policy default remote. What identity oh, so those look good to me. Oh, wait a minute. Disabled TLS is where is it? Sorry. Oh, I typed it while you were at it. That is false online. Well, it's just Hubble stuff. I think we can ignore the Hubble. Okay. Everything else looks I mean, let's jump out of this for a second. Right? Because Okay. We have two running operators. We've got one unhealthy, which I'm assuming this
1:20:50 is this node. Let's put a log. See if we can see anything. Too noisy. But we could just try and roll out restart. You're labeled here? I would do, d s slash Cilium. You can tell it's due to Dave and Seth, that's named Cilium instead of by label. Let's see if that helps. I have no idea. Okay. But we definitely need one of those silly and pause to be running. I mean, let's see. So while this comes up, could you run that for my binary? Because there was another Easter egg in there based off how you ran
1:22:05 that. It's not gonna actually do anything. Alright? No. Was that your Easter egg? Yeah. So that when you would try to look for stuff in the cube system, it wouldn't find it. But if you actually take the space out of the between the dash end argument, it will work. Alright. It looks like we have some work in Cilium, which I'm happy with. Next problem is we got broken core DNS, which might just need a restart. Okay. Wrong argument, Karen, or unexpected lane. Thanks. Okay. Did you break core DNS too? What? Did I continue my DNS malicious activity?
1:23:11 I don't know. Did I? You're still not invited back. So This is our core DNS config, which, again, I have no idea what is actually supposed to be in here. But I don't see anything. That gives me concern. Is it for me? Ready? Is that IPE six supposed to be doing anything? Or I thought. Is it I'm guessing. I think this is okay because this is saying Kubernetes plug in. This is the cluster DNS name, cluster dot local. Mhmm. So which mean if he's not modified that, he's modified this. Noise. Think the image looks good. Yeah. It does.
1:24:37 Is it supposed to be UDP protocol for one of them? Yeah. I think that's right. Yeah. Okay. I think So the error message was complaining about oh, I'm scatterbrained today. It's complaining about the resolve.com. But I don't see that mounted. I didn't change the Kubernetes deployment. Right. Well, why am I wasting my time with you? So I mean, it just uses the host. Right? Does it? Does Q just put the host in the container? I think it said check that the command are pretty messy. Unless it's using the system d resolver. So on a bunch of 20 o four,
1:25:40 the system d resolver is actually an act it's not exactly a sibling, but system d is what writes the at c resolve d comp. I did not change that. So null is saying check the Kubelet config file and see which DNS and resolve comp file it's using? Yep. We can do that. Let's see. I was using the system d resolver. I guess we should look in that file. My system be Yes. It is. The same as Same as. Special result. Yeah. But how does the that that's all host level settings. How does the container have DNS?
1:26:38 By magic? I don't know. Oh. What a oh. I mean, can we get EsterPod? Let's let's try that. I don't think I think it's a, isn't it, though? Also, I would double check the the core DNS config map. Okay. I'd compare it to another one on that forward line. Do you have a a healthy one there, Rachel? Yeah. Let me look at mine. Kubernetes, let's plug in. What's your dot local? Is that I mean, what's this? Maybe it is that other than Okay. Let me look at yours. Let's see where we at. Okay. So fall through
1:27:41 TLS. So I got minus core errors lambda. I gotta put it side by side here. Last time someone modified this config map, they gave me a unique look alike new character, yeah, that. And then the next line, one line. Yeah. Let me just I'll copy and paste it here. Well, that's let's see. I I'd say that looks better. But it's just We should run. Right. So in DNS part so in DNS, that would say send everything through the resolve conf. Right? Because of how the canonical naming works from subdomain to TDL to or subdomain to domain to d TDL,
1:28:48 the dot indicates seven is everything through that. But that's just how core DNS handles the DNS request out of the cluster for forwarding connections. So the max connection is an extra parameter you don't need. I would save that. Alright. Let's save our doc. Okay. Oh, I think we're both trying to save. Oh, I'm not touching it. Now you could do a rollout of the deployment. Oh, that's so fun. No, Russell. The unique hot break was not awesome. So not awesome. That guy, actually he's from Glasgow as well. He's like, is that really the right thing?
1:29:51 And I'm looking at it side by side with a working core DNS contact and I'm like, yeah. Unicode. Yay. Support DNS. That okay. Let's you wanna just take one more look and see if anything works there? No? It's trying. Wait. Did we look at any wait. Was there others oh, come on. Oh, failed to connect the database. But now our postgres service may have not have any endpoints? No. You have endpoints. Maybe we just have to roll out our cluster deployment to get the new Sleep. Return. Resolve config or something. Okay. Looks I mean, what did you let's see.
1:30:55 So let's oops. Oh, wait a minute. Is it fur it's frozen for me? Hold on. I think we can just do a Okay. No. It was just stuck. My screen got stuck. Okay. Come on. Give me one. We'll just ignore that one. It's been terminating for two days. I still haven't seen that movie, actually. You've not seen that? That's good. No. No. So what what what's the error what's the last What's the end of that error telling you? Temperature failure and name resolution. What controls name resolution in the container? Thanks. So can you go into the Docker container? Can
1:32:01 you exec into that pod? Maybe. So if you do deploy slash yeah. You don't have to anyway. Okay. The vein saying net pool. What's this? Oh, ping won't work. Other reasons. Yeah. Service IP. Because right. Which actually during this week's hackathon, we wrote a demo set to fix that. So do we have we don't have I I don't wanna have to remember the name of the package to install dig. Do we have DNS resolution? No. No? Well, what's the what's the IP address for DNS in the cluster? 10096. What's the name server that that one's using?
1:33:09 Yeah. Let's just get services. And 8 to 60. Oh, yeah. 32. So what puts that 32 in there? Nope. Did you not modify the deployment? Not at all. The daemon do we have a daemon set? What actually creates the container? Container d? And who and who provides the instructions for that? Kubelet. Have you modified the Kubelet config to pass it a different DNS resolver? I don't know. Have I? Alright. Sorry, RJ. You could type. Are we going to the I have tomatoes for throwing, by the way. I see if this wasn't virtual. It's throwing away. The
1:34:12 Kubelet config is not in Etsy. Yes. And the viral load Kubelet. Let's see. So yeah. Var what is it? Var what? Lip slash Kuplet. Lip or Lip? L I b. L I b. Oh, library. Okay. It's just my funny accent. Okay. And then what was it at? Kubelet. And then there should be a config dot yaml. That's the DNS is fine. So he could be overriding it. There's two other hosts in the cluster, and the core DNS doesn't run on the master. I always make that mistake when I'm debugging this and don't jump onto the workers. I get
1:35:08 too comfortable in a control plane. So we get it. Alright. Do you wanna pick one each? I'll grab worker one, you grab worker two. We'll check the cubic config. Alright. Sounds like a good idea to me. Grab worker two. Alright. But, hopefully, you have better luck than I do again onto that. Let me see. What? Okay. That's not expected. Uh-huh. And enough of your games. Did I do something wrong? I don't think so. Can you try to touch a file? No. I can't touch a file. Oh, okay. I don't think I did that, which this makes us really exciting
1:36:40 because this is what I wanted to do all along. You have given me a read only root fail system. This happened on its own, I swear. I'm so happy. I was forcing this to happen, but I recovered it every time because that broke Teleport. But it seems like the nonsense I pulled has actually caused a file error. How did I change the option? So I can't remember how to remount or rewrite. I thought it was dash O Equals. No. So dash o remount comma r w, and then you need to and then you need to provide the device.
1:37:23 So it goes device then mount point. MD one twenty six. What are they? MD one two six. Yeah. It's write protected. I didn't do this, but I'm so happy because this induces the original error I wanted. I'm getting home. Do you know how to fix this, Andy? Let me try something. I know we've put in another machine. But it would explain why I couldn't teleport onto the session. Right. Because teleport does that audit log, which writes a file. Dash t x four label equals root. Hey, David. Real quickly. Worker too. Could you change the cluster DNS
1:38:37 I guess, deleted. You have to restart Keyblade also. Oh, do not see what you changed this year. Was it '96 or '98? '90 '6. '90 '6. Oh, I know what to do. Yeah. +1 096010. That should be okay there. Okay. Okay. On worker one, do FS check dot x four. Yep. It should be the the device. Alright. Now try to reboot. Oh. Yes. Yes. Yes. K. That one's dead. But Rawkode two is okay, so let's just get our Rawkode running there. Yeah. You could quarter in that note off. I could set up on fire. If
1:39:59 you actually look at dMessage, you can see where I forced it to mount as read only, and then it did it on its own a couple thousand seconds later. Okay. Worker one is corded. Rachel, fix the Kubla configuration, restarted the Kubla. Yep. Alright. I think can correct. Which means we should now be in a position Where oh, that might work. You need the pod to restart so that it picks up. Otherwise, Keyblade won't change the resolve comp. Yeah. So if you try now because if those are in terminating state, they're not in the pool. Alright. We have video one.
1:40:58 Okay. So Alright. Let's add our deploy. Okay. There's only one deploy. And let's see. I'm gonna need a change in career. I think farming looks nice and simple. It's just two. Right? David, something else in there? Or maybe a flower arranger. That would be quite peaceful. Okay. Nine seconds. Now it's over. I think I still have the watch, but I think that's just because my browser cache even though I'm telling it to do a hard refresh. There we go. There we go. Is that it? Well, that was painful. When network stopped working, I stopped trying to,
1:41:14 Kubelet Running, No API Server Logs
1:42:17 like, write the steps to fix my cluster. So, yeah, that well done. I wasn't sure if that was recoverable because what I was worried is that when I restarted the API server, my malicious one deleted the node from etcd. I thought I had corrupted etcd, which is another thing I know you hate people doing. Can't stand it. Because whenever it goes to networking, or NCD, like, all the tooling and experience I've got is now useless, and that's the challenge. But at the same time, it's episodes like this where although it's a bit of a challenge,
1:42:36 Restarting Kubelet
1:42:56 is it you learn all these new little tools and techniques and other things. And that's what's important. Yeah. I mean, I don't think I'll be having a beer with Andy anytime in the next couple of years, but I'm just joking. That was that was a tough break. But Yeah. You made us work for that one. Thank you. Thanks, Andy. I learned a lot. That was cool. I might have to physically break your cluster next time I see you in person. I'm just gonna put the hammer, like but it was cool. Learned a lot on that
1:43:26 No API Server Logs After Restart
1:43:28 one. Yeah. Yeah. No. Your break I think a lot of people like myself look at the old episode and go, oh, how can I be more creative? And I like the simplicity in yours because it's like, how well do you know your Kubernetes? Can you just go through, like, the from pod creation to service creation to ingress creation? Make sure it's all wired up because it's a lot as like you said, it's a lot of things developers that I work with internally need help with just because, like, there are moving parts and Kubernetes forces you into
1:43:51 Reviewing Admin Config
1:43:59 a certain deployment model. And so if you don't remember what that deployment model is, it's really easy to forget it. So I think I just have lots of experience kind of going through those primitives. Yeah. I think it's two breaks from different perspectives. Mine was a developer break. Yours is more like on the platform engineer break. So we got, like, breaks on both from both perspectives. Yeah. Very well done, both of you. It was a absolute pleasure working through these clusters. So thank you for taking time, And if you're weak for breaking them, joining me and sharing
1:44:35 your knowledge. I also wanna thank Teleport for sponsoring the show. Please check out Rawkode.liveteleport to support the show. I am gonna go and lie down for a bit, but again, thank you, everyone. Thank you for watching. The cluster will be back next week. Have a wonderful weekend. Bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments