About this video
What You'll Learn
- Troubleshoot Kubernetes control-plane failures by reviewing API server messages for webhook and scheduling-related breakage.
- Recover a broken cluster by checking replica set and node controller behavior after file or binary tampering.
- Diagnose and clear network faults by testing DNS, flushing tc traffic controls, and validating iptables plus NodePort access.
Alex Jones and Alistair Hay break each other's Kubernetes clusters. Alex hunts a disabled replica set controller, a blocking NetworkPolicy, and a tampered cluster DNS IP. Alistair unwinds tc traffic-control delays, a rogue kubectl alias, and dropping iptables rules.
Jump to a chapter
- 2:04 Introduction & Welcome
- 2:24 Sponsor Thank You (Equinix Metal, Teleport)
- 3:00 Guest Introductions (Alex Jones, Alistair Hay)
- 4:30 Alex Fixes Alistair's Cluster (Challenge 1 Starts)
- 5:12 Initial Host Checks (Files, Aliases)
- 6:44 Testing Kubectl
- 10:04 API Server Logs (Webhook Issue)
- 10:41 Fixing Admission Webhook
- 11:00 Scheduling Issues Begin
- 13:06 Debugging Scheduling (Simple Pod Test)
- 14:13 Checking Journal/System Logs
- 15:00 Debugging Controller Manager (Replica Set Disabled)
- 17:24 Fixing Replica Set Controller
- 17:47 Connectivity Issues (Curl Refused)
- 18:00 Debugging Service Connectivity (NodePort)
- 28:40 Testing Connectivity from Worker
- 32:21 Checking CNI & Network Policies
- 33:08 Identifying Network Policy Block
- 33:25 Fixing Network Policy
- 33:51 Connectivity Restored
- 41:11 Challenge 1 Complete & Debrief
- 43:00 Alistair Fixes Alex's Cluster (Challenge 2 Starts)
- 44:04 Initial Checks on Slow Machine
- 45:24 Debugging System Slowness (df, htop)
- 50:28 Checking Sysctl Parameters
- 55:01 Hint: Traffic Control
- 56:51 Identifying Traffic Control Delays
- 57:59 Fixing Traffic Control Delays
- 58:32 Slowness Fixed, Kubectl Issues
- 1:01:03 Debugging Kubectl Command
- 1:01:33 Identifying Kubectl Alias
- 1:02:07 Fixing Kubectl Alias
- 1:03:16 Pods Stuck Terminating / Worker Unhealthy
- 1:05:41 Debugging Node/Kubelet Health
- 1:07:06 App Pod Image Issue (v1 vs v2)
- 1:07:24 Fixing App Deployment Image
- 1:07:58 Application Fixed
- 1:08:02 Challenge 2 Complete & Debrief
- 1:12:19 Discussion & Reflections
- 1:14:31 Thank You & Wrap-up
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
2:04 Introduction & Welcome
2:04 Hello, and welcome back to the Rawkode Academy. My name is David Flanagan, although you may know me from across the Internet as Rawkode. And today, I am your host for another episode of Custard. Today, we are joined by Alex Jones and Alistair Hay, who will be doing their best to fix clusters broken by the other. So before we begin, I just wanna say thank you to Equinix Metal. They provide all of the hardware that we use for Clustard. I over provision because it's more fun and I enjoy it, but we have tons of cores, tons of RAM, and two very broken
2:24 Sponsor Thank You (Equinix Metal, Teleport)
2:34 machines for each other to fix. And also thank you to Teleport who have been supporting Clustard for a long time now, almost since the very first episode. So, you know, appreciate it. There are links in the description below which will get you $200 of free credits with Equinix Medal and some more information on Teleport. So go and click away and enjoy. But don't click until after the episode because we've got some clusters to fix first. Now my guests are Alex and Alister. You're now on screen. Hello, and welcome. Hey there. Pleasure to be here. My name is
3:00 Guest Introductions (Alex Jones, Alistair Hay)
3:09 Alex, as you know. I thought I'd introduce myself and just say that I'm super excited because I think I've got myself into the deep end, but I'm excited to to sort of compete in a friendly way against Alistair. So, yeah, looking forward to it. Awesome. Thank you for sharing. Alright. Yeah. Thanks thanks both. I'm Alistair. Yeah. Looking forward to, what would be the the way to say this, making a fool of myself, floundering around and mistyping all the commands like we do when everyone's watching. So looking forward to getting stuck in and seeing what's in
3:40 store for for us in the episode. Thanks. Well, that's half the fun. Right? Like, once it's there's something very weird about it, but the minute there's a camera on your face or an audience and sat in front of you on a bunch of chairs, it's like everything that you thought you knew just disappears. Some sort of, I don't know, live stage effect. However, you you were on briefly before. Right? You did Rawkode Yes. Rawkode where I set you to all of the film miserably. So if we've already seen you, you look stupid, that's fine. Yeah. That's true.
4:15 Alright. Awesome. Thank you everyone in the comments. Feel free to say hello and support. And remember, your comments are there. You're gonna see things that we don't. But whenever you see something that we mess, throw it straight into the comment section and I will relay that on to our two contestants today. We are gonna start with Alex fixing Alister's cluster. So if you could please open a session on Alex Jones control plane one, I will share my screen and join, and we will kick things off. So I should join on my control plane one? Oh, sorry. Alister. He control plane one.
4:30 Alex Fixes Alistair's Cluster (Challenge 1 Starts)
4:48 I was worried there for a minute. I didn't break that one. Okay. Cool. Can you see it? I do indeed. So I'm hitting join. I'm just joining first so that I don't flash your IP address to the world. Cool. Alright. We can now see everything you type. Feel free to export your cube config and take it away. Best of luck. Yeah. I I think probably what I'm gonna do is reverse search on what files have been modified on the host OS before I do anything else. Sneaky sneaky. Oops. Sorry. It's dasher. That's it. K. So let's go from top to bottom.
5:12 Initial Host Checks (Files, Aliases)
5:28 Otherwise, put it into a file. There's quite a lot of files on this computer, and I'm going through every single one of The hard part is also the machines are spun up on demand, and everything has been modified in the last forty eight hours. Yeah. But it'll be it'll be it'll be in order. True. Let's have a look what we get. So okay. So file parts, system b. Yeah. Okay. Nothing interesting. K. Well, let's see what sort of key control points to before we run it. Let's check if that's your binary. Looking like someone who has seen an episode
6:11 before. Yeah. I've worked. Yeah. Exactly. Okay. Great. Or since you've shipped before, either way, we'll fork. Yeah. And I think I think it's a combination of both. That's the r c file. Okay. Let's check the aliases. Okay. I don't think you trusted you, Alistair. Yeah. Yeah. So the oops. Saying nothing. Okay. Let's try and just see if we can connect it to the cluster. Oops. Sorry. It's keep it complete. Keep it complete. There we go. Okay. Let's have a look. Atman.com. K. So I can see the cluster. That's great. Let's export that like you did.
6:44 Testing Kubectl
7:19 Yep. Finished. Just grab that. Put that in here. Okay. Okay. This is the thing here. This one. Right? Clustered in the default namespace. Let's make my life easier. I've only got so many minutes to live. So let's go on my other tab. I'm gonna grab web install. We're gonna web install k nines because I've got stuff to do. So I'm gonna do that. Grab that. There we go. Web install is a really great project. If you've used it, they got sorry, webby dot shell, I think it is. The key works really hard to make a lot of stuff
7:55 compatible apart from when it doesn't run. You have to update the export path, it looks like. Yeah. So that should be fine. Right? Here we go. Okay. Let's jump into this and have a look. In fact, before we do that, let's just see and see any of the part the nodes had scheduling turned off on them because that's kind of an easy thing to do. So there's no schedule on the worker plane. That one's got a taint on it. That one the worker also has no schedule slash oops. I don't think that comes with Kubernetes.
8:40 Let's remove that taint. Okay. Oops. Let's get rid of the case. So let's remove that taint off of there. I'm not sure I can leave a spec object. Oh, It's not got a role, but that should be fine. I'm not scheduling yet, so let's go and have a look at what else could be broken. CoreDNS running. Is there a core file? I'm being overly cautious, but it's sort of covering the. Is there a core file? Because sometimes people mess with DNS. Yep. That's the default core file. Okay. K. Gosh. It doesn't show up very well. Hope
9:39 you all can see it. Yeah. It looks okay. I'm just saved. And it's v two. Right? It's the image. Indeed. Alright. Let's see what goes wrong. Okay. So nothing's happening. Let's have a look the API server. Okay. So failed calling webhook. Default gates. This is a warning on the dispatcher. Let's have a look. These are the case in part. Okay. Let's see. Mission webhooks we've got. Let's go back and have another look. So we should have a type. I'm validating. Yeah. Validating. That's what I want. Fault takes webhooks on validatingexample.com. I can just delete this.
10:41 Fixing Admission Webhook
10:41 I don't need a validator in webhook. Oh, that hard work. Delete it. I don't understand. Yeah. So let's just try and delete this pod to see if it will if it will recreate. So the pod's gone. Have a look what's going on in the deployment. I'm just gonna resize this because I can't quite see half of the screen. Okay. It's gonna have replicas. It's a replica set. So you've got the old one. Just purge that. Current replica set. K. So we have an issue. Oh, we don't have any issues. This is trace garbage. Okay. Let's let's
11:00 Scheduling Issues Begin
11:41 go deeper. So okay. So let's look at what's going on here. You've got a bunch of pods here. This is your I c t l. I think we can also look at the current container drilling or current container trying to schedule. My server is running. CTL steps. In a. Alright. Let's go back and figure out what is going on. So our pod is in scheduling. Is there a placement issue? Let's have a look. Server. Okay. I need to get a better view of this log. It's not very useful looking at it in k nine. It's actually like that.
12:44 There we go. I got some stuff coming out here. That's a while ago. Request too large. This is the entire API server log, so I regret doing that now. We've we'll be here a little while. So there we go. So the last stuff we have is the webhooks, and then we can see now that we can't schedule. So if I was to debug this, first principle is dictate that you probably want to try and replicate it in a simpler example. So what we're gonna do is we are going to grab the simplest pod on the web, which is the
13:06 Debugging Scheduling (Simple Pod Test)
13:19 NGINX example. We're gonna grab that. Paste that in. Oops. No. Not because click. Okay. Okay. So the deployment creates so we know that we know that is working. Do we? Yes. We do. So is working. So the actual deployment's being created. It's admitting it. We know that Rawkode is being created, but we don't have the pods being scheduled. Typically, I would expect to see an issue with the the pods in a pending phase. So let's figure out what we've done wrong or what we can't what's not working here. Sometimes what can be useful as well is
14:13 Checking Journal/System Logs
14:15 to look directly at the journal. Not this time, perhaps. Let's have a look at x e f f. Is that right? Okay. So we've got our webhook issue here, and now we're seeing a direct response from logging.go on the go restful. Okay. That's not particularly useful. So what we can try doing is to see if we're getting anything in directly to the syslog. Okay. So this is from Teleport, I think. Yep. What else we got? We've got against teleport teleport. Yeah. Nothing really going to syslog. I've got here we go. Fail creating default webhook. Yeah. That's not what
15:00 Debugging Controller Manager (Replica Set Disabled)
15:15 I want. Quite a bind. Okay. Let's get we I mean, we can do I'm going to and see what's going on, but I don't really want to have to do that just yet. So what else we've got running here? Maybe we've got something deleting things. Got the controller manager. So what's responsible for creating a pod after we have a replica set? So I think that we have the controller manager to start scheduling placement. What would your inclination be here? My inclination is that he has changed the parameters on the API server and disabled the pod controller.
16:01 Okay. Let's have a look. Oops. On the controller manager, not the API server. Sorry. Yeah. Sorry. What am I not seeing? Okay. So controllers, you've got bootstrapper, token cleaner rep minus replica set. Can you put a minus in front of it? Does that actually emit it? It disables the replica set controller, which won't create pods. Yeah. Yeah. I thought let's have a look. There there is a doc for this, isn't there? I just wanna check for it. I thought that the the this it's a sequence so that the the asterisk wildcard will be overwritten by the last argument in
16:57 the line. I think that might be how it works. It's like gets expanded and then everybody Yeah. In order. Yeah. Okay. God. What's it called? Cube. It's not cube, but it keeps the control manager. It should this shows how It's the cubit. The cubit will start to control the manager as a container. Yeah. It just shows off now if you do this. Okay. It's not that It will probably pick it up, that change, because it's in the manifest file. It can take up to four But if you restart it, it speeds it up. Hey. We have pods.
17:47 Connectivity Issues (Curl Refused)
17:48 Yeah. And then we have NGINX, and then we have the the new version deployed. I think think that's it. We're good? I can't test it. I mean, I could download the on local host on port 30,000. Okay. Okay. Let's do that. So so let's go to keep control of get pods. Keep control of port forwards. Oh, you don't have to port forward. That's a node port service configured. Oh, okay. Okay. So you can probably get service curl localhost 30,000. Oops. That's yeah. We have more break, I think. Okay. Is the image correct? Seems so. Yeah. It looks correct.
18:00 Debugging Service Connectivity (NodePort)
18:36 Maybe the node port's not working. Let's see where we are. Actually, before we do that, let's just look at the okay. Let's just have to write down. Let's just do an IP table's flush. That's any Not even looking at the IP table. It's just flushing them to hell. Yeah. I this in production. It always works. No no no implications of deleting all the IP table chains. Okay. Let's flush the IP table rules. Let's check the service. So node port 30,000. Okay. So cluster IP 30,000 binds to six six six. That looks kind of malicious. That is correct. I'm sorry. That is the
19:28 real Okay. Okay. Okay. Sometimes my key binds don't seem to work. There is no there's no port on the on the deployment. Is that the is that the issue? Okay. I I I do some Kubernetes training, and I always talk about the ports on the the pod spec, and there is the it doesn't do anything. That allows you to name a port. Everything other than that, it's documentation. There's no We're not need to we're not need to define the container ports. Nope. It doesn't do it anything at all. Anyway. It's perchless. YOLO. Okay. Anyway. Yeah. So at the same time, it's not
20:17 You need to fix your dash, because the dash was stolen on image. Okay. Okay. Well, I'll just I'll just leave it then. Everyone's gonna do this. Okay. Okay. So that's not it. Okay. Let's try that again. Okay. So it's still blocked. So there's two reasons for the it may not be that it's blocked. It could be that it's time and out, and it could be the cluster of the application speaks to a Postgres service. It could be that Postgres doesn't happen. Okay. Okay. I I don't know enough about the application. That's true. It's Postgres running as a stateful set.
20:46 I can sort of see it. You're managing, like, running PG bouncer on this, are you? No. App info. No. App info. I should know this. I thought I told you it's installed. I thought that's I considered doing this, but I thought it was too evil. Okay. Okay. So not PG bouncer. So we could try and figure out what is going on. Now, firstly, let's just check that everything's good with curl. Yeah. So we've got we've got curl working. We could then also then try to see if it's an issue with the node port. So if we do get the service and
21:31 we do a direct port forward and we put that in background actually, I might live to regret that. I can't open multiple terminals up here, can I, very easily? Is that okay if I connect you in another session? Or Yeah. Go. Feel free. Okay. I'll just do that so that I'll open up the session briefly, and then I'll close it again. I'm just gonna validate what's going on here. Yeah. Okay. Okay. So I've been mocked in the comments anyway from Russell who's saying, you could check the application logs. But, unfortunately, I've been meaning to add logging
22:04 for the application for for many, many Yeah. I didn't I didn't see any logs. I should think that. I should Let's go and have a look. I don't see. So this port for that locate six six six. In another tab, I'm just gonna complete curl that. Yes. It's still not working. Okay. So let's think about what's going on. We can't see anything coming out of the application. Well, if you do a curl with a for both logging, does it show connection accepted or connection pending? Let's have a look. K. Looks like A curl dash v should show
22:55 isn't it? That's just trace? You can just trace it. Yeah. Yeah. It's it looks like it's hanging. What's going on here? Hang on. What's what is this garbage? Okay. Cluster. Not garbage. No. No. No. No. No. I I just see I'm I'm just I'll show you what I was looking at, actually. It's going local host. I think it's 30,000 there. So we're seeing that it's connection refused. Yeah. Okay. Yeah. 100% refused. So we you know, I again, I I I kinda work from what I know, and I think that's kind of a good strategy in
23:41 life, generally. And so I would pretty validate this by on my NGINX deployment, opening up that node port because what we can determine from that is, has it been broken on the entire machine, or is it specific to this deployment? So where did our engine x go? I didn't roll out a didn't roll out a service with that. I'm gonna have shamelessly steal one. So glamorously on the other side of the page, what I'm doing is doing Kubernetes..i0 and just grabbing the default deployment and service. You could just keep control exposed as a command as well.
24:14 Do you know what? I have probably used that four times in my life. Let's do it. I I honestly I never ever use it. Let's have a look. So let's see if the API is in the same. So keep control exposed as an example. Expose, rep a cassette. Right? What's what's RC there? Okay. Port eight. Alright. Let's try that. Assuming that RC is oh, yeah. Replication controller. Okay. So let's do deployment. And the name might think for us. Let's just double check. Yep. Was engine x deployment. Okay. Next deployment. And I think Target port is six six six, and the
25:02 port is I don't know. I'm just checking I was just checking my NGINX example so I can validate whether it's another issue. Hold on. I'm just checking what port it comes on. Think it's 80, isn't it? Port 80. Oh, it's gonna be a screen. Come on. Yeah. Target port. Your screen's wider than mine. Okay. Let's go with that. Okay. Let's now curl it. Little v. Okay. So we've got connection refused to that. You just target Porsche on on your exposed. That should be 80. Oh, I'm sorry. Yes. You're right. And yeah. Yeah. Chicken and egg problem.
25:43 K. I prefer it locally if it wasn't bound to the h p p h p port. Okay. Interesting. So if I have that right, what that tells me is that I can't no ports aren't working in the cluster. What I can do to verify that, again, is to do a port forward and check if the so then I'll get I've seen the service type. It's just IP right now. Wanted to get a report, didn't I? Okay. I've done that slightly incorrectly. Okay. Oh, now your now your node port is 32,000 something. Oh, yeah. Yeah. Into it. Oh, yeah. And I'm also
26:54 on the wrong it's it's not targeting the wrong port on the on the deployment. So you get when you do something in a rush. I think the target port's necessary anymore, is it? I could delete that. Well, the port is a service port for cluster IP. The target port defaults to the normal port, and the target port could be different. So services are fun. Yeah. There we go. That should be correct. Can you run cube control get s v c comma e p? I like to call that there. The Which sorry. Which one? Delete the
27:34 slash into comma e p. E p. Uh-huh. For endpoint. Yeah. There we go. It looks good. Yeah. It should it should be working. I like that. Yeah. Okay. So that's been blocked as well. Okay. No. You used the wrong port again. Remember, it's 632428. 3 2 4 2 8. But still blocked. Hanging as well. Okay. We'd already looks I've already dumped the table. It's not gonna be that. Could be this c n I. Let's think about that maybe, sort of. It could be let's have a look at host name. Let's look at AC hosts. Local host is set.
28:38 You wanna try it from a worker node? We could do that. Yep. Let's so shall I do I'll I'll open a worker node session, and you can join that one. Is that what you Yeah. Want you to Okay. So let me just open one. Okay. Okay. So Just try curl local host C2428. Hey. How did you configure cube control in that? On this node? Yeah. We just exported it very quickly. But there's no there shouldn't be an admin.com from that machine. Oh, you're using a Kubler service account. Right. Okay. That's okay, isn't it? Yeah. Sure. It's not
28:40 Testing Connectivity from Worker
29:40 Oh, so oh, that sounds like something rules, is it? I mean, I can't You could could create these things for sure, so why not? So, yeah, try the current local host 32428. Just because I I need to know. 3248, was it? 428. Yeah. There you go. Okay. It works there. So try the try 30,000? It's been set on that machine. No. There's still something broken with that one. But your engine x one did work. Interesting. Okay. So I'm just thinking if we go let's have a look what's actually Got about fifteen minutes left. Okay. Time goes quick, doesn't it?
30:31 Alex has dropped a lot a lot of truth bombs on us. A lot of things I'd never thought to look at. Yeah. Your camera's gone off. Haven't fixed it yet. Haven't fixed it yet. So we'll we'll see. I mean, it could be a proxy configuration. It could be I p six being forced. I don't know. It could be a lot of different things. Can you prod your camera, Alex? I think it's Sorry. Yeah. This is what I'm talking about, my camera. Sorry. Let's let's try echo. Let's do an env. Is it an HP proxy? Mean, probably not.
30:59 Nope. That's not an echo now. Let's just do Could be That would be mean. Yeah. It could be. Then I'd be really, really stuck. So just looking. Yeah. I mean, I don't see anything that is currently active on those. What else can I do? I'm trying to think proactively. Rogue processes under PSUX would be what I started to do. I could oh, that is htop, isn't it? What we got here? The Chrome process. Macron Taproot. What's we got? Wonder if it's a what's that? That's picking up on two thing. Okay. I don't see anything that stands out.
32:05 That is a beefy machine. Yeah. 48 cores of power It's 62 gig of RAM. That's the rule. It's what you need for your bad code application, isn't it? Yeah. For a Rust application that takes about 36 bytes of RAM. I didn't really think about this, but it could be done in Cilium, couldn't it? It could be a it could be a Yeah. It could a network policy. It could be a yeah. Could be a network policy. Let me just switch. Are you can you switch back to other computers just because I can change it? Okay. Yep. It's gone. K.
32:21 Checking CNI & Network Policies
32:35 Nice. Off you go. No cost to wide network policies. Okay. Damn. Thought that was gonna be it. Well, there are three policies. There's cost to wide network policies, there's still a network policies, and then there's networking v one network policy. Okay. No network policies. And what about network and v one network policy? You see that? I don't see that here. Egress NAT policies endpoints. Well, just run kubectl get net net poll. Yeah. They're not custom resources. So Yeah. There we go. Okay. I don't think it allows all, like, at the scrape. I was just gonna have a look. I
33:08 Identifying Network Policy Block
33:22 think I could just delete that. Yeah. There we go. Allowing this traffic none. Yeah. Yeah. It doesn't allow any ingress to the cluster. Yeah. It deletes. K. And I was gonna have you looking for eBPF probes as well. Oh, that that's not work. How mean do you think I am? Yeah. Okay. That affects your current local host 30,000? I'm just gonna have a look. What was it again? It was thirty thirty thousand. No. It's still sitting. Okay. But that that oh, no. Look. That's Postgres now waiting because we got we got headers. Do have any logs from this?
33:51 Connectivity Restored
34:17 Okay. No logs. Postgres will have logs. But, yeah, cluster point. Not in the Postgres cluster. Let's see logs there. Okay. Let's see if the service has been deleted. It points to the The ports look funny, though. Right? Ports. Can you see that? Zero. Data is zero. Oh, no. That's okay. Yeah. It's okay. And CA will just throw me off on the k nine. Yeah. I think it should show zero when it's going to the same port. Why not just Yeah. Why not just show nothing on? Or sort of yeah. I don't know. I've spotted that
35:02 in the past. I don't know. I don't know. Maybe it's not that. I mean, I don't know what the passwords are, so I can't tell if you've done if you changed that. I'll assume you weren't enough because that would be unsolved. It says the password was there, Postgres password. Yeah. It's hard to put it on the app. I I I think that's correct. Okay. Okay. Let's have a look. So it got a little it's it's running. It's running on it's bound to an interface that is bound to that doesn't matter there. That's inside the pod.
35:34 That's just the library to check. What else we got? Yeah. I I think Postgres is running. I can't really tell. I mean, I can install p s equal, and we can start playing with it, I guess, with the next thing. We've only got a few minutes. There's no hints in the hints directory. So you can ask for him. Okay. What's for him? I'll for a first hint. Can I can I get a lifeline? It's it's DNS. Oh, it's DNS. Oh, I thought I looked at the core file. Back to it. We go oversite. There's an
36:13 oversight. Yeah. It's not that. Yeah. That looks okay. Yeah. That's fine. Let's have a look at your proxy. It looks like there is a good proxy because it's a silly and cluster with your proxy. Yes. Proxy running. I kinda haven't used I haven't really used how to say the inverse DNS. Let have a look. Is there a resource that you configure it as a file? Let's have a look. We can figure out how isn't responsible for DNS. Let's have a look. So disable DNS. Let's have a look at host DNS. There were no cellular network policies. We still
36:56 have l seven DNS awareness, but we didn't I thought I thought we could do local DNS with Cilium or something like that or, you know, DNS based rules. You can if you use that Cilium local redirect policy, but I don't think we've seen any when we looked at k nines earlier. Okay. Well, let's let's try something else. So if I spin up a quick pod in the cluster, and I could try and use some network tools to do an s already. Should give you the to app. That's do they have apps in it? Yeah. It's
37:29 a point to I mean, by default. It's had I mean, well, unlike Ubuntu, this has CVEs in. Ubuntu is the only distribution of Linux on zero c major CVEs, criticals, apparently. Whereas NGINX is the one that always comes up. I should just check this installed. Maybe Okay. App update first. Oh. Okay. So the so this part itself can't yes. It's Yeah. They're doing it. That's the cost of a ton. Yeah. Okay. Okay. K. Let's look at what's going on with core DNS. Okay. Let's go to the logs, settings. I don't think there's anything you can really
38:29 tweak in there. I think it's just the core file. Abilities Let me ask you a question, Alistair. Should we be looking on the worker? Yes. Yeah. I think I know what this is, but I'm gonna give Alex Let's go and have a look at the worker then and see what's going on. Okay. Everyone's the worker. Okay. Let's do let's see if we can do any kind of NS lookup. I'm just thinking whether whether it's gonna be something you've done at the host level, you know, like no. We've still got we've got local DNS on the host.
39:20 Okay. Let's try I'm trying to think here. Coming from the one of the pods running from this particular node. So there's a template setting for DNS configuration, which might be Oh, right. Okay. And the kibble kibble config in the manifests. You're need to go to Varlib kubelet, probably. And then config dot YAML. Okay. Oh, okay. So the cluster name is fine. Cluster DNS, is that actually set to the IP address of the static parts for core DNS? I'm not sure. I think that's been modified. I don't think it should be 1010. I don't need to restart this, do I?
40:37 So just auto restart? I think you're gonna have to kick some pods, I think. Right? Because the DNS configuration is injected into the sandbox, so you'll probably have to delete the cluster pod for the DNS to do it. Okay. And that's assuming he's not changed the default cluster DNS policy. I couldn't see in the config. I feel like I didn't put enough breaks in and mind that. I think I need to things. Oh, well, we got through quite a few. Let's see whether that's worked. Nice. Job done. We have v two video in there. Well,
41:11 Challenge 1 Complete & Debrief
41:17 lots of help from you, David. So thank you. Well done. That was fun. Well done. Those those were all things that I've fixed in the past on the cluster. So I tried to keep them real. Awesome. Interesting. My favorite one is, I like the I like the net policy. I think that was quite funny because you can just look at that and, oh, it's fine. You know? I think that's the the thing about network policies. Right? It's like there's so much defaulting going on that they passed the eyeball test when they do literally nothing. And then what you have to really you
41:54 gotta really understand that actually they they block all by default. You need to start opening things up. Yeah. And then the DNS change with the eyeball. It's like 10 dot 10 passes the eyeball test. I've just seen enough shit that I know that that's not right. That's a magic IP as well because it's set to the same as the service is set to. And if it's not right or one of them has been changed or something, then things get weird. There we go. I have to I have to prefix this as if we go into it. Mine
42:24 was way too hard to begin with. Right? Way too hard. It was causing David some headaches, like because I kept asking to restart the computer server. So I I think I've gone one away now. So I think you might end up solving it in about thirty seconds. So if that's the case, I apologize. Sorry. You could just add more Blake's while he's faxing. Like okay. Yeah. That is true. I'm just going in. Live action clustered breaker. Red team, blue team. Yeah. That'd be fun, wouldn't it? Alright. Well, awesome work, Alex. You finished under time. You got everything working,
42:56 and we've seen some cool breaks along the way. So thank you, Alistair. Alistair, do you wanna open up a session on Alex Jones control plane one? Let me know when you've done it. I'll join it, and we'll check over. Okay. Yes. Control plane one. Join that session. Okay. Give me a sec. It's just hanging, I think. Yeah. I said on the chat saying that was great. Alex is still calm while debugging. Not only were you calm, like, were verbalizing your thought process. Like, we I I just I was just sitting back and joined. Like, I didn't have to do anything. Thank
43:00 Alistair Fixes Alex's Cluster (Challenge 2 Starts)
43:35 you. If as a word of advice, when people do that in interviews, it's really helpful for interviewers. You know? If you're doing a coding interview and you don't feel you feel really stressed or nervous, just vocalize it because I actually I find it helps me stay calm. Can you not connect, Alistair? Yes. I'm in now. It took a long time. I'm sure that's a good sign. I'm sure that is Everything's very slow. Oh, no. You should have seen the sadistic people I work with. You should have seen the things they were suggesting I do. I have held
44:04 Initial Checks on Slow Machine
44:11 back. Well, we had a nice welcome message as we approached. So Well, I'm I'm doing the same, Alex. Sorry. Sorry. You're doing that. No. K nine s first. But it's being very slow, which is filling me with thread. So we we have some hose breaks, I assume. I'm just gonna grab a glass of water. Alright. It is doing it, but it is slow. So Give that a second. Running h top after this and see if we Yeah. Maybe I give you one of the machines with one VCP here. Okay. What does that say? It's telling you
44:58 to source that mail. Yeah. Okay. There's nothing more annoying than you're on a really slow machine. Gosh. No. You got a nice machine too. But it's likely that any performance degradation you're seeing is intentional. Good luck. Okay. At least you could taste. Thing. Which one is it? E f minus h. Have we got a full disc? No. Okay. Can I ask a question? I I is it another part which is why you're doing so we'd throw them off? Sorry. Say that again? Say again. I was wondering if you're okay if I ask questions, or is it gonna throw you off whilst
45:24 Debugging System Slowness (df, htop)
45:45 you're doing this? I just was wondering why would that be your default go to? Why would the disk being full, why do you think that will slow down the CPU? Well, it just slows everything down. From from from his from my past experience, I don't I don't really know the ins and outs of it. But when I've when I've SSHed onto a machine, which is full, it's usually pretty slow. Normally, because all the processes are trying to write logs, and they can't and you get lockups. But I think they're okay. There are I mean, there is some rules because there's
46:20 one of them has no saturation attack. So We should it's largely through swapping as well. Yeah. Just a problem. So I'll send them I think it's not there now. Is it gone? I don't know. What what now? It's there. I know. It's it's oh, it's It's slow. Guys, very slow. Okay. Why is it slow? What's happening? Nothing showing as using too much CPU. In fact, this command is the slowest one. Oh my god. I mean, all you need to do is update v one to v two. It doesn't matter if it's slow, really. Give oh
47:34 my god. Mistype. And it's admin.com, isn't it? It is indeed. I didn't do my export. He doesn't source the yeah. I've gotta fill this time with talk, and I have no chat. So okay. Alright. What do we have? Unable to connect. Of course. We have no context, no cluster, no API server potentially. Okay. So Not the best amount to test the APIs ever. I've gotta say. Quit. K nine is quite heavy, isn't it? There's quite a few calls. Yeah. I feel like you're trying to give me a hint, but I'm too stupid to know what it means.
48:41 I can give you a hint if you would like. Because this might if you don't know this, it might be quite difficult. Yeah. Go for it. Oh, come on. It's too early for that. Me just say this. The the the speed slowdown is not gonna impede you, like David said. That's not gonna stop you from doing what you're trying to do. If you're worried about slow down, like, best CTL would be something I start exploring. Yeah. We don't need that. We can do this. Come on. Yeah. Let's have a look. Container d is not running.
49:34 We don't need Container d. So the qubits in the crash loop. Dynamic file. Dynamic. It suddenly got really hot in here. I'm just gonna turn the heating off. Alexa, turn off the office floor. Oh, actually hot. Alright. I thought you were just nervous and sweating. Oh, no. No. The temperature is it? It's only 22. But I only updated your Alexa, so it's increasing the temperature in your office. Next level clustered. Yeah. Would that be causing it to be super slow? Probably not. I mean, if we're seeing a slowdown like this without CPUs and RAM being hammered,
50:26 my immediate thing is I I mean, I would wanna start looking at this control and see maybe modify the table to see if it is running processes under nice, like, control groups for random system processes. That for CPU. Is it That guy. Mean, it should all be lowercase, but just okay. Yeah. Mine is alright. Yeah. Max percent 25. Or CPU. So it could be it depends how it's been. Right? So you could have processed these changes to some sort of fail. In which case, you could try just running you you could try resetting all these things.
50:28 Checking Sysctl Parameters
51:17 But, yeah, that pair of CPU 10 max percent. I I don't know if that's just normal or we're just looking at it under the context of things being broken, but I would maybe change it. How do you set one? MSW? No. Well, just do sys control dash dash system. Does that reset them? I think so. So no graphic for CPU again? Still very slow. Alright. Is that something we should be looking at, Alex, that pair of CPU time max percent, or is that a redhead? Yeah. That that has not been modified. Okay. That's not because
52:12 of the speed. What command was that? Message. D message. Alright. Okay. I mean, that's going very low level. Yeah. Yeah. I have literally no idea. Alright. I would just kill this machine from the auto scanning group, and it would come back to life. Alright. Let's let's do some searching. Alright? So alright. Or you can just keep typing d message. Sorry. I didn't know what was happening. That was fun. I don't know. I alright. Maybe not then. Mhmm. What have you done? Let me see. Well, processes look alright. The slowness I mean, yeah, the the typing
53:52 speed is a little slow. Right? Mhmm. What if we oh, I wanna see if we can actually cause the course to actually do some work because yeah. Alex has been mean, hasn't he? I've done far fewer fit things than you have. You were up over, I think, five or six different things. So you had me running through hoops, jumping through hoops. Alright. We'll take a hand then. I feel like you've disabled some of the CPUs, maybe constrained the IOPS over this. But I to be honest, I'm not sure how to check. If you were physically in front of the
54:44 machine typing on the terminal, you wouldn't see this latency. Okay. So some may Just to slightly speed this up, you're still gonna have to work out the answer, but it it's using traffic control. The TC control. Which is a program. We've all ever seen traffic control once on clustered. Yeah. It's pretty nice, and not a lot of people use it. Where's its comp? It doesn't have one. Oh, no. It's everywhere. Run PC commands and you apply constraint. We've seen that in the past where one person applied a 50% network packet drop on all is OTCP
55:01 Hint: Traffic Control
55:40 traffic. Wow. That's mean. Also breaks teleport, apparently. Less constraint. I'm gonna Google it. Me too. Traffic control. Alright. So we have some sort of queuing discipline. We have filters for going to just how do we list all those traffic control modifications? That's what I wanna know. If that's even possible. Qdesk seems to be the command they mentioned them more. There you go. So we have a limit thousand delay one point five seconds and a limit thousand delay five hundred millisecond applied to east zero and east one. Okay. I don't know how to delete them. You can just live with it.
56:51 Identifying Traffic Control Delays
57:19 You could just live with it. Mean Yeah. Yeah. Yeah. You're getting, like, the experience of running a Pentium four six five or, like, you know, early nineties computers. Yes. From half the country away before my time. I mean, I I don't know if I should be allowed to make up commands. I I'm not feeling brave. Okay. Let's just Google it. I guess it's what we do when we talk to the answer to something. So how do we remove What was it called? And So you can do a t c q dash del dev dev root. Right? So let's modify that
57:59 Fixing Traffic Control Delays
58:04 to try and match those two rules that we have there. So q dash del dev probably stays dev, but the dev in uppercase is probably EF0 and EF1. E F 0. Wanna run it for those two, I think. Okay. And if we lose access to Teleport, we'll just go to the pub and have a beer. Well, I set it to a thousand seconds then. Alright. Try the Oh my god. It's so quick. Yay. Oh, I feel like a running arch there. No? There we go. It's so quick. I've never seen somebody so happy. Oh, it's so good. Right. There you go, DJ.
58:32 Slowness Fixed, Kubectl Issues
58:58 Right. Yes. That was fun. Okay. These are the cubelet logs. What happens if you try and keep control get nodes? Do we just have no API server? The Kubelet is crashing. You don't need a Kubelet for an API server. Alright. Probably. Yeah. This one, CTL. No. Status. Okay. No. Inactive and dead. Yes. Do enable that session error, and that will start with the. Yeah. Look. Hey. He's doing stuff. Something not found. Versatile part. Let's see. Okay. You didn't trust cube control, did you? Oh. Fine. So that is a binary. It's a go binary. Try control
1:01:03 Debugging Kubectl Command
1:01:10 version. I'll leave it all that. Yeah. I was gonna say I should It's just the one. So you did a fail on the actual binary, but try to type your control without the which. It may not you may not be executing the binary. Yeah. Oh, there you go. That was that was last year originally. That's been turned out. That's Control delete all namespaces. Shift code tag. Shift all alpha echo. Terminal echo. Okay. Think you have a pretty broken cluster right now. Yep. In fact, I think it's in worse shape than it was when you started.
1:01:33 Identifying Kubectl Alias
1:02:00 Oh, yeah. But you can you can unalias. You can re alias kube control. Is it unalias? I I will just say I will just alias kube control back to slash bin slash kubectl, then it will work properly. I'm alias. Sorry. Need to give you a just to give you a hint because we're kinda Yeah. There you go. Meow. Really, it's a cat meow. Thanks. I like that one. That was nice. Yeah. I mean, we don't need the emissary, so I'm not terribly fast for that. So you definitely use cube control lease. Cube node leases. We've we've made we should
1:02:07 Fixing Kubectl Alias
1:03:00 be alright with that. As long as you don't need to get a new one. Alright. Just gonna check the event. We have lost the worker node as well. You wanna open a session on the worker? Or Yeah. Yeah. You don't have to fix the worker. I'm just throwing it out there. Has he done the same on this one? Yeah. Yeah. You can just schedule on a control plan if you want. I'm still on the same frame. It's just so fun. I will leave that. If I just remove the no schedule on the control on the
1:03:16 Pods Stuck Terminating / Worker Unhealthy
1:03:59 Yep. Control plane, does that do it? It doesn't do. So edit node. Yep. No. Shit. Capital n, capital f on the node schedule. Just get rid of them. Yeah. Yep. Oh, yeah. They ate the wrong node. Delete one of your nodes. You have node nodes. You can either use node space or or slash, but you can't use. Node nodes. Node nodes. Alrighty. That was aggressive. I wanna move him off that note. So I mean, a card that would have been fine, but, I mean, that's Yeah. You can control c that. You don't need to wait for the second half of all
1:05:33 that crap. Cluster is not scheduled. Add a add a dash of white just to confirm where where. Yeah. Control point. There we go. Yeah. And because the postgres is a stateful set, might wanna delete whatever force and just because in fact, people start running, it will never do a status check. I dashed the I don't think you need the grace period anymore, but No. I don't think so. Conditioned. So Okay. There we go. Yeah. So that that kubelet is not healthy, which is why we've got stuck in terminating. Because the would never check-in to see it.
1:05:41 Debugging Node/Kubelet Health
1:06:27 It's actually gone. Yeah. It's healthy, funny. WD Fortnite chair. I know. Can't stand the one. It is very old. The canine's. That's why he's got the loud keyboard, so you can't hear the chair. You can't hear the chair. Doesn't have any logs. It has no logs. Is it running v one or v two? Did you update the v one? I did update it. It's still v one. Not changed. Okay. It's not to create a replica set. Just check I changed that. No. Oh. Something. Yeah. I didn't think I changed it. I forgot. Maybe I did.
1:07:24 Fixing App Deployment Image
1:07:24 Did I edit the pod? I did the pod. That's why. Pod specs are immutable except for the image. Yeah. I guess, every bit almost is the point. Mean, that is v two. Alright. So that is done? What was it? Three Sorry. No. Of course. So you're getting headers. That's Yep. Yeah. So network oh, just put Postgres is running. Yeah. May not have initialized, actually. Okay. Yeah. It has. Run get pod service endpoint. I don't think I don't know why people don't run that more. I don't know why I don't get the AI inside. I just You don't like it? No.
1:08:02 Challenge 2 Complete & Debrief
1:08:29 I hate too much. I don't think this is something that I've done. This is exciting. I think you just nervous. Well, actually, there's there's a weird thing that I've never really gotten to the bottom of where the node ports don't work on the control plane sometimes, and you have to do it on the worker node. But here, it should definitely work. We'll try again. Well Yeah. I know it's it's simple to know to do do all that stuff. Well, I'll just check these as well. No. You press not policies. Endpoints. What was the other one? Let's see the
1:09:23 network policy now. Okay. So we're good. Control c. Okay. So pod's running. Postgres is running, but networking IP table? Yeah. No rejections. I don't see anything. I don't really know what I'm looking at. Some firewall pulls. Alright. Sorry. That will keeps Kubernetes running. Yeah. That looks suspicious. Yeah. You see the drops? The bottom. Yeah. We're blocking He's commented them for us. So kind. Drop all anywhere anywhere. Yeah. It's dropping it's dropping all packets where the source haven't been marked. As as local host. Yeah. I think. Sometimes I read these wrong, but I I would flush them out. Just go
1:11:01 there. What is it? Dash dot capital f. And then run dash l again just to yeah. That's a. Nice. Well done. That was pretty mean. I've gotta see what the tweet is. What's the tweet? Productivity does not determine your value. You have value in just being you. Yeah. I mean, we don't fill up the website much anymore because we just rely on curls. You know what I mean? The downside is people don't get to see me back. But yeah. Those are those are pretty harsh breaks, I think. The traffic control one was particularly mean. Like, I don't think we'd ever have
1:11:57 got that. Like, if you had I had just slipped with it. Yeah. But We as we said, I I I scaled it back because it you know, as you said before, like, sometimes you don't know what a good level is to do. Like, it was actually doing a seg fault every five minutes, and that just felt, like, really mean. So we're not doing that anymore. I mean, it's tough. Right? Because he he never it's hard to understand the experience the other person brings like there are you could do two breaks and someone never finds them.
1:12:19 Discussion & Reflections
1:12:28 But if you get the right person that knows what they're looking for, then you need 20 breaks. Like it's just it's a complete game of cat and mouse and as again, it's not about the fixing and it's not about how quickly you do it or how sneaky you are with the breaks. It's just about getting two smart people who know what they're doing and seeing how they handle and debug a situation. I think, to me, that's always the most valuable thing about what we're doing here. And I really liked you know, we both approached the the the
1:12:55 breaks at different layers, which was quite cool. You know, you were doing them more in the kind of case and config, and I was just using some more of the Linux capabilities. I think it's quite a nice overlap to to explore. Yeah. I I pretty much never run Kubernetes in production. Actually, the control plane because I've always used managed. So for me, it was like, I'm just gonna break it in, like, user space almost as much as I can and things that I actually have control over. I mean, saying that, you have disabled the replica set controller.
1:13:25 Yeah. Okay. That one's just because I hate the. I always fail to look there. And I'm like, nothing's happening. And it's usually an admission, a validate workbook or a mutate workbook fail or something. And it just, like, stops in that layer and disappears. Nobody ever looks in that file or at the controls. I don't think most people know that you can disable individual controls with a controller manager. I think it's always me that kinda really points that one out. And, it's just because I've seen stuff on cluster. We've we've seen a lot of these things.
1:13:55 The other one, which I I felt hilarious. I can't remember when it was, but maybe about a year ago. Someone just set the deployment to pause and it just doesn't roll out any new pods. But if you don't know that that's there, you just you never find it. It's just it's just the Nothing in the logs. Why it's strange to do this? Yeah. Yeah. It's it's like, oh, you can pause the deployment and then you have to explain again what's actually this whole rule out command and you can finish this stuff. It's just there's so many different parts of
1:14:23 the Kubernetes API. Nobody could very unlikely nobody could know them all. And again, it's just to get people with different backgrounds as you know. This was awesome. Thank you both for for taking time. I know this could be quite stressful, so I really appreciate you both breaking clusters and fixing them live on the stream. Oh, thank you for having me. Any last words before we we say goodbye to our audience for today? I think just to say that this is awesome. We've I've certainly enjoyed it. I I I I think as well. It was very
1:14:31 Thank You & Wrap-up
1:14:54 educational. I think I learned a lot as well, and, you know, it's great to have you as the host, David. So thank you, and please keep doing it because I know I'll be watching. Thank you very much. Yeah. Shout out to everyone watching. Get involved if you can. Reach out to David, I guess. Yeah. Are you planning to carry on these? I know at one point you were struggling to find people to come on. It's hard to find victims. I mean, contestants. So It it was less bad than I thought. I'll just leave that with everyone. Well, this
1:15:24 is the thing. Right? It's like there. I I know both of you. I know you're both incredibly smart. You're both very, very aware with Kubernetes, but it's intimidating. Right? You have to be very vulnerable to set on a livestream and do something like this. And it's like, you know, so it's difficult to find people that are willing to make themselves that vulnerable and and do this. And I would love to be able to encourage more people to do it, but I don't want people to be uncomfortable either. So it's like, you know, I'll I will continue to do a cluster
1:15:51 for as long as I can convince you two and others to come on and join me into it. And, you know, when I can, I'm happy to be the one that looks silly. So, you know, if you just wanna break a cluster and watch me laugh and laugh at me try to fix it, I'm I'm happy to do that too. Let's let's we'll do that all day. Anyway, thank you both again. Absolute pleasure too. Very sneaky breaks, and best of luck for that. Well done for fixing them, and I'll see you both again next time. Thank you. Thank
1:16:16 you. Bye bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments