Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Troubleshoot a Kubernetes CrashLoopBackOff by auditing pod image updates, probes, and restart patterns.
  2. Verify API server startup failures by checking kubelet logs, static pod manifests, and network settings.
  3. Detect tampered control plane components by inspecting pause images, replaced kubectl binaries, and API manifest edits.

Teams from Giant Swarm and Wildlife Studios swap broken Kubernetes clusters and debug live, working through Cilium CNI sabotage, tampered pause images, replaced kubectl binaries, broken DNS, etcd failures, and iptables policy traps.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 1:45 Welcome and Intro to Clustered
  3. 2:21 Sponsor Shout-outs (Teleport, Equinix Metal)
  4. 3:11 Introducing Team Giant Swarm
  5. 4:16 Starting Debugging Session 1 (Wildlife Studios' cluster)
  6. 5:34 Initial Cluster Health Check (kubectl)
  7. 6:48 Identifying Crashing App Pod
  8. 7:47 Attempting Image Update (v1 to v2)
  9. 10:30 Debugging Pod Probes
  10. 11:13 Removing Probes and Resource Limits
  11. 19:39 App Pod Still Crashlooping (Observing 30-second loop)
  12. 22:00 Considering Looking Beyond Kubernetes
  13. 24:08 Hint: Network Timeout Suspected (from chat)
  14. 25:51 Checking Kubelet Logs (Failed to Start Container)
  15. 28:51 Investigating Node Processes (Suspicious Sleep)
  16. 30:54 Attempting Pod Reschedule (to Control Plane)
  17. 35:35 Requesting First Hint (Pause Container)
  18. 40:58 Debugging Pause Container Image
  19. 42:43 Removing Tampered Pause Image (crictl)
  20. 43:58 Debugging Service/Network Connectivity (Curl Timeout)
  21. 45:19 Checking Network Policies (Cilium)
  22. 46:20 Checking Cilium ConfigMap (`enable-policy`)
  23. 47:47 Attempting to Fix Cilium Configuration
  24. 51:17 Time Runs Out for Team Giant Swarm
  25. 52:05 Introducing Team Wildlife Studios
  26. 54:08 Wildlife Studios Explains Cluster 1 Problems (Cilium & CNI Spoofing)
  27. 57:11 Starting Debugging Session 2 (Giant Swarm's cluster)
  28. 58:05 Debugging kubectl Binary (Found replaced binary)
  29. 1:00:14 API Server Not Running
  30. 1:01:29 Checking Kubelet Status & Logs
  31. 1:03:10 Checking Kubelet Config Files
  32. 1:05:22 Kubelet Failure: Image Pull Error (DNS Issue)
  33. 1:06:17 Debugging DNS Resolution (/etc/resolv.conf, dig)
  34. 1:09:12 Checking /etc/hosts File (K8s Registry Mapping)
  35. 1:10:41 Fixing /etc/hosts
  36. 1:11:47 Image Pull Successful (DNS Fixed)
  37. 1:12:17 API Server Still Not Running (Cluster IP Range Mismatch)
  38. 1:13:28 Checking API Server Manifest
  39. 1:14:06 Fixing Service Cluster IP Range
  40. 1:16:16 Forcing API Server Restart
  41. 1:18:25 API Server Starting, Checking Logs (Etcd Connect Failed)
  42. 1:20:32 Etcd Not Running
  43. 1:20:41 Checking Etcd Manifest
  44. 1:23:59 Requesting Hints for Cluster 2
  45. 1:24:33 Applying Hint: Flush IPtables
  46. 1:25:13 Lost SSH Access (IPtables Default Policy Suspected)
  47. 1:30:32 Testing IPtables Flush (Confirms Default Policy Issue)
  48. 1:31:55 Explaining IPtables Default Policy Problem
  49. 1:34:47 Ending Debugging Session 2
  50. 1:35:00 Wildlife Studios Explains Cluster 2 Problems (Giant Swarm's cluster)
  51. 1:41:06 Giant Swarm Explains Cluster 2 Problems (Wildlife Studios' cluster)
  52. 1:45:33 Conclusion and Farewell
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:45 Welcome and Intro to Clustered

1:45 Hello, and welcome back to the Rawkode Academy. My name is David Flanagan, although you may know me from across the internet as Rawkode. And today we have an episode of Clustered. Clustered is a, I don't know, actually. Clustered is an experience where we have two members or two teams join us with some broken clusters and will attempt to share their Kubernetes knowledge with us as they debug and fix, hopefully fix those clusters. Before we begin today, there's a little bit of housekeeping and thank yous. First, I need to thank Teleport. They've been sponsoring and supporting Clustered

2:21 Sponsor Shout-outs (Teleport, Equinix Metal)

2:25 since the very beginning, and we've been using Teleport in every single episode to help share, pair and fix clusters. So if you wanna say hello to Teleport and support the show, please check out Rawkode.liteleport. It keeps them happy, it keeps me happy, and it keeps things ticking along. Also, Equinix Medal have been sponsoring by providing all of the hardware. Every cluster, and clustered is, a multi node bare metal Kubernetes cluster hand rolled by my own artisanal bash code and some Pulumi sprinkled in too. So thank you to Equinix Metal. It's great hardware. I love using it every week, and I really

3:03 appreciate it. Alright. Lots of chats, lots of hellos in the chats, and I'll get to them in just a minute. But first, I wanna introduce today's first team. Hello, team at Giant Swarm. How's it going? Hello. Hey. How are you? Hello, everybody. He's all excited? Oh, yes. Yes. Yeah. Of course, you are. Alright. Alright. Let's start with you, Marcus. We'll go around clockwise. Can you all introduce yourselves, say hello, and then we'll take it from there. Hey, everyone. I'm Marcus. I'm a platform engineer here at Giant Swarm, and I've been with company about six months now. So fairly new. I'm

3:11 Introducing Team Giant Swarm

3:46 quite looking forward to this. Yeah. Hey. I'm Fuya. I'm a virtual product here at Transform. A bit longer than six months or a bit more than seven years now with Transform. And, yeah, we're looking forward to how to fix a really meanly broken cluster. And hello, everybody. I'm Christian, SRE at Transform. Been here for a couple of years now. Let's do this. Alright. Let's get my screen share up. So I have already given you all access to the other cluster broken by team at Wildlife Studios who will be joining on us in roughly forty five minutes time.

4:16 Starting Debugging Session 1 (Wildlife Studios' cluster)

4:29 The you got forty five minutes. I set it to forty six. That gives us one minute to make sure things are alright, but we do have a time we have to keep track of progress. We have access to all the nodes. I will bravely going to use the SSH client for today. And I will open a session on Wildlife Studios Control Plane one. If you could all please join the session, type echo hello, anything that you want, just to let me know you're there and then we'll get things underway. But remember to find the session ID and

4:58 join. Please don't open your own one. If we can't watch you type, it's not very fun. Alright. There's a first of all. Thank you. Alright. You're all in. Perfect. Well, good luck. Off you go. Cool. Marcus, you're driving. Right? I am. Yes. So before we actually get going, we're gonna pull in a few tools to help us debug. Tube control? You bring in your own Tube control? Yeah. You you don't trust anything, do you? Nope. Nope. Alright. Kube control, j q, k nine, entity control. Cool. Nice. I've never seen this subscript like that before, but I like it.

5:34 Initial Cluster Health Check (kubectl)

6:01 We can share it afterwards. Cool. Okay. So We have a control plane. Access, so that's that's a good That's nice. We do. I always worry when there's a control plane in case people forgot to break a cluster. Oh, yeah. Canines doesn't doesn't work very well on Cool. So Not running. Actually, back off. Let's have a look. When running commands with long output, I would encourage you to use less so that we have the same scroll settings instead of scrolling in your own tower. Just FYI. Sure. Sure. Cool. So I also just wanna point out, this is

6:48 Identifying Crashing App Pod

6:50 the first cluster in three months where everything has been running when I gave the cluster to you. You know? I've been improving my my process. Very worrisome to me. But yeah. The app is not running. It's crash proofing. There is m s r e ingress. There was an m s r e ingress in one hours. Right? Yeah. There was. Right? Was it? Yeah. There was. Yeah. Yes. There was. Did you did you remove it? So that doesn't work right. It takes it with status code zero, so maybe the image is not what we expect it to be.

7:47 Attempting Image Update (v1 to v2)

7:47 Anyway, we have to bump to v two, so let's do it straight away. Right? Cool. Yeah. Why not? Let's do that. Clean the password. Oh. Oh, wow. It does not like that. Yeah. There have been issues with them before. That's not nice. Okay. So we're doing this old school one. I mean, you can try installing Nano. I can switch to the web interface. That may help. Whatever. What do you what do you all think? It looked okay on the web interface. But Yeah. It's it's just the CLI that gets the weird issue with them. Okay. Now I'll tell you what. I'll I'll

8:50 switch over to the You can use QCTL set image if you want. Image deployment slash deployment name. Yeah. We might need to remove the the ID fields there. There's a container ID and an image ID. Right? Oh, no. That they they'll be updated. Right? They'll be updated. Yeah. If you if you change the image, the image ID will be updated. Yeah. Right. Oh, sorry. Yeah. You also need the container name, which is cluster d. Clustered. Yeah. I never remember the same image syntax. Mhmm. According to the documentation, you need container name equals image name. As in

10:00 No. No. No. No. It was already Clusters equal. Yeah. Yeah. Yeah. Sorry. Yeah. Thank you. Okay. Let's see what happens now. Hopefully, it's pulling an image. Let me see. Let me describe it. Yeah. See where it's hanging. Redness probe and redness probe. Is my timer broken? Okay. Surely, that's been more than one minute. Yes. Let's check the probes on the on the. Yeah. I'm gonna switch over to the web, though, so I can use the use VIM because otherwise, this is gonna get quite messy. Yeah. Oh, okay. Okay. The loudness probe is clearly wrong. Also, the readiness probe is clearly

11:13 Removing Probes and Resource Limits

11:16 clearly wrong. Mhmm. Cool. That's not is that showing alright for you? It's alright. I can Yeah. I'll I'll open another session. Yeah. I'll join the session too. That way, when we do use them, we can at least see it on the web, and then I'll pop back. Yeah. Yeah. Cool. So I'm just gonna get rid of the probes. Completely get behind them. I worked hard on the probes. But before hold on before you do that because it I saw something weird in the describe in the describe. Okay. Can you describe the path so you can so we can see it

12:03 together? If you scroll up a little bit to the probes definition. Maybe it's normal. It's just me. You need to paper for less to show that on the screen. Oh, Chat when you see him. Where is it? There we go. Yeah. So they're missing IP address, but it defaults to Rawkode. Right? So it should work. Yeah. Yeah. That should be fine. So any objections when you just get rid of them? No. Just get rid of them. It's fine. I mean, that's what I do on prod. Yeah. Why not? Exactly. Yeah. We need that. We know this pro.

13:09 It is running. It's running. You want me to test it? It doesn't no. I've just tried. Wait. Oh, what was that? It was me. Sorry. No. We haven't we haven't got access yet. What's the note for it? 30,000. You can curl it on that machine, I guess. 30000. Right? Yeah. It's still crashing. Oh, it's still crashing. Yes. Oh, It's restarting. Okay. Any jobs? Pods and bugs changed. They will be killed and recreated. I just turned on my timer off. I have no idea what's going on with that timer. It went back up again. Alright. Let's let's start having a look what we've

14:23 got. And so I'm gonna check if we've got any webhooks in play to start with. Mhmm. So there's none. That's that's good. What if we check the Kubelet locks in the node where the pod is running? So which one is that then? There's a worker two. So if we we switch over to the worker two, David? Of course. Session opened. K. In the meantime, I can check on the control plane if there's network policies or any cron jobs. Thank you. Cool. Cron jobs. Which one was it? JF6Q2. There's another one there. So do wanna give this or the bad

16:17 news? Both? No. Well, there isn't any good news. I never actually added any logging to the clusters application. So it could be working. It could not be working. Cool. Alright. I mean Alright. Let's switch back. It doesn't look like it's working. Right? Let's let's just I'll try I'll load it up here because we're not we're not done that yet. So thirty after three. It's it's not returned as a thing. Well, join the control plane or worker two? I'm back on the control plane. Alright. This is a here. So I believe the port's correct from what I remember. Is is was it

17:29 8080? In the Yeah? Yeah. In a whale, but yeah. Resources look fine. There's not really anything else going on here. Mark Ross is in the chat saying, remember, there are hints in home if required, but I I I don't think you're there yet. It's too early for that. I mean, there's other stuff from this deployment spec you could remove that could potentially cross actually back off. Right? Mean, you've deleted the probes, I would delete one other thing just to be sure. So get rid of the resource limits. Good call. Because my code is perfect, so we don't need

18:51 resource limits. We don't need probes. It just works. Sure. I mean, it's running the Rust, so it can't crash. Yeah. That looks healthier. Access. Check services, Mary. No. It's still crushing. Is it? Yeah. It is. Alright. It is crashing. So it took about thirty seconds before the first crash. Does that give you a hint? Rest period is at thirty seconds. Right? So maybe there's no pressure. So that last one's been modified. I think that was me. Yeah. Yeah. I don't want I want you chasing a red herring. That that was me fixing KubePit. Okay. My my favorite bug that I have

19:39 App Pod Still Crashlooping (Observing 30-second loop)

20:44 reported this week, though, if you open the KubePit manifest again, this will just take, like, one second. The spelling of Prometheus. Oh, wow. Prometheus. So I I got this a a pull request to. Hopefully, that gets there. But there you go. That's why I had to modify that file because I spelled it correctly, and it was crashing. Nice. Okay. Alright. So we have a pod that continues to crash. We do. At least Postgres is happy. It is. Check service for post custom. Alright. What have we got here? So absolute. You seem to be confined in your search

22:00 Considering Looking Beyond Kubernetes

22:00 to the Kubernetes API. Maybe maybe look on the notes if there's anything else. Yeah. I mean, I I don't know. I I'm I'm speculating as much as much as you all are. But at this point, I I think I'd start looking beyond Kubernetes, potentially. Okay. Back to worker two. Yes. Let me just move it to another worker. Could do. Yeah. What do you think there's something specifically wrong with worker two? It's still in worker two, so let's stick to worker two. Yes. Yeah. Let's switch, please, Yeah. Ah, okay. You move. Oh, you're you're going to try and reschedule that putt then.

23:07 Yes, Stefan. You're right that a pod without logs should not run-in any cluster. I do apologize. I I will add logging for the next episode. You are as welcome. Right? Let's wait for thirty seconds. There we go. Yeah. There you go. I'll go. Good. Alright. Do we do we wanna switch over to that node then and and have a look, see what's going on there? Yep. Alright. So worker one? Yes, please. So it does the same on worker one and worker two. I looked at emptying the crown and didn't find anything. Alright. One of your colleagues is in the chat.

24:08 Hint: Network Timeout Suspected (from chat)

24:08 Hey, Joe. Thirty seconds feels like network network timeout. Oh, it's okay. Do we think it's silly in that? Well, why don't we spin up another container and we try to ping, I don't know, the API server from the internal service IP and stuff like that to check if networking works? Mhmm. But would networking cause a crash loop back off? It shouldn't. Right? If your image is trying to reach something else and can't, maybe it crashes. Oh, my application does nothing until the first request comes in. Okay. K. So something something requesting something from the from the application?

25:13 I mean, you could wait for it to come back online and then nobody touch anything and see what happens. The do nothing approach to debugging. So oh, you can't see when I highlight things, can you? So there's a there's a line in here. Hold on. Let me anyway, that line. Does anybody know what what that is? Those oh. Error syncing pod skipping failed to so it's failed to start the container. Check container deconfiguration on the one. Yeah. Hold on. Where are we for container d? Right. You have twenty minutes left. Okay. So we've got it. It's showing that

25:51 Checking Kubelet Logs (Failed to Start Container)

27:01 that's that's good at least. Oh, that's interesting. Emissary showing is not ready. I mean, just ignore Emissary for now just in case. I'm sure I fixed it in all the clusters, but but but I just I don't want you to chase that one. I feel weird the cluster says ready because the Kubelet logs that you pulled out suggested that I wasn't able to actually start the container. So I'm a little bit confused. But that does appear to be a running custom. Otherwise, there Yeah. Wait. The deployment said something about progress deadline seconds having right? That usually just happens

28:14 if if the resource set like, if the deployment can't be updated. So there are hints. Can we Remember if we need to look at the first hint. What do we what do we think? There's a comment from Russell in the chat who spotted a sleep 20 in the process output. There's a nice thread for that. There is. Yeah. It's a pause shim. What where are you looking? Oh, you got it? Okay. Yeah. Hold on. Yeah. Maybe let's process the information as a tree and see if you can find a hierarchy. Yes. Somebody remind me what the Maybe e f

28:51 Investigating Node Processes (Suspicious Sleep)

29:48 might be a good one. A f. Oh, Sorry. That's That's not enough. P s a u e f. I can never remember. A u x e f. Yeah. I know. Keep adding letters. Oh, there we go. There we go. Right. Okay. So so Yeah. That looks suspicious. It does. But it looks like it's just in the root directory. Assuming it's not in that container, of course. But I would I would check the root fail system. What if we try to shadow the part in the control plane? Yeah. Could try that. Can we switch back to the the control

30:54 Attempting Pod Reschedule (to Control Plane)

31:06 plane node? Yep. Done. Cool. Copy and paste, I think. So what is it you're trying to do just now? Move it to the master node, to the control panel. Yes. Yeah. That was the plan. That's the hard way of scheduling it there. Just drop a note name in there. Come on. Yeah. Yeah. Let's do that. Let's do that. Where is And it's like isn't it? No. Not name. Was it not name? Yeah. You can just do not name and then do wildlife studios dashboard control plane one. Whatever the name is. I can't remember. It's wildlife studios dash control dash plane dot

32:48 dash one. Yes. You may have to put it as not under spec template spec node name. It would be spec template node name. No. It would be on the pod. What'd I do there? Yeah. Try there. I can't remember. Just keep pacing the place, please, to the work. No. I I I I guess that's the ten minute. Schedule. Because we're not scheduling the deployment. We're scheduling the pods. Yes. It's in in there. Yes. Not there, but inside spec. It was inside spec. That was right, actually. That's what documentation says. Hey. Come on, chat. Tell us what I'm

33:50 doing. Where where do we put on name? They they always fix it for me. The Rawkode scheduler. Yeah. Russell's seen this this before. Where's your customer? There we go. You may just have to delete the pod to force the reschedule. No. It didn't update. Run get pods with a dash of weight. Where is it? The the the new name's gone. There we go. Yes. There now. Cool. Running. It restarted already. Oh, that was That was fast. Yeah. I would maybe check a hint. You got somewhere between ten to fifteen minutes left. Okay. Yeah. So oh, wow. It's a it's

35:35 Requesting First Hint (Pause Container)

35:43 a lot. That that does not fill me with confidence. It lives next to your application, and it's usually unnoticed. Is it the sidecar? I guess it's the pause container based on the things we've seen so far. Right? But you won't see that in the deployment spec. I've I've just checked the pod now. There's no there's no other container. It's just it's just the one. Did they replace the pause container in in the API server complex? Yeah. Good call. Check if you see it. Not. Wrote it. Dimension of pause in the API server. It's not a it's not a Kubla setting,

37:09 is it? It might be. It might be. Kubla or a container d. Pretty sure it's not in the CLI. It might be Kubler. But the container d could be configured to pull different image. I don't know how nasty Layout Live Studios would be. Oh, there's no container d. So if you run system control cat kubelet, it'll show you where all the configs are. Like look at that. Oh, wow. Okay. Cool. So there's the FireLab duplicate config. There's the flags dot ENV and the default kubelet config. There's three things there you probably wanna look at. Yeah. So when I look at the

38:12 check this one out first, I think. So nothing in there is jumping out at me. What's the next one? Let's have a look at the environment flags. Mhmm. Yep. So there's the there's that pause image. Yeah. But that looks Seems that snow. Right? How long we got left? Around ten minutes. Let's pop another hand. Yeah. That's That's it. I'm stumped. I I have I I don't know what we should be looking at just yet. Yeah. We know that. Right? So we're definitely it's something that there was a paused container. It's definitely something Run a container,

39:35 the dump config. Oh, hold on. There we go. Oh, yeah. Close that. Oh. That's that's not what I wanted. Oh, come on. Config dump. Yeah. Okay. There's none. They could have pre pulled a paused container to the machine that's been painted. Marcus says there's three things broken on this. Let's just pop the hints on part one and see if we can get into part two. Okay. So check part two. Hint one dot two. Right? We've got to that yet. Two. Yeah. Okay. Oh, my gosh. Come on. We know. Yeah. Mark was just saying Yeah. You have

40:58 Debugging Pause Container Image

41:08 tampered with the pause image. So if you remove that from the Move it from the notes. Oh, god. Does anybody know how do that with container d? Cry control image r m try control image s l s first. Yep. And then image r m and yeah. There you go. Two of them. There's a two meg and a 300 kilobyte and a 200 kilobyte. I would just start deleting all those image IDs. So they're still there. That's not the right plan. Oh, RMI. RMI. Cool. That's that one gone. Oops. Well, that's the only one you should

42:43 Removing Tampered Pause Image (crictl)

42:47 have to remove. So it should now pull a tagged version. So try deleting your cluster pod. Maybe? Hey. Let's see if it works. Doesn't work. Nope. Nope. Doesn't work. Alright. Next problem. Check endpoints. Okay. The pod is working now because if you curl in the pod itself, it's responding. So that's something. Cool. So it's now just a case of the ingress not working. There's no ingress? Is it No. It's not ingress. Sorry. It's it's no. Yep. I ran a curl on worker one, and it's timing out. So At dash capital l? Yeah. I can't I can't type, it seems.

43:58 Debugging Service/Network Connectivity (Curl Timeout)

44:41 Note to pad also works. So maybe it's a service problem. Mhmm. Mark, the data chat says no IP tables or has been used on this cluster. You got six more minutes. Are there any Kubernetes ways to restrict network traffic? Not a bunch There no there was no network policies in there. This is a Cilium cluster. Check the Cilium ones. I checked the check the Cilium ones too. Yeah. The cluster wide and the Yep. PCMP. Yep. And that goes the idea. Yeah. Let this is calling the service working? Oh. Awesome. Has a contract map. Right? Is it? Let's look at that.

46:20 Checking Cilium ConfigMap (`enable-policy`)

46:23 Let's check how it's configured. Maybe the two proxy or the the the node port implementation. Thing is, I don't know enough about Cillium to know whether this is default or not. So if I remember correctly, there's this enable policy default lane, which I think may block everything by default. We can set that to never. You can set that to yep. Which might be a good first step. What's it called? Because this is Kubernetes, we have to rotate the pot. When the Cillium pods come back up, they do provide a CLI. You can do Cillium status and save one of those pods.

47:47 Attempting to Fix Cilium Configuration

48:08 Maybe worth doing. Oh, yeah. Cool. Yeah. Shouldn't there be three city imports? I have no idea. Alright. It's a. Oh, is it? And so, yeah, that probably should be then. Did you exec into the Cilium pod? No. Checking the the data set first. Not seeing anything. Last minute. It's not running on the on the the control plane node. What's the container called? Cilium agent. You can just do Cilium status, and it'll run it in a default container. Okay. But delete dash agent and then just space status. Oh, I'm it was called Celium. We just run bash again, so you don't

50:46 Yeah. We're gonna have to add a dash IT, though. Well, like, it's got bash. Alright. Well, we are all out of town. That was a tough one. Well Alright. I'll let you jump off the call. I'll get the way I'll let you just people in. They can tell us what they did, and then we'll take a look at your cluster. But thank you. I mean, that that that's stumped me. And Marcus has been on that before, and he's always lethal. So but good job. I'll just cut this on a bit. Yeah. Thanks. Alright. Let's get Wilder Studios involved.

52:05 Introducing Team Wildlife Studios

52:05 Hello. How are you? Hey. What's up? That was tough. Can you hear me? Hey. Yeah. Yep. Yep. That was Oh, yeah. Yeah. Sorry for that, man. Sorry for that. It was fun. It was fun to watch, though. We we thought we left some crumbles along the way. You know? You know, the the sleep process and and all those things. But, I mean, these guys made an amazing work. It wasn't easy. It wasn't easy at all to understand what was happening. Alright. Well, why don't you well, back before you tell us what happened in that cluster,

52:38 just wanna both introduce yourselves, say hello, and then we'll get started. Cool. We I can't hear Leonardo. Oh. Yeah. We can't hear you, my friend. Maybe he's muted, but I can introduce myself in the meantime. Hi, everyone. My name is Marcos Niels. I've been here before, so we are, like, Internet friends with David. Hopefully, we can meet each other in person, David, at some conference or something soon. And now can you hear me? At Wildlife's We can hear you now. Yep. That's better. Nice. So I work at Wildlife Studios with Leo. We are part of the engineering team. We do

53:16 a little bit of Kubernetes gaming stuff as well. Of course, we are hiring people if you want to join us. It's an amazing company. And we are here to have fun, learn Kubernetes, and share our knowledge with the the rest of the world. Thank you. Leonardo, do wanna introduce yourself? I'm I'm Leonardo. Everybody calls me here. It's my surname. And I've been on YDLi for two years now, I think. Went to to two years on March, if I'm not mistaken. And yeah. I've always liked it to be an SRE, and that's it. Awesome. Alright. So quite a bit the lag, I

53:57 think. Let me move my screens here to the side. Yeah. It's a little bit. Oh, there we go. That seems to be better. Much better now. So tell us what happened to that cluster that pure cluster. Nice. So so I Go ahead, Daniel. Started to we started to to mess with Cilium because we use Cilium here at Wildlife. And I know I think it's more of our expertise. But at that at some point, Marco said, like, we have a lot of problems with network already. So let's move to another another let's break another thing. And he fought on the POS containers

54:08 Wildlife Studios Explains Cluster 1 Problems (Cilium & CNI Spoofing)

54:36 itself. But two problems were related to container d and c and I, and the other one was the POS container. Yeah. Tinting the POS container was pretty nasty. But then messing with Cilium. I mean, nobody knows how to configure Cilium. I just always trust it works. Exactly. So to give them, like, more concrete things about what we did is, like, initial first, we disabled Cilium. Right? We downscale the demo set. Right? So Yeah. And we renamed there's another deployment that is called c Cilium operator, which is the operator, and we renamed that deployment to Cilium.

55:14 So they think that the demo set was up, but instead the operator was up, which was not providing any connectivity to the cluster. Right? Yeah. I thought those were very names of the daemon set. Doesn't have the node names on it, but, you know, they were we were just running out of time to start debugging that. But it that's tricky as weird. I was like, that's not a daemon set because they have a very special name. Exactly. We also what is some parts on the we also configured some parts on the ceiling itself so it wouldn't

55:41 route for the service. Right? And when they if they discovered about this the demo set, when they started the demo set, it would Citi would it would configure itself and almost everything would work. Almost. Only the the application wouldn't. Like, card DNS would and the Kubernetes itself would port to port connectivity, but not the server itself. So, yeah, there there there is also Yeah. The configuration of this config map that we did. The the interesting part of this is that if you downscale CDU, the notes are gonna become not ready. But as you saw, the notes were

56:18 ready. Right? Mhmm. So what we did instead is, like, we tricked basically the QLED to believe that the note was ready when CDU was not up, basically. So that's why they believed everything was working where on like, we actually put a a CNI configuration, like a dummy CNI configuration file on the on the host. Like, yeah, basically, making making Kubernetes believe that everything was working when it wasn't working. But we didn't change any any like, besides the the post image, we didn't change any outside the the normal configuration of Kubernetes. Right? We didn't Of Kubernetes. Yeah.

56:52 System or yeah. Exactly. Everything was Kubernetes based. Wow. That is that is clustered on hard mode. That is harsh. They're taking the CNI configurations for the pods already and then alright. Well, let's see what Giant Swarm have in store for you. So screen share is up. I have connected and started a session on their control plane node. So if you could please join a session and give me an echo hello or anything to let me know that you're there, and best of luck. Let me get the link. Thank you very much. So is gonna be

57:11 Starting Debugging Session 2 (Giant Swarm's cluster)

57:25 is gonna be driving. I'm gonna be taking the chat and, you know, helping from far away. Sounds good. Let me open the the machines here. The control plane's here. So hello. Oh, I think you started your own session instead of joining main. Oh, okay. Let me it's on clusters on activity Activity. Active sessions. Yep. Giant swarm, control plane, joint session. Yep. Perfect. I'm the leader as well. Hello. Awesome. That's it? Yep. Nice. Best of luck. K. Let's do this. Oh, sorry. Sorry, Max. Sorry. Go ahead. Let's start with the alias configuration. CTL. We need your code, Scott. We're gonna be

58:05 Debugging kubectl Binary (Found replaced binary)

58:21 looking at the tail. CD configuration admin. So oh, there is no there is no complete here. Right? No. They it seems like they they already messed with the q c QCDL binary. Let's So I would I would I would do an APT reinstall because I I think it's gonna be faster than trying to fix it. Yeah. It looks like a Wise words. Download. Let me do just You can do an APT, I believe. APT install, reinstall QCTL should work. Yeah. The flag is install dash dash reinstall. Oh, James. So Let me Probably just We have something in

59:23 the, yeah, in the in the path. Right? Oh, shit. Not sure. Let me just be sure this this okay. Maybe. Which cubes do you know? Remove that one and try again. Okay. Now it should be the first one. Yeah. Yeah. Like, the the batch basically. Secsorts. Yeah. Yeah. So we can Okay. So we don't have the connectivity to the server? You haven't reexported the KubeConfig in your new batch. Oh, you're right. Thanks. Oh. Do nothing to cancel it. Okay. Down. Yeah. So the API server could be down. Can you check if the do a PS

1:00:14 API Server Not Running

1:00:21 and see if the API server is running? If what's running? Do a PS and see if the API server is running the process. No. So it's down. It's down. Let me check the containers here. Yeah. So probably they guys not here. They changed the manifests on Rawkode container. Right? Yeah. So there is no Kube API here either. So let's check for the manifest. Kubernetes manifest. So they already changed it. Right? Let me get this. A copy. API server. Yeah. Try follow need to do a last. Right? Yeah. So that's my follow. Try running the maybe they disable the system,

1:01:29 Checking Kubelet Status & Logs

1:01:39 the unified. So try doing Yeah. So what starts these manifests? This. Check check the. Status. Cubelet. Also, there's an error. CNI plug in not initialized. Not initialized. Okay. Good thing is they're experts on Celine. K. Oh, ensure lease exists. Maybe HCD is not starting. That's why Yeah. The service is not done. Let me check again the Do we have locks on on HCD? Oh, it's not this. I want to Oh, there's no There's no CD either. Okay. We're on the control plane, so it should be one at least here. One thing that they can do is that

1:02:45 they can fool the the Kubernetes to use a different manifest configuration directory. Right? So maybe that's why the QLED is not starting anything because it's not actually picking the right configuration files. So you Actually, there's a that. There's a QLED configuration. I believe it was in its it ATC QLED. Do you recall, David, where it was? You can run system control cat QLED. That's all the different config option. All the different config will place you. But barleb kubelet would be a good starting point. Okay. So we need to check the atckuberneteskubelet.com. This is the four right here? Or

1:03:10 Checking Kubelet Config Files

1:03:36 I would start with rlkubeletconfig.yaml. Yeah. Let's go there. Oh, okay. Yeah. Okay. I'm gonna open this on the web as well in case I feel like VIM's coming. So It's not that big. Okay. What? What did you open? Yeah. I think it's because So the static port static pod path, I think it was there. Right? So it's a t c Kubernetes manifests. Right? Yep. It's right here, the static pod path. Can you can you do something Your Kiplet was failing. Right? Your Kiplet wasn't running. Yeah. Exactly. Wasn't running. Did I start it? No. It was running.

1:04:26 I think it's running. But it was failing. Yeah. There was lots of errors. Still some errors. Yeah. There was a a lot of errors. But k. Okay. So let's can we check the logs and restart the Kubernetes? So So let me to the just restart kubelet. Yeah. Let me check the logs. Let's see then. Do the journals down. I think you're gonna need to scroll up a little bit. It's following. Let me let's go to the end of the file. K. I just complains about not filing founding this. Agent control plane. Failure to pull an unpack

1:05:22 Kubelet Failure: Image Pull Error (DNS Issue)

1:05:24 image. There's a failure to pull an image. Queue scheduler. Failure to resolve reference. Seems like it can can you try can you try to see if you can pull the image with the CTR? The those images, basically. This queue. Let me check here. So copy. CPR. IU. It's pool. No. It's not pool. It's High pool, I guess, maybe? I think it's high pool or not. It's image. Yeah. Yeah. I'll pool. You're right. And this one right here. Oh, it's resolving to local host. So they tampered with the DNS resolution project. So that conf? No. Okay. Can you do a

1:06:17 Debugging DNS Resolution (/etc/resolv.conf, dig)

1:06:24 do you have DIMM installed to check what k test .gc r? Yeah. Try k test.gcr.ao. Dot gc r .a o. Yeah. So it's resolving to local host. Wonderful. So So it's DNS. Okay. So it's DNS. So this could be probably a re system d resolved configuration. I don't know if they installed DNS mask or something here. It's But I think you can go to a t cresolve.d, I believe it is. You are always so mean to my clusters. You want to go where, Marcus? There's no Look for resolve..d or system d, resolve d. No. No. No. This configuration. Yeah.

1:07:30 Let me check. I'm googling this right now. It is csystemdresolve.com. System d resolve Com. Yeah. It's a Can you do a a dig against that server? Maybe it's a So I should put any DNS and You can just put k test .gcr.a0space at .ap. So if you look at your resolve.conf again, the main one. Is resolve.conf, you mean? Right. So the the there's Oh, the yeah. You mean the options there when they But that that's okay. Right, David? Because it's using the system d I guess we can change this and put Google Google thing. Right? It's the same thing

1:08:31 pretty much. Because maybe they messed up with the system d result d daemon, but if we change it here, it should, like, basically avoid using that. Okay. Okay. Nice. So let's let's start the Qubelet now. It's Restart on the kubelet? Okay. Restart kubelet. Now it's hidden. Yeah. Should we start also content ID? Maybe. I really hope they just put something on the host file. I really, really hope it was the host file. Yeah. Let's see. It's not gonna third, right, to restart everything. Oh, it's still failing. It's still why why it's still Try resolving this. So

1:09:12 Checking /etc/hosts File (K8s Registry Mapping)

1:09:31 there's some configuration next host. Try next host. Maybe it's a QLED configuration, one of the different files that the QLED has configured. Maybe they configured, like, local host there. And Our container d? No. There's no container configuration. Have you looked at the host file? On yeah. I I I did. Was there nothing there? On the host it was this the first one. Oh, there's a Oh, no. There there was a Yeah. I'm so glad it was the host file. It just makes it funnier. What? Wait. That that's cool. So when you oh, wait. Maybe when you are trying to

1:10:23 beam that that yeah. It's. Yeah. They tried to do a type beam. Maybe they yeah. They're basically fooling us there. Just, yeah, copy the content without that line and then echo it into the it is a host, and that's it. Right? You mean you want me to get this on the first on the first line. Right? And whatever No. I think you need to read to well, the other one is IPv6. So, yeah, I think we don't care. Yeah. Only the first line and echo that into existing hosts. A sledgehammer approach, Alex. No. No. No. It's a

1:10:41 Fixing /etc/hosts

1:11:15 host. No. No. No. No. It's host. Oh. You gotta watch it. Yeah. No. No problem. No problem. Break more stuff. Sorry. You you can take another result from another note if you want. No. It's just Just that name server. Yeah. Yeah. Name server. Yeah. Server. Good job. Nice. Let's see now. And now so pull. Nice. So let's see the log from starts. Cubelets. And let's see. CLS. So the kubischedule is now running. Let's put the that's a l s here. The QVAPI server is not running. Oh, it's running Okay. Okay. Try doing a QVAPI server.

1:12:17 API Server Still Not Running (Cluster IP Range Mismatch)

1:12:27 Do you have the Get the box. Or if I don't get get positive, well, it should work in in any case. Right? Yeah. Yeah. Yeah. Same thing. So check the logs of the API server now. Where is it? API. Close this one. Copy. I think you can use autocomplete. Right? Then we have separate range. Service cluster appearance must be at least seen. Oh, okay. They they mess with the, yeah, with the IP address of the Oh, that's configured on Kubernetes. Right? On the manifests. On the API server. Right? Manifest. So Oh, they messed up with him, so we

1:13:28 Checking API Server Manifest

1:13:40 can't trust. Yeah. We can't trust anything that comes with him. Advertise. So that seems to be okay. Oh. There's a cross IP range line. There you go. Okay. Do you have the the correct range, David? Sixteen. Sixteen. Sixteen. Let me just do a damn. I'm going to do a set here. Uh-oh. No. No. Go go with BIM. We don't know if BIM is gonna work. Right? We didn't check that one. Try to open it with BIM to see if it's. And that's all you can Why don't you trust why don't you trust said Yeah. Why

1:14:06 Fixing Service Cluster IP Range

1:14:36 why wouldn't you trust me? Right? It's like the That's probably It's already failed us. It's probably just something in the bash profile for when you try to open that file, I think. Okay. I don't know. Restart you can restart the API server now and see what happens. Oh, it should have been restarted automatically. Well, it might take a few moments, but, yeah, it will it will be restarted because the manifest changed. Wow. Giant storm certainly did not mess with them. What? Then we got a gimmick setup. Our our troubleshooting doesn't doesn't agree on that, probably.

1:15:18 I'm checking that bash profile, but then there's that, but there's definitely something in there. Maybe it's doing two skill kickbox. No? There is no accept mailer. Okay. So let's see the logs. No. There's it's not that anymore. Still no control plan? You can do so. This one right here. 04:00. Yeah. Still not starting. Cats. Paste. No. It's still it's still the same error. Let's go look. I can arrange for at least eight IP addresses. It's Dash16. Maybe it didn't it didn't restart. Try restarting that manually. Try killing the the API server process, and it's gonna start

1:16:16 Forcing API Server Restart

1:16:21 again automatically. You mean like this? Yeah. That process was started three minutes ago, which is probably before your change. I didn't find it. So Maybe Should I just click the what? Oh, the process ID is there. Right? Although that's the log fail. Oh, so maybe they met with the issue. Messed with the pass. Yeah. I just found that. Try doing a grep dot dot dot dash f. It's pregrepped, but it's not pregrepped for the process itself? Yeah. Same thing. Yeah. It'll be pregrepped dash dash f API. Dash f API. Like this? Oh, sorry. Yes. No. Weird.

1:17:13 We don't see the process yet. Let's see the logs again. Logs. Let me see if it's the same. Let's just kill the part, I think. Yep. Let's see the variable to kill it. Delete. Grab the API and just see RM and this one. Sorry. This one. So now yeah. It wasn't recreated yet. You can force the kubelet to restart, Sandeep. Restart kubelet. Power manager. There you go. Now server. Okay. Going up. Just checking out. Oh, sorry. Alas. No. The other one. The other one. Shit. URL logs. No no cursing here. It's complain. I'm sorry. You can curse if you want. Don't worry

1:18:25 API Server Starting, Checking Logs (Etcd Connect Failed)

1:18:50 about it. Maybe it's for children. That's the watershed. What's the new one? This this one. So Oh, there are multiple logs, so we need to pick the the latest one. Right? Maybe we're seeing that in old desktop the last time. That's why. You can always use firewall containers firewall containers for the latest logs. Caught some of that noise. You can grab the ID of the container and and just oh, it's I don't know if you tested this. Is that a tail in firewall containers? And then it'll be the name of the it'll be kubedash API.

1:19:29 And it's it's always the latest one. Teo and varlog container. Yep. And then kubep And API. Kubep dash API. API. How do we define? We see the it's the seven. It is the one that start with seven. Yeah. 70. 7. Okay. No. That's the old one. Still the same problem? Or is that the old one? How do you know if that's the old one or The 75 stamp on it was 1843. Okay. So we should try the other one. Try 88. Yeah. That's the thing. Now it seems like it can connect to Edcity. Right? Yeah. 2379

1:20:17 is the city port. Try checking the Edcity docs maybe. Is etcd even up? Doesn't appear to be let's check why. Manifest. Last Okay. Image full policy always. The image seems to be okay, I believe. List okay. Client URLs. The port is okay. Does it appear on CTR? I don't think so. Or else it would appear on the No. It's in log out there. Yeah. It doesn't. But it's applying every manifest on that folder? Yeah. It's applying. Right? It should be applying every manifest. Yeah. So maybe we can check the the QLED docs to see if these if it says anything

1:20:41 Checking Etcd Manifest

1:21:33 about the XCD. Can you do a graph on the city in the in the Cubelet docs? So Cubelet and do a write city. Right? Oh, sorry. Graph. Okay. There are a lot of things here. This from now. Right? I'm not sure. Yeah. But it's just Oh, it's trying to to system. Yeah. The last error. It says dial TCP. The less oh, the less error that isn't syncing port. Right? Enable to write in. Status for call, it says. It's region giant form control plane dial. It's timing out to dial against Can I do But it doesn't it doesn't

1:22:37 need it doesn't need Selum to start, right, because it connects to the API server directly? Yeah. Yeah. I think so. What is that IP? 139. Because the the XCD doesn't doesn't need to connect to the API server. Correct? So it only starts and then the API server It's the other way around. Right? Yeah. The the okay. So why is it not starting? Try try restarting the and see if if we have any bots again. I wanna see if it starts and and then goes away or something. So I should do that really fast. Right? It's

1:23:33 So If if any pods would appear, the the logs would be on the on the directory, but it isn't. Yes. It should be there. Yeah. We don't have any tripods. So it's not even starting. Can we check the manifests again on the Kubernetes manifest? Yes. On the chat? No. We need some help, folks. Do we have hints for this? Maybe Yeah. We have. We have. I saw that. We have version. So, yeah, let's use some kits. Check the first hint here. Bring your binary. Yeah. That makes sense. Makes total sense. Hints too. Network. Network. Yeah. That's probably the result of comp.

1:23:59 Requesting Hints for Cluster 2

1:24:26 Hosts thing. It was open. Why is there a Oh, the flash the IP tables. Amigo. You're gonna straighten the flash? You don't even want a look at them? Yeah. Let let's do it. Oh, there are dash. Drop. Drop. Drop. Drop. Okay. So Yeah. S t p four, API server port. Just drop, drop, drop. IP tables? Dash f, capital f. Type I IP table. Yeah. Then we need to restart everything. Right? Like No. No. Restart a guildet, and that's it. But I but I don't have access to the machine now. Oh, we've lost access. Yeah. This didn't happen in the past.

1:25:13 Lost SSH Access (IPtables Default Policy Suspected)

1:25:23 Okay. So Let's see if can create another session. I'm not sure. Right? Alright. Let's go for it. Go over it at. It's no longer active. You're on the right. It's gone. Maybe we so we we lost the machine? Alright. Let's see if we can get in another way. So mhmm. Okay. Then just restart it, I think. Restarting Restarting bare metal is not a fast process. It's not a fast process. But then restarting it could also I don't wanna do things that they did, so that's not that funny also. We have we have lost access. So let's take us through. All we did

1:26:45 was a flush of IP table. So not even not even SSH light works, David? Okay. It's not even SSH. How long does it take to restart? It's not it's not feasible. Ten minutes. Ten minutes. I have console access. The trouble is the root password is only available for twenty four hours, and I'm pretty sure I'd be on that. I don't know if I keep it in my Pulumi code. I can definitely check. But that's funny. Why can I I think it was trash for the prevent us from access? It depends what has been done with the

1:27:31 network. Yeah. Mhmm. No. No. Maybe it's configured to drop everything that isn't there, and it could be that. Yeah. Let's see. I'm seeing if there's anything we can do in the worker nodes, but, yeah, we can't do anything pretty much. For the policy the policy for forwards drop, the follows for inputs drop. So, yeah, that's it. Okay. So the machine is still alive at least. But I don't think I export the password. One forty five. Okay. And I believe that's where we we made it. Yeah. We didn't even start the Kube API server. Okay. Yeah.

1:28:43 I I don't have access to the password to go in over the serial console anymore. So I'll tell you what, I'll reboot it, but I I just I don't know how long that's gonna take to come back out. Do you wanna give it five minutes maybe? Maybe we can if you want, David, in the meantime, we can show what we need to break the other cluster if we have time. I don't know how much time you have. Yeah. It's So seems why is I mean, Teleport is still running. Right? Oh, yeah. Because I saw this on machine.

1:29:22 No. It's not connecting. No. Okay. Alright. Let's reboot it. Let's at least try. How much time do you have left? I mean, I I I don't mind I don't mind going over. It's not a big deal. It's just we'll see what happens. We should be able to see the machine reboot at least from the studio console so we can keep an eye on it. But I I I just can't I don't know what the flush why that would I think it's because of the default policy for input is that's it. Right? I think it's because of the default policy

1:30:04 of the input chain. Right? If it's on drop, I'd say that you do because I I flush IP tables all the time on custard. It's it's normally safe. Normally. I I don't know what's happened there. Giant form says they don't believe that that was them. Let me well, we can try trashing the IP tables on our control page to see if it happens the same. Right? Yeah. Go for it. I'm gonna try that right now. I'll open a session. Don't don't do the the on the control plane because our also lost the cluster. Oh, you are correct. You won't have

1:30:32 Testing IPtables Flush (Confirms Default Policy Issue)

1:30:48 don't wanna work or so we can show. Yeah. Alright. I've opened a session on worker one. Okay. I'm gonna join in there. Wildlife Studios works. Joining there. Okay. You can see my screen. Right? Yep. Yep. Mhmm. I'm here. And I think that's f. Yeah. I'm synchronizing to the machine. Right? So Giant Swarm did something weird. I don't know what they did, but it works here. Well, our machine is now at BIOS bit stage. So hopefully, we get this back in a minute or two, hopefully. And I I'm going to change all my automation to keep a track of the root

1:31:50 password so that we can actually get on the machine. Dash l. Right? So here, our our input policy for the the policy for the input change accept, and on the on the their machines is dropped. Oh, they remain the full policy. That's why Yeah. Yeah. When we flush it, the the full policy for the input chains dropped, so we just get dropped. That's what Yeah. I I still have the my tab open with the logs of the Oh, yeah. That's right. So what did you see here? On the chain inputs, it's the first one that should appear. It you'll see that on

1:31:55 Explaining IPtables Default Policy Problem

1:32:41 the, oh, see see here. The policy drop. So everything there isn't listed there is going to be dropped. When we flush it, now there is no there is nothing. So it's going to get dropped. But it's only port six for port three. Right? No. We we we that that was as an explicit drop, but everything that isn't there will also get dropped. Right? I I think that's the I'm not very acquainted with IP table, but I think that's the way that it works. Machine is coming up. Yeah. It's getting there because system do you start in there? So we're just we're just

1:33:24 we're not far away. Nice. Almost there. In fact, we may we may actually be okay. We're very close. Yeah. That's it. SSH route transform control plan. I see. Is getting connection refused? It's almost there. We're just waiting for teleport to get help or maybe Oh, it takes a while, okay, to start everything. I have ten more minutes to spare, so let's try to see if we can finally get everybody. Hey. Come on. Control plane. It's annoying that I have a login prompt, I don't know the root password. Well, you always learn something new. Right? Oh,

1:34:31 yeah. It's an iterative process. I will. I'm never going to dismiss that root password again. I I don't think we're gonna get access again. Teleport should have started by now. Okay. So if you want, David, we can use these last minutes to show the people the configuration of our cluster. I don't know if it's worth it or we can just call it for today. Yeah. Show us what you did, and then we'll call it quite fresh. Okay? There there's just there's just too much networking badness from from my blood. These are these are both cruel cruel teams.

1:35:00 Wildlife Studios Explains Cluster 2 Problems (Giant Swarm's cluster)

1:35:11 Alright. There's a session on your control plane, Wildlife Studios control plane one. Show us what you did, and then we'll we'll wrap up. We'll we'll we'll try that we'll try that machine again before we do it. It sounds good. So do you wanna show, Leo? Or I guess. I'm already there. I can do Okay. Let me join too. You are watching. Right? Yeah. I'm already I already did. Nice. So, basically, what we did oh, I mean, I've been already. Yes. So, basically, what I did is, like, if you do oh, let me export. So if you do as I as we

1:36:01 said in our in our introduction, if you do cube cube system, you check the demo set, it's gonna be scaled down. Right? So Selium wasn't working. However, the nodes are showing as ready, right, even though Selium is not ready. So what we did in order to fool, basically, the queue that that the node is ready, We actually created a configuration in at at ECnet.d. No. It's not in c. It's Oh, c n I. C n I. Yeah. Right? So we created this Cilium config list, basically, fake configuration file, and we are fooling the the QLED that

1:36:39 the node actually has been working enabled when in reality, is not working. Right? So that's why the nodes show up as ready. So what we need to do to fix this is, basically, we need to edit the. We need to what we need to disable is, like, we only change the the affinities. Right? So this should be Linux instead of AMP. If if you check the demo set, the scripture is gonna say that it doesn't basically match the affinity. Yeah. We can't see that because of the weird bug on this CLI. But So now that Tilium is running,

1:37:24 what the the next step from here was basically to exec into a Tilium bot. Right? So I can pick a Pod here, like this one, and then I can do key system. Then it takes the bot name, and then you can do a search here. Batch, maybe? No. You need to dash dash before the s h. Right? No. You don't have to. Oh, no. No. No. You have You don't have to. Yeah. Maybe I'm executing the wrong bot. Let me maybe I'm it's still checking on the old ones. Yes. I believe I'm checking on the

1:38:04 old ones. So let me grab another code, this one, the short names. For this one. And let's do watch. Okay. Now I'm in the Serum pod, and you need to do Serum status. Serum basically, you have the basically, Serum is gonna show everything here. And if you did, like, Serum endpoint service list. Yeah. You can do also the endpoint list. Exactly. Only API server, and I don't know what the service is, are basically registered in. Right? So it it is missing the cluster the cluster service. And the reason why it's missing the cluster service, it is because we changed a configuration file

1:38:50 in Cilium in the config map. Cube system. That is this one. It's Cilium Cilium config. Dash config. Yeah. It's basically we only told Cilium to Yeah. It's just that one. That's what I'm You can delete it. Or Yeah. Basically, this this basically tells Cilium to only apply on the services that have have the specific layout. So you remove this. You re redeploy the app, basically. Let's do I I don't know if I need to touch the services to cluster. Cluster it. Let's change, I don't know, annotation. Let's do macros. This so the service basically gets modified. And now

1:39:46 if we exec again into the Serum pods and we do Serum endpoint Endpoint. Serum service. Serum You didn't yeah. It raised the endpoint. Right. Service list. I think you it's you need to re yeah. You need to start setting because it changes this con the config map. K. We did Pods in cube system with the label. Ecosystem. You must have system. You must have name. Oh, Oh, nice. K. To the postache, and this should be it. Let me exec again into the port. This one. Serb service Service service. As you can see, now everything should be

1:40:49 working. Yep. That's it. Alright. I mean, we two members of giant swarm of us now, so I'm gonna pop over to camera mode. Oh, it's you need to restart the deploy. But after that, everything is should be working. Alright. Great, Giant Swarm. Why did that flush destroy the world? Good game, mate. So, basically, we we've we also enabled a a UFW firewall. Okay. By default, sets everything to deny other than things that we've automatically allowed. We set an allow in place for Teleport, but you flushed it. Mhmm. So yeah. Yeah. That went away. Do you want us to go through the other

1:41:06 Giant Swarm Explains Cluster 2 Problems (Wildlife Studios' cluster)

1:41:46 things that we we tackle? Yeah. I was going to ask exactly that. How much were we missing to finish? I So I have to drop guys. Sorry. I'm it's getting late for me, but it's been an amazing experience. No worries. Thank you, Markus. Bye. I'll catch up with the recruiter later. Thank you. Have a good one. Take care. Cool. Yeah. So you caught the first one with the the kube control binary. So we we we didn't even replace the one that was there. We just curled down the the macOS one and put it earlier in the path, basically. Oh, okay.

1:42:18 We didn't do anything with VIM. Didn't touch that at all. What happened in the host's file is we just put a load of space. So the the host names that we had were off the page. And then and we then disabled Vim's line wrap in the Vim RC so that you won't see it unless you yeah. It's Sneaky. See. Very sneaky. Yeah. I I was I was quite happy with that one. So you you caught the the service side arrange, which was good. What you didn't catch, the issue with etcd, was we changed the host network from true

1:42:58 to false in the static pod. So it was running on the host network. That's why I was looking for a network plugin, which wasn't there. Okay. The other change we did was in API server. We changed the port that the API server was listening on, but didn't change the the admin dot config, kubect config, and and everything else. So if if you'd have got etcd up, you'd have probably So that's why there was the IP table rules for 8443 and for 6443. Yeah. Exactly. Yeah. So, yeah, we had the IP table so that you couldn't access it locally. We had

1:43:34 the UFW so that the other worker nodes couldn't access it. Mhmm. Okay. The final thing we did is we messed with the core DNS core file. Oh, the core file. Okay. Yeah. So in there, we added an ACL block to prevent to block type a lookups, which if you'd have looked, I think you'd have probably caught that one. The one that, was a bit sneakier was in the Kubernetes, cluster dot local block. We added in, namespaces all in there. Now name now all is not a keyword in that case. That will actually match the and and namespace called all and will only

1:44:23 serve all post names for that. Oh, that that would pass this classic look. That would pass the I test, wouldn't it? You would oh, yeah. Yeah. I was I'm quite proud of that one. The the fight the just the final thing that we did, well, we covered our tracks by doing a touch on the entire file system so that everything has the same enhancement. So you could see what we what we changed or anything like that. Sneaky. Sneaky. Sneaky. Yeah. We we messaged very little with the file system. We only added that file that that message with continuity thinking Kubernetes thinking that's

1:45:03 the node already, but it wasn't. But yeah. So we That was a good one. That's Yeah. That was the first thing that we thought on how to how can we disable Cilium and make the the node fix it's ready, but it isn't. Yeah. So we we have no experience with Cilium, so we we were never gonna get that. Next up, learn more about Cilium. Yeah. Yeah. That's on our to do list now, I think. Alright. Well, user both evil evil teams. So thank you for taking part in custard. I hope you enjoyed that, and I'll see

1:45:33 Conclusion and Farewell

1:45:38 you all again soon. Enjoyed a lot. Thanks for the opportunity, David. Yeah. Bye. Bye. Bye.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Cilium

More about Cilium

View all 36 videos
etcd

More about etcd

View all 24 videos