Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Track down rogue static pods and remove stale manifests that block cluster startup and service health.
  2. Repair broken service networking by diagnosing Cilium, CoreDNS, kubelet state, and node IP table failures in sequence.
  3. Debug image and policy issues caused by admission webhooks, scheduler misconfiguration, and shell aliases that rewrite kubectl.

Eric Smalling and Carlos Santana race the clock to debug broken Kubernetes clusters. Expect rogue static pods, ZomboCom surprises, broken Cilium network policies, CoreDNS misfires, and a mutating admission webhook swapping images.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 0:40 Introductions
  3. 1:01 Introduction and Housekeeping
  4. 1:52 Guest Introductions (Eric & Carlos)
  5. 3:51 Challenge 1: Fixing Eric's Cluster
  6. 4:00 Cluster by Eric Smalling
  7. 4:28 Initial Cluster Check & Setup
  8. 6:48 Investigating Unexpected Static Pods
  9. 8:53 Diagnosing Service Connectivity
  10. 12:18 ZomboCom Discovery
  11. 16:28 Removing Static Pod Manifests
  12. 16:51 Worker Node Connectivity Issue
  13. 17:32 Removing Static Manifests via SSH
  14. 18:18 Scaling Up Deployment
  15. 18:57 Diagnosing Image Pull Error
  16. 21:31 Fixing DNS and IP Tables
  17. 24:52 Fixing IP Tables on Workers
  18. 25:55 Deployment Pod Running
  19. 26:04 Upgrade to v2 and App Check
  20. 27:15 Eric's Hacks Revealed
  21. 30:00 Cluster by Carlos Santana
  22. 31:11 Cluster Check & Setup
  23. 32:51 Diagnosing Stopped Kubelets
  24. 33:59 Starting Kubelets
  25. 35:06 Diagnosing Pending Pod (Wrong Image)
  26. 37:45 Pod Image Mutation Found
  27. 40:00 Discovering Missing Scheduler
  28. 46:39 Fixing Missing Scheduler
  29. 48:23 Removing Node Taints
  30. 48:50 Image Still Wrong
  31. 50:00 Diagnosing Database Crashes
  32. 51:16 Diagnosing App Networking (Network Policy)
  33. 52:51 Fixing Network Policy
  34. 53:47 Database Image Wrong
  35. 57:24 Using Hints
  36. 1:03:17 Malicious kubectl Alias Found
  37. 1:05:40 Deleting Mutating Webhook
  38. 1:06:48 Pods Running Correctly
  39. 1:08:10 Upgrade to v2 and App Check
  40. 1:09:18 Carlos's Hacks Revealed
  41. 1:10:10 Conclusion
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:01 Introduction and Housekeeping

1:01 Hello, and welcome to today's episode of Clustered at the Rawkode Academy. My name is David Flanagan, although you may know me across the internet as Rawkode. Today we have a fantastic episode of Clustered, but before we start that, there is a little bit of housekeeping. Please remember to like these videos, add comments, share them, subscribe. It just helps other people find this content and we all have a lot more fun. If you really wish, there are membership options available on the academy to support the channel and to get access to an InfluxDB Complete Guide course, which is currently in flight. We

1:36 also have a very active discord community available at Rawkode Chat, where you can come and say hello, talk all things Kubernetes, cloud native, Clustard, eBPF, and just so much more cool stuff. It's a great place. I look forward to meeting you. Alright. Today is clustered. We are joined by two members of the Discord community, our book club organizers, Eric and Carlos. Hello, both of you. How are you both doing today? Hello. Doing great. Alright. Let's start with you, Eric. Do you wanna introduce yourself, please? Sure. So I'm the guy who's here to learn because I'm sure Carlos is going to decimate me

1:52 Guest Introductions (Eric & Carlos)

2:15 here, but not that it's competitive as we're talking before the call. Now I'm Eric Smalling. I am a former DevOps consultant. I've been with VMware Docker VMware twice, actually. And I've been doing Docker containers for, gosh, since, like, dot o six days. Now I am a senior dev advocate at Snyk. I've been there for almost a year now, so focusing on container security and stuff like that. Thank you very much. Thanks for sharing. And Carlos. Hello. Hello. So glad to be here with Eric. We meet every every Friday. We have an open book club on Kubernetes, so anyone

2:54 wants to join our Discord. So this is a fun thing because we always spend time on the book, and this time, we're going to actually implement the stuff we're learning. So I like I said, I worked in IBM for the last about twenty years. Last things around cloud native, I implemented something called IBM Cloud Functions, serverless. And then the last four years, I've been doing, cloud architecture with Kubernetes OpenShift most likely, and, yeah, helping customers get into production with, cloud and Kubernetes. And I'm also a a technical lead for the Knative, so I do a lot of

3:33 open source. Alright. Thank you very much. We have a question from Noel in the chat. Is Carlos wearing a clustered shirt? Yes. I mean, David is trying to hire me, so I feel like one over one. Alright. Awesome. So we're about to kick off first. It's going to be Carlos and I working on Eric's cluster. Yep. I have Eric's cluster available. So before we do that, I just wanna thank Teleport. They've been sponsoring Clustered recently. I really appreciate all their support. We've been using their product since, like, the first episode. I love it. You're gonna see us using it

4:00 Cluster by Eric Smalling

4:10 today. It's a wonderful tool. So thank you to Teleport. Alright. Eric, Chris is where you get to sit back, relax, pop a beer, have some laughs, taunt us in the chat, whatever you wish. Carlos, I'm gonna pause our timer. That's a bit harsh. I'll start that when we go. We have forty minutes. Our mission is to upgrade our cluster deployment on the server to v two and browse to the application. I'm going to on Teleport, pop open our root session on the control plane one. If you could do me a favor and join this session and just type an echo hello,

4:28 Initial Cluster Check & Setup

4:44 anything to let me know that you are there, and we will get underway. And my mic's still working. Right? It is still working. We can hear you. See if that's me. Perfect. Cool. Alright. So I will start the timer now, Carla. Feel free to get our KubeConfig set up, and let's check for a control plan. Cool. So let's do export. KubeConfig. It should be in a to c Kubernetes. Should be no config. Right? Perfect. My alias k equals kubectl. Good call. I'm a fan of completion. I cannot live without completion. So completion, I can never remember

5:39 the solution, but it's always in the help. So that and then I'm a fan of completing on pay. Right? So that's that's the problem. So our graph is stuff showing up on your site? It is. Yeah. Okay. Just, like, be I'm being anal about my terminal and my Kubernetes stuff. I don't have all my tools here, so I'll I'll see what I can do. To honest, I don't know how to set up the completion on the key alias, so that was nice. I like that. I I think I did. Right? We're about to find out. Yeah. So

6:19 get get yep. Namespaces, network policies, and notes. So do we have notes? Oh. So nice. So Eric left me Control Yeah. Control play. Right? So API. I I prepare I I prepare all week for fixing the API server, so now I don't know what to do. So let's see. We have pods. Right? Yeah. We have Interesting restarts going on there. Restart I mean, things are running, but they've been they've been flip tapping. Yeah. And the one that we're looking for is cluster. Right? So we are see how is our deployment doing. Look at the format of those pod names.

6:48 Investigating Unexpected Static Pods

7:16 Those do not look like a replica set to close to me. Right? The you mean the these two guys? Yeah. Clustered, and then there's a replica set ID, and then a pod ID, and then a suffix? Yeah. So it's like they're static bugs. Sets or something maybe or static. Yeah. Maybe. Static bugs. Where are they coming from? So let's check who the owner is. Right? So we do get bug. Let's look at one of them, worker one. That's o yaml. And this guy oh, scrolling is working. And we have who created this thing? Owner reference. Right? Yeah.

8:03 The owner reference. Was created by the node. How come this thing is created by the node? So this should be a a static pod. Let me check the the deployment still here. Yeah. Let's take it Right. Scale down. Let's see if the the replica was scaled down. Let's see how many replicas we have. Spec. Where's the replicas? Replicas? Where's the replicas? Yeah. They're set to zero. You see it? I see it. In fact, if we if we use less, we can page it together. That might be a bit easier. But yeah. It's We have hours?

8:49 Ah, here it is. Right. But we have we have fake bugs. We can try to delete them, I I guess in the will come back. Right? I do unless we want to see if we have endpoints in this team. I I don't know if they're actually harming us yet. I mean, we could just Yeah. Scale back. They could be running. Right? Yeah. So maybe we can do they're they're there. K. Get service and cluster it on 3,000. Right? So we do local host 3,000. You have it configured for Yep. 30,000 that is running on. Yep.

8:53 Diagnosing Service Connectivity

9:34 Not responding. Interesting. Okay. So network policy, just in case. We don't have network policies. K get Cilium. Yeah. CMP and c c c m p, I think. C CMP? CMP and C CMP. Yeah. C c And CMP is so as cluster network policies and then cluster wide network policies. I see it now. Selium cluster network policy. Selium what's the other one? Network policies. Yep. Okay. So do you want to describe the service? Yeah. We can describe the service. Yeah. So, yeah, port. It's eighty eighty, three thousand. It looks fine. I'm getting endpoints. So let me describe the pods.

10:46 I did I think I already did describe the pods, right, underwear. Happy, I think. No events. Do you see it? Let me see. He's being sneaky. The image. The image. This guy. Eric? So it's a different image. Welcome. I kinda wanna try to present to it. See we have logs for this for this. It is not our image anyway. Right? So we don't Yeah. I'm hoping he's left as a wee a wee message. No. I don't think it's gonna resolve. You wanna see the logs? Yeah. Let's do the logs. It's just asking that we scale up the

11:46 cluster deployment. No. We've not done that yet. We're still exploring the the rogue image. Yeah. I will tell you those pods are hosting something. Are hosting something? They are. They're there. Alright. So if we fix the ingress to the cluster, we should be able to see what the sum is. Do you want to open it in the ingress? Yeah. It's not working. We're getting an internal server error. So if there's some network policy or IP tables blocking access to the node port? Yeah. This looks like just NGINX image, I guess. We can do a k exec inside the

12:18 ZomboCom Discovery

12:28 Yeah. Cluster. Let's see if what we have inside. Oh, right. Local host eighty eighty. Connection refused. There. Get the port run. So I'm thinking, like, maybe we don't care about this bot. Let's delete them and get yours running. Oh, but I wanna see what your left is. Oh, okay. Let's see what's running on this thing. Ah, interesting. What? He's running get TTY. What sneaky thing have you been doing? So I think this is run as a static pod because it's properly run-in as privilege, and he's trying to pipe something through from the horse, like a TTY console.

13:46 But it's not serving a web server. Right, Eric? Yeah. I think I saw a message that logs. Since I started it. Yes. There is an NGINX server running in there. There was. Gonna say crashed. There's no guarantees what I did was stable. And look at the pet numbers as well, actually. This this is definitely a privileged container. Right? Oh, yeah. Okay. So we are inside. What's PID? So it's just running as a privileged container. Oh, we we are not inside the container, are we? Think so. Yeah. No? No. We are in a host. Oh, no. No. No. I think

14:25 I was inside for a second. Right. Right. Right. Okay. No. No. I know. I'm not Oh, you forgot dash IT. Yeah. Right now we are. Local host. I don't have colors in my terminal. Tell me where I am. Alright. We have HTML. Okay. Right? And you said that maybe the ingress is broken? Yeah. I think those IP tables or something is blocking the node port access because we can't do a curl local host 30,000. But let's confirm. Right? I mean, let me grab the worker one IP address, and I'll just browse to it at my

15:12 browser. You can do a port forward. Right? There we go. Oh, you got it? Yeah. So the service the NodePort service isn't working, but if I browse the Node IP to Zambokov. This some This may be too old for most people on the call. Welcome. This is some Are you familiar with Zambokov? Welcome to Zombokam. I'll leave that for a minute. You can do it. Can First way to Zombokam. Alright. I've had enough So what so maybe Rawkode's Rawkode's going with the ingress? The ingress cannot talk to the thing? So Teleport runs on the control plane, and

16:00 it would appear the control plane can't hit the node port on one of the workers. I I don't know if we need yeah. We will need to worry about it, but we've seen what this You can access it through okay. Yeah. I think we can just get rid of those static pods now. I think Alex had his his fun with ZomboCom and Okay. That was meant for later. I didn't think you'd solve that one first. Okay. Let's get rid of static pods, so then we'll we'll do something. We'll work it through. How do we get

16:28 Removing Static Pod Manifests

16:33 to the other worker nodes? Yeah. I'll open a session on the worker nodes now, and you can just join a session. So we got worker one now available for you. Let me see. Maybe. I need to refresh on my side? No. It's actually failing to get a connection. Hey. One rule, Eric. There's one rule. I didn't break that rule as of Friday afternoon or sun Saturday, whenever it was. I Alright. I see now. It didn't load for me. I'm trying worker to oh. I saw a session on worker one. Oh, but it's not getting a prompt. Yeah.

16:51 Worker Node Connectivity Issue

17:18 K. That's not intended. Can you SSH from somewhere? I can. Yes. The working one? Yeah. We could do it old school. I've got a terminal going in. We just want to remove the static manifest. Right? So I'll just do that and then Yeah. Alright. The static manifest is gone from worker one, and I'll do worker two. I see. David knows that at one point, I did break rule number one, and he had to reset my cluster. Oh, really? Yeah. That that that man you wanted to be? That that how much you you hate me, I guess.

17:32 Removing Static Manifests via SSH

18:10 No. It was it was my mistake. It was not a it was not intentional. Okay. Now we get pods. Looks normal. So let's fix our scaled down deployment, I guess. So let's do k scale deployment. We're gonna cluster. That's dash replicas equals how many replicas you want? One? Yeah. One's fine. But it's creating. It's creating. Let's see. Wow. It's taking a long time. So the Rust Rust code must be big. Yep. Oh, we got an error. Yeah. Let's see. I cannot get to It is a networking bug. There we go. Failed. So the DNS resolved.

18:57 Diagnosing Image Pull Error

19:12 DNS. Did it? Oh, no. Wait. So DNS request is DNS. Image rank. Yeah. Look up g c g h c r I o on why is it going out of my that could be one of the node IPs 14775, actually. Okay. So Failed to resolve reference. Failed to resolve reference. Failed to request. Oh, the head. Right? So I failed the LTCP. I think it's DNS. Yeah. I'm getting back an IP address. But I think that's the cost the node IP. Oh, yeah. This is 53. So this is, like, the DNS server. Yeah. I think this

19:52 is our control plane or one of our working node IP addresses, which I can confirm. To wait. So the way DNS works is oh, I cannot type any longer. We're using core DNS? We are. Yeah. And then core DNS uses our resolve. ETC resolve. Does that look correct? So this is Where how do I resolve Google? How do I resolve GCR.io? I think I need a a a a. So the 14775 is the correct prefix for this, that what I would expect. But those last two octets are wrong from what I can tell. But you don't have a DNS server running

20:49 on these folks? Not unless EquinixMetel are providing DNS, which I don't think is normally the case. Right? Because the other place would be bar lib to lead it. Right? Let's just because the first two octets are right, I'm inclined to think they're okay. I think we just need to fix the networking maybe. Right. So are you saying that this is not DNS? Would I have to resolve the DNS? Let's let's check. Bind. Now I need to do the weird 50 variations of DNS and binder tools. There we go. DNS Tools. Oh, there we go. Okay. Yeah.

21:31 Fixing DNS and IP Tables

21:45 How are you going to install something? We cannot the DNS is not working. Right? So let's Yeah. Okay. So This DNS is not working on this machine, so let's configure it with Google's DNS? Yeah. We can. Although, we ping eight dot eight dot eight dot eight? Let's check if it's the full stack or if it has DNS. I don't know if this thing will work. Okay. Yeah. Okay. So let's just swap out the resolve then. So do we do the system d or the ETC? I'm not remember. We can just modify the ETC resolve. I

22:27 don't I don't think we need to change this system d one. And we can just go with, yeah, e.e.e. Server. So try this at first. Right? Correct. And then we should be able to in Google. That might not be the one unless we have IP tables. It went to IP tables. Show dump. Dash l, capital l. That's it. Any drops? None. I don't expect this. Oh, wait. There. It's fine. Right? Yeah. Drop all UDP traffic. Is that a problem? That is a problem. Yeah. Yeah. That's a problem. So we need to flush that rule. I mean, we could just flush them all

23:29 and restart the With the firewall? We could just do IP tables flush, and then we should be able to there we go. Yeah. That it? That's DNS? Well, I mean, we also have flushed every Selium role, but Selium will catch up eventually. Yeah. So let's see let's see the pod. Right? And let's I'm impatient. So let's delete the plot. Try again. You left hints in this? No. I'm your verbal your homework? Started say verbally. You haven't spoke since we started. Yes. I did. Okay. And what's going on in the chat? Anybody making fun? Yeah. We have Noel.

24:31 Hi, Noel. Khalid. Walid. What's the all the book club is here. I don't know. People, I think, want to skip homework for Friday. Alright. So we're still getting filled. Oh, because that's on the worker nodes. So I bet that rule's that rule's been replicated, hasn't it? I'll try the session. Alright. Yeah. So IP tables dash l for a p two p. Actually, I don't think there's a check over there. He he replicated the rules on all the machines. Oh, okay. Just to frustrate it. Yeah. So the, yeah, the DNS so the I guess the DNS was correct.

24:52 Fixing IP Tables on Workers

25:17 Was just IP tables. But if it was DNS, was the the work the pods get the DNS from from CD and CD from the control plane. So it's easy to resolve on the workers will not do anything. Okay. I So you fixed the other two? Yeah. I have. I think if we delete that pod and try again, maybe we'll we'll get back in. Let's try again. Okay. Delete. Delete bug. Let's do it. I think that bugs h w. Oh, it's running. Wee. Alright. So we have Refresh our I check if we have endpoints. Yep. We've got the watch.

25:55 Deployment Pod Running

26:03 Oh, cool. So now we can update? We can try. Appointment. Customer led. Change to v I don't know if you explained the rules for the audience if somebody's new. So idea is to get v one working and then upgrade to v two. Let's do clear. If you're not a fan of the show, you should be. But w terminating terminating. Let's see. I think running. Not yet. But sometimes it takes 15,000 refreshes before I get my dance. Haven't worked out why. What's going on? Maybe it's already running. You wanna confirm the image? Oh, yeah. It's it's

26:04 Upgrade to v2 and App Check

27:00 running already. So let's let's see. Yeah. We got events. Yeah. Are we done? Yep. There we go. That's it? There was a second hack that apparently didn't didn't work. What was the hack? Supposed to be able to do it after just clearing the IP tables. Interesting. Can you do a a a I don't wanna give it away, but okay. Can you do a t c We're done. I'm curious about something on my cluster because, basically, the DNS was just bonus. That wasn't really what I was aiming for there. I was trying to destroy the VXLAN communications

27:15 Eric's Hacks Revealed

27:45 Yeah. And which I did, which is why even without DNS, you weren't gonna be weren't gonna be able to get that thing connectable. However, there's a second stage to it that apparently didn't work long term. I'm not sure why. I also had a a a traffic control command in there that if you do a t c space q d I s c space show. Q what? D I s c. I s c, like disk, space show. I wonder if this got cleared out. With the IP tables flushed? There should be a so there should be something that's basically throwing away a %

28:37 of all traffic on the Cilium VX LAN on all nodes. And it looks like it got cleared out. So something I I didn't count on something maintaining traffic control behind my back. Yeah. You'll probably find with because if we take a look at the custom namespace, those silly and pause were rotating a fair bit, and I wonder if maybe the oh, I don't know. I'm guessing. What about persistent on the worker nodes? Did they reboot it? How how do you persist that traffic control configuration when the server I thought it auto persisted. I'm I'm not enough of an expert on

29:16 it. This was a bonus that somebody I was talking to said, hey. You could do this too. And then well, that way they wouldn't see it in the IP tables. I thought, oh, that's a good idea. But Yeah. I don't see it on the workers either, so not sure what happened there. I I so you went to the extra you went to the extra mile to ask someone how to be meaner with this this app. Right? Right. I was saying to try. Thank you for not trying. Sorry. Like I was saying, this was this was an

29:47 exercise in trying to David and Goliath, except I have no rocks. Let's put it that way. You are the much more experienced here. So this should be interesting when we flip the tables. Alright. Let's do it then. Okay. Let's drop the session. No. That would be boring. Thanks. Alright. Well done, Carlos. Eric, your turn. So now I go to hold on a second. Let me get over to the right window. Alright. I have got Carlos' cluster. I have opened a root session. Just let me know when you are there. Where is the sessions? I see the chat. I think the people

30:00 Cluster by Carlos Santana

30:39 Just ignore Noel. Noel's being mean. Noel, I was afraid of a eBPF hack on the system d that are for not that I watch every episode this week, but There we go. Joining session. Mobile eBPF. Alright. Let's see. You see me? I can see you. Alright. I will start the timer. You set up your config. Check the the API server, and let's see what happens. Let's see if we have we don't have it yet. I think teleport have fixed the scroll. That's awesome. Said in the chat saying, what are the odds that Carlos's break involves k native?

31:11 Cluster Check & Setup

31:44 Do you have any admission controllers run with k native, Carlos? It involves k native in some way. Yes. It involves k native, actually. Oh, I shouldn't say anything. Alright. Well, we we have an API server. We don't have two worker nodes. Cluster, it looks like it's not happy. Yeah. It's a little bit broken. And yeah. Alright. K. Postgres is scaled down, but it looks good too. And what do you wanna start with then, Eric? Hold on. Let me rearrange so I can see both your screen and mine. Yeah. Take a look. I'm Well, there's no point. Sorry.

32:31 Go ahead. I thought you were setting some stuff up on your screen. Was gonna say I was just gonna describe an order and see if we could see the not ready states, but I'll Go go ahead and do that. Alright. So I'll paper through left so we share scroll. Yeah. Network unavailable false. Node status unknown. I think the kubelet's just not running. Yeah. I think our first mission is to jump onto the worker one. So I popped a session on there. Let's check the status of the kubelet. K. I'm jumping on that session. Okay. There.

32:51 Diagnosing Stopped Kubelets

33:53 I think it did. Hopefully, he just stopped it. Scroll broke. There we go. You see anything? I don't see any. I think he literally just did a system control stop kubelet. I think we should just start it and see what happens. Yeah. It's up and running. Yeah. It looks good. Let's see if we can get some node status. Yeah. It's now ready. So I I'll start it on worker two. Okay. And then we'll go back to the control plane. Oops. Sorry. Oh, I'm sorry. Go ahead. I was just trying to get notes to confirm they were both ready.

33:59 Starting Kubelets

35:02 Go for it. Alright. What's next? So the pods are gonna count. It's pending now. Service up it is. So The auto complete thing like him, I'm not fancy. Yeah. I think I can remember that command. So Carlos did a complete, like, create cube control and then he copied this and changed second cube control to k. That's now my favorite type of all time. So Nice. Did we load the completion at the start? Say that again? Did we load completion? Yeah. We did. Okay. Yeah. We're good. Take it away. Are you guys trying to see the bot?

35:06 Diagnosing Pending Pod (Wrong Image)

36:16 It was still pending, so I was wanting to see what's going on. If you watched the search magic on Cloud Native TV this afternoon, I discussed the pending state on a pod. No. I did not. Why is our image from Quay? Is that normal? No. That's cheeky. Also, did not rename my username from Rawkode to Carlos Santana. Then you can delete them, the parts, and get them with Russell's giving me a trouble for blindly starting the Kubelet service without checking the definition. Whoops. And then Noel wants YouTube to have GIF support. I do not need trolled with GIFs too.

37:07 So the the hint you can look at the hints, but to retry, just delete the pods and then We don't need no hints, Carlos. We got this. Delete the pods so you don't lose time on the other stuff. That would make sense. To deploy the the image looks fine. So Yeah. Delete them. See what happens. They come back. You're a little too was not a little bit If I if I broke it correctly now. It should be okay. Three seconds. Now you can describe it. Yeah. Describe that. Pending is never good. Yeah. More useful message.

37:45 Pod Image Mutation Found

37:54 Oh, the image came back. So I think we need to check for a mutating admission controller. Nope. But pipe. Yeah. Mutation web configurations, I believe, is the name. Help me out. Yep. Okay. So we don't have one of those. Do you mind if I let's describe the replica set. Well, you checked the deployment. The the deployment had the correct image. So let's check the replica set and see if that's got the right image. I wanna know if it's happening at the pod level or the rep set level. I'm now I'm curious why you're not getting

38:42 messages. Oh, I see. I see. What I broke. I forgot that I broke something else. Check check the hints if you want. They're funny. I don't know. Yeah. Check the hints with three minutes left. Okay. No. Don't look at the hints. No. This is this is a good one. Like yeah. Don't don't look at the hints. Don't look at the hints? Okay. Why don't we Yeah. The white spot what's happening? This spot is not being what? What should be what? Yeah. Yep. You found it. Our replica sets. Okay. So the replica set is actually correct.

39:23 Yeah. Our pod was created. So interesting. What is the what is the process of creating a pod? The bug gets created? And then I'll then customer must be care again. What's running in our cluster? Can we do a get pods all? Yeah. There might be something sick. Ambassador. Everything. Yeah. I'm gonna assume he swapped out the images on the controller manager. Let's describe the controller manager. Come on. Autocomplete. Helps to give it pod. I'm looking at your scroll. The image is is good, annoyingly. So it's a nice effect. Delete the pod. The pod comes back.

40:00 Discovering Missing Scheduler

40:53 The image has changed. So we're gonna have to check the image on the API server now. Should we go to the static pod directory and take a relook? The worker on the oh, sorry. Which one? On the control plane. Yeah. It's a Kubernetes manifest. Let's see what is changed in here. Oh, beta dog 20. Yeah. We'll need to check the API server too. The image looks good. Assuming this is the right static pod directory, which you could also have changed. That's true. Let's check the API server before we go down another rabbit hole. Image is good. Okay.

42:09 Could be a container D trick. Like what? Like he's wait a second. Yeah. Did you I I thought you you described clustered. Right? And it was I was wanting to see if he changed the image load policy and if it's only loading from cached. Oops. I'm failing. Yeah. Let's let's grab that for image and see what we get back. Full policy always. I like I like it. So let's think what's just see, it's really weird because the API server is giving us back the image, is car loss. They're infuriating. Hot. Check that. Noelle, we did check for mutating webhooks.

43:36 That is sneaky, Carlos, whatever you've done here. So the replica set is created with the right image. The pod is then being created with one. I think the only way I know how to do that would be for him to swap out API server or to a mutating admission control. Now he could be hiding the mutating admission control from us. Right? Like Yeah. But I'm I'm it's I will be over my head on that then. How would you do that? Rename it rename it something that looks innocuous? It's it's simpler than that. Look at the hint. Check. Oh, yeah. Check

44:16 the hints. I we checked the hints. I I think Unless but don't look at all of them. Just them in order so you see only the beginning. Alright. So we fixed that one. Yep. Fixed. Yeah. You let it down. Is that important? Yeah. That's fixed already. Alright. Okay. Yeah. I wasn't sure what there's a play on words for Kubelet, I believe. Yeah. This is I think you're using a static pod. Alright. Okay. So read read the sentence. You think in action this poor static pod. So you can hide resource. We can hide the the mutate and

44:59 admission controller configuration. Missing in action, this poor static pod. No DNS, but then that's not scheduler, tool manager, API server, etcd. So what's Is that the right hand for this problem, Carlos? Yes. Yes. Okay. Missing an action. That's pure static. I'm going to check. So there's there needs to be a a list of yep. He did change the static directory. You even said that, and we didn't follow-up on it. That's So I bet he has swapped our controller manager and API server. Yeah. Controller manager is maybe alright. I didn't search a bunch. And API server alright. We're gonna have to

46:14 go through the options then. And so with the API server, he could have configured a static admission controller, which he's not. I have no idea what I'm looking for. I'm just dim scrolling at the moment. Oh, there's no scheduler here. That's true. That is accurate, which makes sense. That's why it's that's why it's pending because it has it doesn't know what node to put it on. But the image has also changed in the pod. Right? Yeah. Scheduler wouldn't do that. I don't think. Okay. Let's bring back the scheduler. That's hiding the fact that yeah. It would

46:39 Fixing Missing Scheduler

47:11 not normally would say image pull problem or something unless he's got it. No. He probably has an image on Quay. Never mind. Yeah. I'm gonna trust that the manifests in here are actually okay. I think saying enable the hack. Okay. We have a scheduler now. Let's just kill those puppets. Yeah. Let's kill the pot so we don't lose time. That's my sister in the in the chanting, Russell. Delete the pass so you don't lose time on the other stuff. Let's take a look at why are we in Kubernetes with a c? Did we did you delete it?

48:12 No. Did you describe it? Describe it to see if you have events. Oh, he just tinted the notes. You happy for us to delete the tents? Okay. So we have a container creating, but I'm still really worried about that image on the pod. I don't know how he's achieved that yet. You're seeing things. The deployment should have the right image. The deployment did. The replica set that the pod didn't. I don't know how you did that. Yeah. So now you're surprised. Check the deployment. I think the deployment was correct. Oh, correct. So is it running?

48:50 Image Still Wrong

49:28 We have blank blank, but They are running. Would you like me to browse to your application? If the if the database is not working, I don't think it will work. Right? I tested your image. Yeah. I'm trying to think you've been writing some Rust, Carlos. I yeah. I like Rust. Do we have a database yet? No. Crashed. Alright. Oh, crash? I didn't do that. Check the database. K. Yeah. We're getting an internal server at our our databases, and I crashed it back up. Oh, no. It's restarting. Oh, I need to do that, I think. Are you sure?

50:00 Diagnosing Database Crashes

50:31 It's flapping. Do you edit my stateful set? I didn't do anything with the database. Well, you should you should get if you you don't have a database, you should get your angry man. Right? Yeah. So we still have a networking problem. Yeah. So that one, I did. Let's see. Alright. Let's work at the networking problem, Eric. Okay. I I was looking at documents for a second while you apparently the the c the Kubernetes with the c, is that his alternative directory, I guess? Correct. So I copied the scheduler and and trusted that the other manifests were okay.

51:16 Diagnosing App Networking (Network Policy)

51:40 More fun hints. We want to look at the hints. No. I think we got enough things threads to put on right now. I'm curious about the networking. I wanna be able to fix the ingress onto the cluster. And I also wanna check the container d configuration for using the matter alias. Perfect. No worries. If the pod is coming up and then crashing, and is it not logging anything? Any reason? No. I'm saying the first time we see prod is crashing like that, I don't know. But it might be some related to some builder hacks. Alright. Let's do curl

52:21 h p. Let's check our node port. Broken. I'm gonna do that on a worker node as well. Broken. No. Fine. Okay. That's expected. So the control plane can't reach the worker nodes. When do you wanna check first? I mean, the first thing, just like you did for mine, let us look at see if there's any net policies. Oh, yeah. Oh, look at app tables, know, I mean, obviously. Defaults, all pods. I don't think that's allowing. I think that's a denial. Yeah. That's that's saying the opposite of what it is. Yeah. Since we checked these on your cluster, we'll

52:51 Fixing Network Policy

53:26 check them here. Alright. Good. So we'll try the curl again. Okay. We fixed that. I like that you both used a network policy. I can't believe that that would be that's not the database issue, though. I have Wouldn't I? I mean, the database doesn't rely on anything. No. It's the self bootstrapping Is it does it have a volume that it's mounting that it can't? No. Is it all it's all self contained? Yeah. Let's see what happens after a few seconds. No. Press restart. Are you sure you didn't break this? Yeah. I think so. I I think I

53:47 Database Image Wrong

54:30 broke it without Is the liveness probe? Yeah. Check. I mean, we don't need wherever you're going, we don't need probes. Check the image. Just yeah. I want to help out. Check the image of that database in the pod. Did I delete too much? No. It's okay. I'm just removing them. Okay. Running. Okay. So there's maybe another networking thing then. Is there anything that your app logs that would be more sorry. Go ahead. Check the images on that database on the pod. Oh, if you turn a weird thing with both of them, you cheeky monkey. Yeah. I

55:49 wanted to break one, but I think I broke two. Yep. I hate you so much. I unintentionally broke the the database. Alright. So we need to we need to solve this. Right? Like, the deployment, the rep, like I said, have the correct image. How is he solving gonna be a great hack. It's going to be me dancing, but it didn't work anyway. Alright. I'm gonna because I had a hunch that he's used the container deconfig, so and he is not. Alright. I'm gonna be honest. I don't know how he's done that without mutating the machine controller.

56:48 And all the control plane components seem normal. I'm starting to think it could just have modified the DNS to resolve the query, but I don't know why It wouldn't say this. It wouldn't update it in the pod. Yeah. It would just be the one time use a different image. You're right. That's a really cool trick, Carlos. I hate you. I have to hit me. But how I said it wasn't competitive, it's competitive now. Alright? That's it. Yeah. I'm thinking we'd use the hints, Eric. I'm I'm at a loss, mate. Use the what? I'm thinking we look at

57:24 Using Hints

57:29 the hints. Yeah. They're in order. Yeah. I just what hint what's what hint are we on? What well, Dave said that he was going to wait three minutes. Like, when what's left to secret hint? It doesn't want to wait. Let's take a look at three or four. I can't remember which. Cannot Cannot or node. It may be painted. Alright. We will do that. We will do that. Yep. We got that. Oh, please allow all do not deny policy. Yeah. You got that one. That was that was the thing that you were you were going to see, but you

58:15 didn't see. So that was that was the swap of the image. It was going to be me dancing, but So So the next the next hint to oh, no. Yeah. This is the hint. I'm sorry. Yeah. Is there mean, do we wanna go to six? Or I wanna check some more manifest. That's that's half of the hint in this one and the next one have the other half. So, yeah, this this hint. This problem is split in two. So what does this sentence says? Read it carefully, the hint. The mutant's changed the music. You're hunting at

58:54 an admission controller. I just don't know Music. How you're gonna end. I see someone. The mutants change the music. I wonder if you could just inject it directly into the EDC d. Yeah. That that And My kidney function, I I did a bug. So I I I changed too much. I changed both the image and the I changed the the app and the database. I wanted only to change the app. But, anyway So, I mean, that's not API server seems All right. I was checking PS because I was curious. In fact, I said I would start doing this in

59:44 future. Was looking to see what random stuff you've added to my system. Nothing. Nothing. But I thought there was maybe a second API server running on the machine, and we were speaking to the wrong one. Like, like, API server in my client function in the cloud? No. That's not it. I mean, we could we could add a blanket IP table rules to block all external traffic on six four four three and see if it helps. Although, who knows what port he's running on? I'm not I'm not that smart. I don't know much about IP tables.

1:00:44 Okay. So this is our API server. We got advertise, alert privilege, authorization. We got our node restriction admission controller, which is all good. We've got There's nothing wrong with this API server from what I can see. Yeah. This image appears to be alright. I mean, we could always describe it in the pod, but I think I did that and I think it was okay. I'm being really silly, aren't I? Hold on. What? What if he pulled the image and changed the pill policy? This this isn't necessarily my API server. Case9Case7, emoji. Oh, you gotta put this namespace first.

1:02:04 Correct. There's two APIs. So It's gonna fix it. Right, Carlos? It's gonna fix it. What did you change? I set the image pool policy on the API server to always. Because you want what I I think you see the the machine with a fake image, a fake API server. API server is not in charge of changing those. Well, I could do it. The controller manager. Okay. So, yeah, you're telling me I just need to update them all. Alright. Got you. Don't trust anything. No. Oh, I don't think I touched any of these these folks. Well, that makes me very disappointed.

1:03:03 No. Those those images are are legit. And they're coming from the right place. Right? So oh, but you think I put them there and cache them? Uh-huh. Yeah. Otherwise And you're you're telling me that's not what you did? Nah. Too boring. Alright. What happened are we on, Eric? We're too we're too too clever. I'm not that good. Six. Oh, hints. I want to see the hints. It's patient. Yeah. Oh, you mother. That's twice I've been burned by that. If it was. Want to explain to people watching what you're doing to you? It's Eric driving? What's what's Eric No. That's

1:03:17 Malicious kubectl Alias Found

1:04:02 that's has. You should be fixing this, Eric. So can you explain it for folks? I think people are a little bit lost. Yeah. Let me run through that. So if we run alias grab to cube control, he's running a script called HVWC. The script called HVWC is grabbing for anything where the resource is mutating and returning no resources found. That's almost as evil as Duffy's fake SSH container. You want the you see the YAML the YAML of it? I want what I want to do is fix Postgres and see you then. So what I'm going to do is yeah. We'll

1:05:02 get the YAML. It's for fun because you can see the hack. When we're done, remind me to tell you what I didn't do in mine that I could've that was related. Okay. So you've got this running on app codeengine.appdomain.cloud. Is that some IBM service? Yes. K native as a service. Nice. Just shoot it. Okay. So that's now gone, but I'm gonna describe the pod. If if you go to Kubernetes seek seek app machinery, you will see some chat in there. I was trying to hide it in a different way, but I couldn't. With a validation webhook, hide the mutation webhook,

1:05:40 Deleting Mutating Webhook

1:06:02 but I could. Okay. So I did it this way with alias. Right? Where's my pod? Did we lose our scheduler? No. Did we lose our controller manager? Check the system Get parts again. I think it's just take a second. Yeah. I've not got controller manager. What's the problem? I don't know. It doesn't seem to like that change. Alright. It's back now. So we should get our pod. And I think the first thing we're gonna do is edit deployment clustered. Queta I o slash your username clustered. I mean, we all wanna see it. Oh, you wanna see it? Oh, you want to

1:06:48 Pods Running Correctly

1:07:10 see it? Okay. Yeah. I think you you have a, indentation problem. So is the database running now? Yes. Okay. Oh, no. We need to delete the database because I removed all the probes. So that's still your image. Yeah. The chat or else then, the one I see you dance. Okay. Oh, okay. Okay. I don't dance that well. So I was gonna do the same thing and wrap IP tables to hide the drop, But Duffy and I were talking because he's my network go to, and he's like, oh, that would be mean. Nice. Well played. Alright. Let's let's finish this.

1:08:10 Upgrade to v2 and App Check

1:08:15 I forgot what I'm doing. Alright. Set to v two Yeah. We gotta change it back to yours. I'm assuming it's just gonna work. Right? I would hope so. I I said that I was mean by this time, so I said, like, it's gonna be day. There's some there's no more hints. Right? I think you found the hints. You can check the hints. There's something else. And that should be I guess it's cached. Even though you have it set it up to there you go. Ask the dancer people like, not me. Very nice, Carlos. I can't believe we got burned by an

1:09:18 Carlos's Hacks Revealed

1:09:20 alias. You don't run alias the first time you went to set up Yubectl? No. I'm too trusting, which I should look at right now. I want I want the chance, oh, they might just run it. So that's the reason auto completion didn't work because I had kubectl two and alias. So you you folks set up auto completion correctly, but it wasn't working because of the alias. Damn it. Alright. Well played. Two fun clusters. We got there. Cool. It was fun. I'm gonna let you I'm I'm I'm raging about that, I guess. I've gotta say. Like, I'm gonna have to write

1:10:04 a new list of rules now. Alright. It's the love for the episode in the chat. This episode was awesome. Yeah. Was good fun. It was nice to do this. Please update the official v two image with the Carlos dance. No. No. I think we will need to find a way for that to survive somehow. Like, that may just become the star and image. We'll see. I went yeah. I learned I learned a little bit of grass. I went figure out how to do an MPEG forward to a webM. So Very nice. Alright. Well, thank you both very

1:10:10 Conclusion

1:10:41 much for joining me today, for breaking those clusters, for powering through and fixing this cluster, and for running the book club on the chat along with others, Walid and others as well. So, you know Yep. It's very great. I'm gonna have a very strong drink after that, Aries. But other than that, have a wonderful day, everyone. Thank you, Carlos. Thank you, Eric. Thank you, everyone on the chat. Thank you, Teleport. Yep. Speak to soon. Thank you. Bye. Thanks. Bye.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Cilium

More about Cilium

View all 36 videos
CoreDNS

More about CoreDNS

View all 21 videos