About this video
What You'll Learn
- Diagnose a Kubernetes API server outage by auditing control plane components and kubelet/static pod manifests.
- Upgrade the broken clustered application from v1 to v2 in namespace while recovering failing controllers and services.
- Identify a hostile kubeauth helper by checking at jobs, process tables, and BCC tooling traces.
Community volunteers each get ten minutes to debug two broken Kubernetes clusters live. Cluster one hides a renamed etcd static pod manifest; cluster two has a rogue process scaling a deployment to zero via a malicious kubectl auth helper.
Jump to a chapter
- 0:00 Holding screen
- 1:26 Introduction and Episode Concept
- 3:08 Background on the Clustered Series
- 3:47 Sponsor Shoutout: Teleport
- 5:03 Sponsor Shoutout: Equinix Metal
- 5:35 Introducing the Audience Participation Buzzer
- 5:56 First Contestant: Benjamin Joins
- 7:00 Getting Benjamin Access to Cluster 1
- 8:53 Debugging Cluster 1 Begins (API Server Down)
- 10:15 Investigating Logs and Manifests (etcd Issue)
- 12:00 Offering Hints and Support
- 13:09 Examining Static Manifests Directory
- 14:31 Looking at API Server Logs (Connection Refused)
- 15:25 Checking Kubelet Status and Logs
- 17:31 Benjamin Hands Over
- 18:24 Second Contestant: Bogdan Joins
- 19:40 Getting Bogdan Access to Cluster 1
- 22:56 Debugging Cluster 1 Continues (Following Benjamin)
- 23:38 Manifest Timestamps (Red Herring)
- 25:11 Kubelet and Static Pods Investigation
- 26:52 Restarting Kubelet and Tailoring Logs
- 28:16 Controller Manager Issues Appear
- 30:32 Cluster 1 API Server Becomes Responsive
- 30:57 Attempting Application Upgrade
- 32:38 Controller Manager Probe Failure
- 33:15 Modifying Controller Manager Manifest
- 34:56 Controller Manager Restarts, App Still Stalled
- 35:28 Success! Application Upgraded on Cluster 1
- 36:07 Post-Mortem Cluster 1 (Rawkode Explains Intended Breaks)
- 38:01 Third Contestant: FHKE Joins
- 38:56 Getting FHKE Access to Cluster 2 (Starting Fresh)
- 42:14 Debugging Cluster 2 Begins (App Scaled to Zero)
- 42:44 Checking Deployment Events
- 44:21 Searching for the Rogue Scaling Process (Cron, Jobs, Systemd)
- 46:28 Examining Running Processes (ps aux)
- 48:56 Considering Container Processes (nerdctl)
- 51:14 FHKE Hands Over
- 51:58 Fourth Contestant: Vladimir Joins
- 53:23 Getting Vladimir Access to Cluster 2
- 54:43 Debugging Cluster 2 Continues (Following FHKE)
- 55:15 Rogue Process Scales App Down Again
- 56:54 Re-examining Images and Processes
- 57:27 Vladimir Hands Over (Due to Family Interruption)
- 57:49 Fifth Contestant: Alistair Joins
- 58:41 Getting Alistair Access to Cluster 2
- 59:19 Kubeadl Conference Mention
- 1:00:21 Debugging Cluster 2 Continues (Following Vladimir)
- 1:00:56 Checking Images and Processes Again
- 1:02:00 Keyboard Enthusiast Chat
- 1:03:00 Chat Suggestions (Mutation Webhook, Process Table Focus)
- 1:04:17 Investigating Process Tree and Sleeping Processes
- 1:07:12 Hint: Consider the `at` Daemon (`atq`)
- 1:08:50 Closer Look at `ps aux` Output
- 1:09:59 Alistair Hands Over
- 1:10:32 Sixth Contestant: Seth Joins
- 1:11:44 Getting Seth Access to Cluster 2
- 1:13:21 Debugging Cluster 2 Continues (Following Alistair)
- 1:13:34 Investigating `atq` and `systemctl timers` Again
- 1:14:51 Installing and Using `nerdctl` for Container Processes
- 1:17:01 Examining Container Processes Again
- 1:19:57 Seth Hands Over
- 1:21:11 Seventh Contestant: Dan Joins (Faces Technical Issues)
- 1:24:04 Dan Hands Over (Technical Issues Persist)
- 1:24:16 Last Call for Volunteers / Community Discussion
- 1:26:26 Eighth Contestant: Bogdan Returns
- 1:27:09 Getting Bogdan Access to Cluster 2 (Again)
- 1:28:18 Bogdan Asks for Hints (Hint: Host, at daemon)
- 1:28:41 Investigating `atq` Again (Scheduled Jobs Reappear)
- 1:31:02 Rawkode Hints: How kubectl Authenticates
- 1:31:52 Identifying the Malicious Auth Helper (`kubeauth-off-metal`)
- 1:32:49 Revealing the Malicious Script Behind the Auth Helper
- 1:34:17 Explanation of the Cluster 2 Break and the Proper Fix (Trusted KubeConfig)
- 1:35:30 Debugging Tools Demonstration (execsnoop, open snoop)
- 1:38:24 Conclusion, Thank You, and Upcoming Schedule
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:26 Introduction and Episode Concept
1:26 Hello, and welcome back to the Rawkode Academy. My name is David Flanagan, although you may know me from across the Internet and this channel as Rawkode. And I often say clustered is the worst idea I've ever had. I've now taken that back. This episode is the worst idea I've ever had. Echo. Can I ask if I have an echo? I'll work on that if I need to. There is an echo. Two people have said it now. If you have an echo, make sure you don't have two tabs open. Yes. There we go. Yeah. Don't have it open on Twitter at
2:28 the same time and cause me get the fear. Alright. We're on low latency mode. So we have around a one second delay from whatever you type and whatever I see. So we can have a good conversation today. And we're gonna have fun. So like I said, this is the worst episode. Worst idea I've ever had because I have broken two clusters. My goal is to have random members of the audience come on and join me and see if they can fix them. I was really confident when I broke these clusters that I I was like, yeah. Nobody
2:57 will fix this. And now I'm worried that they're too easy. So whether this episode is thirty minutes long or ninety minutes long, I have no idea. That is down to you, the audience. So if you haven't seen cluster before, I would encourage you to check it out. We pop over to my screen share for a second. You will see that there is a cluster oh, don't do that. There is a cluster playlist with over 42 of these episodes. So if you wanna see how people come on computer. There we go. Clustered, filled playlist. There's loads of these. Right?
3:08 Background on the Clustered Series
3:28 We've had teams from Adobe, Zapier. We've had members of the community, IBM. We've had Pixielabs, Aerospike, ChainGuard versus ChainGuard. That was a fun one. Like, there are lots of companies. So go please check out the other episodes of Clustered if you wanna do more or see more people sweat on the stream. Alright. We're gonna get this thing kicked off in just a minute. What I need you to do is I have created a buzzer for today's session. If you go to this URL, this will allow you to register your name, and we're gonna allow you to buzz in
3:47 Sponsor Shoutout: Teleport
3:59 and join me for ten minutes to see if you can fix a cluster. If you don't want to, that's fine. You can just watch and laugh. There are rules, Andrew. Yes. No. URI code has not been used to break any of these clusters. So join the buzzer. And while I do that, I'm gonna thank our sponsors as well. So get that at the sponsor's way. Alright. So from the very start of clustered, we have been using Teleport. Right? Teleport is what we use every single week to debug these clusters and pay it on them. I can,
4:33 like I'm gonna do today, add GitHub users to a GitHub team, and they are magically gonna have access to these clusters. They can SSH and and try and fix stuff. So if you're not familiar with Teleport, please go to Rawkode.live/Teleport. It'll make our sponsors happy. It keeps clusters happening, and I would be well, yeah, just check it out. You're gonna love it. I'm gonna use it all through this episode as well, so you'll see how amazing it is. Oh, we've got people on the the buzzer. Sweet. I was wondering what that noise was. Also, Equinix metal provide all of the hardware
5:03 Sponsor Shoutout: Equinix Metal
5:06 for clustered, and they have done since the very beginning. If you wanna check out Equinix metal, go to Rawkode.live/metal. Use the code raw code, and that'll get you $200 in credit to start playing around with it. These machines are chunky. Right? I could run clusters on a small DM with one core and one gig of RAM, but I don't. I run it on large bare metals machines with 48 cores and 60 or 90 gig of RAM depending on which instance I use. Why? It just makes it a bit more fun. If you've not tried that, go check it
5:32 out. Alright. Let's put this back down here. So I can see yeah. We've got people in there, we can start testing and get our first person on. I'm gonna reset the buzzers and lock them. No cheating. And we're gonna get our first contestant on. So I do have space for somebody here. So I'm gonna unlock the buzzers. First one to buzz in, I will paste the link into the chat, and you can come and join me. Ready, set, go. Benjamin Ritter, You can come on in. On our system here, I'm pasting this link to everybody, but I only want Benjamin to
5:56 First Contestant: Benjamin Joins
6:20 join. So click the link, turn on your camera. We'll get you access to this cluster. If you're not here in a minute, we'll pass it over to someone else. I hope you can hear me, Benjamin. We have Benjamin. Hi. Alright. Let's get you access to this cluster. I'm assuming you have a GitHub account. Right? Yeah. I do. Awesome. Okay. So I'm gonna add you to either cluster one or cluster cluster two. Do you have a preference? I have literally no idea what's going on the first time here. So I hope you've read it. Alright. Let's just give you access to cluster
7:00 Getting Benjamin Access to Cluster 1
7:21 one. So if you could tell me what your username is, please. I didn't type it in because that's not an that's gonna employ if I try to can I do Yep? There's a there's a private chat. Feel free to use that. There you go. Thank you very much. Alright. So this is gonna get you access to this cluster. You can accept that invite just now. Will unlock my password. Alright. You have an invite. Once you've accepted that, you will be able to log in at join.clusters.live. This will give you access to the cluster, and I will show you how to start
8:08 fixing it. Alright. I'll just hide this over here. Where where would I find the the invite? I think it's in GitHub. If you go to GitHub.com, it should be a notification. You'll also get an email address at an email to your to your email. Oh, yeah. I'll just do that way. Sorry about my wife. I'm a little bit sick right now. No worries at all. So how's your Kubernetes experience? I am a systems engineer and IT consultant by day. And my main thing is Kubernetes clusters and operating them for customers, but mostly managed ones. I'm do I
8:53 Debugging Cluster 1 Begins (API Server Down)
9:14 have one for one customer that's on prem and that's fully fully managed by by our team, including control plan and everything. It's on headset cloud, so that should tell you how much manual work is. Yeah. And then I'm I'm doing this, like, for two year two years now. Nice. Okay. So I just opened an SSH session onto one control player. What that means is if if you go to join Clustered Live, you'll be able to log in with your GitHub authentication. And then if you click once you're logged in, click on activity and active sessions, and
10:00 you will see my session, and you can use options and join. Alright. So go to act yep. Sweet. Alright. You now have access to a cube admin cluster. I'll give you one bit of advice because everybody gets this unclustered. You will have to export your cube config, which I'm happy to do for you. Okay. Then the cube admin location, Everything should be as unique. Your mission is to upgrade the clustered application and a default namespace from version one to version two. Good luck. Okay. And I'll give you ten minutes, and then we'll we'll give someone else a shot.
10:15 Investigating Logs and Manifests (etcd Issue)
10:48 You already exported the cube config, but is this is this part of the challenge, or is this a problem? This is part of the challenge. So it would appear that you do not have a working control plan. Oh, no. How is this cluster provision? With Kube admin, Kube ADM. Are you familiar with Kube ADM? Vaguely. Okay. Just true wrappers. So if we run p s a u x and grab the cube, we'll see all of our control plan components. There's one missing at the moment, which is the API server. Okay. So by joining me, you already get a
11:51 $20 Amazon voucher. If at any point you wanna hand it over to someone else, just feel free to to say thanks and off you go. But if you need any advice, I'm happy to give little hints as we go. So, you know, don't feel uncomfortable. Don't feel under pressure. Like, just do what you're happy with. Of course. Okay. Just just to be real with you for a moment here, I'm just googling what to do to restart the API server for now. I I Google every day. Don't you worry, man. Yeah. Just just to to keep the viewers
12:00 Offering Hints and Support
12:26 involved and and not staring at a blank screen. And I found, like, I'm I think from 02/2017, which says, oh, just restart the container. I I guess that won't work since modern Kubernetes doesn't use Docker anymore. But There's no Docker on these machines, I'm afraid. There's container d. And the way that Kube admin works is that all of our control plane components are configured via static manifests. Okay. Do we have the the manifests on the on the control plan notes themselves? They are indeed. They're in slash e t c slash Kubernetes slash manifests. And thank you for using a pager.
13:09 Examining Static Manifests Directory
13:34 That means the audience sees exactly what you see, which is always great. Yeah. Okay. We got six minutes left to make some progress. Alright. Wanna chat? Help out. You guys are always really good at this. Oh, okay. I didn't even think about that. But it's No. They haven't ticked a lot too far, but the chat are usually pretty good at helping. So Alright. But right now, you you have no API server. There are logs for the API server. I can tell you where those are if you want to go take a look. They're in VarLog containers,
14:31 Looking at API Server Logs (Connection Refused)
14:36 which may be a good place to start. We have a symptom. Okay. Just uh-huh. So this is telling us that our API server actually cannot start, and we're seeing a connection refused on local host port two three seven nine. Do we know what port number or application that may be? I don't, but we'll find out soon. Joss, our objective is to upgrade the application from v one to v two, and the chat is in there pretty quick. Is down. It's what you see. Alright. Okay. So It's kinda sad. Let me look at the manifest here because
15:25 Checking Kubelet Status and Logs
16:04 it's like mean okay. That's a good idea. Oh, don't run cube in cube admin in it, please. That will get us in a much worse position. But yeah. Oh, yeah. There's some HCD thing, and there's no HCD in in the the locks. Is etcd in this case part of the not like I'm not actually sure. It's part of the control plane, most definitely, but should should it generate a lock? Has is it has it has it been running at all? So I mean This cluster was healthy at one point today. SSD was alive. It would appear that it's
16:48 currently broken. You certainly can take a look at the static manifest or try to find the logs, I think, would be a good show. Okay. That's no. That's not is is not running. Seems it is not a problem. Okay. You've got a couple of minutes left. What are you thinking? What are you thinking? I'm thinking I have no fucking clue how do Can can I can I swear on this show? You can't swear. Yes. We we swear all the time on cluster. Don't you, Lenny? Oh, alright. I have no fucking clue how to use KubeAdam and how to to
17:31 Benjamin Hands Over
17:44 deal with the manifest and how to to deploy them if they are not deployed and or check if they're deployed. Alright. Would you like to swap out, or do you wanna I'm a be real with you. Some somebody else can take it at this point. I can can Google it. No worries. Well, thank you for coming on. Email me. I'll put my email address is david@rawkode.com. I'll put that on the stream somewhere. I'll add it to the the description on YouTube, and we'll get you that voucher over today. But thank you for coming on, and we'll
18:19 speak to soon. Thanks, man. Bye. Thank you. It was a it was a pleasure. Alright. So what we're gonna do is get the buzzer back on the go. Now I'm not gonna touch that session or that cluster. Whoever comes on has the choice to continue with cluster one or maybe try their luck with cluster two. That is up to you. So the link for our cosmobuzz.net is on the screen. I'm going to unlock the buzzers in about ten seconds time. Maybe a fact I'll give you twenty seconds. That gives you ten seconds to get onto the URL. So get onto
18:24 Second Contestant: Bogdan Joins
18:51 the buzzer page very quick. First one to buzz in gets to come and join me and take a shot at this cluster. I am gonna post the link one more time into this chat. Alright. Buzzer's unlocking. I'm not gonna tell you when. I'm just gonna do it. Alright. Bogdan, you are up. Please come on, join the link that is in the YouTube chat. And then we'll decide what customer you wanna take a pop at. And I do trust you, Benjamin, but I'm just gonna remove you from the team. Person. There we go. Hello. Welcome. Hey.
19:40 Getting Bogdan Access to Cluster 1
19:53 Hey. Alright. I've got lots of people trying to dial in now, so please only come in when you're invited. Maybe I should have come up with a better delivery system for that URL. Hey. How are you, man? I'm good. I'm good. Alright. Awesome. I wanted to to go first, but I was a bit afraid. So yeah. Well, one of the things that we try to do with cluster is to, like, try to dismantle this hero culture that we have. Like, we all know some stuff. We all know we all don't know a lot more. So it's alright when we Google. It's
20:23 okay if we say we don't know, and we try to work on the answer together. Vladimir, please stop trying to die. Alright. So could you please type your GitHub username into the interview chat on this web page? I will add you to the team. Oh, yeah. Yeah. Yeah. There's a a chat. Yeah. Okay. I will get you added to the team, and then you have the choice between continuing with cluster one or trying cluster two. How do I write in the chat? I mean, you can just tell me your GitHub username if you if you want. It's
20:58 l Bogdan. L. So Bogdan with an l. Gotcha. Alright. You now have an email with an invitation to the team. Please accept that, and then we'll get you access to this cluster. Yes. Got it. Join. Okay. Let's okay. I joined. Alright. If you go to join.clustered.live in your web browser. Yes. And I guess I have to log in. Yep. Authenticate with GitHub. And then if you click on activity and active sessions Oh, wait. What cluster do want to work on? Do you want to continue with cluster one, or do wanna go on to Let's continue.
22:09 Alright. Then you'll see an active session under activity menu on the left, and then please click options and join. Okay. For login name, I can use Root. Oh, Root. Oh, no. Yeah. But you have to join, not open a new session. Oh, okay. Click activity on the left, and then there's active sessions. I I I can show you. So activity active sessions and then options and join. Activity active sessions. Yep. Yeah. I want to look into Teleport one of these days. Alright. Looks like you're in the session. So if you can just type echo hello, let me know that you're
22:52 there. Thank you. Teleport is awesome. It's such a cool tool. Alright. So you're continuing from where Benjamin left us. So what's your thought process so far? And your ten minutes section there. Well, I I guess it's this is not running. I see some container d, so I guess we're using container d. That's correct. So okay. Then I guess we should look at few. I think you mean. Right? Yeah. I'm a bit nervous. I I feel like on a on a job interview. Every day someone's on clustered, typing is your the first thing that goes. Yeah. I mean,
23:38 Manifest Timestamps (Red Herring)
23:51 while watching, it seems easy. But then when you're on the spot, it's like, oh, what should I do now? Okay. So controller manager failed to get status. I guess these are k. So let's do it like this and look for that CD. No. It's d. K. Manifests tier. Sorry. I'm I'm laughing at my own maliciousness now. It's just I don't know if you know what's the timestamps on those files, but it made me laugh. The what? The timestamps on the files. Oh, okay. It's in the future. Yeah. What do we have here? This seems not okay.
24:58 But I I think the time stamps are super freshly. You can ignore that. I just I'm laughing. That's all. Yeah. Yeah. Yeah. It was a red herring. Yeah. Yeah. Yeah. For sure. But it's weird that we don't get any Yeah. I would have expected there to be etcd logs and the kubelet to be trying to start it. So yeah. Weird. Kind of weird. Is the cubelet binary untempered yet? It's untempered. If I have not recompelled the cubelet, that I can tell you. Okay. Why would it okay. Let's see. But, yeah, I I guess the manifest folder
25:11 Kubelet and Static Pods Investigation
26:04 is set correctly because it seems like it's trying to start the other static pods or not. Yep. So you're talking about a kubelet flag that allows you to specify a different directory for the static manifest. You can confirm that if you want. I'm happy to tell you that it's not changed. But it should be looking in the Etsy Kubernetes manifest. Yes, Russell. You are the only person who has recompiled the Kubelet. You evil evil person. Advertiser. Okay. So why would the kubelet not start a static pod defined in the manifest folder? Do we have this?
26:52 Restarting Kubelet and Tailoring Logs
27:11 I find a good way to kind of check this. Restarting the Kubelet and you've tailing the logs from the start usually does reveal a bit more information. You got five minutes left. Yeah. Okay. Let's actually would can we do we have screen? You can use screen. Yep. Okay. That's Loading cert starting controller. Okay. Let's see. K. So I see for creating already exist. And then get preparing. So did you see anything for ETD there? I yeah. I saw this, the deleted smear pod. And then I saw failed to something about probe. But okay. Let's leave it like this.
28:16 Controller Manager Issues Appear
29:03 So it looks like we have Good cold dropping down to cray control. I think that that should help. Yeah. I'm not sure how. Copy paste. I can do that for you if you want. Oh, you got it. Okay. A bit more oh, no. Find oh, because it's p s. Let's see. Looks like it's running. Copy. Is there a way to easy copy and paste? You just have to what operating system are you on? Windows Chrome? Yeah. Control c and control v in your browser should hopefully work just just fine. Yeah. I tried, but it didn't seem to work.
30:19 Okay. So cache is open. Does it work now? Oh, okay. So it somehow fixed it. I have no idea why. It it shouldn't help? No. Okay. But, you know, the the name of the game is the update image. So if it works, it works. But, yeah, I'm I'm a bit confused by that. So now we're two. So you have to upgrade the clustered application from v one to v two. Do we have canines? No. We do not. Do we have nano? Yeah. The name cluster. Oh, no. How do I well, I guess I don't. Where is the image? So to v two.
30:57 Attempting Application Upgrade
32:12 Is it working? Can you still still see my screen? Yeah. Run kubectl get pods. Do we have a new pod? We do not. Okay. Did not. Interesting. Do we have a Oh, look at that. The controller manager is also broken. Yeah. Alright. You've got about one minute left. Okay. Yeah. Just let me know when You've made amazing progress. I I have no idea why that API server is working right now. What can you say what should have been the the issue? I'll wait till the end of the episode just in case it comes back and breaks again.
32:38 Controller Manager Probe Failure
33:10 I don't wanna give it away, but we'll we'll go over the break and for sure. HTTP probe failed. So I guess there's a liveness probe and manifest cube. Controller manager, I think. Can I remove it? Yeah. Or You can if you want. I'll let you finish finish this change, and we'll see if anything's working. And then we'll swap you out for someone else because you're a little bit over, but I'll let you finish what you're working on. Oh, okay. Okay. Yeah. That's fine. That's the manager usually has have a RIPE NOS pro? It does indeed.
33:15 Modifying Controller Manager Manifest
34:00 Yes. Describe. Yes. We should oh, yeah. You can move it out and back in. That's a good trick. Yep. It usually helps speed it up. Alright. The comments are saying that Benjamin did actually run a cube admin in a on a machine. I didn't clock that. It must have skipped past me when I was looking at the comments. Cool. But I think that should be okay. I don't think we keep in it. Whatever. Think I changed. But I'll check. I'll check. And we dot Okay. It's restarting. Yep. You have a controller manager. And my cluster
34:56 Controller Manager Restarts, App Still Stalled
35:00 pod is still not doing anything. Just on this, and then I'll pass it over. A wrong name, Swiss. Oh, yeah. The part may have been rotated. Okay. Is it using the new meter? Let's check. There are local host on port 30,000. It it seems like it does, but yeah. Local host. 38,000. That's 30,000. You got v two. Well done. Yay. Alright. Thanks for Yes. Me see. An email, and I will hook you up with your $200 Amazon voucher. I have no idea why it's fixed, but does it matter? You fixed it, so well done. It it happens sometime. Like, you you fix
36:07 Post-Mortem Cluster 1 (Rawkode Explains Intended Breaks)
36:09 it and you don't know why it works. Well, let's take a look at some of the breaks in this machine. Right? So if we take a look at the Kubernetes manifest now the Kubernetes may have broken may have fixed this, but I'm not sure. Oh, I need to test the browser because the then causes a little bit of trouble. I changed the name of the entity node in the manifest. So, actually, the entity shouldn't have been able to come back up, which I'm a little bit confused, but it did because its real name is one.
36:47 Yeah. One. Yeah. But oh, let's see. I would have probably not called that. Also, nobody said an alias, which is It's bit quite weird with, like, changing underlying IPs or host names or, like, it kind of breaks and you need to manually fix it. So yeah. But it came back on that, and that that's the only mission. The the other gotcha here was that if you'd set an alias decay, which 99% of people do, things would have been a little bit broken. But because you kept using kubectl in fact, no. Now it's broken. S e
37:29 z is now broken. Anyway. Oh, well. Yeah. I changed the auto complete on the k alias to be the get command. So just as a mails annoyance to people. But it doesn't matter. Like I said, please drop an email david@rockwood.com. I will hook you up with that feature, and thank you for for joining. We've got one more cluster to fix that I'm gonna get someone else on. Yeah. Thank you for having me. Cheers. Alright. Damn you, cluster. Cool. Alright. I'm gonna open the buzzers again. So if you wanna come on and take a pop at cluster number two,
38:01 Third Contestant: FHKE Joins
38:08 we've got more Amazon vouchers to give away. The link is there. I'll give you all another ten, fifteen seconds to get back into the buzzer room if you're not already, and then I'm going to unlock the buzzers. No idea how it's a d got back online there. That's annoying. Alright. I'm going to unlock the buzzers at a random moment. Said you were close, but FHKE was first. So link has gone into the YouTube chat. Please come and join me, FHKE. I will get you access to cluster two. I'll open a session just now. I'm looking forward to this one,
38:56 Getting FHKE Access to Cluster 2 (Starting Fresh)
39:08 I think. Alright. I'll give you another twenty seconds, FHKE and then oh, there we go. Nice. Hello. Hey. How's it going? Yeah. Well, I'm a bit disappointed that my first break was only partial break, but sometimes that happens on clusters. I'm not gonna dwell on it. How are how are you doing? Yeah. Yeah. Good. Thanks. Alright. Can you give me your GitHub username, please? Yeah. Sure. It's f h k. So that's the one I use for here. K. Let's get you added. So you're gonna get an email just now. If you just click on that and accept
40:04 the invitation and then go to join.cluster.live. We will get you into cluster number two. So we haven't seen anything to do with cluster number two, so you're starting from a fresh slate. Good stuff. Dan asks, is this the first live Kubernetes debugging session you've hosted? No. No. We have over 42 episodes of clustered. This is the first one we've made interactive with the audience actually participating, but there are loads and loads of episodes of clustered on the channel. So go to rockwood.live and check that out. Russell asking, can I save the session history? Yeah. The session history will be there in
40:44 the other log of Teleport and in the bash history. I'm I'm gonna go back through it and see if I can work with what happened. It did appear SGD was broken at the end there as well. So but the the image was updated, so I can't I can't argue. So Where do I need to go to? If you go to your email, you should have a GitHub team invitation. Yeah. I've just joined that GitHub. Oh, alright. Okay. So if you go to join .clustered.live, Andrew is asking which episode was the Unicode one. Search for clustered Guy Templeton
41:20 or Guinevere, who was my host cohost on that episode. And that is the one with the Unicode break that we never speak of again. Alright. So if you're logged on to, Teleport now, if you expand the activity menu you go to active sessions. Cool. Yeah. I can see that. And then options and join, and then just type echo hello, anything to let me know that you're in this session. Awesome. I will export your KubeConfig for you, and then the cluster is all yours for the next ten minutes. And I'll fix my spelling as well. Alright.
42:12 All yours. Cool. Best of luck. I'm sorry. What's the challenge on this one? Same again. Upgrade the clustered application from v one to v two. Okay. No event. The pending salema operator is okay and can be ignored. Okay. The chat comments are assisting you here if you can see them. Yeah. No. Sorry. I missed the I really thought I've done that. It's just a clustered with a k, Like, Rawkode with a key. Oh, where'd it go? I think staying it down. What's going be done? Good question. Can't see anything obviously there. No. No. No. Guess, though.
44:21 Searching for the Rogue Scaling Process (Cron, Jobs, Systemd)
44:46 I wonder if there are any comm jobs or anything. Yep. Current jobs. I can look in the APIs overall, but it logs up there. What's up? They don't have. What were you looking for in the manifest there? Just see if the audit log was set up for the API seven. I should set them up, actually. That would that would be good. An improvement for future clusters, I think. I can just sleep there. So what's that? What's the current amount? So I'm gonna guess this is some kind of loop, and every six sixty seconds, it's gonna scale it down.
45:47 This is the cluster that I broke Carl off. Yeah. Yes. My job is to laugh, Carlos, not to help. This is the only episode of Clustered where I have not been on the line with someone else for fixing. So I'm enjoying this one. I mean, I'm wishing I had had those sleep processes in the process table with LD preload, but that may have been a bit too mean. But there's definitely stuff in roll capping on this machine. You got five minutes left. So you ran who, checking for other users. You'd run jobs to see if there was
46:28 Examining Running Processes (ps aux)
46:49 anything scheduled. Nothing so far. If you could choose a pager on the longer commands, it just means we see what you see as well for you. Yeah. Sure. Thank you. It's obviously running. See, this is one of the breaks where I was like, this is super sneaky. And then five minutes before we went live, I was like, it's too obvious. Like, it's just far too obvious. And now I'm not I'm I'm so swayed. I don't know. I hope you find it because I want I've I've gotta talk about it, but let's let's see what happens.
47:58 Yeah. Clone would be too easy, wouldn't it? Peter has just joined and is asking what is the issue. Well, the issue is that every time our deployment is scaled from zero replicas to a positive number, Some rogue process appears to be scaling it back to zero. We are looking for that process. I mean, if it's running on the host, maybe I could just move the KubeConfig somewhere else, and it would fail to authenticate. I like that. Very clever. Terminal just. If you need to resize the window, it does help. It's a bug in xterm JS.
48:56 Considering Container Processes (nerdctl)
49:26 Just resizing it by, like, one pixel does seem to kick it back in line or type in reset as a command. Oh, yeah. Seems to be working now. Okay. It's happening again. So it's not probably not something running on the hose. So maybe it's, I guess, on a container. Potentially. Yeah. Alright. You have two minute. I like your process here, though. You're doing well. Thanks. Yeah. I can't see anything, but, like, obviously, they'll be doing this when I get good data. Depends whether you trust all the names of the processes, I guess. That's true. There's a really cool command for debugging every
50:25 image in the cluster. I am not saying that's gonna give you any information whatsoever, but you never know. You're already doing it. So where did they make that one? Which one? This one here. I can't see what you select. You'll need to Sorry. It's the one that begins SHA256. Yeah. That's just that's just one that's resolved to a sha. If you were to on your Gret, do a cap dash capital b two, it would also show you the name, I think, before it. So it could be extra matchers. So before the dash I do. Yeah.
51:13 All right, you got about thirty seconds and then I'll swap you out. But I feel that you've made really good progress for people here. Alright. Your time Yep. Nice work, though. Very well done. Yeah. Thank you. Yeah. Drop me an email, david@rawkode.com. I'm gonna send that to my other time. And I'll put it into the the comments as well at the end of the episode. But, yeah, you got a an Amazon voucher. Thank you for joining me. And Thank you. Let's see who else wants to come on. Alright. The link for the buzzer is there.
51:58 Fourth Contestant: Vladimir Joins
52:00 I'll give these all twenty seconds to get ready. I'll unlock them at a random moment in time. First one to buzz does get to come on and join and see if they can fix this cluster. Vladimir, you are first. Please join the stream. The link is in the chat. I'll see you in a minute. Carlos, a a different cluster or the same one? It is the same cluster. I only broke two clusters for today's episode. The first one was fixed by magic. Hey, Vladimir. How's it going? Can you hear me okay? Yeah. One second. I will turn off the
52:58 stream. Okay. I'm with you. So I need to send you my GitHub account. One second. Please do. Yeah. Asked me if you can have an Equinix metal voucher. Use Rawkode. That will get you $200 in credits for Equinix metal. Can I just tell you the my login? Because You can. For some reason, I shut up on chat. Okay. O I k seven four one. Alright. O I k seven four one. Right? Yep. Nice shift. Right. Alright. You have an email inviting you to the team. Please check it out. Accept the invite, and then go to join.cluster.live.
53:23 Getting Vladimir Access to Cluster 2
53:44 We already have a session active, so you'll be able to continue from where Farragall left us. I will share my screen so you can see what you're looking for as well. You want to come to active sessions under the activity menu on the left. Once you've accepted the invite, join dot clustered dot by. Yep. Activity. Active sessions. Transition. Sessions. I'm in. Alright. Type echo hello. Anything you want. Does that there we go. I can see you type it already. You're carrying on from where Fergo left us, so, you know, best of luck. You've got ten
54:43 Debugging Cluster 2 Continues (Following FHKE)
54:46 minutes. Sorry. I have a buddy here. That's alright. I've got two cats of my own. I know. I'll do well. So Yeah. Something is scaling a replica or deployment a replica set time to zero. I saw here. It's half Postgres. Oh, because she said I can't get. It's just k logs. I'm like, Oh, okay. Sorry. Sorry. Yep. Yeah. So I'll remove the get. There you go. Alice, Alice, Alice, Okay. Yeah. It looks like real. Then for some reason, mom's face me. Me. You're checking all the images right now. Oh, so it's was just just a pseudo
56:54 Re-examining Images and Processes
57:12 images. Yeah. Mean, previous guy was suspecting something that caused this. He's over to this chase, and it was just silly. Sorry, guys. I can't proceed. I need to Of course. No problem. To my daughter. Sorry. Sorry, guys. No. That's all good. Yeah. Thank you for coming on. See you. Bye. Alright. I know that feeling all too well. I have a three and a half year old daughter and an eight month old son. So we'll get Vladimir on for another episode. That's for sure. Alright. Let's get the buzzers back up. So the link is at the top of the
57:49 Fifth Contestant: Alistair Joins
57:53 page. I'm gonna unlock them in a few moments. First one to buzz in can come and join us. Alistair, first one in. When you come. Hey, friends with this. Hello. Alright. I should know this, but please tell me your GitHub username. Well, how do I type? You can just oh, you can type that into the interview chat on the web page that you joined from, or you can tell it to me. Can you see that? Yep. I can. It's in the guest chat. Wonderful. Alright. You have an email with an invitation to this team. Please accept it.
58:41 Getting Alistair Access to Cluster 2
59:09 I will. And while I'm doing that, you can sell kubeadle to everyone. You're right. I can't believe I haven't mentioned that yet. So I am organizing a Kubernetes conference in Scotland. You can go to kubeadle.com where you can get information on our twenty twenty two conference. We are in Edinburgh, Scotland's capital, in October. So go check out the website, and, hopefully, I'll see some of you in October. Alright. So if you could It's my wife's birthday on the third. So she's getting a trip to Edinburgh. Right? Well, she she was going to, but then oh, cluster.live.
59:19 Kubeadl Conference Mention
59:55 Yep. Join .clusters.live. I authenticate with your GitHub username. There we go. And then please go to activity active sessions and join the session that we have open right now. Yeah. She was gonna come, but then it's it's DevOps days London on the Thursday, Friday. So that makes it a bit Ah. Interesting. Once you're in the session, if you could please type echo hello or anything like that to let me know that you're there. Perfect. You have ten minutes. Best of luck. So It's very different when you're on this side of the screen, I've got to say.
1:00:21 Debugging Cluster 2 Continues (Following Vladimir)
1:00:36 Your ability to take will follow the window, I guarantee it. I was just checking in case you put the deployment YAML in here, and it was being auto sort of sculped down or something like that. Where do we start? There's that command to get all the images. I'm just gonna Google it because I can never remember. I tend to use git control cluster dash and full dump and then grab for image. Yeah. I wonder if I can paste. No. Paste. Okay. What's the namespace? It's. Alright. Oh, it's before. It's really annoying. These two things. This sounds like some nice
1:00:56 Checking Images and Processes Again
1:01:45 twitches on that keyboard. What are you rocking? M x Browns, I think. Oh, nice. I only have one mechanical keyboard, which shocks some people. I currently just got the Keychron q three, which I am loving right now. Although it is a heavy beast. So Very nice. Is that, like, metal aluminum thingy? Yeah. I got the knob version, so, like, I can use that for random stuff. Don't let me distract you from fixing the cluster at all. What's Kubefit DS and Kubefit? Kubefit is what we use to advertise the BGP address on this echo Xpero cluster. You can ignore Kubefit
1:02:00 Keyboard Enthusiast Chat
1:02:26 for today. Okay. CTR. What was that other one that that does the CRI one? CRI CTR, is it? That's the one. Yeah. Yes. Agent. What's agent? Oh, emissary ingress agent. KubeScheduler. I'm just doing the same things that everyone else does. Okay. So in case I see something different, well, that unless you can see it. Root root root. Look for the hybrid things. There's people saying stuff in chat that might help me. Not yet. No. Oh. I think we've stopped a lot of hopefully, I've stopped a lot of people with this one. Although Carlos is suggesting that you
1:03:00 Chat Suggestions (Mutation Webhook, Process Table Focus)
1:03:23 check your mutation webhook. Andrew thinks maybe something in an in a container. What is it? Mutating Mutating webhook configuration. Look. Configurations. Bug down 40 sleep, 60 in the process table. Yep. How can I not spell mutating? Mutating webhook config. Yeah. That's right. Right? It's actually being made. Run kubectl API dash resources and get from music. Which one has sleep? Sleep. There's a way to see the process tree. If I pick a ped Yes. You can see the process tree. And I can never remember how to do it. E f e f a u x would
1:04:17 Investigating Process Tree and Sleeping Processes
1:04:37 be a good way to do it. PS. What is it? E f? A u AUX. Well, if you come up for sleep, you won't see the process tree. And if you use a pager, it means we see what you see. Thank you. And then we can search for sleep. I mean, there's lots of harmless reasons sleep could be running. Is there all down and on sleep here? We all use sleep in our scripts. Right? What are you thinking? See. I don't know. What am I doing wrong? CRI. CRI CTL. What subscribed purely to comment on the mutating
1:06:04 typo? Thank you, bot. Oh, what did I do? Mutating. I don't know. It's been too long since we Too many a t a t a t a t I n g's, I think. Andrew is asking what's the KubePap pod that is not named for the not named. CubePad also runs a small daemon set to advertise the BGP on every machine. And the control plane one is the leader election and services routine stuff. State not ready. It's because it's being is it being paused Or something. What's doing that? What could do that? Good comments there from Carlos and Bogdan.
1:07:01 Not that I want you to be able to fix this or anything, but those are good comments. Carlos says concentrate in the process table. If we need to, we'll jump into EBPF. How's your EBPF, Allison? Terrible. Fuck Dan noticed ATQ maybe. I don't know what ATQ is. What is ATQ? It is a command runner on Linux. Settler for cron. How do I view them? Minus minus b. No. Q. Oh, we can never send mail to the user. Minus c. Nope. Cool. Someone was saying processes. Yeah. Everything's in everything's everything's running as a process. Everything is a process.
1:07:12 Hint: Consider the `at` Daemon (`atq`)
1:08:39 Everything's a process. I mean, the PSEUXEF, I mean, it shows you much more than I think you you you actually paid attention to. Again, not that I want you to fix this, but there's a lot of information on this Yeah. System d. No. Something's running Chrome. SSHD is running. You've got one minute. Playing flies when you're having fun. Right? I know. Yeah. Contained the share. There's so many things on here. Teleport. Bash. Woo hoo. Teleport. Bugged down. Somebody already moved admin.com away, and that made no difference. Mhmm. Just delete all the pods. Please don't delete all the pods.
1:09:59 Alistair Hands Over
1:10:08 That's usually what I do on the managed cluster when stuff's not working. Alright. Your time is up, Alistair. Thank you for joining us. Please email me, david@rockwood.com. I will get you the Amazon voucher, and we will invite someone else on. Thanks, man. Alright. So the link is on the page. CosmoBuzz.net blah blah blah blah blah. Get your buzzers ready. I will be unlocking the buzzers a moment's early. First one to buzz, the guest to come on and join me. I think we'll do two more people. We'll try and finish this at seven. And then if it's not fixed, I will
1:10:32 Sixth Contestant: Seth Joins
1:10:51 run through the break on this cluster. That oh. Well, it said Dan, but now said it was first. So I think it that's some sort of weird sync on the buzzer. So said, click the link and come and join me. I will paste it one more time. And I will wait here patiently for Seth to join us. Seth has been active in the chat so far, so I know he's got ideas. Curious to see if he's gonna get to the bottom of this skill down. And I think Dan's the only other person in the buzzer that's not been on. So
1:11:35 Dan, we'll get you on after Seth if Seth doesn't fix it. Because that was pretty close. So Dan, you are next. Hey, Sid. Hey. How's it going? I'm having fun. Are you having fun? I am. I am. Alright. Dan's joining us too. Dan, you can fix it next. Right? We're gonna give Seth ten minutes and then and then Dan. So thank you both for joining me. Alright. Seth, let's get you access to this cluster. Cool. My GitHub is just first name, last name, Sid Palace. Yep. Alright. You have an email. Please go check it. Accept the invite and
1:11:44 Getting Seth Access to Cluster 2
1:12:22 go to join.cluster.live. You know the drill. You have been here before. Join join Rawkode Academy. Well done asking if execs note is installed on the server. No. But you're more than welcome to install BCC tools if you want to execs note. And then what's the the URL to join the teleport? Join.clustered.live. Alright. Once you're logged in, please go to activity followed by active sessions and join the session that we have opened just now on two control plan one. And then just type echo hello or anything like that so that we know you're in the same session.
1:13:20 Perfect. Alright. You got ten minutes starting now. Best of luck. Okay. Great. So, yeah, I think there were a couple of things that Alistair was doing that seemed useful. Someone had suggested a t q, and so I think a t dash l will give us anything scheduled there. It looks like nothing. Another potential place we could schedule stuff would be with system CTL timers, I guess. Let's see. So if we do is it just timers or list timers? List timers, I believe. Right? List dash timers. One more well Yeah. No space. Interest and hypothesis, Dan.
1:13:34 Investigating `atq` and `systemctl timers` Again
1:14:33 When I tell you what the break is or you discover the break, I'll talk about something that's important. Maybe. Alright. So what are thinking? Nothing there jumps out at me. We people had suggested looking at containers running on the host. Cry control always confuses me, so maybe I'll try to install NerdControl because it is more familiar. Yep. Do we go do See what happens. Dot local ben is not in the path, so you will have to modify it or push it to somewhere else. It seems to be you pushed it to home slash dot local then.
1:14:51 Installing and Using `nerdctl` for Container Processes
1:15:41 That should be it. Okay. Nothing there. Kubernetes pods run-in a namespace that you would have to And so to add a namespace in nerd c t l, I'm just gonna Google that. Try dash capital e. It might do all namespaces. Good idea. I think after the p s. Nope. Or for shop. Then you may have to do dash n create style. Apparently, I don't know how to work Nerd Control or Nerd Control. Oh, there you go. Here we go. It needs to be before PS. K. Everything came up seven hours ago. We do have this
1:16:46 pause container. So you think the malicious workload is in a container? I wasn't sure about that, but I I figured we should look at them as a starting point. But, yeah, nothing here looks particularly interesting unless I'm skimming over it in the with the the pressure of the stream on me. You know? You know, it's amazing how quickly we should get everything we've ever known the minute cameras in front of your face. Right? So you're just scrolling up and down this list right now? Yeah. It was just looking at well, I I made my screen large enough that I
1:17:01 Examining Container Processes Again
1:17:52 didn't have to scroll. But Just looking at the commands that they're all issuing. Venom is curious if the pod is scaling itself down. Leszek Yes. It's fine. Is it going to be something like a bash script pushed into the background of a session? Peter is asking if the images are real. Alastair's looking for processes running Python or parallel. Lots of ideas. Wonderful ideas, but they're all wrong. Get deploy clustered. Kind of. Maybe. Yeah. So there's no in it container or anything on there. Would have been a good way to do it. The same effect, I guess. But, yeah,
1:19:00 not how it's done today. That would have been would have been neat. Let's look at the postgres one. Yeah. Wesley's asking, consistently triggers fire when a pod starts. Interesting. I'm not entirely sure, to be honest. No idea. Yeah. So Postgres appears to Russell's asking if the process runs on another node. These are single node clusters for the record and no external hardware we choose. I'm afraid you've only got one minute left. It Yeah. It it flew by. I have been rather sneaky with this one, I have to say. I was unconfident coming into this. Like, I know people are
1:19:57 Seth Hands Over
1:20:12 gonna get it too quickly, but I I'm glad that this has stumped people because there's something important to learn at the end of this episode. What was the the flag on PS to see the tree? E f a u x. E f will give you the tree and a u x will show you all the processes. I think the f stands for forest. That's what they call the tree in the PS format. Nice. That's it. HPAs were checked. Yep. Well, Dan, glad he didn't choose this one. Dan, still very a lot of people curious about the ATD.
1:21:00 I'll give you another Yeah. Think thirty seconds that what do got? No. I I think I'm ready to hand it over. I don't have any ideas that'll bear that'll bear fruit in thirty seconds. Alright. Well, thank you for coming on. It's always a pleasure. Please email me david@rawkode.com. I'll hook you up with that voucher, and then we'll hand it over to Dan. Now, Dan, please come and join us. Thanks, Ed. Yeah. Thanks. Alright. I'm just gonna leave this up here. Dan, please click that link again. We'll get you added to the cluster. I'll okay. I'll share one again.
1:21:11 Seventh Contestant: Dan Joins (Faces Technical Issues)
1:21:33 There you go. Alex, joined late. What's the symptom? Our deployment and replica set has been scaled to zero, and no one has been able to work out how or why yet. Hey, Dan. Hey. Thanks for joining us. Can you Okay. What do I need to do next? Give me your GitHub username, please. I think it's Dan Kirkpatrick. Let me I'm not sure, to be honest. Let me check my one password. GitHub. Yeah. It's dan.kirkpatrick@gmail.com. Alright. This may be you. Please check your email and see if you have an invite to add a team on the Rawkode Academy. If
1:22:26 you do, please accept it. There it is. Alright. Join Rawkode Academy. Join. I'm in, but Now you can go. I'll paste you the link into the chat, but it is join.clustered.live. This will ask you to authenticate against GitHub, and then you will have access to the session that we are in just now. Alright. I've lost my window now. There we are. Share. I don't see the text window on the on the chat. Oh, there it is. I see it. It's on the side. It's on the side. Yeah. Anybody check the So if you go to that link, join
1:23:12 .cluster.live, you'll be able to join my session. Join dot I don't see a link. Just go to your browser, join .clustered.live. Join .clustered.live. Let's try that out. Can't find that server cluster with a k? K. Oops. I missed all the clustered. My bad. Yep. That didn't work either. Clustered.live. So join .clustered.live. I'll type it again and speak a chitchat as that. Dot live. I'm no. That doesn't work either. And I've lost my YouTube window now. I'm gonna have to to bow out because I I'm having technical issues. Alright. No worries. Well, send me an email.
1:24:04 Dan Hands Over (Technical Issues Persist)
1:24:06 We'll get you an Amazon voucher. Thank you for jumping on. Okay. I'm not sure I deserve that, but thanks. Well, alright. Do we have any last calls? Does anybody want to join us before we we give away the secret sauce here? You wanna join? Let me know. Dustin asking, has anyone described the replica set events? We have seen the events on the replica set. We're scaling up and something is scaling it down. Russell is asking about OS cron versus Kubernetes cron. Could it be cron? Well, as it says Cron jobs were empty. Any takers? Alright, I am just gonna post a link.
1:24:16 Last Call for Volunteers / Community Discussion
1:25:08 Anyone that wants to come on and have a go at this, please click the link. We will not be rebuilding the cluster from scratch. Alright. I'm gonna give it one more minute and then I will walk through the break. If anybody wants a chance at fixing this before I reveal it, now is your opportunity. Bogdan, maybe you wanna come back on and try it? Dan, feel free to jump back on. Dan, if you've got the link when you come. Yeah. Alright. Hello. Alright. Hello. Hey. Okay. Let me see. I still have this open. What was your username again?
1:26:26 Eighth Contestant: Bogdan Returns
1:26:45 There we go. Sorry? It's a different team for this cluster, so I will add you to that. So it will automatically add you this time, so you may have to just reload your teleport page. In fact, I can always change the roles. Oh, that's better. One. Because she gave you one earlier. I will just give you access to everything so you don't have to reload. Please join the active session and type echo hello or something so that I know you're here. Okay. Is that your production account? Which one? No. No. I was just joking. I I
1:27:09 Getting Bogdan Access to Cluster 2 (Again)
1:27:38 asked if that's your production account. Yeah. These are my production clusters we're fixing. I actually don't know what's wrong with them. Okay. So should I see it in the active sessions? Yep. You should hopefully be able to see the active sessions. Looks like Dan is on a note. Join the first one. One. Right? To control plane one. The first session, forty nine minutes old. Join that one. Join. I mean, I I said yes, but I have literally no idea. Can I get hints to to maybe get it over with faster? Like, is it something on the host, or is it something
1:28:18 Bogdan Asks for Hints (Hint: Host, at daemon)
1:28:31 in the cluster? It is something on the host, and Dan has been very close a few times by telling us to look at the at daemon. Oh, okay. So a t d a t q. I mean, that is definitely how the rogue processes are being scheduled. Okay. So can I just do I mean, you can stop it, but I'd like you to faint the the bad thing too? Okay. Do we have the scale here? Okay. That's annoying. Like, I oh, okay. Where did this come from? Didn't you we already run ATQ? And Yeah. Something you did triggered some schedules.
1:28:41 Investigating `atq` Again (Scheduled Jobs Reappear)
1:30:02 Oh, okay. But how do they still run if ATD is stopped? We've had someone else join, but I don't think Joe, are you there? Like, if there's Right. Someone who wants to try, I can just Dan removed all the ATQ commands, and they came back. That's that's a pain in the ass. I don't think anyone's got any ideas. Oh, okay. Interesting. Is this some tampered with? CubeControl is not tampered with. No. Interesting idea. Alright. I'm gonna just start dropping hints so that we can uncover what's actually happening here. But I find it interesting that people are running kubectl
1:31:02 Rawkode Hints: How kubectl Authenticates
1:31:10 in a very trustful fashion. So how does kubectl know how to speak to our API server? Oh, okay. Well, did you set it? Oh, okay. Is there, like yeah. So tricky. I guess oh, because it's we also have oh. So the comments seem to think that I set up an alias or a function for cube control. I have not. This is 100% a perfectly legal break. So there was something in that admin.com that should be investigated. The cube boss metal? Mhmm. So do you know what that is? Just a take on which? I mean, it is a command that KubeCTL
1:31:52 Identifying the Malicious Auth Helper (`kubeauth-off-metal`)
1:32:37 runs and then it gets a token, right, or something like it's Yeah. So if we take a look at Like, an external Like authentication. Exactly. Right? So the this is called Kubernetes or a kube control authentication helper. Mostly for managed Kubernetes clusters like AWS and GCP where you don't have a specific token and you run a command and it gets a token. But there's nothing stopping that command from doing malicious things in your cluster. Yeah. So, actually, our way sorry, sir. Our way Yeah. I'll let you drive it. And if we run a tape on this,
1:32:49 Revealing the Malicious Script Behind the Auth Helper
1:33:19 we'll see that we have a binary, which I thought was particularly mean of me. A fail. Fail? Yeah. That's what I meant. As I I mean, we don't really have any visibility into what this is doing at all. We could run strings on it. Yeah. And I guess you And you may you may see something interesting or you may not. Yeah. Actually, you don't really see that much. But what I wanted people to be aware of with this is that, like, those authentication helpers are really opaque, you should never really trust a KubeConfig without knowing what it's actually
1:34:05 doing. I have stored a plain bash version of the script on the host, which I can show you if you wish. Yeah. Which I sneakily did it. How how would you go and fix it? Because So that's what our authentication helper is actually doing. Okay. It sets up a so because this happens before the kube control command, I had to use at to sleep them for five, thirty, and sixty seconds. So they would run afterwards. That's what I your cube control modifications would actually work for a short period of time, and then it would all go
1:34:17 Explanation of the Cluster 2 Break and the Proper Fix (Trusted KubeConfig)
1:34:49 to shit. And then I could parallel it with SHC, which gives you a binary of any shell script that kinda make us a bit more hidden. But that was it. The fix here is not to use that gip config and to instead use trusted credentials, which I have left. You can see here, we have used our local BEN cache, which is a real credential file. Oh, okay. Yeah. And that's all the the results for dot key and dot certificate. Yeah. But it's it's it's really, really easy to write a bad KubeConfig authentication helper that does lots of nasty stuff inside of a
1:35:30 Debugging Tools Demonstration (execsnoop, open snoop)
1:35:35 cluster. Like, I mean, all I'm doing here is cat and out some very important credentials. Yeah. But I I think exec snoop would have helped because you would be able to see that kubectl executed that those helper. And then yeah. Yeah. Let's let's do it. Let's because I like BCC tools. I just don't remember the name of the package, but search. Maybe in one word. Oh, no. BPS CC tools. Yeah. Carlos is asking how I would debug this. I would use execs note to debug this %. So let me open a second session. We could run exec snip.
1:36:42 We could run our acute control. We've got a scale command here somewhere, don't we? Yep. And probably should have felt this on a few things, but Yeah. That's alright. Let's see. Yeah. There we go. I've got our three act commands being run. You can see our cube control scale, which is then Mhmm. We've got our off metal and then off metal as you run-in the the acts. We could see the certificates that are being kind of injected in. So everything you kind of need, that is available for exact. The BPF tools are awesome. Right? I I
1:37:20 I really feel that everybody should learn all of these. There's loads of them. Yeah. Actually, if we do a s user local PIN rep, where do they live? No. I think it's in Oh, Spend. There there's a lot of scripts in user share BCC or something. Yeah. I guess We got exact snip, kill snip, mount snip. Open snip is one of my favorites. That's fucking awesome. In fact, we'd see I'll grab it because I know what I'm looking for. If we grab for cash, it's not too noisy. Oh, that was too noisy. We would have seen the credential files in
1:38:13 there as as well. So, yeah, all those not kills are pretty cool. Yeah. Alright. Thank you, Bogdan. Thank you a lot again. Cheers. Bye. Alright. Thank you everybody for watching the community versus Rawkode episode of Clustert. Clustert is on every single week on my channel except for when I'm traveling. I'm actually at CD con in Austin, Texas next week, so there won't be Clustered. But we will be back the following week. So, hopefully, I will see you all then. Remember to subscribe to the channel, like the video, and, hopefully, I see you all again soon.
1:38:24 Conclusion, Thank You, and Upcoming Schedule
1:38:47 So thank you for tuning in. And I think, actually, this was a whole lot more fun than I expected. So I'm gonna run more community versus Rawkode episodes and see if we can do this again. So I'll get that maybe once a month. Let's see. Alright. Thank you, everybody. Have a wonderful evening.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments