Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Trace a compromised Kubernetes control plane by forensically locating a malicious static pod manifest hidden on disk.
  2. Remove persistence from a cluster by stopping a rogue systemd service and neutralizing an LD_PRELOAD injection path.
  3. Mitigate etcd database flooding by diagnosing quota pressure, identifying attacker IP traffic, and validating cluster recovery.

KubeCon special part two with Thomas Stromberg and Kris Nova. Two broken clusters: a malicious static pod manifest persisted via a systemd service and LD_PRELOAD hook, and an etcd flooding attack that exhausted the database quota.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 4:00 Introductions
  3. 4:23 Introduction & Welcome
  4. 5:06 Guest Introductions (Thomas & Chris)
  5. 6:15 Starting Challenge 1: Chris's Cluster - Initial Look
  6. 6:30 Cluster by Kris Nova
  7. 8:30 Initial Diagnosis & Forensic Tools (Using FLS)
  8. 12:41 App Deployment Fails - API Server Issue
  9. 16:50 Identifying Malicious Static Pod Manifest
  10. 22:25 Malicious Pod Manifest Keeps Returning
  11. 24:21 Tracing Persistence to Systemd Service & LD_PRELOAD
  12. 27:01 Fixing File System Access (LD_PRELOAD)
  13. 28:02 Restoring the Correct Kubernetes API Server Manifest
  14. 32:51 Cleaning Up Malicious Artifacts (Infect Namespace)
  15. 34:21 Challenge 1 Fixed & Exploit Analysis
  16. 40:00 Cluster by Thomas Stromberg
  17. 40:01 Starting Challenge 2: Thomas's Cluster - Initial Look
  18. 44:17 App Deployment Fails - ETCD Database Exceeded Error
  19. 50:44 Investigating ETCD Connectivity Issues
  20. 1:11:40 Identifying & Blocking the ETCD Attacker IP
  21. 1:15:01 ETCD Flooding Exploit Revealed
  22. 1:15:38 ETCD Cleanup & Validation (Increasing Size Quota)
  23. 1:16:41 Challenge 2 Fixed & Wrap Up
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

4:23 Introduction & Welcome

4:23 Hello. Welcome back to Rawkode Live. I'm your host, Rawkode. This is our clustered KubeCon special part two. Today, we have three clusters all broken by people joining us on this call. We are gonna work in teams to fix them as quickly as we can. Before we begin, I just wanna encourage you all to subscribe to the YouTube channel and click the bell. This will get you notifications for all future episodes. If you're not watching live or even if you are, but you wanna chat about cool you wanna chat about cloud native, or anything in between, come and join the

4:55 Discord. There's a few hundred of us on there now having a bit of fun. And lastly, thank you to Equinix Medal, my employer. They allow me to do this on their time, and feel free to check out the platform by using the code Rawkode. Alright. Let's get started. Today, I am joined by Thomas and Chris. Hello, Thomas and Chris. Hey. Alright. So let's start with you, Thomas. Do wanna just say hello? Let us know quickly who you are and then we'll move on and get started for today. Sure. I'm Thomas Stromberg. Once upon a time, I was a mini

5:06 Guest Introductions (Thomas & Chris)

5:29 cube maintainer. I work at Equinix metal now working on Tinker Bell. So looking forward to this. Feeling a little rusty in Kubernetes debugging, but I think we're gonna have a time. It's alright. Nice. The breaks are always really simple. You want me to know much. I promise. Chris? Hi. My name is Chris Nova. I am just a computer person who does computer things and I'm here to play with Kubernetes with some new friends. There seems to be a lot of comments and chat talking about how everyone feels sorry for Thomas and I and your cluster. And

6:06 I believe some people may have watched your stream this break during the week, so they probably know what's coming. But feel Yeah. Okay. Well, let's just get started. We are a little bit late today. We just had some technical hiccups that we have all resolved. We're gonna jump on to Chris's cluster first. So Thomas and I will be putting on this. Chris, we will ask you for help as we need. Feel free not to laugh too loudly, but we will do our best. Have No. This is gonna be great. We're gonna have we're gonna have a great time.

6:30 Cluster by Kris Nova

6:33 It's gonna be a lot of fun. Alright. I have my screen shared. We are floating heads. I am gonna open a connection onto the control plane node. I can see that we have a message of the day. Thank you for that. Always appreciate it. Thomas, can you join the session and give me a echo hello so that we know we have a shared buffer? Oh, you're going straight into the node. Oh, wow. Okay. Well, I will use the Kubernetes admin token to try and execute a get nodes or get pods. We'll see if we have a

7:04 working control plane and we'll take it from there. I am here. I got coffee. I've got some kombucha. Let's do this. Alright. Thomas, you're muted right now. I don't I think Thomas is muted. Yeah. Yeah. You've muted yourself, man. Sorry about that. Alright. So I tried to connect in, and it made my own session. So how do I share your show? Of course. On the left hand side, there's some active sessions. If you can click on that, please. You should see those one by active sessions. Okay. I will join that one. I'm just gonna get us

7:47 oh, I just had that weird screen blip which means you're in a session. So I will give you the honors, Thomas. Feel free to type, get nodes, get pods, whatever you want, and let's see if we have a working control plan. Alright. Whoo. Okay. Wait. We do. A lot here for this. It it in an answers. So that's one of the more more impressive things that I've seen here. So so scrolling up. I see a lot of NOVAs. Interesting. Okay. So we have a lot we have a lot of stuff here. So one of the let's see.

8:30 Initial Diagnosis & Forensic Tools (Using FLS)

8:31 Google. What's it's the thing to no. It's the thing to dump all the config to a file again. Buster info. That's what I'm trying to do. Alright. Okay. So, yeah, we got a lot of junk going on here. I love watching. I like I've already learned, like, three things just watching you so far. Well, so as I was talking to I I mentioned this before the the stream that, know, when I broke a cluster previously, I realized that one of the things this is like is it's it's almost like investigating a security incident. Like, you've got a parasite

9:24 in your system. So I'm I'm trying to collect some information first. You're doing great. In in which case, like, I feel like I I I left a good amount of clues. So this is gonna be really fun. I'm stoked to see how this plays out. So this is really weird scrolling behavior, though. Alright. Yeah. Just reload the page. It'll fix the scrolling behavior. It it does happen, actually. It's really annoying. I got it. Sort of. I haven't. Sorry. I can't see the bottom of my screen yet. There we go. Alright. So what is the the test to see

10:01 the what do you call it? Isn't there like a WordPress app or something that we're supposed to do to see, like, how it works and to effectively say, is this thing healthy? It looks like it's What would rate for is this healthy? It's running. I I I would say if if you can get the WordPress app or whatever it is that that Rawkode has in in his repo, you can get that running and and get the cluster into a state where it'll continue to run. I will I I will buy you a steak dinner. Okay.

10:39 So we do have a clustered pod running in the default namespace. Whether it's actually running or not, I'm not sure. Whether we can even upgrade it to the version two, I'm doubly not sure. Alright. So What is FLS? So one of the things I'm going to do is basically make a since I know we're dealing with a kernel hacker here, I'm going to make a, basically, a dump of the file system and a dump of all the file system changes recently. So FLS is part of SleuthKit, which is showing my age a little bit. I

11:36 I I've I put this on easy mode. If I would've known we were really gonna be doing this, I would've it would've made this a little bit more fun for us. Alright. So I'm act I'm just collecting information before mutation here. So so for instance, what I'm showing oh, this machine was installed recently. Yeah. These are cluster API clusters. They respond off a few days before. So So I don't know when alright. I'm just gonna start with so for instance, I can see that the most recent changes on the file system are basically here. It's basically sorted order of changes

12:21 in the file system, and I'm going to collect it for later. I do notice some bizarre things already here. I see a system d change here. It looks like some files were edited in this directory with them. So so yeah. Anyways, let's go to the let's now that I've collected the the data, how do we get can can you drive a little bit for how do we test your WordPress app? Yes. We can definitely do that. So Alright. What I would normally do is run get pods. And I should have just used my alias, but there we go.

12:41 App Deployment Fails - API Server Issue

13:02 Well, okay. Our app is So it's gone. Gone. Yeah. Okay. And and just just for the record, I just I I really like the fact that they're all in crash loop back off. That that was by design and and I'm just really happy with this aesthetic right now. So this was a big part of my work. I love that you've got it to deliver a message all nicely lined up. That's that's pretty special. So where do we find your app to deploy? Okay. So we can just press space. Switch over to a do that locally, but

13:36 we can do that. Alright. Okay. Clustered. This is cluster 22, and I can just run just workloads, and we'll see if it works. This is what I love most about this. This just just this makes me gives me the warm and fuzzies. So we it did create it. At least it says it did, but running get pods on teleport as a big no no, which makes me think. I don't know yet. Are we speaking to the real API server? Are we speaking to the real API server? Yeah. So I'm gonna run my Just Workload again. I wanna see if it says changed

14:41 or created. Okay. It's saying created. So we're getting the API server. We're getting a response, but nothing is changing state in the cluster. Alright. So let's take a look. I'm not I'm not quite following here. Sorry. So can you see my screen at the moment? Yeah. I I can see your your screen, which is the same as my screen. Yeah. So on my local terminal, which I I need to deploy the application, I ran Yep. Just workload twice. And you can see here we're getting created and created. So, like, I don't think it's actually being created. Otherwise,

15:27 we'd see unchanged on that second deploy. So where we are speaking to an API server, we're not something's not happening on the back end, so we need to work that out. Alright. Interesting. Yeah. It's somewhat difficult, but not impossible, to have a fake API server. But what is going on here? Are we talking to the API server on the right machine? Yeah. So I I think we probably are based on the fact that Chris didn't know we were gonna jump on to the node and use this token. So I think Yeah. I mean, the certificates match and everything

16:11 else. Like Yeah. So I think this is a real API server. I'm assuming we've maybe got yeah. If it was a, like, an admission controller, it wouldn't say created on our site. So it must be someone else. I can see Chris just giggling. I'm I'm I'm not giggling. I mean, to be fair, like, Thomas Thomas is the hero. Like, the game's kind of already over. But, like, I I wanna see how you would go about fixing this because this is gonna be really cool. And I think this demonstrates a lot of I mean, like, I I'll just let you

16:45 guys go through it. You're doing great. You're gonna you're gonna find it. Alright. I'm I'm gonna go first with identifying the weird things that I saw in the file system. Okay. And hope that will give us some of the some of the interesting bits here. And so we we have this wonderful bash script that I saw earlier, which is highly suspicious here. It's highly suspicious. And to be honest, I was I was anticipating this being the last thing we found. So watching you approach this the other way is gonna be really fun. I'm real I'm literally, like, tickled right

16:50 Identifying Malicious Static Pod Manifest

17:20 now. This is great. Yeah. Oops. Okay. So so what did we do here? All right. So you did mention oh, you are evil. Is this an API server aggregation? This is yeah. Look at this. The affected API server. So, yeah, you were, Oh, what did you do to my life? Oh, okay. I'm sorry. So okay. So there's there's a lot of evil happening here. Let's see. What what do we do here? Right. So okay. Okay. So I see so I'm base basically just dumping a list of here are all the images on running containers. Let me actually do this a little differently.

18:33 Thomas, I'm gonna have to get that FLS command from you. That that that thing saved the day here. That's the real hero of the show. If everybody here walks away with that copy pasta from Thomas, we're If you if you, yeah, if you ever have to deal with infected machines now there's probably better things. This is what I used, like, fifteen years ago. But Yeah. Let's see. So so infect API server. I mean, that seems to be, like, one of the first things to kill, I guess. Well, can we check some of the basics first? Right? Like Yeah.

19:08 Yeah. Let's check the basics. Look in the static manifest, like, I expect to see an API server here. Yeah. Immediately, see it's been date is different. Modified. Yeah. Okay. There we go. It's with that with that it's that temp file we saw. Yeah. So we could use QBDM just to reprovision the APIs to our phase. Right? Yeah. So I wonder now here's here's a question that FLS can help the FLS data can help identify. Was was the original copied was was Chris nice enough to copy the original one anywhere else? I actually, looking at the script, it

19:47 doesn't look like she was. I I I have a copy. I I have a copy. Yeah. I can I can send you one as well? I just have to go in the other room and get it off my local social team. So, yeah, I think that's look. Can we kill that that infect pod and then Well, we can remove the the API server from the manifest directory, which the kubelet will then get rid of it. We can either copy the q b d I API server from another machine or we can try and fluff our way through the

20:19 q b d m commands, which I think is in a oh, no. Help. Is it not phases? Phases. Hold on. I'm trying to remember how to curate the end. A phase for just dumping the config, I think. I'm sure it's in it phases something. Why did I not remember how to do this? So a a clue would be if you can get just get rid of the container, that little everything I put in the comments, that's that's really the only mutation you should you should care about. It's I I put a big block around it.

21:33 So it should be safe to say if if we can get that code block to go away, we'll we're we're back you know, Jurassic Park is back online. Well, that pod should already be deleted because we removed the static manifest. Okay. Sorry. What did you say there? I thought that pod would be away because we removed the static manifest, but then I can see in cube system, we still have a cube API server. Oh, that's just a real API server. Can we edit that? Oh, no. No. That can't be right. Yeah. Hey. Did I not delete this file?

22:23 No. We haven't deleted this file yet. I did. Oh, did it get reinserted by A mysterious force? I think so. Oh, sorry. On you go. Stop. This is great. I I love watching the the teamwork here, how we have kernel introspection coupled with Kubernetes expertise because this is this is good. This is good stuff. I I am so sure I did an RM on that failed to to the move. I I did not see an RM. Alright. Let's let's do it again. We've always got the fact that copy and temporary. Maybe I just wasn't paying attention. But

22:25 Malicious Pod Manifest Keeps Returning

23:01 yeah. It's there again. Excellent. So so I think it's basically just reinserting itself. Yeah. Wonderful. What's the e attribute? Beats me, but it could just be the same as others. Alright. So we have unknown processes running on this machine that are that are mutating this file. So we could use, like, audit CTL or something like that to actually watch the file system changes. But then I have a lot of copy pasting to do, and I don't I don't have audit CTL memorized. So but So there's there's, like, three directions I can nudge us in, which would be fix

23:58 going about Kubernetes and and what we would wanna look for there and how we could find some introspection there. There's the the things on the file system we've already found. There's some more clues in there as well. Oh, yeah. And then ultimately, I think, like I said in the beginning, that that first FLS, that's like, you got the keys to the kingdom right there, baby. Like, that's got all you need. Yep. Yep. Alright. So let's let's go ahead and disable some of these random things we found in the file system. And there's there's a I

24:21 Tracing Persistence to Systemd Service & LD_PRELOAD

24:27 saw two suspicious system b services, one NTP and one Nova. So let's let's go ahead and look at those. So And I I I did name everything very politely, if that makes sense. Yeah. Alright. So alright. So system so we have this Nova service that starts that up so we can and we've got this NTP service that was a little suspicious, but maybe not that suspicious. Looks It's clear. Fairly normal. Yeah. So that should be fine. Alright. So So we did a system control disable, but I don't think that stops it. Right? We'll need to stop it as well. Yeah.

25:29 You're right. Do we think that's responsible for putting the dodgy manifest back in this directory? Yeah. It well, it at least has the capability of putting dodgy manifests in place. Let's see. There we go. Well done. Alright. Well done. So since we know the calling card anyways, I'm just gonna look here and see if there are any suspicious other things. I see there is in the root directory also a Nova, but that might have just been development. Yeah. There's I mean, there's a couple of, like What is with this? LibNova? Oh, okay. Yeah. I'm happy you found the calling card. You've

26:32 done this before, haven't you, Thomas? I've investigated break ins before. That's bad. Not in a long time. Yeah. No. You're questioning. Alright. So I might have broken something. I hope I did. Go for it. Try to list. That'll let you know. Just try to list, and you'll you'll get it real quick if you I mean, try to do anything on the file system. There you go. Alright. So we found I forgot And so this is where lds0.preload. That's the one. Okay. You can just echo echo empty to it, and that should that should fix it. Yeah.

27:01 Fixing File System Access (LD_PRELOAD)

27:42 Beautiful. And list? Hey. Okay. K. We got one interesting bizarre thing out. Okay. Next is alright. So what do we gotta do now for looking at Well, we have no API server. Oh, no. We don't. We have an API server, but it's not really the right one. So I don't know what config you used to generate like, what the options were for your Cube ATM cluster initially to create a new one. Okay. I will grab an API server manifest from one of the other machines. Alright. And I'm just gonna okay. So so what I'm gonna do is just look contextually

28:02 Restoring the Correct Kubernetes API Server Manifest

28:48 if there are any other interesting events in the changes file around here. And I see the well Alright. I see the LDSO preload thing here. Do mind if I paste in this manifest? Yeah. Go ahead and fix the manifest. Yeah. Thomas, share that that FLS command with me. Sorry. I your your reputation preceded you, so I I knew Yeah. Roughly So this is one of equals on my encounter. You did well. This is one of the things we we spent it was, like, last night and the day before, we spent a couple hours, like, meeting some of the the guys on Twitch.

29:45 I say I'll say guys, but folks on Twitch. And it was, like, do you know, should we should we name these, like more conspicuously? Should we have a common calling card? Like, how much cleanup should we do? And we we kind of were like, let's let's keep it fun. And I think this is I'm I'm thoroughly impressed. This is well done. Getting the rest of the cluster to clean up is not impossible. It's it's it should be pretty straightforward here once we start digging around. My teleport's not typing anymore. Yeah. Didn't Dave you David send you the message

30:24 that this was supposed to be an easy one? No. Oh, okay. He told me I should make it an easy one. So I said because it was time box, we should take it easy. Oh, sorry. Well, if if you're interested, I can I can kind of guide us through the rest of this really quick? No. No. No. We're we're Yeah. Let's let's keep let's keep break. My Firefox is crashed. Breaking it or breaking it more. So can you if your teleper is still working, I think there's a couple more IP addresses that need changed in that file.

30:57 Yeah. My teleport session is still working here. Yeah. I need to get mine back up. And really the a lot a lot of this is prior art from from other scenarios I've gone through. And, you know, if there's there's a lot of good Easter eggs and good, like, tips and tricks. So I I was hoping and then we already got one of them, which is that that fabulous FLS command. I was really hoping folks would would walk away from this knowing that there is some some things to watch out for. And let me grab this.

31:49 But really in the in that that static manifest now that's not being overwritten with that errant system d service, you really should just be able to leave that that one container and and the clusters. And then it's just cleaning up the cluster at that point. Yep. Okay. So that should be a working, hopefully, API server. Can we delete the x file? Yeah. You can delete the x file. That's fine. Alright. Let's check for Sorry? Don't like me. Alright. So so what are you looking for here? Oh. I'm just waiting for the API server to come back. Static. Yeah.

32:33 In theory, I should. Oh. I should. I'll let you type, Thomas. That mean we can run get pods and actually see? Yeah. Let's see. Alright. So I we could probably just delete that whole namespace. Yeah. Let's see. I'll do this thing. Old school way, I guess. The first one's You can do dash dash all now, Thomas. You can do delete PO or delete namespace now. Oh, yeah. Well As long as there's no admission controllers on there in that in space with other objects with finalizer, it would be fine. Alright. I'm just gonna do it this way.

32:51 Cleaning Up Malicious Artifacts (Infect Namespace)

33:21 Good. I love I love seeing somebody really proficient on the command line like this. This just this just touches me right in that, like, soft place in my heart that nobody ever goes. Alright. So those are terminating. We still have some of the default ones, but that's fine. I'm not I'll I'll I'll leave your calling card there. I accidentally killed two of them, but whatever. I'm curious to see if they come back. Yeah. Me too. Still terminating. That's not looking good. They they they should we'll see. Yeah. They're they're done. It looks good. Yeah. That in this case

34:18 is all cleaned up. Yeah. Cool. Oh, well done. Do you wanna try your deploy again and see if we have a healthy deploy? Unchanged. Unchanged. Excited. Yay. Good job. And they're running. We got Postgres and clustered. And I guess the last thing is to verify connectivity. Well, have to edit deployment clustered. We should be able to upgrade Why are you editing? We're Oh, I was trying create a video. Okay. Which I suspect will work fine, and then we'll port forward and see. I'm still impatient. So I have to ask, since I didn't disassemble your library or even, you know, really dump

34:21 Challenge 1 Fixed & Exploit Analysis

35:34 any debug, what Slash temp. Go check out slash Go check out slash temp. I I figured if if you were looking, slash temp is like that's I feel like if if I put it in slash temp, it's that's the difference between hacking and security research. It's fleeting things in slash temp. Exactly. I I I saw there was the temp nova director. I haven't looked at what all or the file, but I haven't seen what else. It's it's I love it. It's very Yeah. You should just list everything in temp and it's a hidden file called dot nova,

36:07 I think. It might be obscure. It's caught in there with last. And yeah. I left it's very polite in there. There's even a a lovely make file for you as well. Excellent. So let's Yeah. I guess before we drop off, I wanna quickly browse those files. But so far, things are looking too good to be true. No. You did great. You honestly, I I mean this sincerely. You knocked it out of the park. This is this is well done. Alright. So let's let's see. Oh, damn. Okay. So there is the dot Novo file where What all files were there in temp?

37:04 How do how do we get back? Are you back in temp? I don't I don't see Yeah. I'm I'm in temp and the the dot Nova was the Look for a different file. Just list list long temp. Everything in temp. Okay. Obscure. I see. Temp obscure. There you go. And this is I even think the shared object should be there. There you go. It is. Oh, hiding process the process. Okay. This is this is a very common I did so we we thought about doing an eBPF something or other, but this is just good old l d

37:48 preload, and and that was kind of, the big talking point I wanted to get here. And, yeah, I can see that my p f that I did earlier, the Nova did not show up. So My heart was beating out of my chest when you did that p s. I can't even and it and it didn't work. And I was like, oh, I've got them. I fucked I fucking got them. And then it was and then David was just like, oh, no. You gotta do a system cuddle stop. And I was just like, that's gonna get it. You can

38:12 know that the PS didn't show it. Cool. Yeah. No. Well done. This is great. Thanks. Alright. Well, our port forward didn't work to browse to the application. Does that mean there's more to It should it should I didn't touch the network. Alright. You're you're welcome. Alright. Good job, Thomas. That was a Thank you. That was nice to watch. FLS is now gonna be stuck in my brain. Yeah. So SleuthKit, it's it's an old technology, but there are many other things. There's but one of the nice things about FLS is that it also notices deleted files because

38:53 it is dumping the raw EXT file system data and not just relying on the current state. So Nice. So check out if you go to the root directory and look at the bash RC, I hit a little there's a a bash forward in there and there's a I made the assumption a couple of assumptions with this. One of which was that if you were looking at the raw block device logs in any form like with FLS, you were gonna get everything. But even if y'all were to exit out of this and come back, it you it's gonna

39:25 clean up after you're on the way out. Check out dot Nova in here. There's it's I hid it somewhere in there. I don't know where the pointer is. Yeah. There you go. That very last line. Yeah. So yeah. I mean, it was superficial clean clean, but but really looking slightly deeper into the Linux system, you were gonna get a lot. And I think this this is good. Nice. Excellent. Thank you for for setting up this scenario. It was fun. It it Yeah. It exercises a lot of muscles that haven't been used in decades. So Great.

40:01 Starting Challenge 2: Thomas's Cluster - Initial Look

40:01 Alright. And I think the yeah. Yeah. What do I do? Tell me what to do. No. If you got something to say, go for it, and then we'll jump into Thomas' cluster. Yeah. I'm trying to figure out how to open up Thomas' cluster. I'm here in the teleport browser, so I just go to root and click on connect. Is that what I do? Hold on because I got logged out when I restarted Firefox. I'm I'll start a session just so it doesn't flash your IP address on the screen. Oh, because if I go to active sessions,

40:30 your IP will be there. Let's just then do Okay. Yeah. Yeah. So I'll open the control plane and if you go to active sessions now, you should be able to see that join and then just type echo hello or anything like that. Okay. So I see a few seconds ago. I see two of them. I see Rawkode options join. Oh, because I just have one when I was copying the API server. So if you yeah. The one you've just joined is the correct one. It's my screen flashed. Let's see. So I'm gonna have to admit, I I

41:04 was not nearly as creative as you. Okay. No. That's good. That's good. I so I didn't really know what to expect, and I I kind of live in a bubble. Like, I I've I've taken a step aback from kind of what's going on in cloud native and Kubernetes and everything. And so, like, really, all I knew was here's here's a Kube config and an SSH to a cluster and break it. Yeah. And I was like, okay. Well, I've got some some good, you know, horizontal attack vector tools, and I just kinda put together a fun scenario. So

41:32 I I I have no idea what to expect here and I don't know very much about you, Thomas. So like, you're gonna get to watch me frail now. Yeah. I'll just say everything's working just fine. Oh, great. Well, I have exported kube config path and alias kube control to k. So if you wanna run, get nodes, get pods or anything else, Chris, and we'll we'll see where we're at. Well, I I want it I feel like I should return the same favor of collecting before mutating. But I I don't even know if that's gonna that's gonna be necessary.

42:05 But let's yeah. Yeah. I'm just gonna just do just check out the state of the cluster, I think. We'll start we'll start at the top work our way down. Whereas I think we we started at the bottom and worked our way up the last. So let's just let's just get let's get pods in the default namespace, and I'll run the command that I wanted everyone here to I wanted you guys to run. So that looks good. You both managed to drop a message and to get pods. That was It's tradition at this point. Yeah. Wow. Okay. So, like, now I'm I'm

42:44 wondering, should I should I be collecting information on on the the file system here? Well, he says he kept it simple. Maybe maybe we I kept it very simple. I don't I don't I'll I'll say there are no root kits on this machine. Okay. Well done. So then let's let's see what we have here. So let's let's get we'll get all in the cube system namespace and see what we got here. And I'm just I'm just looking at this point. So everything looks fine. It's all running. The control plane looks good. Yeah. So no crazy restarts. I mean, already

43:28 the things I did were sticking out to me like it's sore thumb. You know, I had two containers running in the API server, you know, fourteen, fifteen, 16 restarts. Let's get namespaces and let's see. We have Rawkode. We have metal l b. These all look fine. Selium looks fine. And I guess the success criteria here would be if I scroll up, we do not have our application running. So I'll I'll I'll do the same thing that that that we did on my cluster. David, could you please deploy your application and see what happens? Indeed. And I have to ask, was the was

44:06 the pod honk Whose idea was this? Where did this come from? Okay. Well, we got a etcd server. This database space exceeded, so we can't write to etcd in order to deploy our application. So we've seen this before on clustered. Sorry. Let me go. Can you oh, can you go back and and show that? I I missed what you just said. So we've got etcd is complete. We can't write to etcd to do the deploy, and it's because the database space has exceeded. Okay. We'll Let's see. Just wanna check out our file system here. So that's okay.

44:17 App Deployment Fails - ETCD Database Exceeded Error

44:55 This all looks good superficially. And I'm gonna I'm gonna trust Thomas here that there's no rootkits and that nothing. These are all actual real numbers. The last time I seen this error on this show, the problem was either I think the attributes were changed on the STD fails, so it couldn't be written to it and then the other one was someone modified the STD configuration which actually has a size attribute. So which says we can only write this much to STD and then it stops accepting anymore. So I think we'll need to do that. And I have a cheat sheet

45:29 that I always use for entity because I can never remember entity. So I'll get that set up for us. I'm also I'm also curious about these these Honk pods as well. I don't know anything. App install s c d file. I don't know anything. I love I love this. Oh, this is evil. I was so mean to you guys. I'm so sorry. That's that's okay. I so I've I've broken one other cluster for clustered, and I and I I did get excessively evil in that one. So I was encouraged to keep it simple this time.

46:16 I see. And Yeah. We destroyed the boxes so much, we couldn't go on to fix it. So I mean, I I was like like, what what else do we wanna do here? I do I recompile a kernel and like put a new kernel image in the bootloader? Like like, what are we what are we doing here? And I just was like, my my promise to myself was we will we'll stop at the layer right above the kernel. Like, when I started entertaining the idea of loading kernel modules or eBPF probes, I was like, nope, Nova, that's just you can't

46:43 be that mean. Yeah. Okay. So, yeah, back to fixing that d d. Let's see what we got here. Yeah. So at CD, we now have a working client so we can speak at CD if we if we need to. It doesn't look like fails to be modified, but of course we know that means nothing on Linux. Yeah. You can I I I Linux timestamps are adorable? Yeah. I think on this part one today, all three of us modified all the date times on the static manifest. They can no longer be I thought I thought about doing that or like mutating any

47:16 of the I node features and I was just like, no, just keep it nice. Keep it real. Well, I I I don't see anything weird here. Can we check out the STD logs? I just wanna see what what the STD logs are saying in the control plane. Oh, there's two of them. Let's try You can tail star. I like if if yeah. There we go. Dialing connection refused. That doesn't look right. What's going on? 202379, that's the STD socket we're listening on. Right? Yep. So what's what's going on with that? Can you can you can you hit that socket?

48:14 Can we send data? Can you in map that? What do we got? Who's screwing with my network? We can because we can run an STD get. So we are successfully able to speak STD. We got some data out. What what command did you just run at c d cuddle? Yeah. Get and then pass in a path and let's just close that Kubernetes service. Okay. Cool. Okay. And it's still saying it's unable to write? Where did what does that c d write to? So if we take a look at the static manifest, we will see that it's written to I

48:55 think it's varlib etcd. There's a host path in here. Yeah. Varlib etcd. Well done. Thank you. And you can just delete any of the files in here, and it'll recover just fine. Don't worry. That didn't sound sincere. So these are all 62 meg files. Okay. So this is this is effectively the STD database? It was yeah. Can can you type can can you type history? I just wanna see if what if we have anything there in the batch history. Yeah. Let's see what else we got here. Scroll down. Yeah. This is just our stuff. Okay. That's clean.

49:49 So there is because I only know this because this happened before, but there's an x t d max database size thing that I bet has probably been set. Yeah. I don't know. I my work with the STD is is really limited. So I'll I'll at least give give a hint that might help. Sure. I've actually I didn't SSH into this machine at all. Of course, you didn't. Well, okay. So I SSH in to verify that I could SSH in, but I didn't do any mutations over SSH because I wasn't sure how to SSH in to begin with

50:30 anyways. Me neither. I did everything. I just Over APIs. As did I. I just privilege escalated into one of my hack pods for everything I did. Yeah. Have we lost our API server? I don't know. Check and see. Hello? Yeah. That looks fine. Okay. That's not working. So I'd I would be willing to say that because ETCD is not responding, the API server's hanging. This IP address is wrong. Was that a cron job change? That would have been sneaky. I'll I'll say there are no cron jobs on the system. And no system d services. I tried to keep it very different this

50:44 Investigating ETCD Connectivity Issues

51:42 time. Did you use system d services last time? Yep. Nice. Alright. So it looks like So we can't hit local host? So we were able can you just try to list we tried to list pods and it's hanging now. Correct? Yeah. Unless I am losing my mind. Bet we we were able to list pods. We've seen that Thomas has necessarily all those pods that said honk. Obviously, was some sort of timer thing happening. So maybe it's like Kubernetes job. But Yeah. There's there's no timers that I implemented. Just keep in mind that oftentimes the API

52:28 server gets really grumpy if etcd is grumpy. Yeah. That's I I mean, the etc server in general can take thirty plus sixty seconds to kind of get to a back to a happy state sometimes. Oh, we've lost HTT server failed to apply request. No space. It's it's just no space again. It's gotta be something somebody flushed a large amount of data to a pod somewhere. I'm those honk pods, if we get rid of those, they're I would hope that it's something, like, they're just writing a large amount of data somewhere to the API server, like, the

53:10 form of a label or something. Okay. So we need to be able to shut down those processes then. I think getting rid of the honk pods would be a good first step. And then I also I also like the idea of trying to flush etcd as well. But I don't know how much the persistent state of the cluster we would nuke if we just started manually messing with etcd files. I have I'd again, etcd is like I just I haven't really had to debug it Yeah. In dev. My my recommendation, do not use the file system. Use use

53:44 the Etsy client. Yeah. That's Unless that's really useful. That means I have to know how to work the Etsy command line application. Is there an alternative for us? Do we got a shell? No. No. Can I can I drive for a second? Please. Please. I typed, you know, nothing happened. There we go. Yeah. I wanna see this really quick. And the connection was refused, and let's p s ox grep cube. And which container runtime are we using? Container d. What's I see all these new fancy container tools. I don't know what these kids these days are doing.

54:39 Okay. So let's look and see. I wanna how do we what would be the docker p s for So we tested to a cry control socket. Okay. I can't remember exactly. This may work. Yeah. I can tell you've done this before. But then you have to add PS at the end. Oh, it's a runtime endpoint. Right? I remember in a previous episode, showed off the like, you can just set the default for setting the runtime endpoint in the CryCuddle config to save keystrokes later. But I always I saw it, I was like, okay. I should remember this, and I don't remember

55:29 how he did it. Oh, there we go. It's container t container t dot sock. So Thank you. Okay. There we go. Cool. We have Well, we have what is speaker? That's a metal l b. So we can ignore the speaker part. And STD was recreated thirty nine seconds ago. And attempt one hundred and thirty five. So the STD is yeah. It's rather unhappy. Yeah. So what's going on? Look at how do we like look at this? I wanna see what this container is. Like, how do we we do, like, the like a docker inspect on on this container?

56:12 So we can look at this I'm assuming this Manifest. Wait. What? This is it all here? See, I we should have dumped the cluster when we could have. This is this is why he did this. I should have learned. I should have been paying closer attention. So we could just try, like, an STD compact, which would remove all the deleted items from it and bring the size the database size back into hopefully, enough to get it started. Go for it. It's just compact, isn't it? Yeah. Okay. Number? Oh, how do I pick a number? Needs one argument?

56:56 Just do Required revision has been compacted. Can I let I have I have no idea no idea what you're doing? Me neither. So let's try the defrag command too. I'm very natives. Let's try that. Yeah. Defrag's fun. Let's do it. Let's do it. Is there wait. Is there really an STD cuddle defrag? Is this Yes. There is. I'm buying a farm and moving. I'm buying some goats. I'm so done with tech. I'm done. Alright. Well, that failed measurably. Contacts deadline exceeded. That's a go error. Yeah. So you should check out the alarms as well. That's a kind of a cool feature of

57:46 etcd cuddle. And, yeah, and and the health. Those are all really great things. Oh, yeah. So there's the no space alarm. Yeah. So how can we list all of the is there well, it would be like an etsyv cuddle d u is what we wanna do? Like, what's the etsyv cuddle, like, version of, like, d u dash dash h? Yeah. We're we're getting warmer. Yeah. I I I don't know. You're gonna have to help me here. I I genuinely don't know enough about etsyd. I can start to Google Etsy d u or Etsy d cuddle.

58:32 Yeah. I'm not sure there's a d u. If there is, I would love to learn about it. Yeah. I don't know. I this is just what I think. Like, basically, I just I'm assuming STD works like any database and it just writes, you know, some sort of block abstraction to regular old EXT four disk. And I'm sure there's some way to browse and explore those files and that's what I wanna do. I whatever at CD and those those guys have called exploring their proprietary block abstraction, let's find that and let's do that. Well, it looks like we can get something

59:07 called a revision test key. Sure. So we have some people in chat talking about disarming alarms and increasing size quota. Yeah. I was googling how to increase the size, but I haven't and then I think I'm I'm, like, literally, like, getting getting started with a tt.i0 is where I am right now. Yeah. Also, the my Twitch chat seems to think that SCD db size is limited and time scale db. Did you run out of disk space? No. Tanya, while we're not out of disk space, I'm talking to my my my Twitch friends. Yeah. I don't know. How do you

1:00:05 let's see. I'm I'll tell you guys what I'm googling. STD, Cuddle, size quota. Alright. I found something under maintenance stop, so we can get that a dash. It's base quota. STD dash dash quota back end bytes. What's that? Well, that didn't work. We can't get the current revision so that we could work out how to delete the old revisions. This has a alarm disable. I don't know if we can just do that just to keep it happy. Lift, disable, disarm. The alarm's gone. Does that mean SCD is happy? Try to get pods. That's our that's our

1:01:07 control. Right? Okay. Get pods. Alright. So we can just disable the alarm. I don't know what that does with the database quota though. I don't know if it's gonna come back, but I wonder if we can do the revision command now that I copy and paste it. I'm just still gonna give us a yeah. Still got a contact deadline. SCD is probably not very happy. Did Thomas, did you privilege escalate into something with SCD cuddle, mutate SCD alarm size quota something or other. This this isn't an app. You didn't actually flush a bunch of frivolous

1:01:45 data to disk. You actually just told that td to misbehave. This is fine. Well, the alarm's not coming back, and we do have a working API server. So maybe we can ignore it. Try to And delete these honk pods? Yeah. Delete the honk pods. I'm gonna well, you you have the cursor right now, but I I wanna dump the cluster config now and look at the events in Kubernetes and see what this guy's been doing. Alright. Okay. We still can't write the STDs, so we are gonna have to try and change the database size. Let's see. Etcd

1:02:22 cuddle change database size. I don't like it when people mess with entity, just for the record, Thomas. Yeah. So this is a learning opportunity. So I'll I'll note that you you did buy yourself about ten seconds of runtime. You've already squandered it. By turning Alright. Yeah. Let's see the CTL database. Yeah. Everybody in my chat is like, we were too nice when we broke our cluster. No. This is good. So so I'm gonna give a an interesting tidbit here. One of the fun things about ETCD is that it's the way that it's set up, it's listening on a public IP and port.

1:03:32 The the etcd and the etcd server itself is listening on a public IP that's that's reachable if the the node is exposed and but it's just it uses its own internal auth. That's where all those the TLS, etcd search management comes into play. Correct. So are you saying that you are just flooding well, like like I I don't know. I I wanna start checking in to see how etcd is actually mutating the file system and seeing if you're truly filling up our file system or at least or if you're just filling up what etcd is

1:04:04 told is the limit for its file system. So yeah. Try to set that right there. Set that to something substantially larger. That's huge. I know. I I Oh, because it's in a container. Yeah. Exec into the etcd container. I don't know if we can do that with cry control. Here we can. Okay. Exec IT, copy of the name of Yeah. It's not there just now. Oh, that's probably what's happening is the is the container crashing and that explains the restarts of the API server as well. And It's not coming back. Look at the API server. Oh, we can't.

1:05:15 Come back at c d. Let me make sure. Yeah. It's not gonna work. Wonder if oh, we can just add it here. Yeah. Well, that should get restarted by the kubelet and Check the kubelet logs, journal cuddle f u kubelet. My favorite command to type. Not sure if I see anything NCD ish here. Where did you edit the YTD pod manifest? Yeah. And we oh, it's in a back off state, so it may take a bit longer to come back. Although, I think maybe that's it. There? And check can you check the alarms again? I wanna see if that alarm came

1:06:42 back. We can't speak to entity yet. Yeah. Oh, yeah. Because it's not ready. Yeah. Wonder if that number was too big. Maybe I'm making it worse. That's six that's 16 megs. So change that 16 to one zero two four. Nope. Get rid of that. There you go. That's a gig. Failed to delete. So if you do I guess, it'd be interesting to see what CryCuddle is showing nowadays for Yeah. Yeah. So Oh. I mean, we can always just restart the kubla and try and Yeah. Encourage it. Yeah. Go for it. The only other questions I have is whether like, I didn't realize

1:08:05 you could do shell expansion in the YAML command and the Kubectl command. I didn't realize it went through a shell. Is that in fact true? Where did we do shell inspection? Did we tilde when we should have homed? Yeah. And here, where we passed the arguments then. Yeah. That's not gonna work. Can you actually do that? I've never done it. So but I I would be surprised if we can run through bash. That's a very good point. Yeah. Just change that to if it's in bytes, There you go. Alright. And restart the cubelet again. And can you list pods or containers? Docker

1:09:02 p s, Cryocuttle p s? Yeah. We still can. Or it's not attempted to start at c d. Kibble log What's the Should be it notes not sync. There it goes. Let's see what container d says. Oh, it just came back up. Yep. There it is. Yay. Disarmed. Well done. And can you get pods and delete those pods quickly quickly? Can you delete all now? Goddamn it. Alright. I guess we're going bigger. Alright. Alright. Sorry for swearing on your I don't I don't know the rules. This is, like, corporate something. I'm not allowed to swear like a sailor,

1:10:37 I don't think. People will get mad at me. No. There is a code of conduct which explicitly states you are allowed to swear, but that's about all that's accepted. Alright. Okay. I'm gonna give a hint that the the pods are just a red herring. So something else is filling up at city. Yeah. If if you if you look at the if you look at the pod manifest, you'll see it's just a sleep eighty six four hundred. Okay. Alright. So making that number bigger and restarting the kubelet to restart it to d was probably pointless because the thing is still running

1:11:11 on the system. Alright. How do we find this road process then? Is it if it's a road process that he put on the system, I mean, we can start we can start looking in the process tree. I I wanted to look and intersect and see, like, what has changed on the file system in the last day, in the last two days, the last three days. Thomas is shaking his head no, though. Well, I actually say no because he gave us a hint. Alright? He said that this listens on the public thing. So let's oh, no. Because it's gotta speak with other

1:11:40 Identifying & Blocking the ETCD Attacker IP

1:11:47 nodes too. How do we if you got something running on your own machine that's talking to our that's at c d. Perhaps IP tables might help. Perhaps IP tables. Yeah. List IP IP tables cap l. Let's just see what we have there. And, like, look. You you may wanna see who all is talking to etcd. Tell Matt Tolpin would at least tell us what all is listening and where it's listening. So do we wanna tell Matt stat and look for things on Port 2379? Yeah. Sure. I Close. What's what's like starting for next stat to

1:12:36 get more here? Pulp in. I usually do pulp in, but you can do whatever. Do you wanna type? Because I'm not sure what flags you mean. Alright. Okay. Yeah. So you all those acts might be suspicious. Right. So 157131124242 is an IP address we should maybe block. Is that an IP address? Yes. What? Yes. So he's he it looks like he's doing something to add TD from wherever this IP is, which that looks like it's a public IP. I don't recognize at the top of my head. Alright. So should we just add a drop from that IP?

1:13:20 Sure. Go for it. Alright. I'm gonna have to Google IP tables now. I was gonna say I was expecting I wasn't expecting to see the hint in the next step. I was expecting that you were gonna have to look in TCP dump or or something of that nature. So you did mess with the network. Have to solve that issue. Okay. So we'll just do 157 You can do a Four. A s 20 four. That? No? You can do a 30. Or yeah, that's fine. J drop. And run the net stat again for me, please. Last act.

1:14:10 Should we restart etcd to would that flush those connections? Yeah. Go for it. Did Thomas, did you grab the go ahead. I was gonna say I just noticed in another window that something that I had been observing has changed state in that I can no longer connect to it no it can no longer connect to etcd. So I see. You did put something outside of the cluster. So oh, okay. I was polite. I put in a teammate backdoor in that container that I could at least host exec through, but I didn't like, I'm not running anything

1:14:51 on any of my local boxes. That's just not nice. I thought this was easy, but Thomas, come on. Yeah. It just a very simple denial of service attack. That's it. Alright. Did you h But now you still you still have to so I can know you know, it's no longer mutating anything. So Sure. So whatever it was mutated. Yeah. Alright. So we have h t d endpoint health. There's no response He he dossed etcd on the public port on our cluster, which was preventing us from scheduling work in our cluster because that CD was filling up.

1:15:01 ETCD Flooding Exploit Revealed

1:15:36 So I think let's let's try to tidy up at CD. Let's get the API server healthy and let's I like, I mean, obviously, I wanna start looking at other networking rules in this infrastructure, but we can talk about that later after breakfast. I will say you have you have solved the the the main, like, root cause here. Although, you still have to clean up after it. Right. The whole pods are are gone or at least they're going. So that's a star. Well done. Was there anything else on this system that we Can you list containers really quick for me?

1:15:38 ETCD Cleanup & Validation (Increasing Size Quota)

1:16:22 Okay. Thank you. And try to run your deployment, yeah, one more time. Sorry. I'm just barking commands at you. I apologize. Oh, no. Please, the let's make sure. Yeah. I think Yeah. It's happy. So the the key here was actually you increasing the the space for ETCD Mhmm. And combined with the lack of my scripts to DOS the the data. So can I just confirm them? Right? That is how STD works Right. On a Kubernetes cluster. Like, so anyone could do that to any public STD that they found. You no. You you have to have the certs. Alright. Okay. So I'm just

1:16:41 Challenge 2 Fixed & Wrap Up

1:17:12 actually That's what I was worried about. Okay. So Yeah. So the the the takeaway here, just kinda play it back, you scheduled a pod, you grabbed the STD certs from the system because I mean, we we had root. Once you had the certs, you then effectively I mean, etcd cuddle is a well known tool. So you used etcd cuddle to hit what you knew was a publicly listening TCP port. And from there, you had the keys to etcd and you just decided to pick on etcd for us. Yep. Just just flooding etcd with rights. Lovely.

1:17:48 Okay. So let's can I can I can we just check a few things here? I'm just curious why why we're here. This is a great learning opportunity. Can you go back to that where that main directory where those those dot w files, whatever they were, the etcd rights? Yeah. The far left etcd Yeah. And the right head log. So oh, member. There you go. And can we just see how big these are now? I'm just curious how how much of the etcd of these rights are being persisted, and how much of them are still there?

1:18:27 Etcd by default will at least as as configured here, will only allow 2.1 gigs of data to be stored. Yeah. It's pretty trivial to flood that. Interesting. This is a great exploit. Good job. I did not realize ETCD was this subject to to flooding. But again, like the certs is this is a really good one. Yeah. I was I was going with TLS flooding instead, but it seems like etcd is no longer it's no longer an issue for etcd to to a lot of the DOS attacks I had considered were no longer worked. I see. I I remember hearing

1:19:11 some some HP floods on a couple of the Kubernetes components back in the day, but I haven't actually seen I haven't seen anything like this. So I'm just like, can you just share a little bit about your flood? Just, like, you don't I don't have to see it, but I'm just curious. Were you just able to write, like, a megabyte of data and just flood that in a loop or something? Yeah. That's basically so at etcd cuddle has a put command, and you can put random keys. In fact, you can actually just keep putting to the same key, and an an etcd

1:19:42 will store each revision. So it's a d d of dev random into the etcd cuddle. And so Oh. You can piped you piped dev random to etcd cuddle? Yeah. So you can use the so we talked about compact and defrag, and we showed those off late earlier. And had it only been writing to the same key, that could have helped. But I decided in the end actually just to write to random key names as well, and it was no longer helpful. Interesting. Awesome. Great. I've I've learned awful lot from both of you today. So thank you very much for joining me. I

1:20:20 don't think we have time to give you my cluster, so I'll maybe open it up to people on Twitter or something. I don't know yet. Because we've already done it over time. I won't I won't keep you any longer. But thank you both for joining me. There's lots of knowledge here shared and I appreciate that. And I hope you both have a wonderful day. Any last words? Thomas, I'd love to catch up after this and, like, nice to meet both of you face to face. But, yeah, this has been so much fun. Thanks for inviting me. This

1:20:48 was I had so much fun with this. This was really good for me. My therapist agrees. This is this is good for me. Yeah. Same. I I learned a lot just from, you know, not only building the case, but seeing seeing the case solved. So and as well from your your evil. That that could have been really brutal. I I so I'm glad I I kept it where I did, and I left I left, like, a good a good set of breadcrumbs for you. Yeah. I I will do a write up on mine, and I'll probably do a write up

1:21:18 on yours as well on my blog after all this. Sounds good. Awesome. I'll send you the commands. Alright. Well, thank you both again. Have a great day, and I'll speak to you soon. Bye.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
etcd

More about etcd

View all 24 videos