About this video
What You'll Learn
- Diagnose pod crash loops by checking image tags, readiness probes, logs, and restart patterns under pressure.
- Debug API server failures via static pod locations, kubelet behavior, and kubeconfig typo fixes.
- Locate hidden admission webhook and CNI issues by tracing policy rejections, network policy blocks, and namespace scheduling faults.
KubeCon special part three. Jeffrey Sica and Chris Carty break clusters with a modified kubelet, OPA Gatekeeper policies, a broken Cilium CNI, and a hidden static pod manifest, then race to find and fix the sabotage live.
Jump to a chapter
- 0:00 Holding screen
- 0:35 Introductions
- 0:37 Introduction, Housekeeping & Sponsor
- 1:32 Guest Introductions
- 3:30 Cluster by Jeffrey Sica / Jeefy
- 3:41 Debugging Jiffy's Cluster Begins
- 4:31 Identifying Crashing Pods
- 5:29 Investigating Postgres Image Error
- 11:18 Searching for Admission Controllers
- 13:57 Node-Level Investigation Begins
- 28:26 Requesting a Hint
- 32:41 Jiffy Shares Cluster Backstory (Hint)
- 33:29 Deploying a Test Pod (Nginx)
- 35:58 Switching to Worker Node Debug
- 39:32 Suspecting a Modified Kubelet
- 40:16 Reinstalling Kubelet
- 41:51 Jiffy Explains Kubelet Break
- 44:00 Cluster by Chris Carty
- 44:30 Debugging Chris's Cluster Begins
- 46:45 Gatekeeper Policy: Required Label
- 50:51 Gatekeeper Policy: Allowed Images
- 54:00 Deployment Scaling Issues
- 56:17 Finding Blocking Webhook on Namespace Deletion
- 59:03 Unblocking Namespace Deletion
- 1:00:12 CNI Failure (Cilium Issue)
- 1:00:38 Restarting Cilium
- 1:05:15 Network Policy Blocking Traffic
- 1:07:41 Deleting Network Policy & Success
- 1:08:00 Cluster by David McKay / Rawkode
- 1:08:21 Chris's Cluster Fixed & Rawkode's Offer
- 1:09:01 Debugging Rawkode's Cluster (Optional Session Begins)
- 1:10:45 API Server Rejection & Checking Kubeconfig
- 1:16:01 Using Curl to Test API Server
- 1:17:50 Re-examining Kubeconfig
- 1:21:33 Identifying Kubeconfig Typo
- 1:24:11 Fixing Kubeconfig Typo
- 1:24:41 Chaos Pod Breakdown (Due to Fixed Kubeconfig)
- 1:27:02 Chris Departs
- 1:27:24 Rawkode Explains Static Pod Bug
- 1:32:26 Searching for Hidden Static Manifest
- 1:35:07 Rawkode Explains File Hiding Techniques
- 1:37:30 Conclusion & Wrap Up
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:37 Introduction, Housekeeping & Sponsor
0:37 Hello and welcome to Rawkode live. I am your host Rawkode. Today is part three. It's the final part of our clustered KubeCon special. Before we get started, there's just a little bit of housekeeping. First, you're not already a subscriber, please subscribe now. Click the bell and you will get alerts for all future episodes of Rawkode live where we explore the cloud native landscape together. If you're watching live or even if you're not watching live, but you wanna talk to a few hundred other technologists focused in cloud native Kubernetes and Docker and containers and all that cool fun stuff, there's a Discord server.
1:10 Come and say hello. And lastly, I wanna thank Equinix Medal. They are my employer, and they allow me to do this on their time. So thank you very much. If you wanna check it out, use the code Rawkode. This will get you $200 in credits. There's roughly around four hundred hours of compute if you spend it wisely. If you wanna have a bit more fun with it, you can spend it pretty quickly. So let me know how you get on. Alright. Today's guests for joining me on custard part three KubeCon special are Jiffy and Chris.
1:32 Guest Introductions
1:37 Hey. How are you both doing? Doing well. I love that, Bex. I just saw the question of both people and see who who answers first. But Jiffy, do you wanna say hello? Tell us who you are, what you do, and then we'll move on to Chris. Sure. Hello. My name is Jeffrey Sika. Pretty much everywhere on the Internet, I'm known as Jiffy, so they're kinda interchangeable. I'm a principal software engineer at Red Hat. I focus mainly on CICD with, managed distributed systems like, OpenShift dedicated. I'm also a CNCF ambassador, so I'm kind of all over the place.
2:12 And I'm a Kubernetes contributor, so I'm a SIG chair for SIG UI, so Kubernetes dashboard. And I help out with Kubernetes community and also Slack admin, YouTube admin, that sort of thing. Thanks for sharing. Chris? Hi, everyone. My name is Chris. My day job is a customer engineer with, the Canadian public sector at Google Cloud. In my off hours, I do, do some, committee. I'm nervous and terrified and excited right now, so I'm gonna stumble a lot. It's gonna be fun. I contribute to Kubernetes as much as I can where I can. I got the fun little
2:50 Kubernetes contributor shirt. I do this primarily through the monthly office hours, which is also cohosted by, by David here and, as well Pop. So, if you're looking to learn more about Kubernetes, that's another great way to engage and interact with the community and learn a lot of fun stuff. So, I'm excited to see what we can do today. Well, anxious, excited, and scared is the three things I want people to feel unclustered. You're hitting all the targets. Quite a little sweaty. I just wanna thank you both joining me because you're both cornerstones of the Kubernetes and
3:24 cloud native communities. It's a pleasure to do this with you today. So Thank you. Alright. Quite in watch. That's the niceties over. Now we go straight out to debug. Alright. Hold on. Hold on. And that's nicely leads us into. We are starting with Jeffrey's cluster first. So, Chris, you and I are up. I am gonna connect to the control plane node of this cluster. If you could join the active session and type echo hello to confirm that we are sharing a buffer, that would be appreciated. Okay. So this is gonna show up under activity active session? That is right. Yeah.
3:41 Debugging Jiffy's Cluster Begins
4:02 Okay. I was waiting for it to refresh then. Come on. Oh, maybe I'm looking at the wrong cluster. Yep. One four five, please. Yeah. The two tabs open, and I definitely went to the wrong one. So we're off to a blazing good start today. Alright. Well, you just joined because my alias set forth. Thank you. Yeah. Getting the cube config and alias out of way just saves so much pain as we dive in. But I am gonna give you the honors to run the first cube control command and see if we have a control plane
4:31 Identifying Crashing Pods
4:35 or at least an API server. Let's see let's see what we we have here. What kind of journey are we going on today? Let's check to what notes we got. It's all ready and working. That's that's refreshing. Very nice. I know I've given this. No. There's no niceties here. You got basic tip control, and that's it. I thought I'd check. Yeah. And I'm just worried that this is like an exam. I'm gonna forget everything I ever learned. That's alright. We're just gonna be And we're doing this together. Talks. Yeah. Oh. So we have one or two restarts.
5:20 Yeah. We have both of our workloads here at the default namespace, but crash it back off and over a thousand restarts. Control c works as the copy and paste. Right? It's probably gonna send a sec term to the the terminal. It does capture it and work it. So you might wanna do a shift and insert. Yeah. Control shift and insert, whatever that is. Yeah. Oh, I'm doing a bunch of bad stuff right now. I'm just gonna type it out because this is already off to a good start. Sorry. I'll just Web terminals are my forte. No worries. Here we go.
5:29 Investigating Postgres Image Error
6:00 Yeah. There was the clustered, what, two months ago where I pasted down my password or typed my password into the session. That was a that was a good thing. Yeah. I I might do something like that. I haven't done a debug on a computer in probably about a year, so this is gonna be a this is good times. I just have to reload, but I'm good. Okay. What have we got? Readiness probe failed. The probe failed? Oh, let's see. Let's see if Postgres is giving the same error. I feel like it should be also checking
6:36 for CRDs, but we'll get to that. Yeah. Let's take a look at Postgres first. Yeah. Paste it. There we go. We're just getting loads of evil faces from g p in the chat. So I'm not even looking at me. It's you're looking inside of my face. I'm like, in the terminal over here, and then the camera's over here. So hope you enjoy the side of this. This isn't a good bet. So what we see with this one then, he's obviously changed the image because this is the Jimmy Hong Carnetti's image on our post credits. Apologies. I didn't even look at the images.
7:20 Was looking at what it described. That makes that makes sense. Yeah. I wonder if that's gonna be as easy as just fixing that back to the postage image that we expect or not. But we can get we can get that a try. Robot history. Robot history. How fancy. Oh, yeah. That's what I came here for. Alright. Like, I think for the first four episodes, I would just do, like, a cube control, delete pods slash dash all, and then people kept yelling at me going, roll out, restart. What are you doing, you animal? And then so I
7:59 started to try to clean up my act a little bit. Less sledgehammer, more dancing. Mhmm. Okay. So we can see the revision. But you never left a change cause. That was rude. No comment. I'm curious how much of your screen is actually off the screen because I only just now saw error Okay. With no history. I'll try to keep it up top. I've dragged it back up. We should be good. Cool. Cool. Yeah. There's yep. Yeah. Thanks for the reminder. We got a reminder in chat as well. I need to stop. I have Thank you. I use this window
8:50 thing that adds padding to all my windows, but not quite enough for the bottom of the screen. Yeah. So I just need to check to see if there's a hunker NetEase or something in there, but no. I'm just gonna spin up a local item so I've got access to the real resources and manifests if we need to. Well, we will probably need to take a look. Yeah. Let's see what happens if I kill that pod. Let me just do. Journey is he gonna send me on? The postgres image is just postgres with the tag 13 dash Alpine if you wanna try
9:40 spinning it back. I don't know. Did you revert to manually there? Not yet. I just killed it just to see what would happen if when it spawns back up. So I'm just waiting for it to do it. So it's saying, and then, yeah, we'll we'll change the take back to normal. This is where we twiddle our thumbs. Yep. I just want you to know that I'm even nervous, and I'm the one that broke it. We're getting closer, warmer, colder? Yes. Oh. Alright. This is the restart. So I think that that probe's gonna fail after x attempts.
10:27 Yep. I I I say we just revert the image. Oh, hello. Now Derek came to say hello. Say we just replaced the image to what we expect and then we'll try and tackle this this clustered one maybe. You know, I will I will say one thing. One one little tiny thing. You haven't checked the logs yet. Don't don't get me wrong. It's not gonna help you. Uh-oh. You just wanna show the logs. Right? Okay. Oh, yeah. My process is a little dated, I guess. Alright. Yeah. It's still running. Okay. It restarted. So we're going to
11:12 fully crawled. Stateful set? No. Yeah. The pro schedules are the stateful set. Yeah. Are you gonna update the image? Yeah. Oh, let's let's take a look at those logs first. I wanna see what Jiffy said in Forest. Yeah. Oh, right. Because that's implied. Oh, yeah. Yeah. That's a bad habit of mine. I always specify. Let's Alright. Alright. I think it's safe to throw the image back to normal now. Yeah. Nope. Cool. And where are we? That's about 20 lanes. Oh, I think it went past it. Yeah. There you go. Oh, wow. Sneaky. See what else is in here.
11:18 Searching for Admission Controllers
12:42 Expect containers. Can you search for honk in this file? You're gonna have to remind me how to search in UI. Forward slash. Forward slash. No? Okay. So the state's okay. Being thought he was a mutating webhook? Yeah. It has to be right. Configurations, mutating web configuration. We there's no auto complete. It's really annoying. Okay. I'm sorry. K get. Mutating would you want me to type for a bit? There you go. I have it. That hooks. Yeah. That'd be great. Yeah. You're it's just configurate. It's just such a weird Nope. So it doesn't mean that there isn't one.
13:38 There could be a static one in the API server manifest. Although that would be sneaky because it'd have to compile his own API server. Yeah. You are a sneaky gif. Okay. We can switch over to just drop out of that and go into one of the nodes. Right? What do you wanna check on the nodes? Yeah. Oh, we can. I'm just not sure what to check on the nodes. Like, let's let's try and think this out because this is annoying me already. Something is Yeah. Just wanna take a look at the stateful set again or say the pod.
13:57 Node-Level Investigation Begins
14:46 Map. See if there's anything weird in here. Alright. Do you mind if I try something? Yeah. Go for it. Ever run for dot stock. Can you list images? Yeah. Okay. There it is. I was wondering if there's maybe, like, there was multiple aliases on the image and it was you've shown us the default even with the other one and there was, an image pool policy perhaps, but I don't think that's the case. But we can rule that out Mhmm. By switching this to always. It's not doing anything. So I don't think my my idea is
15:53 right. Yeah. Okay. That time's gone. Yeah. No. I keep coming back to the webhook, but yeah. So what else could modify something in the cluster? So I'm trying to think. Let's see if there's anything suspect in the API resources. And I'll check one more thing before maybe I start trying to talk it load again, but what if we got running across the whole thing? Okay. There's only one other thing I think this could be. And then I'm probably completely out of my depth. Container d allows you to set overrides on an image. I was worried there would be some yeah.
17:11 Is not something I've played around with yet. So this is gonna be a fun adventure, a learning experience. Yeah. I'm not a fan of trying to debug this stuff either. Where is it? Container d is in Etsy. Cry? I just started studying for my CKS. This is going to keep up with me. So container d is running as a system d unit, so there could be a system d overlay maybe. You try system d cat system d cat container d, and it shows the. Alright. Sisters. Sorry. Can't, did you say? Yeah. Cat. Oh, cat. Yeah. That should print out all
18:10 of the not called overlays and system d. Can't remember what they're called. System system CTL. We've been really Oh, sorry. No. No worries. Nothing. I'm seeing honk just in the root words. Where is it? And there was definitely no container d here. There's no. I'm a very naughty goose. I I I appreciate it, This is good. Some someone said, what's up with the guy in the middle? And I'm not talking. It's because if I talk, I might give a hint and give it away. Yeah. Okay. So I think gonna get you talking. We need to get you monologuing. Is
19:04 that it? I think Tahira is just not familiar with the new format where we have the breaker live to laugh at us. So the breaker will be mostly quiet during their cluster. That's very true. Because NF, as they say, can be used against them. So Yeah. This is my brain is struggling to understand what's going on here. Do you do you want a hint? Not yet. K. Okay. So the stateful set shows the right image. That's what that's the best grading on me because Yeah. I'm just trying to think if you can have a stateful set that the pod be
19:45 controlled by something else, but can't. Okay. Let's check the control plane. It's not been messed with. Can we go to the static manifests and maybe see if any of those files look like they've been Yeah. Where is it? It's a true Kubernetes manifest. Yeah. Yeah. We're Kubernetes. Seeing every small Kubernetes. That's the pressure of people watching you typing always just It's Yeah. It's a combination of stuff. Right? Oops. Sorry. I just did the long I was gonna do the exact same thing. Great minds think alike, fools seldom differ. Right? So he's not been tinkering. I mean, he What did you give us
20:42 access to this? Well, he could have changed the dates on them. I don't know if he'd be that of a terrible goose, but I'm trying to look look at his smile and see if anything's slipping out. I'm really this is my poker face. My my stupid smile face. Just just take comfort in knowing in, fifteen, twenty minutes, I'm gonna be the one in pain. Yeah. Yeah. That's true. Where could it be? So there wasn't any mutating admission controllers. Yeah. We didn't see Kiverno. He hadn't deployed anything new to the cluster. Right? Yeah. I did that whole namespace, I think.
21:26 I just wanna double check. You god. I'm trying to match together my hotkeys I have on mine versus yours, and it's just. And we never seen anything in the container d directory to suggest an overlay. So there's a command on container d. Oh. Done. There we go. I'm gonna have to remember this one. Thank you. I don't know if it's gonna be helpful, but Could better than staring at the screen. I'm almost at the stage of going to the root directory and doing a grep minus r honk. That that's where we're getting to here. Alright. So we got barlib container d. I'm
22:21 gonna I think we should check that out. I think I mean, do we even know that container d is real? Let's see. And then you swap out the binary. Right? That's why I was hoping when we did the I'm just gonna slowly go up. So I was hoping something would show here. Well, let's try doing along on March 11. Okay. So maybe not. Let's see. Version. Doesn't say honk. That's a good thing. It's a delightful. Alright. Let's think out the box for two more minutes before we maybe get a help. Yeah. Just cube. Yeah. Resources.
23:16 Did we already do this? I think we did. Yeah. I scanned it, but I didn't look in anger, to be fair. It's really turning into Jake Pisteck cloud kinda kinda scenario. I like it. Alright. Let's let's but we're we're not covering the basics here. We're we're getting caught up at the moment. Let's look at the logs on container d. Let's look at the Kubernetes event log. There has to be a there's gotta be a thread to fill on here. There has to be. Yeah. CTL dash l? Dash l dash u or dash l u and container d. Oh, yeah.
24:05 And if we don't want the pager at a dash f as well. Oh, the u has to be in the end because it's the unit name. Alright. What have we got? Image. Create container. Create container. Looks normal ish. Nothing is popping up. So I'll take a look at these logs again of that postgrads. So KB has asked, is there a proxy somewhere? I'm gonna go on 11 and say no. It has to be in cluster. At least I feel it has to be in cluster because the stateful set has the right image. Mhmm. So I I I don't think it's a
25:27 proxy. Alright. Let's try the events. Get events. Present. We're not friends anymore, Jeffrey. I'm sorry. I'm not I'm not sorry, but I do feel bad. You've you've done your job very well. So what I'm thinking is we could nuke the stateful set and try to reapply it to see if this weirdness persists. But then we don't really I suspect it may not. I'm not sure. And then I don't know if we'd ever find out what it is. Could do a rollout restart. Let's would you mind if I take for a moment? Because I don't really know what I
26:35 wanna do. Yeah. And I'm my brain's a lot like, maybe he replaced something in Postgres there that's calling the API server or I think one thing he's done something. I have I've absolutely done something. Oh, can I try? Oh, yeah. I'll let you. I just wanted to see if we had a kubelet fail here. We do not. I wanna see if our lib kubelet config has anything suspicious. You can feel free to try whatever you want. I k. I'm I'm running out of ideas. API server one word, and you'll need dash u per unit. And dash l k v p I. Thanks,
27:36 They're saying it's always DNS. Maybe not. Thanks, Yeah. There's no logs in the API server. That's terrifying. Oh, because the API server is a static pod. So we'll need to get the logs from Oh, right. Yeah. It's Okay. It's been a while since I used it since they switched over to that. Yeah. This this is if you don't if you don't use kubidm on on a daily basis, you'll lose track of where stuff is. Yeah. Alright. I'm I'm I think I'm gonna ask for a hint. What do you think, Chris? Will we get a hint? Yeah.
28:21 Unless he wants to watch us flail for a little bit longer. No. No. However, you get one of two hints. You can either get a hint, which is the backstory of what happened to the cluster, or you can get a hint about something you've already done. Cool. So did we get close at some point? Pick pick your do you wanna do, dude? I'm gonna leave this decision entirely up to you. What do you want? I'll take the hint of for something we got close to. The story is appealing, but I think we can we might be able
28:26 Requesting a Hint
28:58 to ink the thread a little. Right idea, wrong folder earlier. It has to be the container d overlay. Oh. Alright. I'll give you one other hint. Not container d. Always move to config somewhere? Oh. Okay. So the static pod directory could have been reconfigured. Yeah. So we wanna take a look at the Kiplit config after all maybe. See, See, part of my problem is I don't know whether to say warmer or colder. And I don't know how much you want to flail. I don't know how much fun other people are having watching you. People seem to enjoy watching me suffer on
29:47 the show. I've gotta say. It seems to be the fun part, at least the the learning part. I watched the session yesterday and a couple others, and it's entertaining just seeing the how people try to approach debugging the clusters. And I I love seeing people work real time because everyone works and troubleshoot differently. Yeah. Like, yesterday, there were some awesome awesome hints. I will I will also say that I was going to do something different because I thought this idea was way too mean. And when I saw some of your previous episodes and went, nope.
30:28 Nope. I'm gonna do it. I did I did get opinions from both Christophe and Bob about whether this was too mean or not. Here we are. Okay. Alright. I'm just gonna cut this. And I will I will leave it up to both of you whether you want another hint or not. Can we just let that fail again? So the static pod path had not changed, which I am happy to see. Yeah. I was gonna move over to the cube API one again, see if there's anything kinda Yeah. Red herring there. Yeah. That's it. Yeah. I guess, yeah, the the next hint
31:30 would probably depend on how much time we have left. I cannot remember. So the hand was wrong directory. Yeah. You you were you were talking about something something, but it was the wrong directory just to think about it. Oh. What directory is it looking to? Yeah. I'm trying to we've been in Let let me let me also say, not just wrong directory, wrong project. Do you put Cryo in here? No. There's no there's no Cryo. I'm pretty sure. I will I will tell you the story of the cluster. This cluster was a cluster run by an intern
32:41 Jiffy Shares Cluster Backstory (Hint)
32:49 at a school. Nothing really important was running on the cluster, but we got a notice that we needed to upgrade the cluster. So the intern upgraded the cluster. I don't know how they upgraded the cluster, but they did. And then six months later, all of a sudden, we have this we have this issue where we can't schedule any workloads. That that's our hint? Yep. Sounds like certs. Oversell to the it it can what? I wanna see what happens if you deploy a pod. Can we just do, like, a NGINX deploy just to see if that goes
33:29 Deploying a Test Pod (Nginx)
33:40 up. Or Yeah. Oops. Please. I'm really excited. I can I know this You you can type? It worked. Okay. We're put a load on it, make sure it actually is NGINX. Yeah. You, my friend Ugh. Having a good time. I I'm glad to chat or join it. You are creative. And you're saying it's not container d. So it must be the kubelet. Right? Something I'm committed to working on what he's done to our kubelet now. So let's go take a look at No. Okay. Next. Okay. So using standard here. Barrelib. Kiplit. We already looked at this.
35:05 I also do have one more hint when you're ready. Don't talk to me. What's he doing? What's he doing? Let's do. Nope. Let's do k. I mean, that is all of the Can we list the pods that are running again? Oh oh, what? No. Never mind. I thought I saw something. Can we see what Hunker NetEase is running? Or can we check the logs on Hunker NetEase? I'm being silly. Right? He he he won't have affected the Kubeflow on this machine. Mhmm. So we're I just wanna see what that We need to be on this machine.
35:58 Switching to Worker Node Debug
36:08 Which one? S x six. Because workloads won't be scheduled to the control plane, and we're bouncing around the control plane, which I think is very naive of me. So I've opened a session on s x six if you wanna join. Yes, yes I do. And let's look for irregularities on this kubelet. Let me just reload since you've joined. Okay. So barlib. Alright. That looks okay. Next Our flags here. Let's take a look at the parameters. What? I'm sorry. It was good. What I didn't bother. Alright. We're gonna give it five more minutes and then Are you gonna wanna know the answer or
37:55 you're wanna put this in the pocket? No. No. We'll get the answer. Alright. Yeah. The audience will demand it. We need clarity for what what your nefarious honking you did. Alright. Can I ask a couple of questions right before you I let me let me also give you one hint, the last hint I was gonna give? We talked about our backgrounds when we first joined the call before we went live, and I said what my background was. I'm a developer. I'm not an ops person. Go ahead and ask your questions. Should we be in a control plane node
38:38 or the worker node? Yes. Did you build your own container to do something? I did not. Well, technically, I built honker Netty's Jiffy, but I didn't build Fair. I didn't build postgres. I didn't build a new clustered. Yeah. Just does the binary, wasn't it? How did you get the strings or something again? I don't remember. I thought it was just string. I will I will also save you time. I did not put any honk bread crumbs in outside of the log. Did you build your own criplet? Yep. Alright. I've got an idea. It's like, how do we
39:32 Suspecting a Modified Kubelet
39:39 why is that not working? I'll keep it Oh, yeah. And it really needs to remove everything. No. I have bar, the cache. I have bar cache as after. I was hoping for a nice debian package here. Yeah. How do we do a reinstall? Aft install. Well, it's good. I've just upgraded to Kiplit too, so there we go. Hopefully, that doesn't make new breaks. Alright. I think we have the normal back on this machine, but that doesn't mean that he hasn't done that on every other node, assuming he has Yeah. Has his own kuplet. I can go back to the other one,
40:16 Reinstalling Kubelet
40:56 and I can restart. And, hopefully, that Yeah. Feel free just to And your next is back up and running. Or it's wait. No. We need to check vlogs. Yeah. It's not been restarted yet. Yeah. I got I got excited. I saw the one. It's like, yes. I'm gonna go check the other two machines just in case the is different on them all. If you wanna change the workload. I will I will say, you are correct. You have you have solved it. However, there was a second part of this, which is you have to assume that all
41:32 of the nodes and the cluster itself are owned. Oh. There are no root there are no kernel changes. We did not host the operating system. You just repointed us to the an outside cluster? No. No. I think didn't Seconk do that or, like, roll into a cluster inside of a cluster? Yeah. Yeah. Yeah. I'm giving it to you because you you have absolutely figured it out. I compiled my own cubelet and placed it on all of the notes. What you what you don't see is in my terminal window every sixty seconds, I'm SSH ing to those nodes and overwriting
41:51 Jiffy Explains Kubelet Break
42:22 cubelet if it's been changed. Because I wasn't sure how long this was gonna take and now I feel bad. So I've already seen I've already updated Qubelet twice so far. You're very quiet, Tanya. That that last bit is I mean, we'd we'd never be able to work that out. That's harsh. Yeah. But Yeah. It's a nice one. Well played. The the idea was if someone has a cluster that they didn't know, like, how it was installed or how it was managed, they can't actually validate the supply chain. So they can't validate that Cubelet is actually what it should be.
43:01 And if a cluster has been taken hostage, just fixing the cluster isn't necessarily gonna fix the cluster. That was that was my idea. That was my little Disney story arc of being really, really mean. I'm sorry. So from now on Good. I'm disabling SSH access on all of these machines. That's crazy. Hey. Well done, sir. I I stopped you before that came into play. But it's probably a good idea. Yeah. See, this is one of the thing. This has happened before, right, with the binaries on the machines is that if someone does swap out the binary that
43:35 is 99% functional, it's really hard. Like we need better at and through Falco and there's other tools that we can work in the space that would tell us this. But you know, by default, we have no idea that those binaries have been swapped underneath us. And that's a common theme there that I'm seeing people abusing on this show. So nice nicely done. I I feel like it is a really good lesson because that's also something like, that's a problem if you're running your own cluster. Like, that's a benefit of managed offerings is they're they are going to be the ones
44:00 Cluster by Chris Carty
44:06 managing the whole supply chain start to finish. When you're managing your own, there are just so many different things you need to think about. Yeah. I'm gonna write a script to start doing checksums and all the binaries. I've I've seriously considered, like, starting that as a project just because of this idea. Like, there is no real good way to do that across the whole stack. Yeah. Alright. Well played. Now, we switch roles. Yep. Now I'm gonna feel the pain. Hopefully. Like, this is this is my first time purposely breaking a cluster. So this is either
44:30 Debugging Chris's Cluster Begins
44:40 gonna be really easy or it's gonna be interesting. Alright. Well, I have opened a session on Chrissy's control plane. If you could please join and just give me an echo or something so that we know. I'll set up the KubeConfig. Session. I I already see typing. Yeah. I get a weird screen refresh when that happens and my scroll breaks. I'm just gonna reload. It's a teleport. It's such a cool project, but they need to fix this bug. Alright. No more hump, please. Come on. Right. I will give you the pleasure of seeing if we have a control plan.
45:15 Alright. So far we do. So far, we are only missing the the website. I'm gonna try, George. I'm gonna try. Yeah. Make. Please make me try. Interesting. So we have zero of one deployments ready. Nothing up to date. Okay. So I I just realized that it's not gonna show. So I'm scrolling through here Oh, yeah. Just to see yeah. So there's one desired, zero updated, two unavailable. Also, rolling update strategy, 25% max unavailable, 25% max surge. Well, it looks like we got an old replica set and a new replica set. Maybe something is there no scheduler,
46:21 maybe? You you go hard already. I think he's had a rough rough couple sessions. He's gonna be checking the checksums on every binary. Oh, we have admission. Oh, gatekeeper. Nice. Alright. So error creating admission webhook validation gatekeeper S h a s h. Yep. Denied the request. So gatekeeper's been installed and apparently is denying all pods must have a clustered label for branding and association. So, Chris, if I add that label, what's gonna happen? Oh, you're just gonna find out. I'm gonna try to drag this out as long as humanly possible. So let's edit this. I'm bad at the Kubernetes.
46:45 Gatekeeper Policy: Required Label
47:27 Oh, okay. You can just add labels. You want me to do it? Yeah, please. Alright. So what was the warning on? Was it on the pod? Was it on the it won't be the pod. It was the pod. Right? So we need to It was the pod. So, yeah, we need to add the label. Oh, I'm dumb. Yep. Yep. Alright. So we've added our label. Do we have pods? Of course not. Alright. So whoop. Control w does not work in terminal browser. I've done that so many things. That's exactly what I'm gonna do. Oh, no. We got No. Alright. So the
48:15 replica set Oh, we got an invalid. Alright. Okay. So we've also got something blocking us from deploying anything that isn't an ancient database to IO repository. So we need to take a look at the what does gatekeeper call policies again? So I've tried to pepper some Stargate references throughout this. Just Appreciate it. I always appreciate that. That's awesome. So we got required labels. That's the there we go. Allowed repositories. Okay. Yeah. Yeah. Load. Ancient archive of knowledge. Mhmm. Gotta watch out for this. I love it. So you allowed all the control point nodes? I'm not
49:13 super mean. That's first time. So I just wanna start start simple. I I will will just bounce back to my cubelet patch. My cubelet patch allowed anything in cube system, Cilium, metal l b system. Something else. Didn't wanna throw too many flags. I'm gonna suggest we just delete. Okay. Allowed repose ancient archive of knowledge. Scaled it up. On a way. Yeah. That was the last message. I thought we'll need to Yeah. Maybe do a rollout. Yep. Alright. We got another denied. All parts must have a clustered label for branding and associated. Branded and associated Stargate unit?
50:17 Is I mean, I don't know gatekeeper. Is s g star like, Stargate, or is it gatekeeper specific? Yeah. All the the teams that go to the port the Stargate would be s g one, s g two. Oh. Oh. Alright. So let's look at our API resources again. So one of and I'm assuming if we do that okay. No. No. No. No. I think we need to update the label to not say Rawkode and Gfi, but to say s g 1. Well, I was gonna just take a look at the policy so that we don't guess. Oh,
50:51 Gatekeeper Policy: Allowed Images
50:56 that's that's much better. Maybe. Just an idea. If you want to, probably smarter. I'm enjoying this, but if you wanna do that, by all means. So required labeled Dash o y m o must be clustered. Alright. So does that actually see Alright. I have to scroll up on mine. Yeah. That would be a cool feature that I'm gonna have to ask the the teleport team to work on that. I love teleport by the way. Yeah. It's awesome. And you were also right. Custard cannot be David and Jiffy. It has to be s g dash number.
51:40 Yep. I'll give you the owners. No. No. If I no. No. If I'm reading this right, we can just have a key of clustered, but we also need another key called key. Oh, you're right. You're right. I was hoping you were gonna it looked like you're gonna miss it. Oh, you hey. Off you go. You you take it away. I'm just I'm just over here. I've already exceeded my expectations, so this is great. Match labels. Template. There we go. Boom. Boom. Let's see what SG we are. I'm saying we're SG one. It's the best SG.
52:22 Yep. Oh. You may have to do a well, in fact, the labels would have done a rollout, so it's just not working yet. That's true. I guess we're going back to the event log. Yep. Well, okay. So it set it scaled, wrap was set to zero, and then scaled it up to one. You wanna try the rollout just in case? I thought labels would've What did we roll out command. Was it rollout deployment? Roll out restart deployment clustered. Yeah. Hit control w again. That happens to the best of us. Yep. Alright. Let's describe our deployment.
53:31 Yep. Well, it's yeah. We need two keys. One of them had to be clustered with any value, and one had to be team with s g one according to the policy. Oh, wait. Really? Yeah. No. No. You're you're right. I think, well, it misinterpreted the policy the way I did it first. Yeah. You you did it right, at least. That's how it was intended unless I Unless I typed up. Just possible. Okay. So minimum replicas unavailable, progressing through new replica set created. So it's not scaling it up at all. Yeah. Okay. Now can we check for a scheduler?
54:00 Deployment Scaling Issues
54:22 Damn it. Alright. Okay. Because we're not getting any errors. Right? No. We're not getting any errors at this point. Deployment controller is saying it's scaling it up and down correctly. I'm just gonna type randomly and you've got keep keep going because you're following what I'm following. I see two different replica sets that have a desired of one and I think that's part of the problem. Okay. So we're still getting hit by a Oh, there's a validation policy too. And it has to do with the image. The image. Yeah. So we probably there's probably a different yeah. A different
55:04 gatekeeper CRD that we need to look at. Okay. Key. That came back. I'm really fast with SSH. Okay. Okay. So there's either a rope process on this machine. Chris has another terminal open just to mock us like you did. I'm so sorry. Give me. Nothing in the cron tab. That's the next thing I was gonna say. We have a rule system fail. It was lit normal. Can I take over for a sec? Please stay. So real quick quick. Sorry. Just for the record, the scheduler had a restart, which I always find suspicious. That's fair. Oh, look. It's running flux c d.
56:17 Finding Blocking Webhook on Namespace Deletion
56:42 Oh. No. I I say we delete that entire namespace. After you. Of course. Right. Alright. So we got a policy. We've got a at a mission controller, it's talking about deleting the namespace. So let's take a look at validating webhook configurations and then mutating webhook configuration. Or it could be RBAC. Maybe it's just simpler. Yeah. If I say too much, I'll give too much away. So I'm just gonna a silent observer. We've not had anyone take a approach to sabotage. That's a nice touch. I want to point out no. No. No. I wanna point out something, Chris. And this this
57:46 is a little bit of a tangent, but it's worth it. When we did the first honk CTL session with Bob and I are sitting there, Everyone else is trying to break the cluster as hard as they can. And here you come in, try install OPA, try and lock down the cluster, try and preserve the cluster's health while everyone else is trying to destroy it. Just a reverse hog. That's all. You you are a lawful good, Kubernetes. You are a lawful good, goo. Thank you. Oh, sure. I'll break the cluster. Policies on it. Just just not how you expect.
58:27 I love it. Yeah. There we go. We gotta validate and webhook to block the deletion of namespaces. Yep. I'm gonna try this again. I'm warning you, Chris. This better work. I will say this on a policy I put in there. So I don't know where that one came from, but I'm enjoying that one very much right now. I mean, we could just delete the pods in that namespace and that would get what we need done. Yeah. You're right. There's no R back here. We got root. So it's definitely at a mission controller of some kind.
59:03 Unblocking Namespace Deletion
59:18 So you're suggesting we could do a customized controller. Done. And so now we can probably go back. Okay. It is terminating. Yeah. So now we can probably go back and delete the ancient archive of knowledge. It shouldn't come back. We might be good. Yeah? Yeah? Yeah? Alright. Our custard pod is coming up. Is that it? Did we do it? It might be a little bit more. Alright. Of course. Container creating. Yeah. Like, hopefully, it's just building the image, but CNI. Failed to find plug in. Oh, there's no Cilium. Oh, wait. No. No. Wrong namespace. Sorry. I
1:00:12 CNI Failure (Cilium Issue)
1:00:37 do that all the time. Oh, no. There's no ceiling. That's sneaky. We still have the namespace by the looks of it. Oh, wait. What? Hold up. So three are ready. I am going to, like, mentally delete control w from my memory. Yes. Just don't please don't hurt me after this. It's all. Oh, we're gonna have to roll out everything. Alright. What is the rule out demons restart demon sets. So the Oops. Sorry. Here we go. It's the same note. Oh, no. Wrong new space. Is gatekeeper back? Is it? Other oh, that's the wrong one. That one's not back.
1:00:38 Restarting Cilium
1:02:53 What other CRDs are there? Yeah. Thank you. What about disallowed tags? I see how we're playing the game there. I'm not even looking at him, Chris. It's definitely a thousand cuts on this one. Yeah. So this break does have a story. It's I, a Kubernetes administrator, misconfigured a few things, and this is the ensuing chaos that you get. It's mean. I didn't mean for it to be that mean. I love it. Alright. Okay. No. So so what's happening Yeah. Go for it. Go ahead. No. I was gonna say what's happening here is we never deleted
1:04:00 the policy for the clustered pod. We just added the labels and got the clustered pod working. Cilium isn't scheduling because we haven't added the labels to the Cilium staple set either. But I just removed the policy. You removed disallowed tags. Did you actually delete the the other policy as well? Alright. Okay. I just have to delete an entire CRD namespace. Is that a thing? There we go. Hey. Alright. My host of cards is slowly crumbling. Alright. So we need to do a rollout of same operator too. And technically, the Hubble stuff if we care about the state of that right now. Yeah.
1:05:11 We can probably get away with oh, there we go. We have a clustered. Chris? Yes? Hold on. Actually, I shouldn't clap. Is it working? Let's go find out. Go to the website and find out. I knew it. We'll see if it works. My man is not looking good. Insert evil laugh GIF. Right? Yeah. To be very unhelpful, my application doesn't actually log anything. So Okay. I'll give you a couple more minutes of this, and then I'll and then sake of time, I can give a hint if if you need it. Alright. So the pod is running.
1:05:15 Network Policy Blocking Traffic
1:06:38 Now we can't we never really get the port forward long enough to load the page. It could be that it's timing out. We didn't speak to the database. So let's describe the postcard service, see if we have any endpoints. Good point. Describe SVC Postgres. Alright. So we got one endpoint. No. Counter application actually speak to it as the other thing. Do you wanna check for network policies? Chris is stuck mostly to the Kubernetes API, so I'm assuming we have a net poll. I don't know them all. It's tell you I'm cluster wide network policies and
1:07:28 sell you network policies. I think I think you can just do net pull. Get net pull. There we go. An iris. Do I got it? That's good. That's good. Alright. Let's get that in the bed. Yeah. If you'd looked at the allowed tags policy, it was all, had a list of, the various planets in the s g the planet codes. Alright. Put forward looks health fair. There we go. There you go. Was Nice. That was good. Chris, I love that. Yep. Thank you, Chris. Thank you for going on that that little adventure. That was a it's fun watching.
1:08:21 Chris's Cluster Fixed & Rawkode's Offer
1:08:27 Alright. You two have a decision to make. Yep. You have twenty minutes in my cluster if you wish to take it on. If you wanna look around for twenty minutes, you're more than welcome. If you wanna call it quits and move on with your day, that is alright. I know twenty minutes is not a lot of time, so that's why I'm gonna make it completely optional. I would feel awful if, like, the two of us have made you suffer. You don't make us suffer. But you know how bad my weekend will be if you solve this in nineteen minutes?
1:09:00 Alright. Feel free to take a look around. Take a look around. Alright. Alright. So let's do the obligatory k get pods. Oh. Oh, I wanna make sure I'm in the right cluster. P4X639 is the the worst world charge. Just FYI. And I pop I am not crying. Thank you very much. I am holding that in holding that in well. Yeah. Are you on the Which one is Replicator planet or black hole? A window of opportunity. Window of opportunity. That's the infinite loop one, Groundhog Day. Oh, the best episode. Yeah. That one. Awesome. Alright. I've set you up. That's
1:09:01 Debugging Rawkode's Cluster (Optional Session Begins)
1:09:43 all I'm doing. Appreciate it. Are you in the same back discussion, Chris? Yeah. We got rejected. Yep. So we're back at our request for an unknown reason. Is it actually up and running? It's my That is. Feeling. Such a good episode. Yeah. It's not on Netflix up here in Canada yet. I'm waiting. It comes on in The UK every now and again, and then I start watching it. But then I can't get through all 11 series before they take it back off again. You're stuck in your own little loop then. Yep. What did he do?
1:10:40 Well, server rejected a request for an unknown reason. Mhmm. API server is up and running. Do we have any logs from the server? Oh, Oh, they have a chaos kubectl on there. That's nefarious. Mhmm. Are you also typing really fast right now? Oh, wait. Are we in the are we in the same terminal? Yeah. So I can Okay. Okay. Cool. Yeah. I'm still coming down off the watching. Little little shaky right now. So You know, that's like I let me also shout that out. It's it was actually scarier for me to be the breaker than
1:10:45 API Server Rejection & Checking Kubeconfig
1:11:39 it was for me to be the debugger. Maybe that's weird. Oh, a lot of pressure. Like, can you break a cluster? Oh. Oh, there we go. Yeah. I think that's the part I was worried most about. Debugging is fun, yeah, properly breaking something, it's doesn't come natural. Bucks of tears. Bucks of tears. Client con switching balancer to pick first. So I've never seen this go ahead. I was just like, did he swap out the binary too? I have never seen this chaos thing. Did he maybe change some of the config the cube config? It's deleting some stuff.
1:12:53 Yeah. It's deleting clustered and then no idea what else. Okay. So cube config or yeah. The KubeConfig might be a good example. Let's let's see what that is. We're we're good to show the KubeConfig on screen. Right? Yeah. It's all good. Alright. I would answer that question pop, but I'm keeping my language clean this evening. I don't see anything out of the ordinary here. So maybe it's not this. Yeah. It's the endpoint. Right? Yeah. I think so. What do we have here? One four what is it? 145. It's not any of the addresses here. I
1:14:09 think it's going to the wrong spot, but wouldn't that would no. Would it have timed out if this was going to the wrong server? So I'll I'll spare you going down a rabbit hole. That is the EIP and the correct IP address for the API server. It is not any of the node IPs, but you don't know the setup, so I'm just gonna give you that. Much appreciated. I my my initial thought was he swapped IPs with one of our servers. Yeah. But didn't change the certs, so it would throw an error like this. Yep. And and, like, I'm looking at I'm
1:14:39 looking at the teleport IP, like, is that the same IP? Ditto on my end. Alright. So we still Midna has decided to wake up. We still have no idea why we don't have access to the API server. I do know that this chaos pod or chaos container is definitely going to stop us from it's it's gonna kill the cluster pod What if we get access to it. What if we delete that chaos pod? That's a nice idea. Let's do it. Let's see if it's outside of Let's come back. Oh, wrong thing. I'm gonna have to start practicing my Christ
1:15:29 ETL, I think. Oh, no. It's gone. No. And it's back. Okay. It's being managed by. I'm leaning towards something with certificates. Yeah. But, like, if it's the wrong cert, wouldn't it just stall and then give us the old cert error or x five zero nine? Or am I behind on my error warnings? Remember in the beginning when I said I'm more dev and less ops? You probably know better than me. Okay. I'm footing both doors a little bit. I'm trying to remember I'm trying to remember the curl command so we can just hit the API server and see
1:16:01 Using Curl to Test API Server
1:16:15 if the cert works that way. I'm happy to take the curl command for you if you want. I was just gonna Google it if you have it up. Yeah. If you got it. I mean, that's that's the next path that we're going down. Yeah. Yeah. You could start our DNS. See where was that IP address. Would it would it be DNS, though? I don't like, I would assume if it was DNS again, it would stall for a bit because that response was way too fast in my Do you want me to Run it again? Right. Because it should hang if it's just
1:16:53 it's trying. Did you need the the URL, David? That's alright. Alright. Oh. Uh-oh. You got a cat over there. Cat. Cat. Cat. Oh, no. It is. I just have my desk up high enough mine can't jump. It's great. If you saw my desk setup, you'd laugh. Cat tree over here. Cat bed over here. Cat wall behind me. I have a cat chair in front of me. He spent the mornings snoring. It was great. Alright. So so here we are. Certificate problem, unable to get the local issuer cert. It is still safe. Sir? So the certificate is fine.
1:17:50 Re-examining Kubeconfig
1:17:59 The cert is fine. The cert is fine. You've already seen the problem twice now. That's all I'm saying. Oh, right. What if we wanna go back to twice? We have catered the cube config twice. Mhmm. And we've looked at the cry output couple times. This is in my breaker notes as a superficial problem. Breaker notes is superficial problem because it probably has to do with this. What where's our KubeConfig pointing to? Etcetera, Kubernetes, admin Com, unless you mean there's. Yeah. This okay. So it's pointing to the file that we're looking at then. Yeah. Yeah. Would you just like a hint Well, I
1:18:57 think the the yeah. There we go. It's only one context. No. No. Like Does the context match the name? Kubernetes hyphen admin at cluster o two five KubeCon EU twenty twenty one cluster o two five KubeCon EU yeah. I'll type in it. Client key data. Client shaver data. I'm surprised no one in the chat has noticed it either. Is it an indent error? It's worse than that. Like, the Oh. Subtly worse than that. Oh, it's the little ones that are the where is it? Yeah. One second. Comes. And Midna is now laying on my arm,
1:20:05 so this is gonna be a little bit hard mode. Did you just I can take over if you want. I did. Did I do something bad? Probably. There's a type of config and I don't know if you configs well enough to know whether that was I mean, you messed like a huge hint because when you open them, I actually remembered where my card goes. Oh. So you can watch that one back later. Oh, that'll be that'll be good. Alright. Just to make sure I didn't break it worse or better. Okay. Alright. See how, Chris, you were saying, like, you think
1:20:48 your break was gonna be too simple. Like, user literally gonna kick yourself when you get this is just the slow you down problem. It's a little Okay. So you confirmed the IP address, which I thought was nice. I'm confused. Again, like bringing up a config file on my Thank you. Machine so I can see. I'm I'm stuck on this. Why don't you def it with one of the other ones and see if something's different? What other ones? Like my own? They they you can def it with the Kubelet one. Papa's saying, can he call a friend and
1:21:33 Identifying Kubeconfig Typo
1:21:40 can that friend be Thomas Stromberg? As much as I enjoyed having Thomas on, he's barred. Oh, Christophe sees it. Thanks, Christophe. Oh. If chat sees it, don't type it. Or rather, you can control what pops up. Right? So, like, we won't see it. Yeah. Yeah. I will show you. Okay. When I don't sneakily have the YouTube channel opened anywhere else. So could you could you vim or cat the the file again? Yes. Just keep in mind have this config file over here. Cool. Look what's here. This is the diff between Qubelet and cubes Oh. Calling.
1:22:25 So you can probably ignore the sercs because they use different sercs. It's a kind kind or type bit at the top. Yep. Yep. Yeah. So Oh, crumbs. I gotta get to get to something in five minutes. So Here it is. Is that there. No? Do you do you have your own KubeConfig? Yeah. Do. Get rid of preferences. These are swinging swinging and missing there. Yeah. And I'm just looking over here. I don't have it. It's just delete everything. That's that's how it's going. Right? Yeah. Nuke it from orbit. It's the only way to be sure.
1:23:25 Alright. I'm gonna put my own. Because, like, I have a good idea. Oh, it's because I have, like, a zillion con things in here. Client data, client key data. There we go. I mean The person is there. It was in the diff. I thought it was. That's that's what I'm looking at. But So just go through less. Take it slowly. Really? Oh. Cheap. It's so subtle. I think I had a dream last night where that was the problem. Just in a random config file. Nicely done. Is so good. So was that chaos thing a red herring
1:24:41 Chaos Pod Breakdown (Due to Fixed Kubeconfig)
1:24:44 then? No. The chaos is that'll infect in chaos. At least it should be unless I broke it. Everything's here. Is it still deleting the cluster pod? It's running. It's been up for ten minutes. Yeah. Because that's what it was doing, right, when we saw the output? Yep. And then let's see what the output is for that now because that is certainly come on. Copy. So the funny thing I can't tell you. Oh, he's sweet. Nice. So by fixing your Kubernetes config, you actually bring the chaos pod into effect. Wait. Because the chaos part was broken
1:25:45 because of the cube config. You've now fixed the cube config. The chaos part potentially is now chaotic. The chaos part is actually throwing errors. Yeah. Because it can't get to the because it can't get to it anymore. Yeah. It was working, like, a few minutes ago, but now it's unless we scroll all way down to the bottom of output. Yeah. It's it's giving the same error we got initially. You played yourself. Alright. Can we port can we port forward the damn it. Yeah. Do you Or I can Do have the thing? I David looks like he's thinking.
1:26:29 Do me a favor and delete the chaos pod. Sure. Oh, it's going on 80 k. Do you just want do you just wanna open that on stream for us? Oh, wait. How do we It's how do you do? It gave me an NGINX. Sorry. Alright. Alright. So Crysi CTL PS. Crysi CTL RM. Okay. I gotta jump to a a meeting in a minute. This has been an absolute delight. I cannot thank you enough, David, for, inviting me to participate in this session. Truly divine. Oh, thank you for joining us. It was a treat. Yes. And with that, I bid you adieu.
1:27:02 Chris Departs
1:27:20 Farewell and take care. Yeah. No worries. Thanks. I want to finish this off. Yeah. You can poke around. So you've now fixed it. So I didn't that that's a bug in my I didn't know that that I can't say too much, but yeah. There's no chaos. There is no chaos and then the cluster pod will not run anymore. Correct. Because it'll just get deleted repeatedly. Yes. And repeatedly. And there's a really once we get to the end, I'll explain it. But this is a really interesting one, I think. Alright. Very similar to yours, actually. Alright.
1:27:24 Rawkode Explains Static Pod Bug
1:28:08 Rawkode chaos. Okay. Wait. If it's similar to mine, that means it's gonna be some sort of a matter or is it though? It's not nothing's remote. You you can you can see the container d pod running. You just can't see it in cube control, which I think is interesting. To me, that means it's running outside of Kubernetes. I'll answer questions if you want, depending on how long you have. But I won't say anything more just now. So how familiar are you with container d? Not. Alright. So you can do a container at cry control pods to see Kubernetes pods.
1:29:00 That's neat. But you will see chaos is there. Well, also a bunch of things are not ready. I'm not sure if that matters. Where's cast? Cube control, etcetera d. Oh, it's right at the top. It's Cube name is oh, wait. What? Okay. So there's only one cube API server running. Has eight restarts. Bad typo. Pods shows cube API server that's part of chaos namespace. This is mean. So chaos isn't available either. You're gonna have to give me hints. Yeah. There's no chaos namespace. You wanna talk it through? So either there is something horribly, easily wrong.
1:30:53 I'm, like, I'm trying to decide whether it's at the Kubernetes level or at the cryo level or container d level. Or are they intrinsically linked? It's in the Kubernetes level. You could always check the timestamps in those files. That's probably a good idea. Thanks. This cluster has been running since 1969. That's tricky. Yes. I am. Where we're going, we don't need I yeah. I will be honest. We are already out of my depth. Okay. So you're really close. You're close. So I'll tell you one thing and then this this challenge that doesn't really matter. So
1:32:00 this is the bug with Docker as a CRI. I actually upgraded the cluster to upgrade it to cluster wherever we look at it. You use Docker instead of container d because this does not work with container d. The bug is you can have a static manifest with a pod that runs and if the namespace doesn't exist, it will still run it. So there is a chaos namespace running a pod and a static manifest in this directory, but you've not found it yet. That's mean. I love it. And you've already said you're not an ops person,
1:32:26 Searching for Hidden Static Manifest
1:32:52 so you might find this a little tricky. Yep. Because I don't see but it's in this directory? It is in that directory. Do you know how to hide files on Linux? Nope. Alright. Chat, let's help you feel. Help me out. This is the point where I would be googling. So I'm phoning a friend. And by phoning a friend, I mean, I expect someone to tell me the answer in YouTube chat and then Yeah. David just pop it up. If Waleed, I know you're watching, Noel, you're watching. If you just wanna jump on, I'll send you the link. Feel free. Otherwise, start
1:33:35 typing some think about how we try to fail on that. And if no big guesses any answers in the next minute or two, I'll I'll walk you through it. I enjoyed putting this one together. This is this is evil. That's a really it's a terrible bug. It was actually my colleague, Dan Federan, BSD box who worked it out. I think he typos on the namespace with the static pod and worked out that it still worked, but it doesn't work with container d. I noticed. I had to change it on time. The Docker doesn't validate it. Although we still get weird header.
1:34:18 Interesting. Well, we're not gonna use FLS cheating. Noel is suggesting, is it a main boat, a bind mount, which l s dash al would not show? It is not, actually. Nice idea, though. Alright. So You need you need that voodoo. Give me that voodoo. So it is actually relatively easy to hide things from LS. And normally what you would encourage people to do is an echo star, but I've also hidden it from there too. So, we'll just walk through it. Let's wrap this up. Please do. Like Okay. So if you can't see it with LS
1:35:07 Rawkode Explains File Hiding Techniques
1:35:30 and you can't see it with echo star, you got to start wondering if it could have been really cruel and mounted something in with eBPF. I have not done that. You could check to see if the command has been modified, which it has not. The one thing you could do would be to take ENV and look for certain flags. That's a that's a really strong hand for anyone who's familiar with Linux, but there is also ways using the environment variables to subsequently hide more things from the shell, which I've also sophisticated just for the the
1:36:06 laughs. So is a particularly harsh one. So there is an environment variable called glob ignore, which will obfuscate the file from the echo star by saying that we don't wanna see any files that have a backup, the end and the backup tilde symbol. So we could unset the glob ignore. We still don't have LS, but the echo star now would work and you'd see there's a Rawkode. YAML here with the bad thing. That is beautiful. Now why does the LS not work? Because I'm an asshole mostly. If we go to profile.d, there's my bad command. So LS was modified to always
1:37:01 pass a dash b, which means don't show backup files. I then replaced an alias for ENV to also omit and hide the glove environment variable. Oh my god. I love it. So, yeah, it was particularly harsh. But now you and all the people that are watching are familiar with the Docker runtime and its ability to launch static pods into an namespace that doesn't exist, making it completely invisible almost to operators except for cry control PS. There we go. You're you're evil and I deserved every minute of this. Well, I I'd say yours was a lot
1:37:30 Conclusion & Wrap Up
1:37:39 more evil than mine was, but I think that was cool. This is this is the first time well, this is now the second time I've broken a cluster for the show. I really enjoyed it. Thank you for joining me for breaking your cluster. Some nice little things there. I like the story around it too. I felt helpless for the entire first thirty minutes of the episode, which is the best way to feel during this I feel. I felt just as helpless now. Alright. Well, thank you everyone for watching. Thank you for joining me. Thanks to Chris who's
1:38:12 already left. He had a meeting to go to. There will be no cluster next week, but I will be bringing it back the following week. Gfi, have an awesome day. That was really fun, and I'll speak to you again soon. Definitely. You very much. Was so much fun. Bye bye. Cheers. Bye, man.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments