About this video
What You'll Learn
- Trace and fix kubeconfig port errors, broken kubelet units, and offline control plane symptoms in a damaged cluster.
- Debug real-world Kubernetes admission failures, including Kyverno webhook rejections and Pod Security Policies.
- Rebuild cluster state by resolving quota and restart issues, including vCluster recovery and kubeadm certificate fixes.
Leigh Capili and Marcus Noble debug broken Kubernetes clusters: kubelet systemd units, kubeconfig ports, admission webhooks from Kyverno, Pod Security Policies, quotas, and a kubeadm certificate renewal gone wrong.
Jump to a chapter
- 0:00 <Untitled Chapter 1>
- 1:32 Introduction: Welcome & Cluster Day
- 2:02 Sponsors: Teleport & Equinix Metal
- 2:40 Equinix Metal
- 3:22 Competition Announcement
- 3:31 Guest Introductions: Lee Capili & Marcus Noble
- 3:49 Introduction
- 6:06 Lee's Challenge Begins: Access & Initial Setup
- 11:45 Lee Investigates Cluster State & API Errors
- 14:08 Lee Debugs API Server Connection: NGINX Proxy?
- 16:29 Lee Finds Kubeconfig Port Misconfiguration (localhost:443)
- 18:48 Lee Debugs Container Runtime & Kubelet State
- 22:08 Lee Finds Kubelet Service Disabled
- 25:51 Lee Fixes Kubelet Systemd Unit (Enable & Restart)
- 29:56 Lee Debugs Kubelet Startup Error: Invalid Max Files
- 30:30 Lee Fixes Kubelet Config (`containerLogMaxFiles`, `maxPods`)
- 31:33 Kubelet Component Config
- 33:06 Kubelet Starts & Core Pods Appear
- 35:47 Lee Checks Application Deployment (Scaled to 0)
- 37:10 Lee Attempts to Scale Deployment: Admission Webhook Error
- 44:56 Lee Fixes Webhook Issues (Deleting Configurations)
- 47:51 Lee Debugs Deployment Scaling: Quota & Pod Security Policy Errors
- 51:02 Lee Fixes API Server Config (Disabling PSP Admission Controller)
- 53:05 Lee's Time Ends & Marcus Explains His Breaks (Limit Ranges, Pause Container)
- 57:25 Why Do the Static Pods Work
- 1:01:38 Transition to Marcus's Challenge
- 1:01:48 Marcus's Challenge Begins: Access & Setup
- 1:03:08 Marcus Investigates Cluster State & RBAC Forbidden Errors
- 1:13:05 Marcus Finds User Identity (`uwu-admin`) & Break Glass Hint
- 1:15:07 Marcus Debugs Break Glass Cluster Role Binding Access
- 1:22:18 Marcus Attempts `kubeadm certs renew admin.conf` (Attempt 1)
- 1:33:16 Marcus Resets the Cluster (`kubeadm reset`)
- 1:34:55 Cluster Redeployment & Untainting Node
- 1:39:53 Lee Re-injects a Break (PSP Default Provider)
- 1:40:41 Marcus Debugs Postgres Crash Loop (Deleted StatefulSet/Pod)
- 1:41:43 Marcus Debugs Postgres: Waiting for PVC (etcd Issue)
- 1:44:11 Time Called & Lee Explains His Breaks
- 1:47:27 Conclusion & Giveaway Winners
- 1:47:40 T-Shirt Giveaway
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:32 Introduction: Welcome & Cluster Day
1:32 Alright. And welcome back to the Rawkode Academy. This is Thursday, which typically means it's cluster day, and typically it is cluster day. We're doing cluster today. We have two wonderful guests joining us to fix some rather broken Kubernetes clusters. I'm going to be joined by two guests in just a moment, but we do have to start with a little bit of housekeeping. And I was a little bit flustered because one of my guests disappeared, but fortunately he's just dialed back in. So phew, panic over. So before we get started, there are two sponsors of Clustard and I wanna thank them
2:02 Sponsors: Teleport & Equinix Metal
2:07 both. Teleport have been sponsoring us for a while now. You're definitely no stranger to Teleport. We use it every single week on Clustard. Teleport allows us to use GitHub to SSH to all of the machines, which are bare metal machines, and allows us to pair and collaborate via SSH systems, typing them to the same terminal to work out what is wrong. Teleport is an amazing product that does much more than just SSH access. So you should go check it out at rockode.lifeteleport and support the show. That would be amazing. Thank you all very much. All right, next up we have Equinix Metal
2:40 Equinix Metal
2:41 who are also sponsored by Clustered as well. And they've always prevented the hardware for this, so it was a no brainer that they wanted to be more involved. And we use some pretty chunky machines for Clustered because it makes it more fun. We've got tons of cores, tons of RAM, bare metal is a fantastic way to check out Kubernetes. So check out Rawkode. Livemetal. Also, if you're already in the Equinix portal and already know everything that you need to know, then just drop it in the voucher code Rawkode, and this will get you 200 USD
3:10 and credits. This will allow you to spin up one of their configurations of machines, depending on what size you pick for up to a hundred hours. So go have some fun and play with that too. Alright. There's one more thing. We have a competition, which I'll get to in a second because I've lost this thing. Let's introduce a guest first. Hello, Marcus and Lee. How are both doing? What's up, friends? Hello. Hello. I can I'm I'm doing a lot better right now than I think I'm going to be in just a moment here. So Oh, you're gonna do great. I know why.
3:31 Guest Introductions: Lee Capili & Marcus Noble
3:47 It's gonna be amazing. Let's start with a little bit of an introduction. Can you both we'll start with Lee and then we'll go over to Marcus. Just say hello. Let us know who you are, and then we'll take things from there. Yeah. Friends, I'm I'm so excited to be joining Rawkode stream today. I have wanted to do a a clustered stream since probably the first episode. I've been a a huge fan of this series. My name is Lee Kapili. I live here in Denver, and I have a history of writing software to do operations stuff on platform engineering teams,
3:49 Introduction
4:20 places like DIRECTV, Beatport, and AT and T. I did a stint as a developer advocate, a developer experience engineer with the Weaveworks team, also contributing to the Flux project and Kubernetes projects. And now I work with team Tanzu on a bunch of cool stuff with a bunch of very humbling other developer advocates and a staff dev advocate now. So pretty stoked about that and stoked to still be contributing to this awesome community and breaking stuff just as much as I can fix it. Awesome. Thank you, Lee. Marcus? Cool. Hey, everyone. I'm Marcus. I'm a platform engineer
5:00 with Giant Swarm. I'm based here in The UK. I don't really have as much to say as leaders. So let's just crack on. Alright. Brave guests. Thank you very much. Now I do have the stuff for the competition. So if you wanna win a new t shirt, which we have here sorry, Lee. I'm gonna cover up your face. You can go to Rawkode.live. Hold on. Yeah. Go to Rawkode.live/win. And this is the new cluster t shirt. So you know, enter now. We're gonna give them away. And if you join well, even if you've already entered, go back and check
5:39 the link again now because there's a new tweet that you could retweet that will get you 100 extra entries into the vote. We will be doing the draw after both clusters are fixed or if both clusters are fixed, we're gonna find out. So Rawkode.live/1 will get yourselves a t shirt. Awesome. Now before we went live, you can go right back from center now, Lee. It's all good. We're gonna jump straight over and get these things started. Lee is gonna go first. So I am going to open a session on Marcus control plane. Give that just a second. Now I gave
6:06 Lee's Challenge Begins: Access & Initial Setup
6:21 you both access. I modified your roles using the Teleport UI and give you access to every single machine. So you should be able to see the active session and join. I see that you're logged into this one now. I am here. These past two days have been the first time I've used Teleport ever, and I've been really excited getting to learn to use it. It's such a great tool. I mean, I know they sponsor the show, but I I would use Teleport regardless. It's amazing. So Okay. I'm in here. Alright. Set up your KubeConfig. Check for our control planning. You have forty
6:56 minutes. Best of luck, Lee. Okay. Just give me a second here. I'm gonna do a little probably a little bit more than just setting up my kube config. I'm going to pull up my GitHub on my machine and just get this repository URL. Simple. Just need a little bit more of a ergonomic experience here. Okay. Git clone. Paste. Oh, did I Yeah. It does take the terminal. I need to, like, reset it. Here we go. Clone. And then Wait a minute. So every time we have issues with the size, reset is gonna fix it? Yeah. That's one of the things that you
7:50 can do. And I actually learned that, I think, from a commenter. I may have been doing a little bit of studying over the past couple of days to get some inspiration and to make sure that I wasn't gonna do the same things as everyone else. So Yeah. That is they're permanently naughty because XDRIP JS has caused us many issues with shading EarthView. So fun story here. I'm actually joined in from my own terminal using the TSH command line tool, and I had to recompile it to override a behavior that was not working with the ACLs.
8:26 Ah, so you've managed did you manage to fix the TSH join command? I don't know if I fixed it to a place where they would merge it, but, yes, the answer is yes. Marcus, I think you're in trouble. A little bit. Yeah. So so I am I got full copy and paste ability right here. I'm I'm at home. I'm right here in my WSL terminal. Like, we are we're ready to buy. So, what do I gotta do here? Well, I I guess I could try to exec my bash. I see. Oh, sick. Look at this. Look
9:01 at this. Almost like I'm, like, ready to go here now. Alright. And then, let's go ahead and just, like, do one of these dudes. If I can type properly, which I cannot still. Group config. It's gonna be let's get out of the string for a second, Etsy Kubernetes admin comp. And then I will pop that into a string, and then let's just echo that and append it to my bash r c d just in case we end end up, like, losing a teleport session or something here, we have to re log in that I don't need to do this
9:52 koop s h. How about that? And while we're at it, I'm gonna add an alias as well. Good call. Is it even worth adding an alias? K. Marcus has left me access to koop c t l. Thank you. K. Kubectl, and then we'll do kubectl completions dash, but that needs to be oh, but my I see. I need to, like, size my terminal a little bit better here because yours is a bit smaller than mine. Hopefully, I can get this a little closer. Okay. K. S k. Source. K. Russell's saying 5% of your time is already
10:48 gone. But sometimes it's worth setting up your tools. Right? You mean how much time you can save if I have it on times I'm gonna type k. Yeah. And then this is another thing I had to reference from my own dash r c. Start kubectl. K. Completions. Oh, yeah. There we go. Completion. Cool. And then might as well put that in my bash r c as well. I probably shoulda had that in there already, but bash r c. Toda. We Oh, you put it in there. Alright. Okay. Yeah. Cube. Cool. Alright. K. Alright. Let's go see where we're at. So
11:45 Lee Investigates Cluster State & API Errors
11:48 did I not I don't think you have exported config. Does not have the resource type node. Okay. Cool. Do we have any resource types at all? And can I even get maybe, like, the cluster info to further cluster info dump doesn't have the resource type services, so we cannot see what the Kubernetes service IP address is? At this point, I'm kinda curious. I might go into the Kubernetes manifestor and just check a look at the kube API server flags. Seems reasonable. Right? Because, like, maybe we, like, disabled some APIs here or something. That's kinda my first guess.
12:33 The other thing that I'm thinking is maybe I don't have access to the discovery API, so then I don't have access to, like, Kube API resources. So it could be like an RBAC thing. Let's see. If you can use less as a pager, it just means we can follow your your scanning as well. Oh, I see. Yeah. Okay. Yeah. Because my terminal has its own pager. Yeah. I can do that. Sweet. I don't remember to do that. Yeah. I don't see any I don't remember what the flag is to, like, enable particular API groups and stuff. I believe it would
13:17 have to be done on the API server because that's where all of, like, admission and looking up API groups happens, and I don't see anything in here that, like, looks super out of the ordinary. So the next thing that I'm interested in is, like, maybe I can ask who I am. And I need to find a resource where it's gonna, like, give me an error message or something that tells me who I am, unless I can figure out a Well, you can run off can I on its own, and it'll give you a metric normally? No? Well, I thought it did.
13:58 I don't I I don't know what the command for that is, but I've seen it done before. Maybe chat will chime in. And but, I mean, I can it's just unfortunate if there's no resources. But then, actually, you know what's what's special about this, though, is that resources don't need to exist in the cluster for you to actually bind against them in RBAC. So I should still be able to, like, say, like, can I get pods? And this is telling me that this is okay. I'm not expecting to see an NGINX server error coming back
14:08 Lee Debugs API Server Connection: NGINX Proxy?
14:41 from anything Kube Huddle related. So I believe that we are being misdirected to someone from the funky. Think it's very take a look at the kube config. That's inside of Etsy Kubernetes manifest admin conf is, I believe, what we are using. It is indeed. So let's cat the admin conf or, I guess, less the admin conf. Did I oh. Yeah. So that's a server IP that's listening on port four four three, which is not the Kubernetes default. But there is something listening on 443, which I'm imagining is, like, some NGINX container or system d unit or something.
15:36 And, like, it's worth looking into what that thing is just to indulge Marcus' ego a little bit. So yeah. I mean, I guess, like, some ways that I could look at that is if I have net stat, then maybe I could, like, net stat plant and then grep for NGINX. And I see that there's an NGINX process listening on four four three. Indeed. I don't know what necessarily is running that, but I could also check to see if maybe there's a kubectl or something. Maybe not. I don't know. Moxygo says that the kubectl plugin for the
16:20 matrix thing is off can I list? Now I'm not even gonna try that before we fix the kubectl, though. So just wanna maybe see if I can get an editor assuming that my Linux server is working, which it seems so far, like, apt is pretty happy with that. So that's good. I like this editor. It's like a like, kind of nano but bigger. So you know? Hence the name micro. Micro. Yeah. And so I might just go and micro the admin conf and add a six here and see if maybe we are now potentially talking to Kubernetes.
16:29 Lee Finds Kubeconfig Port Misconfiguration (localhost:443)
17:09 So there's nothing there, and I might then go check to see if oh, Crackle config sets. Was it endpoint? Runtime dash endpoint. Time dash endpoint, and then I need to set it to Unix socket, which is, like, var run container d container d sock. Cool. Yep. So maybe let's go look to see if there's some pods here. And it looks like the scheduler and the controller Meowager are up, but the API server doesn't oh, here it is. Okay. But then the API server was modified. I'm just deducing that because was probably spun up that forty eight hours ago,
18:13 and then here we have you know? But does have a state of not ready for all of those containers. Yeah. I don't remember if those are supposed to be ready, but that's good to know. So let's go ahead and get the containers instead, which is okay. So there's no containers, which indicates to me that if CRI is up, maybe there's something in container b that's not configured properly or, you know, we're not able Yeah. You've missed yourself, Lee. Sorry. Can you still hear me? Yeah. We can hear you now. You were muted for a second. My,
18:48 Lee Debugs Container Runtime & Kubelet State
19:06 headphones died. I I should have charged this last night, but I guess I didn't. Hopefully, audio is okay. It's alright. Yeah. Don't worry about it. Cool. So, I'm thinking since we can't create container sandboxes and KubernetesM depends on containers, that's the reason why Kubernetes is not up right now. And, we'll have to go and look at why ContainerD is complaining. So I think container d in this setup probably runs with the system d unit since we're on a Ubuntu system. It's gonna, I guess. Yeah. System d is our init. Russell, you mentioned that Kubelet might not be looking
19:46 at the manifest directory. But even if Kubelet is not looking at the manifest directory, there are pods here that are inside of CRI. So the Kubelet is reporting that that we want, like, an API server. I haven't checked, like, the the spec here, but I'm thinking we can if maybe we're thinking that the manifestor isn't up to date or it's looking somewhere else, we can check that out later. And I'm gonna keep looking into the container d, problem potentially, I'm guessing, which if I do, like, a CTR CLS inside of the Kate's IO namespace, assuming that I actually know how to use
20:30 container d, then here we see there are image there's containers. But so what is happening here then? Why are there pods inside of CRI, but then there's no containers? That seems odd to me. Is that because the container v ones are just hot sandboxes? There look to be a bunch of different images in here in queue, including things from Cilium and stuff, so that doesn't seem like it would be accurate. Running two container runtimes, and then maybe the Kubelet is misconfigured. That could be a thing. Maybe, it started containers in container d, and then now it's not. But that also wouldn't explain
21:36 necessarily why I I would assume that there's an API server running here. Although, maybe this is just a container that's created, but that is not actually running. And since I'm using CTR, I haven't gotten the status of this thing. So, it's not, like, telling me if it's actually healthy or not. So I should go immediately be stopped as well, right, or paused with this functionality of the container runtime. I I think it is to see if PS showed you any of these processes. Yeah. That's let's let's go down all the way to the Linux level. Right? That's the
22:08 Lee Finds Kubelet Service Disabled
22:15 kernel. What do you think? Yeah. And I tried this earlier with Netstat as well. Right? And I didn't see a Kube API server that was bound to any port. And now looking in the process table of the kernel, it's not showing us that there is, like, an API server per se. But I'm also not seeing the kubelet. Yeah. That's a problem. So what's the deal with that? Right? So maybe, we should go ask, like, system b, if it likes how it life how its life is, like, right now, which I think I can just say
22:56 system CTL, and then this gives me a tree, but then I need, like, the status. And it's there's this NGINX service. So NGINX is actually an assistant d unit, which I guess is a sensible place to run NGINX from. I'm also seeing there's nothing listed under Cron right now. And then we have all these engine x workers processes, pocket number d. And then where is where is my Kubelet? Am I am I missing it? No. I've had I haven't seen it. And also very suspicious is the container d three is empty. And, like, normally, that is huge
23:39 at the top. That's that's up. Did I miss it? Yeah. It's like the second service and systems like. Oh, yeah. I was like, we normally would see container d shims and containers and whole bunch of stuff, but nothing. Yeah. That's a so even though system d thinks everything is running and healthy and that there's no failed units, clearly, we have some misconfigurations here, and I don't see the Kubelet service, which, like, normally, you would, you know, like, be able to, like, journal a CTL and then, like, follow the Kubit log. And I I guess there is a Kubit
24:18 log there, but now it says that it stopped. Shutting down. Kubit service successfully entered the dead state, and then, I guess, did we disable the system d unit, maybe? I don't even know where oh, I guess CRI is pointing directly to the container edition, so that's why Craig Cuddle is working. And then container d, like, has some residual stuff in it, but then the Kubelet's not running. That yeah. Okay. This is making sense. So if I go maybe into, like, some of the system d, like, unit tree maybe. And then maybe we, like, find here and
25:14 grab for if maybe the Kubelet service was just disabled or something. Here we have the the kind of leftovers of config. And since this is here, I'll just go and check that, yeah, this is just a normal Kubitium Kubelet extra arcs. I like your suggestion, the FireFlash, just to try and start the Kubelet. So is there is there no Kubelet unit file at all? Yeah. I don't see the unit file in in at least inside of the system d system. Inside of that c system d, there's no Kubla unit file. There was. Sorry? There was. There there was in your list.
25:51 Lee Fixes Kubelet Systemd Unit (Enable & Restart)
26:16 There definitely there definitely was. Yeah. I agree. Because No. No. I mean, when you looked, it was listed. Oh, go go into the system folder. There should be another thing in the system directory. Okay. This this is just the the service directory. We'll do an LS system. Maybe it's just not showing up for the faint grip. Could be like, that just seems a bit weird. Okay. Yeah. Yeah. Let's I mean, we got That'll be inside system. Yeah. Oh, no. It's not there. Yeah. I don't I don't see I mean, I would trust the output of find. Right?
27:06 I mean I mean, I don't trust the f and f clustered. But but yeah. That's Yeah. There should be a Kiplit dot service, which is not exactly a simple unit fail that tells the head to start. I don't have access to the Kubernetes config that was used to create this cluster, do I? And, like, normally, I would, like, try and get that config outside of or, like, out of the Kubernetes cluster, but we don't have a very functional, cluster at the moment because we have no cluster. Oh, okay. If I have the config file, then I could just
28:00 go and run Kubernetes alpha phase Kubelet. So if you go to barlib cache and grab the dot dev file for the Kubelet, you could just unzip it and grab the unit file. Barrelib cache. You might wanna check the chat. Check the chat as well. Have you have you left the unit fail somewhere for us, Marcus? And that'll save us the the long way. I haven't I haven't moved the unit file. Have a look what the fire flash is is suggested. Maybe just try start. Yeah. So there's from the old media package can also live in user lib system
28:44 d. Yeah. Do a system control status, Kipla. It'll tell you. Like, we don't yeah. I shouldn't be making this up. It's just system d can tell us the state of the world. There we go. But as in lab system d system. Thank you, Kevin. Yeah. It just says that it was disabled. And this is exactly what I was but I was just looking in completely the wrong path. Was I I thought it was for hundreds of these customers, and I was looking at the wrong path, but don't worry about it. Yeah. So this service is disabled.
29:15 And the way that I remember how to do this on a system the system involves push putting units into disabled folders with puppet. I know. You can do system control enable Kubelet. Yeah. I just I just need to do this. Okay. And then we'll just check the status. And here we are. We have it enabled, and it's You're gonna have to do a restart on it, though, to just to kind of take it all. Some CTL restart Kubelet. That was it gave me a successful error code. And here we have and then the Kubelet's entering the failed states, and it says
29:56 Lee Debugs Kubelet Startup Error: Invalid Max Files
30:06 that invalid max files one must be greater than one. So is this a sysctl? Is that what that's referencing? Because this is like a go variable of, like, a max file sysctl. I mean I had a feeling that I should have exported all of the sysctls from my notes when when they were good. Well, you're gonna run a system control cat kuplet, and it will show you all the places that the kuplet is configured. I would stop there. Some cuddle cat. Yep. Cat kubelet. So you're gonna see there's you can ignore the bootstrap one typically, but
30:30 Lee Fixes Kubelet Config (`containerLogMaxFiles`, `maxPods`)
30:53 FireLib kubelet config might be important. And then there's the q b d m flags dot e n v, which you probably wanna look in. And then lastly, etcetera default kubelet. Yeah. Okay. We can start here. And what it's meant it's pointing to container d. It has the a normal POS container image, which is definitely a good thing. And yeah. But this doesn't look like it's missing anything in particular either. K. We have this. This is gonna be the Kubelet component config, I think. Yeah. Much longer. This is maintained by the, Kubelet team. It's not a Kubernetes API that you can use
31:33 Kubelet Component Config
31:46 inside your cluster. Right at the bottom. Max files. Container log max files? Oh, okay. I never used this option ever. But Me neither. I'm also curious about the max pods at the bottom. Max pods? Yeah. That doesn't look really good. Maybe we could drop that up to, like, 50 or something. And then you can have 50 pods on a node. Right? I think the default is a 10. Okay. That's But you could also just remove it and have the default. I I assume. I'm not entirely sure, but I assume. No. You are right about that because the the structural
32:34 deserialize. And then yeah. I mean, we don't even need this either. Volume stats, add periods, sync frequency. I don't know if zero seconds. It seems like that's a, like, a default value. And then local editor is not happy with this terminal size discrepancy, but it we're making it work. K. We updated the config. Let's go ahead and just restart the Google app, and then we will look at the Google channel again. And at least now, it looks like the Google is trying to do things. It is. So that's better than, like, where we were before.
33:06 Kubelet Starts & Core Pods Appear
33:31 And then error getting node, this is, like, not that surprising to me right now because Kubernetes is not up and running yet. So, like, it's very natural for the Kubla control loops to be, like, just throwing up. Typically, it means it can't speak to the API server. Right? Yeah. But we can, like, check and see. Okay. We got our pods here, but do we have our containers? No. Why is that? I just saw a comment from Moz in the chat. Static pod path is wrong. Oh, yeah. Oh, I missed that. Totally. Thank you. So static pod path. Where where are you at?
34:15 Here. Yeah. That's not right. Passes the eyeball test tool. Manifests. Etsy Kubernetes manifest, plural. That's probably correct. That is correct. Yeah. Resolveconf that we're gonna be piping into every single pod in the cluster looks okay too. Cluster domain is normal. CIDRs are normal. Yeah. I I think that looks good now. I think Cool. I think so. Yeah. Alright. Now we have a new manifest path. Let's just go and reboot that Kubelet again. And then, hopefully, you know, like, say we give it a few seconds, and then we just are, like, optimistic that now we have some containers.
35:05 That's good. This is this is a good thing now. Hey. We got STD coming up and stuff. And then from, like that that's like a container perspective. Now I do see that it seems like there's, like, a lot of restarts there. So, like, are these pods, like, happy? Yeah. And then all the base pods are fine, and then now we're getting some of the, rest of the more, like, run time control plane coming up, like, and all of these other things that are bootstrapped into the early Kubernetes cluster. So it seems like now, it might be a reasonable expectation to be
35:40 able to, like, get some notes from the cluster, and, that's good. What's also good here is that our notes are ready. So that should mean that, like, from a networking standpoint, we at least have, like, something that feels like Kubernetes. And then since I'm thinking about networking and we're, like well, you know what? Let's just let's just look at the deployment, because, like, the whole point here is that, we wanted to do a very simple thing with a single application. But instead, it looks like we've deployed Argo CD, and there's also image registry. And there's a Go CD because we need
35:47 Lee Checks Application Deployment (Scaled to 0)
36:22 a CICD pipeline as well. And then, why not also have a place to store artifacts? You know, like, maybe who knows? Maybe now we're building, like, a Google on Rails application instead of, instead of the normal clustered Rust. Well, you know what? It's nice to see someone. You know, I saw all these cores and all this RAM at these clusters. Now it's finally getting tickled a little bit. So that's good. Thank you, Marcus. Yeah. Good question from Russell. Have I gone down to one worker node? I have lowered this back to one worker and one control plane
36:52 just so that we don't have to keep fixing things on two worker nodes every week. I think it's simplified a wee bit. Let's wrangle in the complexity for me a little bit. Right? Yeah. So my intuition here and this I I don't know. Maybe, like, should be good this sometime, but, I mean, we got some beefy nodes and nothing is passing the readiness check here. And why would that be? Oh, we have Caverno. That's interesting. Do you think his Argo is to hide the Caverno deployment? You think it was an illusion? I mean, well, there's there's, like, Caverno just
37:10 Lee Attempts to Scale Deployment: Admission Webhook Error
37:37 in the middle of the screen. I'm I'm just assuming that that is a policy controller that's hooked in with an admission webhook to the API server, and that could be doing all sorts of stuff. That's actually an area where I don't have a lot of experience yet. So I'm about to learn something. Have ten minutes. Yeah. Now what can we do in ten minutes to make this interesting? How many hints are here in the, let's say if, like, I went over to the root directory? Alright. So we got five hints. We must have, like, already dealt with, like,
38:12 one of these. Okay. I mean, you could describe the cluster pods and see why it's not ready. Yeah. Let's let's maybe describe the deploy for clustered. And here, we it says that it's scaled to zero. Oh, no. That's just the replica set, but it also is zero desired. And Yep. It's scaled. Yeah. And then, again, I'm not in a pager, but it looks like you guys can see most of what I'm seeing. Yeah. We we can see all. It's definitely set to zero point replicas. It's saying that it's pointing to your image of v one.
39:02 So I imagine, if I just try and update this, that something is gonna fight me with all of the stuff that's in this cluster, but there's no reason not to try. So I don't know. I I don't Yeah. Just add it to it. Take it easy, though. Yeah. Let's get edit deploy. What do we got? It's a stem. This is stem. Yeah. We do normally two replicas on this. Just one, but you can go for two. Why not? Look at the RAM. I am in some sort of, like, macro mode at the moment. I accidentally
39:45 used a different editor shortcuts, and now it's not letting me hit escape. Yep. Normal. Same. I think I'm recording a macro right now would be my guess. But Alright. Someone Google how do we quit them. I'm trapped. I Shall we open a new terminal? Sure. Or actually, you know, I can just go and kill the VIM process myself. Yeah. Go for it. Why not? Yeah. Here. I I I have the ability to control my own destiny. I just need to TSH into the Marcus control plane. Marcus control plane one. Yep. Verb. Yeah. There was a bug actually in a
40:44 recent point version of Teleport that broke control c. So but that has been affected, unfortunately. But that was fun. It's a it's a VI process. Right? Mhmm. I just did a kill all, say kill v I. Is is a kill not enough to winch editor from the screen? Oh. Alright. I I see the process group here. I believe we're wasting time. It's gone. Let's just let's just let's just just I've got a new session. Okay. Cool. If you just join that. We'll get the session. We'll add on a couple of minutes just to get around that.
41:46 New session. It's the one that's from thirty five seconds ago. Yep. Cool. Echo high into it. Yeah. Fine. Coffee. What's it doing? I was So you can set editor to, yeah, micro. There we go. Okay. Okay. Edit the deploy for clustered, and I want to find the replicas. I should have just used the scale command. Oh, you were also going to update the image, weren't you? Oh, we can't your Salesforce Outlook. So there is probably Caverno or oh, x five zero nine certificate is valid. Okay. So is the webhook cert maybe not, not good, potentially?
43:16 It's one of the fast way to get that error again. To deploy. Okay. It says that it's scaled this time. I feel like that's not right, though. Okay. It it it has a desired I was able to patch it. So maybe whatever that webhook was is only firing with kubectl apply or something. But when you issue a raw patch for the mutating emission webhook that's configured in Caverno is not firing on patch or something. Not not a lot of ideas there. So I should probably learn a little bit about the CRDs that make Caverno work.
44:31 So that okay. So I can get the policies inside the whole cluster, and I guess there's none, but there might be some cluster policies. I'm in a terrible time with the editor size right here. Cluster policies. There's no cluster policies either. So why not check for admit mutate in admission controller at configurations? Mutate in webpack configurations? Yeah. Mutating webpack. Let's look there. So there's this vault agent injector. We could just, like, delete that. Good call. I think so. Because I mean, like, we literally have no secrets in our application anyway. Right? So I just don't trust mutating webhook
44:56 Lee Fixes Webhook Issues (Deleting Configurations)
45:24 configurations. So No. Let's just delete all of those in all namespaces. Who cares? Yeah. And then let's go and describe that deploy again and see what else it's complaining about. It says that it has two desired. Let's look at the replica sets. And for clustered, there's only one replica set, but we oh, yeah. It should have scaled this replica set up to two, but it doesn't look like it did that because this replica set still says zero. So what's responsible for that? And, that's a good question, but I think I might just go and try editing the deploy
46:17 again, and setting the container image to v two because if that's, like, the goal anyway. Failed to call webhook, validate. That port four four three is still there. The Kubernetes this is saying post to local host four four three. Okay. Where is this webhook configuration? It's not a it's a validating webhook configuration, not a mutating one. Let's check them. So let's go look at those. And that was only four eighty five hours ago, which is the same time that the API server was modified. I I also don't know why the API server was modified. We didn't find any differences in the API
47:13 server manifest that we wanted to change at the time that we had looked at it. But I'm just gonna come in here and delete all of these as well, and then we'll try our edit one more time and come in here. Hey. So now we were able to actually edit the deploy. And we have two clustered replica sets. One of them, has two desired oh, no. Only one desired. And then it's not so let's go and describe that replica set and see if there's, like, any events or helpful information about why it's not progressing. Alright.
47:51 Lee Debugs Deployment Scaling: Quota & Pod Security Policy Errors
48:17 It says failed to create. Creating pods is forbidden for pod security policy. Cool. Okay. So just like kinda lots of policy stuff. Let's go. Let's look at the look at the PSPs. This says my scheduler, and I I honestly don't know if there's even any consequence to deleting a pod security policy because, I mean, we weren't using one in the first place. Just delete it. Smash, smash. Marcus is sad. He says my lovingly classic webhook was just carelessly deleted. I didn't even call it by name. Yeah. And but you'll have to send me the write up,
49:09 Marcus. At this point, I'm racing through your your intricately crafted monument of Yeah. Of policy. I feel like we're close. So let's let's give it another a couple of minutes and see if we can get this done. Okay. I mean, we're still making progress. Like, we haven't gotten stuck. Right? So it's just maybe I'm a little bit verbose in my approach every now and then. You wrote, yeah, class instead of grep plus. Okay. Yeah. I I, started typing into some, like, text that's not part of the terminal that was just, like, in my p two y,
49:44 and, like, I couldn't see. Hey. Get deploy r s pods, prep for class. So we have these replica sets, and is it, like, still complaining about something? If I go back to what I was looking at there, it says error creating pods. And if I there are no pods. Yeah. But there's So I actually caught this at the start. In the API server manifest, he enabled a admission a static admission controller. Okay. Cool. Let's go take a look at the and I don't wanna do this from this bottom part of the terminal. Let's look open a pager at the Kubernetes manifests,
50:46 for the Kube API server. And you mentioned that there is an admission controller. Yeah. The part security policy controller was enabled on this API server. Right. So that would be the change that we know happened, and it's in this enable admission controller plugins flag. No restrictions should be the only thing that's there. Okay. If you resize your terminal, it should help a little bit, I think. I'm not sure. Resize. Yeah. Maybe I should make the terminal closer to the size that it should be. It's like, unfortunately, I don't know the exact number columns, and it, like, doesn't tell me
51:02 Lee Fixes API Server Config (Disabling PSP Admission Controller)
51:32 when I'm resizing my terminal, like, how many columns. But and it's manifests kube API server, And looks like I'm kind of closer to where it should one? So do we want this node restriction one in in here as well? We yeah. We do want node restriction. Okay. We'll keep that. And if we do restart Kubelet just to speed that up. It should restart our API server for us. Okay. Cool. Yeah. And then maybe we should just take a quick look at the hints and see what we've solved so far. Yeah. And one, it says something
52:24 hook, probably web hook. Web hook. Yeah. Web hook. There's something between us. So that'll be the NGINX. Right? Yeah. Yeah. Yeah. Yeah. We did fix that. I can't remember. My memory isn't what it once was. Oh, I don't know if we did that. Yeah. I'm not sure. No. We haven't. Marcus says no. Okay. This one is a very small emoji that Pause. The pause emoji. The pause emoji? Oh. Oh. Okay. And, I hope you backed up those config files. What config files, Marcus? Which which ones? So what do you think is a good approach since
53:05 Lee's Time Ends & Marcus Explains His Breaks (Limit Ranges, Pause Container)
53:13 we are we've we've kind of run into the end of time. Do we wanna do a a walk through the remaining problems? You think, David? Or Yeah. Marcus, tell us what what what if we're caught yet. Okay. So you skipped over a bunch of stuff that was good because I did a load of stuff in NGINX and all that kind of thing, but couldn't get the certificates to play nicely. So I was just like, yeah, whatever. Just have it failed. Mhmm. So that's cool. You saw that I just absolutely flooded it with different apps. I basically just went through Artifact
53:49 Hub and installed all the top helm charts. None of those were actually doing anything. So Caverno was just installed with nothing. There was OPA on there as well. There was Chaos Agent, all sorts of stuff. Yeah. One you haven't got regarding memory. So in the default and the kube system namespace, I created a new limit range that limited the min and max memory to one kilobyte. Everybody always forget about limit ranges per limit ranges. Yeah. So does this actually affect things, though? Or It'll don't know. I I don't know. I just threw it in there.
54:40 I wonder. Because It should affect Cilium. I would be surprised if Cilium was working at all, which is probably why none of our pods are ready and stuck in that zero Yeah. On the ready stage because there's no Uh-huh. Working CNI on the worker nodes right now. So, definitely, we would just delete the limit ranges across the entire cluster. Yeah. There were actually two of those. There was another namespace where you had one. Cool. Okay. That was So the one, James has caught it, is around the pause container. So when you actually went past it
55:19 Did I did I looking at the at the Yeah. It's the var the the Varlib Kubelet Yeah. One, I think. Yeah. Or or something. Let's it it was we got there from system, cuddle, cat, Kubelet, if I can fill it. And then, there is the Varlib Kubelet, Kubernetes, and flags ENV really gross Take a look at the last line. Yep. Oh, the pod and procuring right there. Oh, you overrooted over there. So that's is that, like, an old version of that container or something, Or or is that not a good tag? No. If you actually keep
56:04 you scroll across, so you've got some more width, I think. If we just count that file, do mind if I just Yeah. Where are we in? This one? It it looks like there's a I was at the end of the file there. You know? I was trying to Is there is there more? Okay. Maybe I just completely removed it. That was weird. It was meant to be the same image, but on the tag for the for the the the version, I changed the the dots to underscores. Oh, just There we go. That you need to Sorry. I I can see that. I
56:47 couldn't see that either. I think that's the weird scroll bug. Yeah. I I I can see that. And I was using a pager, but I I guess it didn't show up. But, yeah. So that tag, is malformed, which means that the pause container that was being used by the Kubelet, would not get pulled. It wouldn't approve it. Yeah. Yeah. And so then that prevents containers from starting, which would be even if you got the pod security policy thing done, then you still prevent containers from starting inside the cluster. So why do the why do the static pods work?
57:25 Why Do the Static Pods Work
57:37 Because I, you know, I I would I I kind of assumed, oh, okay. Well, my pods are running, so my Kubla is probably healthy, but, yeah, I guess not. Well, do they just restart the containers that were previously there? Well, this obviously would work regardless of the CNI because the CNI is then started by the API server and other things that keep the container. So the static pods don't actually need the pod sandbox because they're actually matter of pods. So those are not real pods. They're they're project matter of projections on the API server. So
58:12 but the pods the pods container is used traditionally to create namespaces that the Kubla can create further containers underneath to support the multi container use case. But the static pods, I don't think take advantage of the pods container. Can the static pods not use multiple containers? Oh, I don't know. I guess I don't know. You could just schedule them next to each other in different sandboxes and not have, like, IPC sharing and stuff. But that doesn't make any sense because then how would you do volume mounts between, like, in it containers and things? I'll have to look into static static static
58:51 manifest must have a whole bunch of restrictions on them to avoid some of these weird things. I'm not I there there are restrictions, but, like, I I remember when I was developing on Kubadm that I was surprised at how much worked. Right? Like, all of the volume mount stuff, like, projecting things and changing like, that whole API is is load bearing, for all of KubernetesM's configuration. So and it's a it's a normal pod spec, so I would be very surprised, to have, like, a different behavior, between, like, multiple containers. But, no pod sandbox used for the
59:33 for the static pods, I guess. I mean, I I guess, like, you could validate that by looking inside of container d directly, and grabbing for the API server, for instance. Alright. Are you still on the session? Because I jumped off. Yeah. Dashing has to go first, and you would be correct about that. Yeah. There's there's no, there's no pause container for that. I don't oh, well, I I mean, it might not show up because if I look for pause here, maybe those pause containers are used for something else. It's not the API server. It's my guess.
1:00:20 So neat. That's using a different pause container version. That's not even the version that you linked. You'd linked it to, like, three forty one with, like, underscores and stuff. Right? So that that's just, like, leftover from before, that the perfect evidence of it not functioning, but the static pods are still working. So Alright. Is that is that all, Marcus? Yeah. You caught the others when when you well, when I did the typo. So when you caught that that error with, like, one whatever it was, that was because that one was meant to be one k I.
1:00:55 So it was meant to limit the log files to, like, nothing effectively. Yes. Uh-huh. Yeah. I got a little bit fortunate where that showed up in the Kubelet log, I think. That was a that was a nice quick Kubelet exit as opposed to, like, a a Kubelet error that's more devious because, like, the flag is, like, not functioning, and Kubelet is still running as a system. We didn't understand any password. Yeah. Pretty. That was so fun. Thank you so much for for putting that, that puzzle together, for us to to discover through. I'm not going to apologize.
1:01:36 Alright. Those were some cool breaks. Now is now is let's let's at Looking forward to this now. Yeah. Let's let's get started. So let's see what Lee has prepared for Marcus. So, Marcus, I have opened a session or I'm opening a session on Lee control plane one. Please just Okay. Yep. Join session. It looks like we have your dot files as well, Lee. So Yeah. Nice. Alright. So after a quick config, which may already be set up in the dot fails. Who knows? And Yeah. Could be. Oh, you've got some tools coming. Key names, JQ, KibTTL,
1:01:48 Marcus's Challenge Begins: Access & Setup
1:02:19 SSD control. Hell. Alright. Alright. Go have some fun, Marcus. Automated the setup. Yeah. The Helm one's actually useful. Cool. So I think Helm's actually already installed, but just in case. Yeah. We do use Helm for the the deployment. So yeah. I think you could have broken control plane, Marcus. Cool. Okay. Let's have a look what we've what we've got then. Okay. So let's first have a look at what we're dealing with. So okay. So the name's been changed. Let's let's change that back. Oh, you can see still see that. Alright. Yeah? Uh-huh. Cool. Alright. Let's
1:03:08 Marcus Investigates Cluster State & RBAC Forbidden Errors
1:03:35 get rid of that. What else we got? Okay. Is your best one? I don't think so. No? No. Okay. That's What are thinking so far about this error message? So the user is not what I was expecting. I was expecting that to be admin, especially as we changed it. What else in here? So let's see if there's been anything changed with the Kubelet. No. That's alright. Okay. So Good comment there from Russell. Do a dash v name and check the the debug logging control if you want. Yeah. Good. Good point. Oh, you now you can't see me scroll,
1:05:16 can you? Hold on. That's alright. I'm doing. Oh, you do? Yeah. Yeah. Okay. Okay. So response. Okay. So I I kinda feel like this is a real cube control and a real API server. Yeah. It looks alright. Yeah. Okay. That's the one I pulled out. Yeah. Try a cube control version curiosity. If you want a cuter version of kubectl, there's also kubectl. What was that, Lee? There's also cute cute CTL as well over there. Unfortunately, you overwrote my my shell script that was telling you to just mute the. I've already been forwarded. Luckily, you're full than I thought you'd be.
1:06:31 So Okay. So we're able to connect to the server. We can see the the version's responding. That looks correct. Okay. Let's Yeah. I didn't have enough time to read help, and I used to add more. I I nearly nearly did. Oh, I can no. There we go. Anyway, Where are you looking? Yeah. No. You can't see what I can see. I can. Yes. So now preference to authorization mode. Is that authorization mode correct? That's right. Yeah. I mean, when I first seen that, I was like, oh, that's been changed. But now I'm starting to second guess
1:07:38 myself. I only changed one line in this file. No. Okay. Hey. There you go. Sorry? Yeah. The the only change in this file is the name. Oh, here we go. Well, you can totally just leave it like that. It's it's completely gone. What am looking at? Sorry. What's the top. The metadata name has been changed, but that's superficial. So Oh. Yeah. Okay. Does that I don't think that makes a difference, does it? No. Superficial. Okay. So what else have we got then? Surely nobody with mass spec kit that. That would be weird. Yeah. I I know very little about it,
1:08:56 so who knows? Okay. I also I also from that, just because I didn't want them to do the announcements. Okay. So we've got some containers. That's that's good. See if we've got any pods. Okay. So we've got kube calendar. Okay. Okay. Cool. Yeah. Google Calendar. Sure. And the controller mule. Get it. The Oulu controller. Yeah. And Yeah. I I think I think you had more fun with this. Okay. So let's see if we can do anything. So we can't can't without the try without the, dash a flag there. Okay. Yeah. That dash a flag, implicitly lists namespaces.
1:10:17 So this is, like, an important thing for, like, any platform engineers. Like, if you're ever trying to share a cluster with, like, lots of tenants and, like, don't want them to know the namespaces of each other, then it creates, like, a huge usability issue in quick control. Because you cannot use a resource name with a list in Rawkode. There you go. I I would have done it otherwise for you so that, you know, you wouldn't run into the problem as soon as you did. Okay. I mean, you only have to modify cluster, so maybe it's not the end of
1:10:50 the world. Yeah. Okay. Let's let's give that a go. Let's let's see what we can do. No. Yeah. Let's not done anything. Yeah. Failed to create new replica set. Forbidden exceed quota. Oh, quota thing. Oh, god. I don't know where quota is a diff a Well, I would take a look at API resources. Would be a good place to start. Okay. Resells. Super. Yeah. Oh. Uh-oh. And there's our backbite share. And have security issues with the kinda own pods. Oh, you think you can edit but not delete? It's sneaky. I see. Let's just yeah. 9,000.
1:12:58 Nice try. Okay. Run cube control off, who am I? Just out of curiosity. Who what? Who am I? Maybe I oh, yeah. What no no dashes. It's just like that. Just who am I? Like, the limits command. Maybe that's the plug in. Yeah. That Well, I think I'll see what The scenario here is that the the SRE team had a meeting, and they read this article from Rawkode Security about how system masters is insecure. Because system masters, it's the group that's used on the admin coop config, and it's it's the group that gets binded to
1:13:05 Marcus Finds User Identity (`uwu-admin`) & Break Glass Hint
1:14:05 by the default cluster role binding. But if you delete every single RBAC object inside the cluster, system masters has a by that's hard coded into the API server. And it still works. It still lets you access the cluster even if you've closed your RBAC. So I just I I decided, you know, when I was talking with our FRA team that, maybe it would be better if we use some least privilege roles, you know. Like, you can get in there and and, like, delete some sensible pods and stuff if, like, you need to quickly change production stuff. But all the policy and
1:14:46 the things that you don't normally wanna touch, we'll keep that out of the normal RBAC range. But, of course, just like any Linux server that you log into with your own user. Right? You're gonna wanna have, like, some sort of pseudo privilege. Right? Well, one of the hints in there Oh, what Marcus tried there was switching to the Kubectlone, which was a nice try, but that that would be able to allow you to modify the r back. But I would encourage you to run can I list again and pay attention to a couple of things in there? Yeah.
1:15:07 Marcus Debugs Break Glass Cluster Role Binding Access
1:15:19 I I saw there's the impersonate. Oh, even before the impersonate, there's something worth checking out. Oh, is there another? I don't actually know if this shows up inside of Canon or not. Yeah. The the thing that you need, it may not show up there for a peculiar reason, but maybe it's there. Elephant strike is interesting. You know? Sorry. Where do you look at? The very I'm trying not to get too much away. The first five, like Oh. Yeah. Sorry. Your scroll is different to my scroll. Yeah. It may not show up for you. Do if you type into the head No.
1:16:07 Yeah. I've I've got it. I've got it. I've got it. Yeah. Yeah. Yeah. So there's a there's a rather interesting that you have complete unfiltered access to something called break glass, which is interesting. But the impersonate is also very interesting. So Yeah. But I'm curious about break glass. Let's take a look at that. So what was that? That was the Cluster role binding. Cluster role behindings. Yeah. I love that everyone watching is getting a deep dive into every single RBAC object. It's exactly what the United object community needs right now. Oh, I would have expected that to work
1:16:54 because you have unfiltered access. So it must be the impersonation there. Yeah. I have no idea how to do that. Yeah. So you don't have gets on on cluster role binding or list and that sort of thing. But we thought you would do it on the name, though. So you have to add the name. So that's that's that's good to know that resource name shows up there. Break dash glass. Break glass. Thank you. Okay. So it's not inconsistent. Well, it's not in the namespace because it's a cluster rule binding. Oh, of course. Yes. Of course.
1:17:41 But it's not found. Well, that should've worked. I guess the list fails. Try edit. You Yeah. Yeah. I just have to edit myself. It's it's legitimately not there right now. Alright. Okay. Alright. I'm I'm good home. Oh, okay. Okay. Oh, hold on. Right. Okay. Yeah. So can we create the break glass? Nice nice idea. So hold on. No. That's a custom role binding. Right. K. You gotta find yourself the system masters. Yeah. Oh, crap. Yeah. It has line If you change using QCTL, you will have a. What? Sorry? If you change your command to use QCTL
1:19:13 instead of QUKE, like, cute, like, it's super cute, cute CTL, it will auto complete. Yeah. Yeah. But you'd have to find some YAML for across across the rule by then. Right? Yeah. You don't have to. You can totally one liner these things. Just maybe look at the help. It's they're really, really simple once you get this one. Oh, yeah. Nice. Oh, yeah. Nice. Yeah. I actually really like doing a dry run client and then outputting normal. It's usually how I author my cluster role bindings for my control repo. Very cool. What cluster role was it called?
1:20:05 Cluster admin. Yeah. Cluster admin. You need to bind it to a user. Oh. But I'm I don't have the permission to create. Oh, that's that's odd. Did you tell break glass properly? Yeah. Because I'm pretty sure those it's three I I did this. I tested it. And Did you create a cluster role binding or a role binding? Let's see. Yeah. Cluster role binding. It should be cluster role binding. Yeah. Yeah. Because it says right here, have any verb on break glass, and then you should be able to create it and bind it to cluster admin.
1:21:11 Yeah. That's great. A oh, the the yeah. That's the role. And then it should be your own Oh, admin. Not yeah. Oh. Yeah. It should be the Oulu Oulu admin. Yeah. I I tried to do that. Failed to create. Okay. So there's one other solution to this. I don't know why this is not working because I have to this make sure I put basically the exact command into the hints file. Let's let's check the hints. Let's have let's have a look at them. Yeah. Let's let's stop. So so there's another solution. Yeah. Here. Let's just use this other solution
1:22:09 here because I don't know why your command is not working, but it should work. I'll have to look through the history file that I saved. Because I I tested it and it functions properly. This is a good way to, like, provide a a way to break glass in a cluster that you don't normally wanna paint. So you we can use KubernetesM and the existing PKI that's inside of the Kubernetes manifest PKI directory, to fix this problem, by running KubernetesM certs, and then there is a a sub command to regenerate the admin conf. Yes. Kubernetes search renew, I believe.
1:22:18 Marcus Attempts `kubeadm certs renew admin.conf` (Attempt 1)
1:22:53 No. Oh, yeah. Renew admin, maybe? Renew all. You don't wanna renew all of them. Oh, okay. Well, sure. Yeah. You can. I don't think that that would spit out the admin prompt. No. Because there's a special there's a special sub command for the admin one. Admin dot com. Yeah. Okay. This is the second solution to the problem, to brute force using the, cluster PKI. Right. Cool. So you should be able to, be administrator now, I think. Does that to out to the same place? It's good to spit it out to the proper directory. Yeah. Looks like it's been regen. Oh,
1:23:50 interesting. You may have to restart the API server. Yeah. It should just be I don't think that file got overwritten. Maybe you have to move like, make a backup or something. Let's have a look. Renew. Additionally, regardless. Oh, okay. Yeah. No. Okay. Yeah. So it's there's defaulted. Yep. Do you remember that? Error reading configuration from cluster falling back to the default configurations. That's okay. We don't use skip proximity cluster. Yeah. Yeah. I I I think we might just need to move the file to a different name, just to verify that Deepgram is actually trying to write it out.
1:24:58 Yeah. Oh, it says missing. So there's another way to do this. Was it? Nope. Maybe not. No. Don't take it. All of the PPIs are used to do this, and I and I did generate this earlier as well. That's so I might log in and and tinker with that cluster role binding command really quick. Okay. So that Yeah. You shouldn't need to keep config params. That looks like it working. Right? Yeah. Try try to get pods. Oh. Oh, it's been really laggy for me. Are you are you seeing that as well? No. It's okay. Well, that won't restart.
1:26:54 And if I'm just full, you would have to do it. Power back is always a pleasure. Yeah. Oh, okay. I've got access to the cube system. Yeah. So maybe you can grab a token in there. Yeah. It's a fair solution. You can grab a token from one of the from something. I don't know necessarily what something might have the ability to give you. Oh, okay. Yeah. Yeah. I mean, one upgrade would be just, like, being able to see, like, workloads in all namespaces, which that would be, like, a lot of pods. Yeah. I'm not sure which of these.
1:28:16 K. So why can you access Kube system but not default namespace? No. I can't. I just can't access namespace in general. Mhmm. So I can do I can do it in in the default, but I can't do so I'm I'm blocked from actually getting namespaces. Yeah. Since you have access to secrets inside of the, cookie system name space, another avenue for you to do a privilege escalation in this situation is to grab a legacy service account token from one of the more privileged pods. Okay. What we got? What do we think? Node controller, maybe? No. That won't have what you need.
1:29:28 We could just do an anti control delete of the bad rule binding. Right? You could. You have to put through that. But it's that you would have less access than before. Right. Maybe I I don't understand. Well, what have you done to the Rawkode? Because I'm I think I'm a bit at a loss right now. So I basically have issued a new certificate using the Kubernetes alpha, or the Kubernetes oh, sorry. There's another there's a do this. Go to Kubernetes on group config. There there's a whole new stuff to manage that's GA now, for this.
1:30:22 Oh. And what user can we mint for this? You could mint the user of, like, a a privileged service account, and then you have to format the service account username. So what I've done is I've minted a certificate, and using this and the Kubernetes config file, you can ask Kubernetes to give you certificates that are signed by the cluster CA. And since Kubernetes clusters will accept mutual TLS client certificate off, the cert the username that you're trying to change before, it's actually baked into the x five zero nine data, and it's it's pinned like that based on
1:31:12 so that's how you have your identity, that oOO admin. And then I've gone in and added privileges to to oOO admin to be editor in every single namespace except for two of them. And then, also, I've I added that break glass mechanism that I I guess I didn't test well enough. I thought I used it properly, and I was able to to use the mechanism just fine. But I must have accidentally created the cluster role binding using a higher privileged account while I was setting it up. And But I don't understand why our koop adm search
1:31:52 renew admin.com used your Mintage username because it shouldn't. Right? No. I don't that's just not producing the admin comp for some reason. You didn't change any of these, did you? No. No. I the only thing I deleted was the admin comp. But, if you use anything that might have access to secrets, or to exact in pods. You could also get the you could also exact into a pod from the workload from the working org, and you would be able to get, like, tokens that would give you access. None but I've I've really not done you any favors, by messing up
1:32:54 that as you go route there with the, cluster role binding create. We do anything with the reset. Screw it. Okay. That doesn't look like it's gonna do much. Did I need to Kubitiam config in the root home directory? Yes. I did. Yeah. So when you use your Kubitiam commands, you can pass that. That's that's just a config map from the, Kube system namespace. But it's I just wrote it out for convenience, so that you wouldn't have to go and find it. Kubernetes stores it config in Kube system. And, so when you use Kubernetes commands like,
1:33:16 Marcus Resets the Cluster (`kubeadm reset`)
1:34:17 like, Kubernetes, it acts for the Kubernetes config and as a flag, and you can pass this. Same thing for when you, like, renew certs and stuff, and, this might alleviate some of the issues that you're having with regenerating the admin conf. I think when I was doing it and testing that roughly, it worked. I'm pretty sure I've just deleted the cluster. Oh, did you did you run a Kubernetes and reset? I did. Okay. Well, I suppose when you and you didn't take a a TD snapshot. Nah. That's not from scratch. And then you'll need to deploy the
1:34:55 Cluster Redeployment & Untainting Node
1:34:59 the workflow. Would you like me to add the the things back that can? No. I'm just gonna deploy clustered and then be done with it. This is this is the bare metal version of of just rolling the nodes. Right? Why don't you apply everything back except the RBAC change? Can you do that? Yeah. I can. Alright. Okay. Alright, Marcus. We'll give Lee just a minute to inject the other breaks minus the RBAC break because RBAC's too hard. I've gotta get get the the node set up again. Get the hard thing up. Yeah. Mhmm. I actually block access to the metadata API
1:35:56 as part of my setup. So you'll have to do an IP tables dash f to flush that block. Capital f. But you should be able to run the public IP, etcetera. Cool. Alright. Well, that's silly. I'm back. So if you run a get pods dash capital a as our cluster. Cool. Why don't I make this a bit easier for you? Hold on. All of the scripts that we use to build this up is in here. We can actually just cut the user data, which will allow us to do some of it. So let's take I need to maybe
1:36:57 jump. It should be alright. We don't need this. I don't need this. I mean, we probably only need the workload at this point, but let me just make sure. There's our QEDM config. Join config. Yeah. We don't do any of that. So all good. All gravy. Let's just skip. Keep it. Yeah. We're probably gonna need that again there. So let's start from here. I don't have to tell to run that actually. Wait. I will need to do it this slowly where we run the script. So which one is kubect? Part 11. Okay. So bash part 11.
1:37:50 12. Alright. 13. 14. Fifth deploy. Oh, I'm just gonna straight to 18 instead of 17. At this point, it should be alright, though. It's just so this stuff. Nineteen. There's our workload. Come on. Where's our workload? Should we just come in? Oh, we don't need teleport. We've got teleport. Grab workload. Is Kubernetes now? Sure. Alright. We now have a yeah. We have a working cluster. Oh, we've now got v cluster in Tanzu being deployed. I'm assuming that's Julie. I don't know. It's not gonna be We will have to remove the tape from the control plane node so that we don't
1:39:04 have to fix the worker. So if you could do that, Marcus. Yeah. Yeah. Which one's let's great. Hold on. Well, I mean, you can just remove the tint from the node. Oh, yeah. Yeah. Yeah. Yeah. I'll just edit node. Please control pin one. Remove those tint lines. Yeah. You need to remove 10. Oh, did that work? There we go. It should do. Right? Run get pods, and let's see. Alright. Cool. Alright. Lee, has your break back in place, or do need a minute? It is I'm waiting it's a it's a competition now. It's waiting for things.
1:39:53 Lee Re-injects a Break (PSP Default Provider)
1:40:00 So just give it a quick moment here. It might take, like, an actual minute because I am also installing Kubernetes. But because we're there, and then it needs to do this. And then Marcus was trying to feed your friend. Nice try, Marcus. It's worth a try. Oh, I forgot to do a thing. That's the one I can do. Sorry. I we're, like, waiting on something that's not gonna finish until I run this one command. Okay. Now now we should be on the path. I I needed to add a a p d c provider and set it to default.
1:41:43 Marcus Debugs Postgres: Waiting for PVC (etcd Issue)
1:41:45 Yeah. We have to reassign it. It's it's there now. And and now the the other pieces will fall into place, and then we will get to learn a a cool thing. So so where's Postgres gone? I just deleted that in anticipation of it coming back. But You deleted the deployment. Sorry? I You know, they Like, the the the deployment for it's gone or the stateful set. Sorry. It's we're just waiting for a crash loop back off. Yeah. It's the the stateful set. We're just waiting for crash loop back off right now, and I suppose I could delete
1:42:42 this pod and expedite this process. So as the cluster currently broken, Lee? Is it broken? I mean, no. Can can Marcus start to debug? I mean, yeah, it is. Sorry. Yeah. Crash the back off. There we go. Yeah. I five seconds. That CD is I don't know why that, PVC is not because the the the NCD is not running. I could yeah. I mean, there's not unless I take a different approach to get the configs onto the cluster and, you know, like, just reprovisioning the the cluster, basically, uninstalled all of this and save. I mean, can be the m preset as
1:43:57 one way to when clustered. That's for sure. Save. Yeah. But it's sorry. I just, like, I just wanna I just fast enough. Alright. Let's call it. QbDM reset was all was always going to cause a bit of chaos in the mix there. But why don't Lee tell us what what was your your break that you put together for us? Yeah. Well, inside of a Tanzu Community Edition cluster, we have an operator called CapController. And if you go, share screen and just go to StealthyBox slash clustered, on GitHub. Alright. And, yeah. Here, I just had the the static pods.
1:44:11 Time Called & Lee Explains His Breaks
1:44:57 It was just the renamed names. The notes was how I got to the the cluster, which includes setting up the, koop configs, and applying the RBAC and, like, downloading some tools like vCluster, setting up the vCluster, installing cap controller, and then, bootstrapping cap controller with the, vCluster app. Mhmm. So if you go to the vCluster app, YAML, then, this points to this Git repository, to the subpack config, and it applies configuration relentlessly just like our friend, Rick Astley. And, then if you go to the config directory, then this is where our GitOps control repo starts.
1:45:50 So we have that stateful set. We've got the deployment, and these are pinned at the version one. And then we've got the Cilium network policy, which was bugged and failed in a way that was actually worse than what I was trying to do. And then yeah, and then there was, the the RBAC here as well is is, this is where that's managed. So yep. But, yeah. So just, you know, a little bit of GitOps, little declarative config. It's like, you're gonna break something, break it well. Right? So Oh, hold on. So in that break glass,
1:46:30 you're binding it to the user uwu admin? Yes. What if me if what if changing the values in the admin dot conf have prevented that from them working? Those values are purely cosmetic for your own use. Yeah. They're they're config database values that are unrelated to the administrator name that actually gets baked into the certificate's, common name. Oh, it just yeah. Yeah. So it's it's baked into the x five zero nine, data. But you when you were doing your cluster role binding creates, there were times where you were trying to bind it to admin instead, just, I think,
1:47:12 out of, like, fresh Yeah. I tried I tried both. Which, that's just my bad. I I I missed some verbs or something here. Might need, like, a Alright. Well, those were both very boring clusters, but I learned an absolute lot today from both of you. So thank you very much for both breaking and fixing your clusters. Before we finish up, I'm just gonna draw the winner for our teacher giveaway. This is This shirt graphic is so cool. And our winners are here we go. Kira and Omar. Kira and Omar, congrats. Alright. And a long cluster
1:47:40 T-Shirt Giveaway
1:48:03 clustered logo. Yeah. I'm really happy with the t shirt. And I wanna see, hopefully, loads of people at KubeCon with these. It's gonna be awesome. And of course, on the back, we've got a thank you to our wonderful sponsors, Teleport and Equinix Mill, who help keep clustered teching along. All right. That was a tough one, both of you. Thank you again for for joining me. We haven't seen vClustered used in a break, and I've not seen QPDM reset used as a fix. So thank you both for bringing us the first. The clustered we'll be back next week with
1:48:37 PexiLab and for our cluster teams. But until then, have a great day and weekend, and I'll see you all soon. Thanks a lot. Thank you. Bye. Thank you, everyone. Thank you for watching Rawkode Live.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments