Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Troubleshoot a Kubernetes CNI outage by fixing a mistyped Cilium daemonset image and validating pod readiness.
  2. Track DNS and service issues by tracing resolv.conf rewrites, flushing host iptables, and restoring CoreDNS/App connectivity.
  3. Resolve etcd blockers with role checks, client certificate generation, and validating webhook override to unblock a stuck deployment.

Two broken clusters from Adyen and William. Debug a Cilium CNI image typo, chase a rogue service rewriting resolv.conf, then unpick etcd client auth, a validating webhook blocking a new ReplicaSet, and a server-side apply fix.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 1:19 Intro, Sponsors & Guests
  3. 2:48 Guest Introductions (Adrian & William)
  4. 4:12 Debugging Cluster 1 (Adyen's) Begins
  5. 5:44 Initial State: Pending Pods & NotReady Nodes
  6. 7:11 Investigating CNI (Cilium) Issues
  7. 17:55 Fixing Cilium Daemonset Image Typo
  8. 20:04 Cluster Healthy, App Not Working
  9. 20:30 Attempting App V2 Upgrade
  10. 21:09 App Connectivity Error (Postgres Lookup)
  11. 23:23 Debugging Network Inside the Pod
  12. 25:31 Investigating Cilium Config & Host IP Tables
  13. 31:14 Flushing IP Tables & Restarting Cilium
  14. 35:46 DNS Resolution Failure Confirmed
  15. 41:35 Investigating Worker Node Services (Hint 3)
  16. 44:06 Identifying & Fixing Rogue OMD Service (Resolv.conf Break)
  17. 49:37 Cluster 1 Fixed (App V1)
  18. 50:40 Debugging Cluster 2 (William's) Begins
  19. 52:32 Attempting App V2 Upgrade (ETCD Permission Denied)
  20. 53:40 Investigating ETCD Authentication Issues
  21. 58:01 Debugging ETCD Client Access (`etcdctl`)
  22. 1:09:03 Temporary ETCD Access Achieved
  23. 1:22:07 Hint: ETCD Roles, Users & Root CN
  24. 1:23:51 Attempting to Disable ETCD Auth
  25. 1:24:28 ETCD Fails to Start Due to Manifest Edit
  26. 1:25:14 Correcting ETCD Static Manifest
  27. 1:31:10 Generating Root Certificate for ETCD Client
  28. 1:34:47 Using Root Cert to Access ETCD (Auth Success)
  29. 1:35:58 App Still V1, Investigating Deployment State
  30. 1:37:45 Hint: Validating Webhook Preventing V2 Replica Set
  31. 1:40:09 Deleting Validating Webhook Configuration
  32. 1:42:08 Debugging Pod Image & Pull Policy Again
  33. 1:42:47 Identifying & Fixing Host DNS Issue (Resolv.conf Break Again)
  34. 1:45:19 App Still V1, Deployment Spec Mismatch
  35. 1:47:11 Identifying & Forcing Deployment Spec Update (Server-Side Apply)
  36. 1:48:02 Cluster 2 Fixed (App V2)
  37. 1:48:59 Wrap Up & Debrief
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:19 Intro, Sponsors & Guests

1:19 We'll bother pushing the video now because I've accidentally just pushed myself to the front. Hello. Welcome to the Rawkode Academy. This is our second episode of clustered this week And we have two great clusters, two great guests, and I'm looking forward to fixing some very broken clusters. Thank you to everyone who's tuning into the chat. I'll get your messages up in just a second. But before we do that, I do wanna say hello and thank you to our sponsors. So since the beginning of Clustered, we have been using Teleport. And almost since the beginning, Teleport have been sponsor and Clustered.

1:54 And I really just can't thank them enough. I generally believe that Teleport is a product, an open source tool that you don't even need to pay for, that everyone should have in their infrastructure to commoditize access, facilitate pairing and debugging. You'll see us use that today. I also need to thank Equinix Medal. They continue to graciously provide all of the hardware that we use for these three node bare metal Kubernetes clusters with tons of RAM and tons of CPU. Is it all necessary for clustered? Hell no. But is it fun? Damn right. So thank you Equinix Metal.

2:27 Alright. Let's get some chat messages on the screen and bring in our guest today. Push the button. There we go. I am having an absolute failure of the button presses today. I think it's because it's a Friday and it's like 05:30 and my brain is clearly just switched off. However, welcome, Adrian. Welcome, William. Thank you both for joining me again. You're both returning cluster clusters. Let's start with some hellos. We'll start with you, Adrian, and then over to William. Please introduce yourselves. Yeah. Hello. My name is Adrian. Been doing Kubernetes things for quite a while

2:48 Guest Introductions (Adrian & William)

3:08 now. Not anymore, which is interesting. Doing back to back end development since recently. Yeah, look forward to mess around with some clusters. I currently work at field. They are looking for infrastructure engineers. They told me to plug that if you they asked me, and I said, no. I want to do back end stuff. And, yeah, I'm looking for it. How to see you struggle and or not. Yeah. Awesome. Thank you very much. William? I'm William Lightning. I'm from Field Medical. I'm a systems architect, and my job is to fix things and be responsible for the infrastructure and

3:57 basically stuff like that. So I'm I'm always looking forward to learning new things and learning about new brakes before I see them in production. Awesome. Thank you both for sharing with us. Alright. We're gonna kick things off with William. We're gonna be taking a look at Adyen's cluster. If you've not joined us before, the rules are simple. Fix the cluster via any means necessary. In order to fix it, all we mean is you have to be able to upgrade the clustered application from v one to v two. My screen is now shared. I have opened

4:12 Debugging Cluster 1 (Adyen's) Begins

4:33 a session via Teleport's command line application. And if you could join this session oh, I haven't. You don't even have access to the cluster, do you William? Nope. Alright. That's because I'm sneaky and I don't give you access beforehand. So I'm gonna come in and modify your role. Very professional setup I've got here as you can tell. However, it does mean I get to show off a teleport a bit more. So we restrict your access with node labels. And if I save this, refresh your server list and you should have access. And I will join from the web too

5:14 because we'll have them at some point and there's a weird bug. Alright. If you can just give me an echo hello. Say I have your cube config. Check for a control plane and take it away and best of luck. Let's see here. Cube config. I always forget. Does it have a underscore in there or not? No. It doesn't. It's just as as is. That's perfect. Let's just start with a Nope. No control plane. Oh. There we go. One control plane. Good. Oh, it's it's halfway through migrating something. Alright. Well, that's Yeah. Let's see here. Should I go for rollout

5:44 Initial State: Pending Pods & NotReady Nodes

6:08 status? Alright. Let's just describe the describe the deploy, clustered. Let's just look at what we have here. So we're still in v1, minimum replicas, unavailable progressing, progress deadline, okay. So let's go back. What if I roll back the deployment? Yeah. I think it's worth a shot. Roll back? Roll out and then yeah. I don't think we used do a dash dash help on that. I can't remember. Okay. Undo. There you go. Undo. Okay. Let's check out our pods. So we also have pending. Everything's pending. Oh, rolling back just created another one. No, it's the same ones I Let's

7:11 Investigating CNI (Cilium) Issues

7:26 check out all namespaces. Let's just look around. So we got ciliums and image pulled back off. I wish people would stop messing with cilium. What does cilium ever do to you, Arlene? Come on. Well, Selium is oops. Gotta describe get deployments of all namespaces and that should we got the cilium operator. Describe cilium operators. That's gonna be in the cube system. You can tell I'm not using tab complete. Making a lot of interesting fun problems to play. Let's try. Let's see if it it it is less mess with things. There we go. No. I no. I encourage less as much as possible.

8:35 It just means we see what you see. So Yeah. Operator generic. So that is from cine cilium at least. Right? What's you know what? I'm just gonna go get rid of this. Let's just edit it because then I get them. And I get an actual YAML file, which is what I'm used to seeing. So type rolling update. So I'm wondering I guess we should look at the logs. That's probably what I should do. Well, what's the what's the first symptom you wanna investigate here? Is it depending pods? Is it are you just fixing Cilium because you see that it's

9:27 got a problem? Or, like, what's your Well, figured the pods well, if Cilium has a problem, then I can't get networking, which means the pods might not come up right. And so I'm I'm trying to skip past the pods being the problem and looking for root issues. But I'm not sure that's because it's let's see here. If I go get pods. Right? Postgres is still running, but clustered is not. And a hundred and thirty six minutes means that all those are fairly recent, But they're yeah. So I'm trying to figure out, like, do I wanna go back and look at the

10:11 networking or do I wanna look at pod? I suspect let's confirm the events. Right? We could we could describe the pod and see why it's not not loading. Yeah. I think that's a good idea. I mean, I think networking is an issue, but depending leads me to think maybe nodes unavailability, tents. Yeah. There we go. Got two. One of had it. The notepod didn't tolerate two nodes, had paint, not ready. Okay. So the master, that's expected because we have the control plane. Right? But the workers are both not ready. Yeah. That see if control follow that. Let's

10:55 go look at our notes. Yeah. They're broken. You asked me to break it, so I break it. That's Well, that's what we asked you to do. Right? I don't know. I'm always worried I'm gonna break it and then not know how I broke it. So CPU limits good. Memory limits good. Not that that's necessarily. Container runtime not ready. CNI plugin not initialized. So we're back at our our our CNI. Right? Our container network interface. Yeah. We are indeed. Alright. But at least now we know why we're looking into the networking. Yeah. Instead of just jumping ahead.

12:00 Okay. So then I I should go look at my mycelium stuff. Right? I'm gonna say no. I I don't know. Like, it's it's tough. Right? I I I don't know what's broken just as you don't know what's broken. And there's definitely a psyllium issue. The thing is popping into my head is that we got all the nodes are not ready. We've got a whole bunch of pods and a pending and terminating status, which makes me think that we might not even have a cubelet, which is why the pods are so stuck on the Yeah. The control plane's not even

12:43 ready yet. That should be at least ready. Let's look at the well, let's let's let me look at that. Ripe node. Type two. No. Not available. You can look at completions. How I look. Left. Okay. So CNI plugin not initialized. Okay. So because that's not an Etsy Kubernetes manifest. Right? That's not you don't usually put that in there. Trying to remember. Because Helium is just an a Helm chart you deploy into your cluster. Right? That is correct. Yes. So that means So it would have bypass. Right. Well, it means we need a cubelet to start cilium and if there's no cubelet

13:48 on any of the nodes or the control plane node then we'll never get a work in CNA. We do appear to have one. Okay. I don't know. Okay. So that IP does exist on the on the server. That's what I was looking at there. Maybe check a look at the kiblet logs. That's gonna be in system system control. Right? Yeah. Correct. I don't know. System control. Oh, that's everything. Is it logs? No. You can do journal control dash flu kubelet. Flu, f l u. Sorry. My Scottish accent there. There we go. Okay. CNA, pipes pod syncing pod skipped.

15:15 Okay. What are we what are you? Okay. So it's definitely Isn't there a way that that, Thelium is skipping that not ready because the c n I plug in would not be ready before Selium is ready. So this comes down to the way the CNI initializes itself, which I will say my knowledge of it is not great. But the kubelet will start this delay in pod which will actually put some binaries onto the host. And as those binaries exist, then the CNI would be considered initialized. And I'm assuming that those binaries do not exist. So what you probably need to do

15:56 right now is because there's a working couplet as a take a look at the logs and start debugging that silly and pod. So the thing you were doing like five minutes ago and I dragged you down the wrong path, not the wrong path but I I made you take the longer road, I think, there. So oops. Sorry. All about showing people things. If I'm going too fast, I can't show people. Yeah. Okay. System. Celium. Let's go you wanna go let's go let's check. Yeah. You won't get let's say image pulled back off. Don't know if you said that. Yeah.

16:40 So you need to describe operators what you're talking about. I know that the operators are actually okay, but if you look at the three selium pods on the three nodes, their image pull back off. So I think we'll need to describe those pods. Type in while people are watching is the fun thing. Right? Yeah. So I think this is okay. So if I remember correctly, this is the right image. But I'm wondering if he did something to the actual worker node to make it so that it couldn't pull the pod. I think that's a typo in the image.

17:28 And part of me just wants to grab the Helm chart for Cilium and redeploy it. No. Let let's do an edit of the Cilium deployment. Okay. This is so Clay dot you think there's a typo in that image there? Not there, but at another part of the image. So I think that in it container is okay. But the look at the message at the bottom there. Cilium x. Okay. Alright. So let's check out the deploy because that's gonna be what has it. Right? Yeah. I think so. Because the the the pod itself is just gonna

17:55 Fixing Cilium Daemonset Image Typo

18:13 I think it's a demon set. Oh. There we go. So I see okay. So let's go find what we want. You said there was a x in there. Okay. There's the x right there. Yeah. I would just remove it. Just remove it? Just the x. Yeah. Alright. Do we think he's only changed one image? No. But what do we have here? So warnings, but none of those none of those should stop us. Okay. No. Those look I don't wanna say safe, but they're they're not issues yet. Oh, there we go. Problem one fixed. And look, all our all our Kubernetes came

19:26 back online and Wow. Reconciled. Amazing. All of our nodes are happy. Oh. That's it's like dangerously positive. Like okay. So can we get to the website? I guess that's the question. Can we get to the existing one? You can curl on local host port 30000. That one? So the answer is no. Well, I'll just confirm with one of the workers. Oh, yeah. That's true because I'm on a control plane. It is not working. Yep. It's time of it. So not quite there yet. Well, do we wanna save a little time and update our deployment so we go to

20:30 Attempting App V2 Upgrade

20:33 our v two clustered? But once we do get things, we're gonna roll out. Sounds good to me? No. No. Go for it. Alright. So we're gonna go too many v ones in here. There it is. We'll replace that with a two, and we'll write this out. And in fact, I'm gonna take off my old namespaces switch. I love watching. So there's our error message. Well, it was a very long time out. But we can see It's failing to look up Something's happening. Right? Yeah. It's failing to look up the database. Do we have any network policies?

21:09 App Connectivity Error (Postgres Lookup)

21:24 Helium supports network policies. Right? It does indeed. Yeah. K. I also support cluster wide network Exactly. Yeah. CCNP. K. I'm going to type this correctly. NP. Yeah. Oh, network policy thinks so. Okay. No resources found. Okay. So our two our postgres goal and our clustered are not talking to each other basically is what it comes down to right now. Yeah. Right? So, like, clustered pod expects there to be a service called postgres. Let's look at our look at our services. There is a service there. That's good. Age is a bit different than everything else. Always a bad sign. Always.

22:26 Let's let's go check it out. Actually, probably ought to describe it first. Throw that in the left so we can see. Oh, there's not a whole lot there. That's not the right port. I know that right off the top of my head. 5432 is not right. You you do $80.80 on yours. That's for the web application. Postgres does run-in 5432. You're right. You're right. Okay. Dang. Oh, I was gonna It does have endpoints with which makes me think that the service might be okay. Okay. I mean, I say that may, of course, that's my, like, a JLC card, but it

23:10 looks alright for me. Okay. What's our next hypothesis? Well, I was wondering if I could I know Kubernetes debug is a new stuff. Does your clustered pod have a shell that I can get into? Yeah. I leave bash in there, so you should be okay. Okay. Gonna be our pod here. But you will need to dash IT. I do I probably need to do that before the pod. Yeah. I think so. Oh, I love this. My home and end key work. Mac always broke my home and end key. Of course, I spoke too soon now that

23:23 Debugging Network Inside the Pod

23:57 I've got some weird stuff going on with the shell. Oh, no. I'm in I'm inside of it. You are. Yeah. Okay. I'm inside. Yeah. That worked. Okay. So I I think yeah. You should be able to do an app update and install whatever you need. I'm not harsh. It's not distroless. It's it's not scratch. It's a although there's there's an issue. So no networking at all. Yep. That that that is a bit surprising to me as well, but it's fine. Or is it? Wait. I haven't I haven't I would have expected networking to work in

24:49 in the pod, interestingly enough. Net networking in Kubernetes can be challenging. So Well yeah. Alright. So we've got no egress from this pod whatsoever, which would explain why it's unable to speak to Postgres. So we need to track that down. Well, I didn't start on that one. Because we looked for network policies. We looked at all that stuff. Well, let's look at the logs real fast for that pod. Right? There's nothing. Should get us. Cluster clogs nothing. I I I keep every time someone tries to look at the logs, I'm like, I really need to add

25:28 logging. This is so unfair. The Celeum has a conflict map. I would maybe suggest there. Oh, yeah. That's good. Yeah. I see they're gonna be the Celeum configuration or IP tables on the hosts perhaps. I mean, we'll need to check both, I think. Okay. So Celium config and cube system. And we're gonna go for a Selium config. I cannot spell Selium, so I'm gonna cut and paste. Show me the channel. Look at all that fun stuff. Nobody's ever broken a Celium on me before, so this is new. New fun stuff. So it's not in debug mode.

25:31 Investigating Cilium Config & Host IP Tables

26:29 Auto direct node routes. From here is I don't have a known good version of this to compare against. Do any of the flags sound ominous to you? Well, always sounds ominous, but I think that one's safe in this case. IPv four true, IPv six false. Policy's default. If you need hints, by the way, there are three hints in the home directory. You don't have to use them yet. They they are just there, just so you know. I'm not completely evil. Just a bit. Well, I'm I'm only three minutes to my halfway point. Yeah. Exactly. So

27:24 let's see here. So the only thing in here that I can never remember and we kinda glossed over it there is the enable policy default. Now I think that's actually correct at at default. If my understanding is correct then everything is allowed until a policy exists and then at which case is they block all until allowed or something like that. So I I don't think it's silly, but I'm not a % sure. Am I allowed to bring up the comp on my on my other cluster and and look and compare it. I mean, I I you would do that in your

28:18 real job. Right? Yeah. So go for it. Always go for the known good and comparing the Exactly. Good with the Yeah. Feel free to dump it out on your own and then drop it into a fail on this terminal and run a diff and and see. Okay. Let's see if I can figure out how to do that. Oh, I like that. Run a diff. Run a diff is always the best way to go. Okay. So I just did yeah. I can just cut and copy this command. I don't even need to Oh, a really cool comment from Walid as well in the

28:48 chat. Just while you're playing with our cluster is we do. Did we add k as an alias? I don't think it did. It's not deployed. Oh, well. Well, you just said Hubble was enabled on the config. Well, he just write. However, it has not deployed to our cluster. I can't remember if I You don't you don't enable Hubble, though. Right? I've never seen you do that. I should though. It's super cool for seeing why network traffic is not going where you want it to go. Is that the UI thingy from Cilium? It is. Yeah. The other

29:25 thing we could do is use the Cilium CLI tool to run a status and a you know, everything that Hubble does, you can do with the Cilium CLI as well. Is that installed? No. You'd have to go into a Cilium pod and run a Cilium status. But stick with your diff first. I think that's like that feels like something I would do and you would do it in real life, so let's try it. Okay. So there it is. So now we're gonna take this a bunch of things are gonna actually diff on this. I don't know. Well, unknown unknown bads. Right?

30:03 Yeah. Okay. That's none of those things or things are gonna cause us any problems. That is clean. So it's not Alright. Well, it's not not silly, but it's mildly confident. It's not silly. So how is your Linux networking knowledge? It's old. Wow. I remember that command. That's always good. Let's see here. I need to go on the left for this. I mean, this I mean, just looking at the IP tables and seeing in the firewall, is there anything weird here? Anywhere. So one of the things I encouraged last week on an episode was just to flush

30:56 the tables and let the cube proxy and sell them back up. However, that does get us into more trouble. But it's normally safe to do that. You so you could do that if you wanted. I think it would be okay. I think so too. You think so too. Yeah. So just do an IP table dash capital f and that'll flush all the rules and then Cillium will put back the ones that it needs for networking to work. If if if you think it's a firewall thing at least. Okay. IP tables dash capital f. F for flush.

31:14 Flushing IP Tables & Restarting Cilium

31:31 Yep. And then if you run dash l again, it should be there we go. We have clean IP tables. So but that that should come back automatically. Right? Because it'll take there's noted tables and then Well, I mean, this is better than last week because we get kicked off the machine last time. So you're still able to type commands, which is cool. So Yay. So you could try your curl again, I suppose. Well, if there's no rules on IP tables, that's not good. It's not bad either. Right. Yep. Is it? We might have to restart

32:10 Cillium, I think, for the rules to be. But But I don't think we need the rules just yet. I I don't know either. Oh, well. We're still getting a table now. So yeah. It's still saving that. Oh, well, better what I got last time. Okay. Let's kill off the helium pods and see what happens. When in doubt in Kubernetes, kill the pod. That's what I learned from you. I'm really glad that custard is teaching people these viable production skills. Delete. Yeah. Just smash, smash, smash. Yeah. You can use that or you can copy paste it. Go for

32:53 that. Can you start with that? No. But you could do dash l app equals psyllium and that would also do it. Right. Pod. You've got delete delete. Oh, I do. Just when one delete is not enough, go with two. That's a good rule too. I so badly want to like involve my my Google voice and like breaking the cluster live for you guys. That would have been fun. Yeah. Where you could just say, okay thing destroy. I was I was was gonna be like, okay thing, engage or activate contingency plan alpha. And then I said, sorry, Dave. I can't do that. I'd

33:44 be like, oh, okay thing, pseudo activate contingency plan alpha. Mean, that would be risky though. If you wait till your opponent runs get nodes and get pods and it looks okay, and then you're like, go. Cluster chaos. Okay. So helium came back up. I'm gonna make that happen, Russell. Definitely. Sorry. I was just just funny. We have we have actual rules. Okay. Our application still cannot reach the database. We still have no networking at our pods. Alright. We've got sixteen minutes remaining. Maybe we could peek at the hint, the first ten? I think we yeah.

34:34 I mean, what what do you think? What are you thinking? Well, I'm I'm wondering, did he mess with core DNS? Because that would but that would also because the pods use that for a network resolution. Right? I mean, you could test that. You could try pinging. Oh, we we don't have ping. You could There was a I can just run a container and see if that container explodes. Right? Yeah. If you so if you want to confirm whether this is DNS or not, you can jump back into the cluster pod. We can try and do something with an

35:17 IP address rather than a DNS name. Right? Yeah. I'll share that message from Carlos as well, actually. We do have a bit called Rawkode Discord run by Carlos, Walid and others join at Rawkode.chat. Sorry, William. Okay. So you're back in the custard pod. Except for I need to do one thing first. That service should have an IP. Right? You can't ping a service IP, but you could ping oh, but you don't you don't have access to ping anyway in the container. So you're gonna have to, like, curl a public IP address. Okay. Let's see here. So why don't you sorry.

35:46 DNS Resolution Failure Confirmed

36:05 Exit this container and then do a ping Google.com, grab the IP address, jump back in the container, and then try that. Oh, you got IPv6. I did not do anything IPv6, which I'm very proud of. Yeah. It's very tempting, but I didn't I I I resisted. Yeah. That's not gonna work. Yeah. Oh, yeah. But you can call 4. Yeah. 6. 4. There. That works too. Oh, yeah. Didn't know that. That was the thing. That's cool. Yeah. They have one so that if you have problems with I p v six, you can jump in and talk to Google directly

36:43 on that. Nice. So let's curl that or It worked. We'll do a dash v and a dash k, which will just get past Oh, yeah. It's like thing. Ignore the do even need the p? You don't need the p. Okay. Well, there's a v which is for for both. But you've done something cool. You've confirmed this DNS. It's always DNS. Right? When in doubt okay. So if we go out of curiosity, google.com. Yeah. Okay. Nothing. Okay. Let's see here. Thirteen minutes left. And a DNS look at that hand. Okay. So DNS is running but is ten days old. So he didn't do

37:47 question is, what did he do to it that he didn't modify it? Or did he fake it out? Well, that just yeah. That's a pod that's ten days old. Right? So he's not restarted the core DNS pod, which makes it unlikely that he has modified the core DNS config. So this is a host DNS problem? Well, which would be on the workers. Right? Because but that it never gets to the host if it's on core DNS because it would talk directly to it would talk directly to it. Right? We're not we're not getting off of this this

38:31 system. We shouldn't be getting off the system to talk to Postgres. Well, the way that these clusters are configured is that I mean, he could modify the host file on each of the machines, or he could have set up system to resolve or something like that. So Postgres is the name. Right? Yeah. Try fully qualified. So Postgres dot default dot s v c dot cluster dot local. Okay. Do we have dig in this container? Of course, we don't do that. That'd be handy. No. No. I don't know. And dig is not there's a package for it. It's always it's a

39:14 package in every Oh, yeah. Or buying details or Okay. I think I think with eleven minutes left, we should probably look at the hints file. Well, our audience is telling us cluster first is an option. But because the pods didn't restart, I I don't think it is. Get service. I could've just deleted the service. It's Ten days old. Kube DNS. Oh. And we don't use ingress for any of this stuff. Did we check policies in all What? Namespaces? The postgres service was modified though. That's interesting. Was that us? You already checked that. What's the first thing

40:13 we looked at? Yeah. It looked okay. Yeah. Alright. Let's check for network policies in all namespaces. Okay. Oh, yeah. Because the name were yep. That makes sense. Because we checked net poll, but we didn't check and we checked CCNP, but I don't think we checked CNP or all namespaces and net poll. Yeah. And net poll. Okay. That goes that. Alright. We're under ten minutes. Yeah. Let's go for the head. Let's do it. Alright. Hint one. Okay. We did we did that one, cilium. Yep. That one's already complete. Don't mess with Sorry. Damn it, Arnean. Is there a way to dump the BPS

41:14 thing so Selen can re recreate it? There there's also a hint three if I would look at hint three as well. Yeah. Let's If you okay. Let's narrow this down before we start, Hunt. There's just one more problem to fix with the cluster. There there's just one. So Ah, okay. Let's take a look at the system control services. So I think adding is running any program. So he's not using firewalls to block our traffic. He's using a probe or something. Alright. So you're gonna tell me about this because I System control status. So this will list all of the

41:35 Investigating Worker Node Services (Hint 3)

42:05 the unit files. Okay. So our queue balance container d. Yep. So that looks good. Those look fine. We're just looking for something suspicious. Unattended upgrades. Hellport's Yep. Those look good. Is that the end? That's the end. Alright. We must have missed the one. Oh, no. You have to look on the workers, not the control plane. Uh-huh. This is another yeah. So Okay. Yeah. Okay. So I really have to SSH into the control workers. Can you get me there? Uh-huh. Alright. I'm on worker one. Feel free to pop, join that session. Oh, okay. Let's see here.

42:57 Sessions. Worker one. Join session. Okay. Echo. Yep. Good. I see it's live. Export. Oh, you just need to look at system d. I don't care about the KubeConfig. I don't want there's a KubeConfig on the work Oh, that won't matter. Yeah. Because it doesn't have access anyway. Alright. So system control. And then what was the Status. Status. And then you just have to remember every service from the last one. Container d I q balance. There. Actually, this had its own. Yeah. It's got its own pager. There we go. That's better. What do we got? Why is there a serial?

43:47 That's okay. Toolkit service, network dispatcher, multi path, accounts Damian. Yeah. Let's go there. There's lots of stuff here that wasn't there before. Out of memory Damian. That's the end of it. Oh, I'm not seeing Oh, it used to the pager. Yeah. Hold on. Let me try that again. Okay. Yeah. Paging on my screen. Yeah. It's been weird for me, but that's okay. Alright. So in a system, IRQ is Okay. So that's ambassador. Selium agent, Selium health responder. It looks less obvious in this system CTR status than I would have hoped. There's at the alright. Let's see if we can just see

44:06 Identifying & Fixing Rogue OMD Service (Resolv.conf Break)

45:00 a rogue process. Sleep one. Who's sleeping? That's a good question. Maybe that's, important. Alright. So let's do that with a tree. We have here, session ID. Oh, that's your grep. There's grep in this the journal stream. Oh, yeah. I I should because I'm gripping. Yeah. You're right. Sorry. Yeah. My bad. Oh, okay. I should have done the grip with a dash b or whatever it is. Dash b five will show you five lanes before. You could do that. Yeah. I was gonna throw it in the lesson, just search it out. Let's do a b five, then you get the

45:55 before five. Right? So sleep. Oh, so it's the system d m d. Yeah. Okay. It's great. Ominous. So we can go to and here and this is your system D unit files. So I guess you wanna be looking at an OM D. Right? Or or whatever you want. Yeah. But I don't see OM D. What? It's it's system d dash OMDI. Need to Jack start. I think that binary is a script. Well, let's go check it out. Okay. But don't cat it, please. We've already had that this week. What? I'm showing my inability to use them.

47:05 Quit. That makes you you that makes you human. Okay. There we go. Oh, isn't too bad. Oh, yeah. But if it was a bainer Oh, there's gonna be a tool. Okay. Here. I can solve this problem. So, I mean, you can delete the script, or you could just disable the service. Or I just change this to false, and it'll never run. And then I can restart the service. Like that? I think it's very polite to leave his rogue script service running. Thanks. Thanks. I need that for tomorrow. Okay. So but the thing is that's We might have to restart Cilium for things

48:00 to get Yeah. Because it's it's adding stuff to Right? Yeah. So okay. Well, let's let's try the easy route. We're running out of time here. Oh, you can't do that here. He only needs to jump back over to the control plane. Oh, no. Okay. I'll do some quick typing while you come over. Oh, I like it. You're the label. Oh. Okay. Yeah. Also, just because we've got two minutes, make sure we get pods are white and make sure we fixed it on the right node and we haven't. Oh, no. So let's just delete the Let's

48:49 just hope it gets scheduled We can hit the notes that it won't it won't run on the other node? Let's just cheat. I think it's the So where's template? Spec. Oh, okay. I'm gonna have to do this on here. Templates back. RNVP worker one. Cool. Okay. So that's not work. Hit the right IP address of that worker because the other one is still gonna be funky. So yep. I think it did. Oh, and you have to move Postgres call too. It worked. It worked. Alright. Cool. Postgrescal's still working on the wrong one. Wait. Fifty seconds left. Yep.

49:37 Cluster 1 Fixed (App V1)

49:49 Wow. Added. I mean, it's harsh harsh. I wanted to do something nasty with but I didn't know what. So then I just decided, I'll just delete all the programs that Cillium inserts. I don't know what that will do because I don't know how BPF works, but I suppose deleting all of them should break things. But I'm glad you opted for the delete because if you had put something in, I don't think we would have known how to delete them. Yeah. That would have been a lot trickier. So Yep. Alright. Nice one. Good job, William. Yeah.

50:29 Yeah. It was very, very cutting very close, but we made it. Fifty seconds on the timer. That's good. Alright. Let's pop open. The bit is on the other foot. So control plane one. Okay. I've just opened a session, RN. Okay. Active sessions. Yep. Got it. So set up your KubeConfig. Best of luck. William and I will be here. K. Access denied. So it is there, but I cannot connect. Did you create your own session instead of joining? No. I joined the new session. You didn't change his role. Oh. Yep. Wait. No. You Well and now and now

50:40 Debugging Cluster 2 (William's) Begins

51:24 the session also disappeared. It was there, and now it's not there anymore. Yeah. But I didn't give you access to those machines, actually. The the the session shows up even though you don't have access. You just can't connect to it. Oh. I thought That's weird. Okay. Well, it did, and then it didn't anymore. Some weird race condition thing. Okay. You should be good now. I'm in. Mhmm. It's a good thing Teleport makes that so easy. I really love Teleport. It's such an amazing tool. Hey. Working cluster. Job done. I'm going to the pub. I'll see you later. We're done. Okay.

52:06 So this also works. It does. Okay. Let's pull it up just so we can see the the watch. I'm so happy that it's working. Hello. Yep. There we go. Watch. Watch. Watch. K. So let's edit the deployment. Right? So let's update. It should be easy. Right? It's Friday night. We're deploying. Oh, it was the best time to deploy? Yep. Yeah. Wait. My exactly. No. Oh. It's been a while since we've seen an HCD error message. Okay. So could there be patched HCD server permission denied? Interesting. Do we have at CD pods running? We do. Some stuff restarted. That's also on the s.

52:32 Attempting App V2 Upgrade (ETCD Permission Denied)

53:22 But it was before the cluster what? Fourteen hours ago, but the age is 13. Okay. This is I'm gonna ignore that for now. Routing header. Yeah. When okay. So something something at c d. But we can read from at CD because all these commands are working. Well, I mean, we haven't tried to schedule anything new, so we're not really sure. Well, that won't work. So we cannot schedule something new. I think we'll need to take a look at those logs. Can I just do that with they're running as a port? Right? And Yep. We're about to

53:40 Investigating ETCD Authentication Issues

54:25 Great job. Yeah. Failed to apply requests. Okay. Permission denied. Permission denied. Okay. Okay. So it does not have the permission to edit things. Could you have messed with the certificates or something? So it does say off permission denied. So like, let's and there's a cool bit of history, which I think is cool. You see the key on the ETCD thing here? It says registry minions, because that control play one. That's because SaltStack used to be part of the original deployment pipeline for Kubernetes. Ah, interesting. I didn't know that. Okay. So I don't see any files chain any certificates changed or anything.

55:33 That was my first gut feeling. Like, maybe the client certificate was removed or or a new client certificate was made that's I don't know. But that's not it. Okay. Does HCD have any kind of ACL features? I I don't know as much about HCD. No. I'm pretty I mean, I may have this terribly wrong, but my knowledge of this at the moment is that it's just MTLS certificate level. So the controller manager, the API server, the scheduler, etcetera, all have their own certificates that will be passed in via the static manifest to speak to etcd.

56:13 Where are it's good that you said it. I see the API server etcd client certificates, but where are the others? Oh, wait. Only the API server talks to that city. Right? The others all talk to the API server. Yeah. Okay. Yeah. Okay. Never mind. I was like, oh, I found it. So but I would maybe go take a look at the manifests as well for the API server. This also all looks very not edited, which in terms of modification date. Well, all the FCD flags are at the top. Okay. So at c d c I e file,

57:07 that makes sense. Cert file, key file. Oh, the thing is it is connecting because we're getting data out of the API server. So Mhmm. We've established that the certificates have not been modified. Well, we don't know for sure, but the modification time times suggest that they haven't been edited. But sure. I worked very hard to make sure you could still see things. Yeah. I see that. How is your entity knowledge adding in? Zero. I know that stuff gets stored in there. Alright. Well, let's get entity cheat sheet Kubernetes. I never remember how to configure the client,

58:01 Debugging ETCD Client Access (`etcdctl`)

58:05 but there's, like, three or four lines that we need. And I always get it on the same website. Oh, yeah. Oh, let's you go. Oh, we need to get the HCT client installed first, but we can do that. I know you're an XOS person, so I'll do the the typing of the app install. So just so you don't feel too dirty. Oh, yeah. Alright. There we go. If we get the same error message. So this is an entity thing. Oh, but here we get the error message on a get, which is interesting. Right? So we can't read either.

58:50 But why does the API server still show things even though it can't read? That's odd. Right? The only other entity command I know is entity control health. Sounds like Nope. Maybe not. State status? I thought it was health. I know one more. It's called test dash help. And well, oh, look at this. There's something with users and roles. And off. Oh, yeah. There's lots here. Oh, endpoint health and endpoint status. Okay. I was almost there. Do you want to try off disable? Sure. So I don't know. I I just I see it sitting there, and I'm like,

59:47 sure. We don't need authentication. Oh, but it says username not found when I do that. It's okay. But it doesn't help us. Okay. Can we list all users? Will that work? No. I already tried to give the same error. Alright. What about role list? Let me just type that. Username not found. Alright. Why don't we just try dash dash user? Root. Oh, perfect shot. Erin's back. Hello? Yeah. I was talking to you. I didn't actually realize you dropped. So Yeah. Yeah. No. You're the entity. Fuck this. I'm away. Exactly. I gave up. So let's go let's

1:01:05 go look at the the keys. That'll have information on it. Right? Okay. The keys. What do you mean? The So if we go to p k I at c t Oh, those keys. Yeah. Sure. Yeah. Like, these are certificates that we can try and remember the open SSL commands to get some information out of these. Yeah. Perhaps. And SSL Dash n. Yeah. Wow. I remembered this. I'm so proud of myself. I had to look that up. Yeah. Does it have anything? Okay. The subject alternative names looks good. Doesn't look so should be able to connect with local

1:01:59 host to the control plane to the control plane's IP address. Does that match? 214.5. Is that what we see here? Yeah. This all looks very not weird to me. Well, it it it makes you Yes. Do we want to check whether it's actually signed by the CA? Don't know how to do that. But Well, there's an interesting comment from Colin in the chat who says it seems fishy about the username has a space in it. Although, he doesn't know much about SCD or if that's expected. No. I don't think there I don't think there are usernames when you're doing MTLS.

1:02:59 Right? It's just certificates. Yeah. So I'm I'm not sure. Again, my etcd, I wish it was better than Okay. But let's let's what I wanted to look for is a way to check if this thing is healthy. Endpoint status and endpoint health. K. Assuming that let's just run those commands. Okay. Okay. I'm curious, William. If we look at the hands, will it guide us? Or have you just left us here to to I've just left you here. I've got notes so I can drop in hints and help if we get there. I figured, you know, if we get down on

1:03:57 time a little bit. Yeah. We can work this way. We're smart people. We're not we're we're We're doing good. Yeah. What what is an alarm? And we can disarm it. I don't know. Was there an alarm? The alarm list? Yeah. There's no alarm. Yeah. K. Don't know what alarm is, but I was like, it has a disarm command that sounds promising. Yeah. That's such really annoying thing that we've seen before where you can set the maximum size of the database just so it can be exhausted very easily. And even if you change the size, you

1:04:31 still have to manually disarm the alarm before it'll even work. That bit me. So this is a tricky one. I'm not. Okay. Let's let's run that STD get command again. Right? Let let's work out what our error message is and see if we can work work out what's going wrong. Okay. I was like, what if I just try to put in a new key if that works? Okay. Did you say? Repeat what you said. We want to look at Well, we got it's been so long there. It feels like so long. But we got we did get an error message about the

1:05:10 user when we try to when we do the roles and user list. Right? So, like, user list. Yeah. It says username not found. But I think that's because we don't have anything here maybe. But so we are that's I think if you're doing standard authentication. Right? So let's let's see our setup for a TT that we stole from the Internet. Right? Yep. No. We've lost I've lost my buffer. Here it is. Now I've minimized Chrome. Hold on. We'll get there. We'll get there. Alright. So this is gonna be important. So set up entity. Now we're set in

1:05:56 I need to look at it for cat. We're setting up that we want API version three. We're pointing it to the CA certificate so we could validate that that's not expired or anything, which may be good We're using a weird Oh, we're using a health check. That that's that's not the right one. Right? Okay. So let's use something with more permissions. Yeah. The peer certificates, I suppose. Right? Or even just the Yeah. Well, sir, I don't know. Can can the server that's the server. No. We want the peer one. That's the one that the API

1:06:33 server probably uses. Oh, we can confirm that, can we? Yep. Here. Okay. And I didn't plus x it, so you only have to source it. And we have to source it anyway because otherwise we don't get still permission denied. Okay. So let's look at the static manifest of the API server and see what certificate it uses just just to confirm. Sounds good. It is using API server at city. That's wrong, right? That doesn't exist. Should be peer. Crt, right? Or oh, no. Wait. Peer is just okay. I always get these confused. Yeah. Think peers for

1:07:52 other entity members. Yeah. So we do have an API server entity client. Sir, thank you. Yeah. Let's actually let's edit this Why is there an API serve or entity client and an API server dot sir? But can we just see the timestamps of all these files? Well, that all look fine. Why have we got two sets of API server? In fact, three sets of API server stuff. Is that normal? I'm I'm starting to doubt everything. Yes. Yes. So I think API server is the serving certificate. The API server, kubelet client is to talk to the kubelet server to get the logs.

1:08:43 This all looks normal to me so far. But that yep. So far. But I want to Carlos. Sorry about him? I said Carlos from chat agrees that that's the one you wanna use. Alright. Okay. So we want to use the one from the API server. Right? Yeah. That's just API server dash h c d dash client. Alright. Let's source that. We got stuff? Yep. Okay. That works. That's good. I guess. So now can we list users and roles? I don't think so. No. Looking for hints in the chat. Look how NCD server got started and with

1:09:03 Temporary ETCD Access Achieved

1:09:57 what flags. Yeah. That's oh, yeah. That's alright. I I I I was just gonna list the the difference. Because what I was thinking was there must be some certificate here with elevated privileges even if our API server doesn't have them. But that's a good idea to look at the flags on HCD. So let's do that. Yeah. The key file, that looks okay. Here, client cert off. This all looks normal to me. So can't we just use the server dot key and then we'll have root on that CD? Maybe. I'm not sure if you can use

1:10:55 the server certificate as a buying certificate, but we can try. But okay. Let's try it. I mean, we have to see here. If we also knew what we wanted to generate, we could build a new certificate. Right. I'm gonna leave this in. I don't have to type it again later. And if we can't use that, the peer must be able to write, consider an entity, etcetera. Right? This is yeah. But this is the first one we tried, and that one didn't work. So, no. You did you source it? Oh, I need to no. I did not.

1:11:57 Alright. Let me expect to mess up the peer certificate. You messed up. Come on. Oh, is that is that a Messed that up anyway. Is that a hint? Let's do a little command on the peer cert and see what we can see. Sure. My my other feeling is to just regenerate the certificates and restart everything. We could use a QVDM. Yeah. I remember I've never used this. Oh, it wants me to. So you would have to apply a phase. So So renew won't help us. What does this do? I have a feeling this command can help us, but I don't

1:13:01 know. Well, we could generate a new certificate key. You pay the m cert certificate keys or key. That's that's help. That's what I meant. Flags. Print out the security. Okay. This is not what we need. Yeah. We need to run the the Kube ADM phases, I think. I think you're right. And then there is a. Yeah. Serres. So let's just do the API server, STD client, or you could do all, I guess. But we don't wanna we probably don't wanna change the CA. No. Exactly. Ah, okay. But we can guess we can move them. I was gonna

1:14:06 say delete them, but sure. Moving them seems safer. Yeah. API server at Citi client dot cert. It. Now, do I need to restart the API server? Well, why not try sourcing our file and seeing if we can run the GET? Assuming we've got it set up still. Yeah. It's in the folder. Yeah. Yeah. Let's get But now I'm happy that I left the comments here. Right? Because we probably also need to regenerate the server certificates because that was also okay? We're in. That looks promising, right? You know, to try and add it. Yeah. So I wonder if the API server has the

1:15:36 keys cached in memory. Yeah. So how about this? Well, that won't work. Does it not? Oh. I would restart the pods. Yeah. Okay. So I just delete the pod, and then it should Oh, you can't delete the pods. You'll need to kill the process. That's annoying. Oh, because it's a step because it's a static pod. Right? Well, yeah. Well, it just yeah. It won't delete. Although, it's not rained to HD for the static pod. So you may actually be able to do a cube control delete on that because it's a shadow pod. Right? Maybe. Let's let's try it. Yeah. Because it's

1:16:17 that that never writes to NCD, so it may actually work. Oh, wait. No. No. Just delete the process. Yeah. Yeah. And then it's +1 7818. 1 7 8 1 8. And it should be back. Wait. Let's just see if this now works. No. It still is broken. You just restarted the API server, right? Yeah. Just the API server. There's some more things we need to restart. So Let's do you mind if I type? So we run an EDC control get on a key. Right? Mhmm. But we never tried the so that's us configured with that new

1:17:38 search still. So the Kubernetes rerunning the search phase actually didn't help us. No. Which means I'm just curious. Maybe we should just regenerate all the certificates except the CA. Do do you need to restart at CTs also? It doesn't hurt at this point. Right? Yeah. We could. Done. Okay. Now let's try that command. Yeah. Okay. So Nope. I That was one of my thoughts as how to fix it. So I'm glad you tried that. I don't think the stair is a problem. So remember how I said this thing about, there was a thing that restarted, but it cannot be?

1:18:51 Uh-huh. Maybe should not trust my initial instincts. And there's something funky with the controller manager and the scheduler. Maybe we've been But we but we can't we can't use entity control locally either, which is, like, seems like a big flag. Yep. That's that that is true. That is true. So it is probably yeah. But let's do do alright. William, can we can we ask a question? Yeah. Yeah. I was gonna offer a hint at fifteen minutes. It's not search. Right? It's not it's related to search, but the search are not messed with. Yeah. That's what I was

1:19:43 thinking. Okay. So and Etcd doesn't have a config. Right? Let's look at these flags. Why didn't why has it taken us this long to look at the FTD flags? We look we look at them, I think. Yeah. Oh, I think I see it. I I I The flags have a hint, but I didn't change anything in the flags. Oh, you you didn't mess with the certificates. You didn't mess with the flags. I was thinking, oh, maybe the trusted CA file was messed with, but no, that looks good. Right? Does it? Yes. There is no, that also. There's client sort of off

1:20:47 in here twice. But Oh, his client start off and peer set off. What's the volume amount on this? Carlos suggests checking the failed system permissions, which again seems quite obvious. Maybe this is nothing to do That's what I did last time. Okay. Volume mounts, etcd search. Okay. Well, varlib etcd for the data. That doesn't look good. Right? Does it? Wait. No. It's at c kuban net t p k I. Yeah. But that what if you change the permissions on natural database fail and Violet at c d? But this is running as root. Right? So let's go to viral web entity.

1:21:48 This all looks beat, write, execute. This looks fine. Right. Yeah. This is all read write. Okay. It looks fine. Yeah. Okay. Do we want Do guys want a hint? Yeah. Let's do one. Okay. Roles and users was the the right path in my opinion. So we need we need to find a working credible certificate to speak to EdCT. Right now we don't have one. Right? User. Yeah. Has he left this one in temp? People are always nice that way. No. I was not nice this round. There's one special user and one special role route. Okay. Right.

1:22:07 Hint: ETCD Roles, Users & Root CN

1:22:53 Did he see TL user list. User equals route. I tried to ask her password. What's the default? There's a flag in your STD start that that gives you a hint on how you can work around that. There's Oh, we could turn client off off. Yeah. Like, client off. A client there. Okay. Who knows? Client there off false. Maybe it has to be before. Yeah. Let's look at the flag. So Kubernetes manifest. There was this here. Oh, yeah. We could just restart STD without oh. What? Yeah. That's a very good idea. Yeah. Yeah. Very good. False.

1:23:51 Attempting to Disable ETCD Auth

1:24:02 And we don't need to worry about the peer. Start off with not getting peers, but yeah. Why not? Just turn it all off. That that took a very long time. Oh, I still have the password. But oh, this is doing Well, so maybe it's not healthy yet. Did I did I oopsie yet? Well, let's check the logs. We may have to check it the old school way. Yep. Oh, it's not started. Why don't because we are very smart people sometimes. Copy at cd.yam onto at c d dot s h. Well, we don't have add we we don't

1:25:14 Correcting ETCD Static Manifest

1:25:32 have add c d on the host. Right? Give me a minute. Okay. And then shift j j j j j j j j j Oh, no. I'll put yammer. Alright. Okay. We'll we'll reg x on that. I'm getting a bit Rawkode. Dash dash dash. Oh. Oh, no. Yep. One less. K. Oh, no. That didn't work. Oh. Alright. No. You replaced them all with g's. Dash slash Slash dash. And then yes. Yes. No. No. Oh, wait. So we want to do this but with a space. Got it. Yes. Damn you. There we go. Alright. And then app

1:26:25 install. Etsy. Etsy. Then let's just make sure our server doesn't start up. Right? So This better be the same version. It's a server. Well, someone already installed it. No. It's trying to start it up immediately afterwards, I think. Alright. So bash@cd.sh. Alright. That didn't quite work. Less than metrics. Oh, well, then well, maybe that's why it's not starting up. We made a typo. Right? I edited the file. Maybe I made a mistake when we edited it. Like, in the even in the YAML file, that would explain it's not coming up anymore. No. It's we need

1:27:20 Do we need to tell it the version? No. Because there's no environment variables in the manifest, is there? I feel like I'm making this much harder than it used to be. Yep. Oh, well. Thank you, I am. I'm here trying to help you out. No. It's fine. It's fine. That feels like it should work. It's complete what what is the error that it's giving? Can we run one more time? Not provided, but not defined. Let's see if there's anything else. Why is there only one dash? I I just support both syntaxes, and it's a weird error.

1:28:02 Alright. Okay. I'm I'm gonna say this is, a very old version of etcd maybe, and maybe it's not even version three. It is. Okay. Did drop a hint in the root folder for you. I'm sorry. You're taking it off. I I'm clearly not helping. You are. You are. Okay. The certificate CN is used as a username and at CD when okay. But, yeah, we currently don't even have a working at CD. So So we can always turn off back on, put the manifest back, and then try this Yeah. Common name as a username. Although I feel like disabling client off and

1:28:51 logging in as root user should be okay too. Right? No. Because then you get then you hit the password. What's the default password? They won't let you set one. So The biggest admin or password. Where did our s c d dot double go? It said slash root. I put it there for safety. Okay. That's like that's Alright. You'll need to put the client off back on. I feel like in the future, need to if I do an ETD break, I need to only do an ETD break and nothing else. Oh. There's more. No. That's fine. That's fine.

1:29:43 I'm gonna start using the Okay. Thing that comes with k three s where it's a SQLite database for state. It's good. Says c d malarkey. Okay. So the hint was the CN should be the same. Right? What is the CN in our case? Yeah. I wanted to open SSL again. Do you know how to generate a new cert from a CA? QPDM does? QPDM does. Yes. Because you need you need a cert with a common name of root because root is the only required user in a TD. I see. We can but I don't think we can choose

1:30:34 the certificate here. So we need to do some SSL commands. Oh, yeah. And this is my my arch nemesis figuring out the syntax for. Or can we maybe? No. I don't think we can change the common name here. It has only predefined ones. Yeah. I think so too. You know what? Maybe it's in the best history. No. That was too easy. We want to So we can go to Kubernetes a hard way on GitHub. Right? Yeah. Yes. And search for open SSL. Are they using c f s s l? That's also fine. Okay. So here's how we do this.

1:31:10 Generating Root Certificate for ETCD Client

1:31:33 So we can copy. Carlos is asking why won't regenerating work? Because we need one with CN route instead of CN whatever we had. Alright. So here's our CSR. Yep. That sounds looks good. Let's just go here. CSR. Oh, that's the best. And we want the c yeah. And we want the c m to be brute. Yep. Right. So we can change that. Yeah. We can keep all the other things, I guess. Stupid thing. Alright. So there we have a CSR. We don't have CSSL though. We can install it. That's good. Alright. So, now we can use c f I don't

1:32:37 remember these commands. I'm just gonna copy and paste them. Yep. We just need to put the right CA dot PEM in here. Alright. So, let's just call this same dot SH. Mhmm. So it's are we in the directory of No. So it's that's etcetera Kubernetes p k I At c d. C d c h dot s r t. Yes. And then, that's just pti@ctca.key. I don't have a c a config. Maybe we can leave it away? Should this be Yep. I suppose. There is a syntax error in our in our JSON. K. We have keys. Okay. We have keys.

1:34:01 Let's modify. Alright. ScratchAI@ct. What did we call that? It's route/route.cert.pem, and the other one is routekey.pem. Correct. Yes. Okay. I know. It's root.pem and root-key.pem. Why is there a connection refused? Are we still broken at CD? Yeah. I think so. I mean Wait. It's not a Let me do this. It's a static manifest. Stop killing at CD. No. But because she installed the package, they're probably oh, no. We we made it worse, I think. Did we? ETCD is not running. Alright. Let's restart the cuplet to encourage that along. Why can't I type? Restart. William, you're not invited back. I'm just in.

1:34:47 Using Root Cert to Access ETCD (Auth Success)

1:35:53 Okay. Okay. We got something. That's good. Woo hoo. Uh-huh. K. Okay. We're brute. That's the I Okay. Turn off. Is this How do we disable k. Boom. Yep. There you go. I mean, we could at least have tried to fix the users and roles in another thirty seconds. How hard could it be? Yeah. One, cube admin doesn't set up those things by default. So they're not even there to begin with. It's running. It is. It may just be cache for why we don't have the dance. This seems to work. Right? Or is this still the wrong version?

1:35:58 App Still V1, Investigating Deployment State

1:36:47 I don't think it has the wrong version. I think it's just I really need to call the files something different. But then curl, it's also serving me v one. But they always do v one. Oh, also in v two? Yeah. Oh. The only thing that changes the video and that isn't refreshing, but it could be cache. But it doesn't look like it. Because normally, would have reloaded by now. Why is your helium crash loop back off? Oh, probably because of your FTD. Yep. But this yeah. Okay. So we are I can just confirm by doing source

1:37:32 to this. It's the wrong one. We're still gonna v one. Unless this is the There is a second level of evil, probably. I I don't hear laughs. So Was there more evil? That wasn't it? There there there is more evil. Still it's still v Okay. There's some admission webhook running. Some mutating webhook that is moving this back to v one. Alright. Just because I now hate William so much, I'm happy for us to go over by another five minutes as long as you both have the time. Sure. Let's let's get rid of the mutating webhook.

1:37:45 Hint: Validating Webhook Preventing V2 Replica Set

1:38:18 What what Configurations. Mutating webhook configuration. No. Was a good guess though. Right? That was a good guess. Alright. So we modified the image. We saved the manifest. It did not take. No. It did not. Let's look. Change it again. I wanna see if the revision bumps up to six Yeah. Or seven. Oh. Oh, it's There's nothing to change here. So wait. You can go back to So where where did you see v one? Did you describe the pod? Yeah. The pod itself is still v one. So he's messed around with the controller manager or Kubelet.

1:39:25 That Well, the controller manager did restart. So In the interest of time, I used a a validating webhook configuration. A validating webhook configuration that should not be able to modify things. I'm confused. Yeah. It can't modify things. It can deny things. But we didn't we didn't get it. Well, maybe we did on the replica set. That that validation dot d ns.kh.io, that's that's not good. There's no such thing. And that just makes it silently feel that it's Edit it. Go look at it real fast. Yeah. Just so that everybody can see it. Was there was there an error message when

1:40:09 Deleting Validating Webhook Configuration

1:40:20 we did the edit? Nope. Not on the deployment. Not on the deployment, but maybe on the replica set. Sneaky. Sneaky. Sneaky. Oh, yeah. Because the deployment was v two. That's right. And then the replica set was wasn't skilled up because the one with the one on one was ten minutes old. Gotcha. That's a nice so we the part. So we have one with v one. Where's the one with v two? Well, if you deleted the webhook, then it should just schedule. Right? So we're running it. Well, I haven't delete I haven't deleted it yet because that was interesting. Just in the

1:40:54 yeah. Destroy it with fire. No. No. I don't want to get the yellow. I want to get the name. Why did they make this word? It's horrible, isn't it? K. Alright. Rugged pots. I don't want now. Let me let me just delete the pods just in case. Do know what? Just delete the cluster. Yes. K. There's a new prod. That looks old. There's there's more. There is more. Right? Four seconds. This is good. Okay. Okay. Is this good or bad? That's bad. This is not the right image. Excuse me. Pill policy always. GHCR Rawkode clustered v two.

1:42:08 Debugging Pod Image & Pull Policy Again

1:42:29 Okay. Resolve conf. It's always DNS. You're gonna you're gonna wanna be in the worker. Yeah. Oh, yeah. But but you messed with DNS. Yes. Okay. Just as a cherry on top. Yeah. That's under your rule. If you fuck with Etsy, do you know how to mess anything else up? Oh, just redirects Etsy host. Okay. Etsy hosts. We can delete it. No. Don't Don't delete Etsy hosts. No. No. But we can edit Etsy hosts. Which server are you on? I'm not on on worker. I need to join your session still. Are you trying to figure out on which

1:42:47 Identifying & Fixing Host DNS Issue (Resolv.conf Break Again)

1:43:14 one the pod got scheduled? Ah, there we go. Yeah. Remove that. I'll do it. Okay. So we need to delete the pod again. Me. You might wanna apply it to the other worker real fast. I never seen anything on the other worker. It should be on both. Yeah. I got stuck. I can not access in on the other worker for some Yeah. I accidentally ran it to the I accidentally ran container d. Sorry about that. Okay. Okay. Now we're gonna delete this pod. Then it's gonna k. DNS might be cached in core DNS as well,

1:44:19 I suppose, right? I didn't intentionally break that. Yeah. You may just wanna restart the kuplets on both workers. Yeah. We can do that too. And container d just in case. Can you just try to type? I don't think so. Oh, you can. Cool. Mhmm. And then we'll need to Oh, wait. I guess because it pulled the image already. I I checked that. It will not re pull the image because it already has it. I told it to. Oh, okay. Okay. Interesting. Alright. Let's go to work this time. K. Stop laughing, William. Something is writing to Etsy hosts con continually.

1:45:19 App Still V1, Deployment Spec Mismatch

1:45:19 It's back. I did not do that. It was tempting. Oh, okay. Are we just saying I am incapable of modifying a fail with them? Because I don't think that's nice. No. We we added it. It's but it's still not working. No. It's not working. Okay. Hint. Yeah. Put us put us out our misery, William. Yeah. Yeah. I'm just like, what's going on? Now you got me wondering. Oh, no. Where is that face? I have one wait. I'm gonna just This Dash L space. All right. Just in case, right? We want to delete the part one more

1:46:28 time. Delete the cluster. I'm done. Try it. It's mocking us. Did you do a shift refresh? I did. Many, many. I'm just gonna go back to v one and call it a day. We're we're back at v one. That's what a true SRE does. Right. We're like we're like v two again. Yeah. Yeah. I I just have a feeling it's not pulling the image for some reason. Oh, no. Wait. The deployment changed. That was always because I because I searched for image. Oh. Always always in the last applied configuration but not down here. That's why.

1:47:11 Identifying & Forcing Deployment Spec Update (Server-Side Apply)

1:47:28 So that means that William when he modified it did server side apply with his coop control. Mhmm. Okay. So go go to the page. I'm dancing. I'm dancing. Good. Yeah. Nothing to say, William. Nothing to say. In the STD break. I I I feel that was cool. I really enjoyed working through the STD break, and then it was just sucker punches after that, I feel. See, I thought the other two would be the easy one. The STD break, I expected to take a while. I should have given you a hint earlier. No. No. No. No. It

1:48:02 Cluster 2 Fixed (App V2)

1:48:19 was good. And I don't know. I just you don't think straight on these episodes. Like, generating a new certificate makes a lot of sense when you know the username is a thing and it's a special role. Yeah. I still feel like turning off the off should've worked, but you told me there's no default password. So obviously, that's a bug in etcd. There's a reason why etcd makes you add a root user and set a password to it before it will let you do turn on off. And I set the password to one two three four five six

1:48:54 so that you could That would have been my next guess. Got that part. I I I have deleted the cubelet on the control plane and it's it's done. It's done. Yeah. I tried admin and I tried password. I should have tried one two three four five six. But that would have been funny if if we did it. And, yeah, I I like the validation thing because it was sneaky because the deployment edit worked, but then you don't realize the pod hasn't hasn't changed because you're too busy frantically looking at everything else. That that that is a bit of a

1:48:59 Wrap Up & Debrief

1:49:29 weird UX though. It would be nice if, like, the deployment would say, oh, this replica sets that we created is not scaling up, like, just from UX standpoint. Like, I know these things are disconnected concepts, but as a user Yeah. It's not very user friendly. Right? And then So you you edit the pod pod when you edit the pod, it actually gives you a real error, which was helpful. But But you can't edit pods. Yeah. They're immutable. Oh, it's true. So did you actually run your own registry in that cluster or on the host somewhere?

1:50:05 No. I just I just use container d to download an existing container and then retag it as yours. And then set the if not present, and then it kinda snuck through. But I only looked at the annotation, which is why that trip is up to. Damn it. Yeah. I originally put never because I wanted to do it. And then I was like, no. They you know, they're gonna see that in five seconds. That's too obvious. Yeah. Oh, yeah. Alright. Good job, both of you. Great breaks by both. Great fixes by both. As always, I am the liability in the group and I

1:50:38 make everything take 10 times longer. But thank you for your patience in sticking with me. Thank you to our audience for watching. Thank you to Equinix Medal and to Teleport for their continued support, and we'll see you all again soon. Have a great weekend, everyone. Adios. Have a great weekend. Bye bye. Bye. For watching Rawkode Live.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Cilium

More about Cilium

View all 36 videos
etcd

More about etcd

View all 24 videos
CoreDNS

More about CoreDNS

View all 21 videos
Helm

More about Helm

View all 49 videos
PostgreSQL

More about PostgreSQL

View all 22 videos

More about Teleport

View all 38 videos