About this video
What You'll Learn
- Trace Kubernetes service selector and endpoint issues to restore application connectivity between teams.
- Audit containerd runtime settings and logs to identify cluster-level component startup and image failures.
- Troubleshoot Cilium networking and kubelet certificate problems with a node reset and rejoin workflow.
Two Chainguard teams race to repair broken Kubernetes clusters, digging into containerd config, service selectors, Cilium CNI, Postgres StatefulSets, corrupted kubelet certificates, and a kubeadm reset to bring the apps back.
Jump to a chapter
- 0:00 Holding screen
- 2:00 Introduction
- 2:26 Sponsor Thanks (Teleport)
- 2:50 Sponsor Thanks (Equinix Metal)
- 3:13 Introducing Team East
- 3:37 Team East Introductions
- 5:18 Granting Access to Team East
- 5:39 Team East Begins Diagnosis (Initial Checks)
- 6:56 Investigating Service & Networking Issues
- 12:20 Attempting to Fix Service Selector
- 14:20 Investigating Application Image & Containerd
- 19:19 Examining Running Containers & Finding Hints
- 20:39 Diagnosing Containerd Config Issue
- 27:17 Checking Application Status & Database Controller
- 32:30 Investigating Postgres StatefulSet Persistence
- 34:12 Identifying Rogue Processes & Node Issues
- 42:18 Making Control Plane Schedulable
- 43:51 Attempting Application Upgrade (v1 to v2)
- 45:20 Further Debugging & Time Runs Out (Team East)
- 47:30 Explanation of Team East's Challenge Issues
- 49:55 Transition to Team West & Technical Issues
- 51:30 Resolving Audio Feedback
- 52:49 Introducing Team West
- 53:09 Team West Introductions
- 55:12 Granting Access to Team West
- 57:10 Team West Begins Diagnosis (Forensics & Initial Checks)
- 1:00:11 Fixing Kubeconfig & Checking Node Status
- 1:06:16 Examining CNI (Cilium) Configuration
- 1:09:55 Diagnosing Kubelet & Authentication Issues
- 1:15:14 Worker Node Access & Certificate Investigation
- 1:18:12 Corrupted Certificates & Considering Kubeadm
- 1:25:23 Attempting Kubeadm Reset & Join
- 1:30:11 Investigating Application Ports & Editor Issues
- 1:31:59 Fixing Application Ports (via Emacs)
- 1:36:16 Fixing Database Service Port & Typos
- 1:37:16 Finalizing Database Service & Upgrading App Version
- 1:40:34 Cluster Fixed & Final Application Check
- 1:41:54 Conclusion & Wrap-up
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
2:00 Introduction
2:00 Hello, and welcome back to the Rawkode Academy. This is Clustered. Today, we have something special. We have ChainGuard versus ChainGuard. We have two teams from the same organization. Hopefully you're all familiar with ChainGuard. They've been hiring up all the best talent in the cloud native ecosystem, but I'm sure we're in for one hell of a Clustard today. Now, before we introduce our first team, I wanna say thank you to our sponsors. So thank you to Teleport. They've been supporting Clustered from the very beginning, and I can't thank them enough. We use Clustered on we use Teleport on
2:26 Sponsor Thanks (Teleport)
2:37 Clustered every single week. It's a fantastic product. I recommend that you check it out. And if you're not convinced, you'll see us using it on this very episode. So there's a lot to love there. Check it out at Rawkode.live/teleport. I also wanna thank Equinix Metal. They've been providing the hardware for every single episode of Clustard two. So thank you to Equinix Metal. Now I could run Clustard on some small VMs with a few cores and a wee tiny bit of RAM, but I don't. I run it on big chunky boxes with lots of cores and lots of RAM because it
2:50 Sponsor Thanks (Equinix Metal)
3:07 makes it more fun. So thank you Equinix Metal for putting up with that. I really appreciate it. Alright. Let us get our first team on. We have Changard East. He's a broken down geographically, and this is team East. So thank you all for joining me. How are you all? Good. Doing alright. Good? Happy? Nervous? Excited? All of the above. Awesome. All of the above. Alright. Let's start clockwise directions. Can you please say hello, tell us your name, and anything else that you wish to share? Okay. I guess it's me. Like, my name is Carlos, and I work for a chain where this
3:37 Team East Introductions
3:54 everybody knows. And, also, I work on the Kubernetes upstream and the. Hi, everyone. My name is James Strong. I am the author of Networking and Kubernetes. I also am the one of the maintainers at Ingress NGINX. So I fully expect today to be a lot of networking issues. Thanks, Thomas. And I'm a lead solutions architect here at Shingard. Hi. I'm Adrian. I started at Shingard a week ago. I hacked it on the show once before as part of container solutions. I was saying earlier, I intend to employ exactly the same tactics, which were to allow only as a teammate
4:39 to do all the work. Was gonna say, forgot what it was. Oh, I was at my first in person meet up for, you know, two years last night. So I am a little bit slower today. That's my I'm getting excuses in fast. Yeah. As many excuses are like that that don't worry about it. Alright. Lots of chat going on there. So that's awesome. Remember to help out our wonderful teams. They're not gonna see the obvious typos and other mistakes because this well, they've got a camera in their face and that's just the way things work. So
5:13 feel free to drop that into the chat and give them as much help and support as you can. We are gonna pop over to my screen share. Now the first thing I need to do is give you access to the other cluster. So this is Chen guard east. I'm gonna change your permissions. Ta da. It's easy as that. Alright. You should know all on the server page, all the machines. I am gonna open a connection onto Chinguard West Control Plane 1. Please feel free to join the session and type echo or anything else to let me know that you're
5:39 Team East Begins Diagnosis (Initial Checks)
5:52 in the session. Remember to use activity active sessions and best of luck. Alright. We've got one already. Awesome. Yeah. Gonna work someone. Adrian living up to these excuses already. Last one in the session. Alright. I would encourage you to export your cube configs, set any aliases that you want and check for a control plane. Best of luck. Jeez. That's flying. See, I'm glad I'm not doing this. Well, you have exported kubectl instead of kubectl. Yep. It begins. Capitalist. Oh, plenty of. Thank you, Thomas. I have no data. I know. Thomas. Yeah. Yeah. Alright. Should we try should we try deleting those
6:56 Investigating Service & Networking Issues
6:58 first? West side, the best side. Don't know. That can be a real control plan. Right? That could be a real API server. Yeah. It could be real. Let's just for fun, let's go look at the actual one. See if they've changed anything here. That looks like is it 8021? Is that the right one? Believe so. Yep. I think you need yeah. Let's let's go with it until Trevor knew otherwise. Changing IP address would be particularly mean, but feel free to run IP adder if you really wanna check. Nope. I think we're good. Wait. You guys wanna just go ahead and
7:49 try to delete this? Let's see if there's any statements. Take a look. What's the Everything says running. So maybe it's all working. They've not broken anything. Yeah. What is it? 3030000. Let's just see if it's v one's running properly. What's the that is as a as a zero extra zero. Extra zero? No. 30,000. Right? Yep. 30,000 is correct. Oh. Do you have to do HTTPS, or are they just playing around there? Okay. It is not running with TLS. It's I can give you that. It's not running it's not running with TLS. They have a separate controller for Postgres.
8:32 Okay. I mean, I can try and open the app right away if you if you want to try and see it. Yeah. Cool. Just wanna see GCR Rawkode d one. Okay. That looks right. Oh, I should have written down, like, the digest of the image, actually. Yeah. That's fine. That's gonna be a thing to change the image, isn't it? Yeah. That doesn't look right, though. Six four four three. Doesn't it? Oh, they're clustered. Yes. Okay. Yep. But you do servers. Dash run on port eighty eighty. Yeah. But that's, like, con this I'm sure that's Kubernetes
9:35 session 6443, isn't it? The API server does. Yes. Was I looking at the wrong one? Clustered. Yeah. That endpoint doesn't seem right. Nope. Yeah. Are we just gonna leave till well, yeah, that's one thing at a time. I was gonna say we're leave those pods. They're probably doing nothing. They're probably just decoration all those pods. Well They're gonna be a painfully. Yeah. Let's go see who this guy is. What did you do? Get pods. Dasha.0y. Looking at the IP address, trying to figure out who that endpoint is. Could be VIP. Yeah. What's that? So that is the control plane private I
10:42 p v four. Okay. Uh-huh. Why is your service pointing to that? Yeah. Yeah. Wonder if we just go ahead and reexpose reexpose it. Okay. Yeah. Can we just I mean, you can probably describe the service and I don't know that gives you. Oh, yeah. The edit looked right. Yes. So endpoints is just wrong. Yeah. So just update the endpoint. Type node port. What is the what's the labels on pods? Yeah. Describe. Yeah. When I keep typing, like, clustered in Kubernetes, I'm gonna keep getting that wrong. Default app is clustered, k l u s d e r d. Yeah.
11:58 So what's the selector on that service? Clustered. Selector. Oh, it doesn't have a selector. Okay. Told you they were gonna start with the networking. But you can't change a selector on it, can you, after you've created it? Because this looks right. You you can add the selector. You can add the selector. Yeah. Change. Session finding. Where is the selector in the spec? Well, after spec, you would just have to do selector and then Yeah. Match labels. I'm not touching the keyboard to put the but Carlos, just just give Carlos a keyboard. Yeah. I don't I don't like I don't
12:20 Attempting to Fix Service Selector
12:59 like to use VM as well. Let's quit some. It's all fine. VMAX is not installed on these servers. Oh, that's even worse. So your hypothesis here is that they removed the selector and added a manual endpoint for the service. Is that what you're thinking? Yes. What is the select what's the selector on it again, guys? I don't know. It's a app App clustered. App app app clustered. It's a good first guess, I would say. You'll need to fix that end in patient. Yeah. I'm actually surprised that it accepted that. It did. There you It did.
13:46 Okay. Thomas, it's true. We shouldn't have left any editors installed. Was Well, that's a good point. Open Vim and type colon old files. That should be fine. Right? I'm just checking to make sure they're not messing with those binary output. It can mess up your terminal. I just put it to a file you want. One sec. Checks checking that. Okay. They they didn't mess it up there. We're not do we still have this old endpoint? No? Nope. Didn't change the endpoint. Just make sure it's going to the right one. I don't need dash. Just need the clustered
14:20 Investigating Application Image & Containerd
14:56 port. But one thing you could do I'm sorry. Okay. So it is it's pointing to the right pod. Right. So go into the pod and crawl from the pod and see if you get the same thing. Good point. I tried to load it, and it doesn't work. Oh, need dash dash. Sorry. Uh-oh. Yeah. I can't. Oh, no. I'm oh, I just yelled at. This is on 8080 in here. Yeah. Act just blow me the terminal. We can always sign a new one. Yeah. Yeah. It's looking like To change the what's wrong? Do a PSEUX for it. Is
15:50 NGINX running? That. So that's that's what you were using it in genetics, weren't you? Or were you using Apache, David? Yeah. Yeah. Yeah. David, is is it running Apache inside your image? It is not. Yeah. It is not. Okay. So Like, it just faked it. This is not David's image, I believe. No. No. It is not. Let's go ahead and go look at the deployments. Matt says he can have it. I don't know why my browser isn't letting me have it. Okay. Let's have this. That's quite does that look does that spelling look great? I could I could imagine them
16:39 putting something up in GitHub. Rawkode Academy, k l s u t e r d b one. Okay. I would next thing I would do is get the digest so we can see when that changes. Yeah. I'm checking more clearly here with this image. Yeah. And he could have done what I did in the previous thing, which was to play with the container d config dot toml. Where is that located? Is it it's a container d I thought it's container d. Let me check. What's that? It's a containerdconfig.com, but possibly I I don't maybe that doesn't exist unless you've,
17:34 like, made a change to it. So that means you made no change. What's your head what's your head cough to this right now? What what are we worried about? What what symptom do we have and what are you trying to fix? Well, I see that's the right image. Yeah. It's not the right image. Yeah. I just grabbed the the digest from the Rawkode Academy cluster d v one. Yeah. And it starts with F0D9. Okay. Give me one sec. Get pod get pods. Let's just do YAML, and I'll be lazy and just grunt for image. What what was it? Yeah. That's correct.
18:31 That's right? Oh. E b yeah. That's correct. What happens if we just kill well, what happens if we just pull a new version of the I mean, So you're you're looking at the container deconfig because you think they've somehow configured container d to pull something out? That's what I did remember. So I wonder if they're trying get revenge. But it's possible they just uploaded a new image onto the the nodes. With a feature? Yeah. Yeah. Or edited the running one. Did we check this crazy part with the stuff to see what's inside those? No. We haven't. So we can go ahead
19:19 Examining Running Containers & Finding Hints
19:22 and just look at one, see what they're actually doing. Thomas does suggest that you can cat minus v for that bainery stream if you want. Just running the box and busy box. I'll sleep forever. Yeah. Yeah. It's just a misdirection. Oh. Do we Let's go to the root. Let's check that read me stuff. Yeah. Let's try to know no spoilers here. Right? No spoilers here. Let's yeah. Okay. This four. Mean, how much have I done? Well, that's that's not ominous at all. Nice. The list of most places unexpected. Can we So I'm wiping out the clustered image and
20:39 Diagnosing Containerd Config Issue
20:41 just Yeah. I think I'm gonna So where is this pod scheduled? And then maybe just manually remove the image and force it to pull again. Yeah. That sounds Because I'm curious and I don't know the answer to this. Right? But the image pull policy was always. But if it reaches out the container registry and the digest is the same, I'm assuming it doesn't pull the same image. Yeah. So you might can you get in like So if they were if they were able to fake the digest and being a security company, I'm assuming that is where we send people
21:15 skill set, then that would be an interesting break. Well, I guess you can just go into you used to be able to just go into, like, the file system. But do it you can use CTR as well. You're in the 12 I mean, it's the 12. The ACLs, but otherwise, it wouldn't pull. So where's the pod schedule? So let's go delete an image. I gotta get that. Sorry. The wrong one. Clustered. Well, this well, this got their full partner. Oh, that's right. Yeah. Need the full one. It's not a deployment. Being lazy. Not YAML. Typing
22:17 okay. That still looks right, and it's probably still wrong. So do I get pods or wait and see where it's scheduled? Yeah. Okay. The previous one was in the worker. Alright. So I suggest we jump on to worker one. Okay. Yeah. You're start another session for us? I am indeed. So West worker one. Oh, it got a nice MOTD as I logged in. No. Of course, it does. Lovely. Alright. I might wanna do a PS first. Yeah. Well, at least, like, you're not in a container. There seems to be a lot of fun going on here.
23:30 Container d. Okay. So are you just going to try and inject this image with cry control or CTR? Yeah. Let's Oh, I don't know what Craig can tell you. Got the money. Not pleased with. Just kill every one of them with honks. Why we have, like, there is some invoice here in the Busy ambassador? I think I wanna check and see I just wanna check and see if there's anything in here. No. It doesn't look like there's anything in there. Yeah. CRI, CTLs, they just stop like I would expect. There's CTR as well. You can use dash
24:21 dash runtime dash endpoint and point it to the computer, the socket to cut out all those error messages. Runtime dash To the container? Point. Dash endpoint. Yeah. Equals ties. Slash run slash container d slash container d dot stock, I believe. This guy? Yeah. Although you might need it before the stock ID thing. Yeah. Okay. You can tell I use this one a lot. Just try p s dot yeah. Or that. You're missing an x. Okay. X. I think you're missing the at the end of SOC. S o c k, right then? Oh, maybe it's my I'm just watching on
25:12 the the videos, I guess. Do you mind if I type for a second? Sure. Run. Enter the yeah. There it is. Oh. The CTR as well. It's not not easier. There you go. She's done. You should you should be able to list images as well. Yeah. It's been up, but it's also restarted. Thanks, Thomas, for the reminder. Yes. Let's delete those images. Yep. So if we run KaiControl images, we can see what's on the host. Okay. Probably don't need NGINX or BusyBox. What was well, I would focus on clustered. We've got twenty minutes left. Yep. Yeah. Exactly.
26:22 Wow. Time time flies when you're in the hot seat. Does it just delete? Or is it d Dell? RMI is the command that you want to delete RMI. And then the image ID. Just as I thought, being in the hot seat is much different than just watching. I think your RMA use a space after the socket. This is deleted. Yeah. I deleted it. Doesn't yeah. It doesn't exist. It's not there. Let's should we try to re should should we try to delete the one that's provisioned on the control plan and restart it? And I'm back onto the control plan.
27:14 Oh, I'm hoping it just restart. Yeah. Looks like it did. I would just restart. Yeah. Yeah. It's creating. It's creating. Okay. It's running. Do we wanna try to curl inside the pod again and see if it's the right one this time? You should be able to curl localhost 30,000. Yeah. That's right. There. Baylor. I didn't expect that. I yeah. I didn't either. But the image yeah. Should we look at it anymore of the hints? Apparently, there is a hint for this issue. So Did did we actually check the config the container of the TOML on that the worker
27:17 Checking Application Status & Database Controller
28:20 node or just on the No. We did not check it. Okay. So that's gonna be in cat etcd or etcd. Container d. Tomo. Nothing nefarious in there on the one side. Can we come up? So I can tell you that container d doesn't ship with a config by default. So this was Exactly. Put there by them. And that snapshot to me seems suspicious. You know, I would delete the file. Okay. Alright. Yeah. I'll run with it. We ought to restart something. What we could look. What would yeah. I'm not sure what you're a start. Yeah. You
29:05 would restart container d and then delete the pod. They're rumbled. We have all nose and crying cat faces or shot cat faces. Yeah. That just means it's like onto like Yeah. We've been to the next one. Yeah. Yeah. This is actually we stumbled into the The first this is the first line. Yeah. Yeah. Thomas is saying, deleting random files from slash e t c to fix production issues. What could go wrong? We're not deleting random files. We're resetting them to the default. Alright. I would suggest that container d doesn't seem to be restarting, which is not Yeah. It doesn't seem to that's
29:48 not okay. You may wanna look at that unit file. Yeah. I would say, like, maybe we can get the default config TOML and put that back. You can container deal, put the default TOML if you want. But you could just edit the unit file. Assume it I'm assuming they must have modified something to look explicitly for a config. Yeah. Carlos, do you know that? I I don't know that. What? I was waiting How help is that? So system d system. It's gonna be okay. Where is it? Alright. Well, let's try system control cat container d.
30:34 I didn't know that. Yeah. You I don't see Well, it's just about snapshot in there that didn't make a look of. Oh, I think that was above my last prompt. Okay. I don't see it It looks clean. Looking for a config. Yeah. It does it looks like it's not looking for a config file. Does it did you control c the restart, or did it just finish? Like, is container d now running? I did I did a control c. I mean, we can do We'll do a system c l c t l status and see what comes up. Let's
31:07 test data. Yeah. It's just the whole thing. I would do everything. No. Come on. Okay. It's running. So you might have said it took a long time because there are quite a lot of pods and containers. Yeah. Okay. So you might be alright. Let's let's go back to control then. Let's play that pod. Yeah. Okay. You just have fifteen minutes left. Well, that's Okay. You definitely But you are doing great. Really, really good. I don't like that post SQL controller zero. So even if we get it get clustered running, something some something's going on with the database.
31:57 I can She's gonna ask me what databases alright. We have a clustered pod. Hey. Hey. Hey. There we go. Alright. Elements by tags. This this is v one. You make This looks right. That's pro that's gotta be the database. So nicely done. I thought we made progress, but, apparently, maybe not. Alright. What next? I'm gonna delete that controller pod, but I didn't see if there was a staple set for it. There is. I'm gonna remove that guy. Is there So what How is it I think you were gonna go with the same thing. How does how does cluster d
32:30 Investigating Postgres StatefulSet Persistence
33:09 know to talk to it? It's hard coded as a service endpoint called Postgres. Uh-huh. Take a look on the services. Postgres. Postgres. And that's 10108. Well, let's let's go see. Well, it came back up. That's fun. Well, it But that has a stateful set name, and you deleted a pod. Well, I deleted the stateful set. Alright. Thought I did. Oh, then that's different. I did delete it. Okay. Maybe maybe this is in a manifest, like, hardcore manifest in there. Let's see. That's what I'm gonna yeah. Let's If it was a manifest, we would see the control plane
34:03 node name as part of the Ah. Part. Okay. I'm I'm getting serious now. Install now you are. Sorry about the sudden Adrian is showing up. There we go. Do app install rip grip, and then change directly slash Etsy rip grip honk. You think we left some honks in the script? You have to change the Is it slash that way? No. You you gotta change the directory you want to start in. Is Etsy. Yeah. Honk. Yeah. I see. Sorry. Okay. My plan failed. The other thing you can do is look in Vim and put call in old files.
34:12 Identifying Rogue Processes & Node Issues
34:48 Matt is also saying check the hints. And Thomas is saying that whatever situation we're in is my fault. So sorry. I'm not sure what I did. We already fixed that, I believe. I hope so. Yep. That's fixed. Yeah. Yeah. This is fixed. Sail. Where are we setting sail for? For land. Be on the realms of control plane. Go beyond the IP address. The IPCs of the land and the workers were more fun awaits. I don't think we've encountered that yet. Well, we did go on to work and change the the config. So we might have got
35:25 part of it, but it might be more there. I don't know. Mhmm. Okay. And then I think we already did the CTR. Okay. So maybe there's something still broken on that. Oh, there you go. It doesn't have to know. The controller the controller controls more than you think. We've not done that. So, like, look at the controller, like, Etsy manifest thing. Yeah. If you look under there Apparently, the worker part is completed. Okay. We have a node. Yeah. So the controller one is a Postgres hint according to Thomas. Oh, okay. Is a Postgres hint. I don't
36:05 know this bit. Yeah. So you go Let's talk about what's happening. You delete the stateful set and it came back. How are they doing that? Yeah. Right? There's a few ways to do that. So let's try and work through them. The logged in to the thing. They're logged in to let's I want could be a service that's wouldn't be. Yeah. That wouldn't be very nice, sporting. I wonder if there's anything we already looked in the manifest. There's no manifest in the worker node. Wonder if there's just a there's gotta be a script or something that's starting
36:38 the pods back up. Could be. A cron maybe? Yeah. Something you create the stateful set when you called it. Right? And he's talking. Let's go with three. It doesn't look like it is. That is not Does Kubernetes have Chrome? Yeah. Gotta love jobs. Chrome jobs. Oh, yeah. Dash figure. Yeah. Not jobs. They're chrome jobs. Yep. Chrome jobs too. Is it just chrome jobs? No. May No? And what if if you want those bots, that crazy pods? Like, it's not a busy box and doing stuff, actually. Yeah. One of them might be doing something, you think? Yeah.
37:32 Yeah. I could do. One of these, like Check the image names. Yeah. Grip. On all of on on all of them? Well, no. Can we not just, like, get all the? Get pods. Yeah. So you can run Get pods and then get all of them and then just grip for the image name, I guess. Yeah. You can run cube control cluster info and grip that for all the images across the entire cluster. Say that again. Sorry. Cluster dash info. Oh, yeah. And just grab images? Maybe image. And maybe cluster info dump. I can't really remember the exact one. Yeah.
38:19 It's probably dump. Yeah. It's got that's the one that's got all of the information. That's the one. Busy bugs. Busy bugs. And Thomas is saying, it's a worker process, and it's still my fault. Okay. Let's go back to the worker then. Oh, jeez. Let's just call it let right. Put the control plane schedulable and call in the damn worker. Yeah. It doesn't matter how you fix it. Right? Good day to you, Adrian. Yeah. Okay. Not pods, nodes. Am I spelling right? Quite far off. That's like a chef. Cardone. Yeah. Oh. Oh. Oh. Cartoon would be to make it religious, I
39:10 think. And then we probably wanna kill wanna kill cluster kill all the bugs. The rubber cable is asking, wouldn't cardinal worker be a problem because it's the only one? Yeah. We need to, like, make the master schedulable. Yes. We can remove the tent from the control plane. Which is yeah. That's weird sent to We wanna edit the way Oh, there's a good hint from Thomas. I should be assistant to the worker. Clearly, I've started some horrible process. But we've cordoned it anyway. Right? So it doesn't matter. Yeah. SSHD isn't. Isn't SSHD. We go through Teleport. We don't go through
40:04 SSHD. Oh, jeez. What have you stunted my cluster? My poor little cluster. It better not be just logged in and doing things. That's gonna annoy me. No. I don't think so. I I'm assuming SSHD is a custom binary for that. Yeah. Okay. So I killed that. Alright. We've got So try calling seven minutes. Do you want to remove a tint of a somewhat schedulable? It's like kubectl tint Edit. It would all help us. Well, you called in the node, but you also didn't drain the worker. Although, I can't remember if a cordon does a drain, but I still see I can't
40:51 remember. No. No. The the drain does a cordon. The drain does a cordon. Yes. So let's drain that node. Which is it gonna drain? Okay. Drain. Okay. What's the node's name? Uh-oh. Do, like, you know, our dev sets and dash dash delete dash dash data. Yep. That's the one. You wanna type you wanna type the One second. One second. I was looking. Dash dash ignore daemon sets. Oh, what's this weird stuff? Well, did we do the drain? I don't think we did the drain. I don't think it did either. No. That is on the cannot delete parts that not managed by replication
41:53 control or replica set. The most set or stateful set. Five minutes. East I think we still need to make the control plane schedule, though. Right? Indeed. Yeah. Somebody wanna get the syntax from that. Yeah. It's gonna be like kubectl taint nodes, the control plane node, and then You can just edit the nodes and remove the taint. Oh, cool. Yeah. Let let's let's make this quick. We've got four and a half minutes. Cool. Kubectl edit. Yeah. Oh, let me get the name. God. That time is of the essence. Alright. So such a tense. That's the one.
42:18 Making Control Plane Schedulable
42:54 Bye bye bye. Alright. You now have a note that we'll schedule. Okay. We ought to restart anything for that. The pods should be scheduled there now. Yep. There you go. Okay. And the I've got the fun ones still coming up. To remove a little frustration, SSH is working on and kill all of Rawkode's processes. Well, we we've removed that note from the equation. I'm hoping Yeah. Hoping. I wanna get rid of the staple set too. Are you? I bet it's gonna come back up. Is it? No? No. No? Okay. So you have the v one image at
43:51 Attempting Application Upgrade (v1 to v2)
43:52 least. So let's get it upgraded. Oh, the image the image is there? Yes. So edit it and change it to v two and see what happens. Okay. Edit. Employed spelling spelling clustered. I think yeah. Oh, next one. Where'd it go? No. Well, I oh, I'm not looking at the stream. We got the image with the hair again. Again. So we we need to kill those processes. Okay. But that's interesting. Is it I mean, that's running on this pod, which presumably doesn't have the and you got pull always. Right? I just killed the the session for it.
45:14 Alright. I have logged off of that bot killing the two processes that appeared to be under my user ID. You might have to delete that image on on this part again, on the control plane, like, the CR, price detail, whatever you want to use. Is it images? Let it yell at us. You're talking about the post presser. Oh, v two clustered. That's what I'm thinking. Okay. Deleted. And then I've removed or removing all their crap from the bash RC. Does that code command work? Nope. Still getting it here. Sorry? Well, we still have those road processes
45:20 Further Debugging & Time Runs Out (Team East)
46:30 again, but I I need to remove Yeah. Okay. I have removed it. What? You said that control. I thought we got rid of that. Yeah. But I think Yeah. Me logging into this machine keeps bringing it back. What? This is gonna have to be explained. Yeah. So there's a process on that thing that SSH is into the control plane. Yeah. That's gonna be good. Where are we in a minute? What's what's currently wrong? It's still directing to the wrong database. Okay. We didn't look at some of the other license. We've only got less than a minute, anyhow.
47:28 Time is up. Alright. Thomas, do you wanna jump on and walk us through how to fix this before we switch teams? Come on. Looks like very active in the chat and you seem to know what's going on here. But apparently, there's an SSHD process. Which is running at Rawkode, not Rawkode. I killed it on the I thought I killed it on the worker, not on the control plane. Alright. So I've killed it again. So if we bounce the postgres pod, we should be working again? Oh, I I closed out of my session. Somebody's still in.
47:30 Explanation of Team East's Challenge Issues
48:09 Oh, do do you need to come out? I'm doing. He said it's it is running as Rawkode. It didn't look like it to me. You probably wanna kill that postgres controller zero stateful set as well. That seems to be doing some bad things. Thomas says it's running as Rawkode. I'll look at it quickly. I mean, I don't see it. Alright. Here he is. Yeah. Oops. Hello. You hold off the page? Hi, Thomas. Did you get it did you get it fixed here? You did. We did. So what was going on with that? SSHD Postgres controller?
49:15 So the the Postgres controller was overriding adding a SQL injection attack to Yeah. But on the worker, there was a a program purporting to be SSHD running as Rawkode that would keep reapplying the manifests when you for people who thought to delete the stateful set. So I'm sorry. No. No. It we were so close. But No. It was good. You got it. Yeah. Well done. Yes. But there was a lot to a lot to fix there. Very well done indeed. Alright. Lairs. I'm gonna invite on Tengard West and say goodbye to Tengard East. Thank you for joining us.
49:55 Transition to Team West & Technical Issues
50:01 I really was on. That was great. And then I'll speak to in second. Just remember, tomorrow is the April fool's day. Yeah. We tried violence everywhere. Alright. Let's see. Where is West? Is it fun watching someone try and fix your cluster, Thomas? That was a doozy. I'm I'm actually really impressed with how much progress they made. I could not have made that much progress in forty five minutes. That was incredible. Someone still has the stream open. Might be a good time to mute or close that. Alright. So Can you hear me? Yeah. We can. Welcome.
50:50 Yeah. We can. Awesome. Someone has a stream open, can you find the tab and close it, please? I have it. Is someone still watching? Someone's still watching. I'm I'm not. Alright. I can mute these one by one, and we'll work out who it is. Alright. I could do. I believe it's Matt. Matt. Maybe my AirPods. Hello? Is that better? It is. Is that better? We're good. I don't hear myself. It is. Oh, no. I still hear myself. I hear myself. Okay. I'm still gonna keep talking. I've muted everyone else but Matt. I don't hear myself. Alright. Let's swap around.
51:30 Resolving Audio Feedback
51:50 Alright. Is you. Do you have the stream open? I've never said that handle out loud before. I don't know if I said it right either. Audio is breaking down at what was that? You have the YouTube video open, I think, somewhere. Like You have video sync somewhere. Right? The YouTube video? Nope. I have the the chat, but that's it. Okay. I I can't hear myself right now. Oh, no. I can't. Yeah. It's definitely coming from you. So that you have a tab somewhere with the stream open. I found it and in mute. Awesome. Thank you. Alright. Let's get this underway. So you
52:49 Introducing Team West
52:49 guys are all a very, very mean team, but that was a great fun to watch. So let's start with some introductions and then we'll see what team have in store from you. So let's start with Thomas, work our way around clockwise, and we'll get this underway. Hey, folks. My name is Thomas Stromberg. I am a software engineer at Changuard, but, I've done, some other, random things in the past, including being a a minikube, maintainer, which helps me understand some of the underlying issues here and know some of the bad things to introduce into a system since
53:09 Team West Introductions
53:31 people seem to keep introducing it into mine. So but nonetheless, yeah, I'm glad to be here, and it was awesome to see the how Team East fixed our cluster. That was just absolutely incredible job, folks. So Awesome. Thank you. Who wants to go next? Don't be shy. You're both muted. Again. There we go. So is that all gone? Maybe. Yeah. My name is, and I am also a super engineer at. And I also work in Kubernetes as a in secret list mainly. And I'm kind of scared of to see what our friends have led for us.
54:28 Awesome. Thank you. Matt? Hey. Can you hear me? Yeah. We can. I did. Okay. Yeah. The the there was a mute icon over my picture even though I was unmuted. So hi. I'm Matt. I'm one of the founders of ChainGuard, and I am incredibly nervous. That was a lot of fun, and and congratulations to team East. That was very well done. Yeah. They did they did great. They were so close. I think we should just give it to them as a fix. Like Oh, yeah. Totally. I think I was hinder on them more at the end than they were
55:05 themselves. So let's get my screen share up. Here we go. I am going to open a session on ChainGuard East control blame one this time. Once this opens, please use activity active sessions to join. Type echo or anything so that I know you are there, and then we will get this underway. So let's see who we have in our cluster. I got access denied. Oh, I haven't given you access yet. Oh, there you go. That's a a problem right off the bat. So apologies. Here we go. I wonder if this were our friends. You have access now. You can join my
55:12 Granting Access to Team West
56:03 session. Promise. There we go. We got our first echo. Alright. Matt's in. Echo. Enjoy, Thomas. I can't hear him. It looks like he's talking now. Thomas, you're on mute. Oh, is he frozen? There he is. Hello? No. I think he has frozen. So until Thomas oh, he's in at least. So that's good. Alright. Set up your cube config. Check for a control plane and best of luck. I'll reset your timer now. Just so you got a bit more time. Alright. So let's see. For the first thing, I think we should check that history. Empty. Covered their tracks. Nice try. Nice try.
57:10 Team West Begins Diagnosis (Forensics & Initial Checks)
57:40 Let's see. So let's see. So the next bit of advice was to dump some useful events with SleuthKit. Welcome back, Tom. SDC three. Okay. You want this with dev s e c three. And SwiftKit returns. Not the first time Thomas brought this on many episodes ago. Let's see. Alright. So Some of the work we did was try to hide that. So for the anyone in the audience that's not familiar with SwiftKit, who wants to go to the TLDR about what FLS is doing? I mean I mean, Tom, it's good. Yeah. Basically, yeah, it's it's generating
58:58 a file system timeline is is what this is doing of all the most recent changes to the the control plane. It may not be as useful since people know about this trick, but it definitely was effective the first time I presented it. So, yeah, Sleuth Kit, old school forensics tool. Right. Even shows deleted files. Did it recover the batch history? I'm afraid not easily, but kind of. There we go. I don't see I mean, this a lot of this stuff just looks like system libraries probably pulled in through the app get install. So I'm assuming SwiftKit is handy if there's
59:51 rogue things still on the file system. But if people broke the cluster explicitly through a cube control, it's probably not as useful. Right? Because there's only entity that's really gonna change. So let's see if we can alright. No control plan. Maybe see what it looks like. Yeah. Let's see. Is it still on my clipboard? No. Let me grab the app. Oops. If we check this out does this look right? I guess we can I don't know? Is it should it be going or let's see. I mean, I can only see the key data. I I can't see anything.
1:00:11 Fixing Kubeconfig & Checking Node Status
1:00:56 I have to to debug right now. Oh, I can resize the window. Is that more readable? It is. Yep. At the very top There we go. Yeah. At the very top, it's basically an IP 443, and I assume that's right, but, I guess, we should take a look at the running containers. Let's see. So we need an API server running. So I guess we should take a look at the container d stuff to see what containers we have. So let's so you see a control point. We so we're on the control point node. I don't I
1:02:16 see the API server here, right, in the middle. So should we check out the logs on that? Let's see. Think you're overlooking something subtle. The right one. Right? Oh, it should be 6443. There we go. On the cube config. I didn't wanna tell you. I didn't wanna just burst out with that since you were so cruel to the the other team. Let's see. You got it all. Alright. So looks like that may have restored some stuff. But it looks like some pods are failing to schedule. So I'm guessing our node is a wall. It may have the same problem on
1:03:18 the node. Let's see. Cute cuddle. Get nodes. Bro. But, you know, they're not ready, including the control plans. So if we let's see. Get pods Pending. Yeah. Okay. So we've got one terminating one let's see. Someone talking. I I hear something very faint. It may just be an echo, though. Yeah. I think there's still a YouTube tab somewhere in the background. Alright. We're gonna check for old tricks. Alright. No webhooks registered. So let's describe the nodes, I guess. I'm assuming this is is all James. What what does it mean when a control when a node is not ready?
1:04:30 Well, it should show up in describing the notes. Right? On schedule. Schedule. Let's see. That's, I think, normal for the master. Right? Let's see. Where's the where's the resource settings? So allocatable, there's a lot of CPU and a lot of memory. K. Ready. Oh, it says let's see. Complete posting ready since said former enabled. Not ready. True. Reason Kubelet ready. So not ready. True. Kubelet ready. This I don't know. Maybe that's normal, but that looks strange. Let's see. And then the other node is ready, unknown. Node status unknown. Kubelet stopped posting node status. So there's something clearly broken broken with the
1:05:54 with the worker. Let's describe them one at a time. Should we try and get the control plane working first? Have you checked the application yet? It's not the pods aren't scheduling. K. Right. Think it'll get pods. Right? So, I mean, the pods are a mix of terminating and pending. Right? I mean, we can take a look at the application, which is running on a node port here with its IP adder. Space adder. Sorry? IP space adder. Oh, where are we looking? That's a modified piece. So when you described the notes, something's stood out to me.
1:06:16 Examining CNI (Cilium) Configuration
1:07:11 A lot of the checks, the timestamps were quite old. Sorry. When we what? When you described the notes. We described the nodes. Yeah. And when you were looking at the status check, like, these timestamps are quite old. These timestamps. Well, the last transition time. The network one specifically. You're the last heartbeat? The last heartbeat. Yeah. Yeah. So let's let me take a look at so Cilium is running, but Yeah. So when it's running, but it was updated two days ago. I guess we can take a look at those. Now that we have the control plane back,
1:08:23 we should take a look at the images we are running. So let's take a look at these. Let's see. So we've got emissary. We've got Kuvip. We've got a bunch of Kubernetes images. And then we've got the Quay Cilium images, which is Cilium and operator generic. Those look like the right versions and the right hashes. So it doesn't look like they tampered with the actual specs. Let's see. Maybe check the how long how the some of those have been running to see if we can maybe get an idea if they were redeployed. I mean, they've been running for
1:09:26 two days, two hours, it looks like, but it says so I guess we can Cillium's probably running as a human set here. Come here and get that. So let's see. So the actual thing itself is two days, two hours, and that's how old the pods are. They may have messed with the pods. Mean, we could try bouncing the pods. Let's see. See. Pod session. What's it? Cilium system. That namespace. No. They are static pods. Or they're running somewhere else. Let's see. At pods. So they are running cube system. Oh, that's fine. Okay. So let's see.
1:09:55 Diagnosing Kubelet & Authentication Issues
1:10:45 Next part. We can describe this part. No events on our project volume. Let's see. But, yeah, the the config map, I wonder, for for Cilium. I wonder if anything has changed there. I saw because I saw that Cilium did restart. So where are they mounting it from? I mean, it's gotta be m cube system. Mounts. M Solium. Figment. Let's volume. So that's the mount. Let's see. We sync map names. These which are post pads. Third API access. So there should be config map. Map. Yes. Here we go. So it's only in config. Alright. So I got
1:12:04 see, I'm three. That doesn't mean anything. Okay. So let's see. I have no idea what this is supposed to look like. Let's see. So we missed you proxy since is some sort of mesh setting. Node port bind protection through monitor aggregation. I guess it should be in the clustered repo. Right? So we should be able to compare with what it should be. Right? Slim endpoint, secrets feed two, Port ranges. C b o. Nothing is jumping out at me. This feels like the standard Cilium config that I deploy with Helm. It Doesn't mean I've not missed anything, but it
1:13:16 looks alright. Yeah. So, I mean, we could try bouncing the pods. What do you think? I would encourage it. I'm curious to see if it works. I reckon at one, I think. Let's see. Let's not delete everything. Let's delete. May maybe just bounce the clustered pod and see the the error messages and Because the fact that we have a clustered pod and a terminate and pending status. I think we have a I think there's a issue with the kubelet on that node. Alright. So kubel pods. So we wanna try, like, deleting the the terminating one?
1:14:05 Do cuddle. Delete. Pod. I mean, it's it's hanging. Yeah. I think what's happening is the API server is trying to delete that, but the kubel is unresponsive on the worker node. Right. And it's not been able to update the statuses of things. You might wanna I I I don't know. I mean, I give the worst replaced ever, but you might wanna jump onto the worker node. Let me open up the session. Yeah. I wanna look at the logs from the API server now that we can. Yep. And I would be curious on the worker node
1:14:54 to check for any modifications to Etsy for, like, CNI files or cubelet files. There you go. Unable to authenticate. Invalid bear token. So I'm guessing they fuzz with the off of Google it on the worker. There's a session open. Feel free to join the worker node. Alright. And you have twenty four minutes. Alright. No pressure. See. Session in progress. This is let me make sure I'm connected to the right one. Alright. Join session. A question from Robert Cable, better token. This is the API token sent to the API server to authenticate request from other nodes
1:15:14 Worker Node Access & Certificate Investigation
1:15:48 in the cluster. Good question. Let's see. So I'm not sure where the Kubelet pulls that from. Okay. Are you on the market node? The same one as me? Yes. You are. Awesome. Yeah. You can see me diving. Okay. Let's see. Take a peek at the Kubelet flags. That's a I guess we can look at SleuthKit here. Let's see. So no Kubelet config. That's exciting. Can we just take a look at at CCNA? I'm just curious. I feel like there's something with subnet somewhere, and that would be a curious place to look. Is that enabled debug?
1:16:39 I mean I don't see any subnets there. So but you'd click config maybe. Yeah. We can look at the coupon. So we didn't set up to config here, did we? There is no Kubernetes access from the working node unless they have left it somewhere. There is a kubelet.conf inside Etsy Kubernetes. Oh, and a honk. We'll get back to that. Oh, yeah. Port's not in here. Current .pem. Okay. Well, we saw off problems. Right? I'm wondering whether there's something funny going on here. What's in the honk? Yeah. Well Yeah. That seems a bit weird. This is that right?
1:17:43 This looks funny. We'll come back to that. Let's see. So kubla con you said honk. Yeah. So let's nothing. Let's I don't believe you. Okay. Well, I think that directory is probably just broken to signal that something's going on here. So I think that this is clearly our problem. We need to replace this with probably one of these other twos, the c r t or the key. Let's look at I'm just looking on the server side what that's set to for the config. Maybe check the Kiplet logs? We can do that. Let's see. You have twenty minutes.
1:18:12 Corrupted Certificates & Considering Kubeadm
1:18:51 Time flies when you're having fun. Right? Yeah. So let's see. So we want service. Here we go. So we want this. Failed to start container for Cilium operator crash loop back off. So do we think it's a static pod? No. Cilium runs as a daemon set. Alright. Well, I'm wondering whether so syncing pod. Let's see. Actually, back off, though. It's interesting. Your Chengard friends from Team East are encouraging you to read the certificate. Read read the certificates. Okay. Let's see. So that was in, what, PKI? It was in FireLib Kubelet PKI. Firelib Kubelet PKI. Yeah. I haven't seen this set up before
1:20:16 with client current and timestamps. I don't know specifically if this is how the certificate rotation happens with Kubernetes. Maybe someone can fill us in if they know. I haven't seen it before. Interesting. Yeah. I think you may wanna use open SSL here. I'm wondering whether they who knows open SSL. X five zero nine. Just without the dash x or and into the to the file dash text. Minus in in minus infile. Yeah. Dash in. K. Mhmm. Alright. Then in file. Then dash text. Okay. Yeah. Alright. Issuer, ChainGuard East. Is that just because of the cluster name,
1:21:25 or are they messing with us? I'm not sure. I'm being I I'm having problems. Oh, there you go. Yeah. Look at the validity. Oh, no. Wait. They're okay. Yeah. I thought it was 2222. It's 2223. Are you doing that? Yeah. Am. I wonder if we can rejoin the cluster with Cube ATM. Don't know what's the Cube. Some doubts off my head. But The chat is saying those are expired. But they don't look expired. Right? No. They what's They're okay. This is the thirty first. Not not after after March 29. Yes. This is invalid. Right? Because it's the thirty first. But it's also
1:22:23 2023, not after 2023. Oh, okay. Yeah. I made that mistake too. Yeah. Okay. Alright. So this is around when the cluster was set up, I think. Yeah. I think so. Let's see. I'm I'm looking at the same files on the Alright. We got sixteen minutes. Yeah. What about the hints? Yeah. Are there hints? For the hints? I think. So Carlos is suggesting who owns the search, and James is saying maybe look at a different search. Okay. So I'm I'm getting the vibe that definitely fucked with the search. Let's see. What are the other files? Oh, we look
1:23:07 at the intent in this directory. Well, they may have covered their tracks. So We know by now on cluster never to trust the timestamps on a fail. Right? Alright. So we've got this. We've got let's see. They at least look like certs. We want what? You're at East Worker 1 CA. That where's the validity? Yep. 2023. That looks. There we go. New bullet, get name, no start line. Well, I don't know. That's not a sort of a key. So I don't know. I don't let's see. Shall we shall we get a hint? Sure. I'm I'm poking at the same files on
1:24:20 the control plane. Let's see. Let's look at the file. Says it's 2359. Yeah. So the the size of Kubelet that CRT is different on the two things and Yeah. They're pretty close to the same size on the key. I don't but I guess the CRT file is it's in the same open SSL. Me restriction. Yeah. I I don't know how to fix these certs. So I guess we can try KubeADM. So if we let me see if my thingy is still here. Yep. That custom data, that's how we provision the cluster. And inside of this is the join token.
1:25:23 Attempting Kubeadm Reset & Join
1:25:35 I'll just leave that there. You can use it if you want. Okay. So, yeah, there's a k d m join, and one of the flags is, like, to provide that token. But I don't know if it'll reset the certs. Main API server endpoint. So this is looking for A QPDM reset would probably wipe all those certs away. Right? The endpoint is this. So we want, what, API server join API server endpoint. So that's this. Yeah. There there's there's also the exact command and the cloud in a config, which I can point you out as well if that helps. Like,
1:26:28 rather than working through that. Varlib instance. It's just directory called script. Living. Grip. Join. Yeah. Part seven. Kube join token discovery endpoint. Yeah. What's the name? We may want to do the the Kube ADM reset, though, first just to make sure that the old configs get blown away. Okay. Sorry. Two ADM. I lost focus for a second there. I'm gonna copy this line just to have it. So reset. What flags does this state? We flag remove its number. Yeah. I don't think you need any any flags. I I think you can just run it. Any changes made.
1:27:21 It'll be reverted. Yes. Okay. Alright. Do we think that oh, do we think that was Now we need to join it. Yeah. Man, yeah, I don't know if it removed the certs, but hopefully it did. So let's try that join again, the Cubadian join that was in that file. Yeah. That's here. The join so we need the join token. So let's see. That was Search for custom data. This is yeah. That custom data. So the join token, what's this? And the DS name is also there. And for free, blah blah blah. Okay. DNS name is
1:28:16 East. Let's see. Oh, I have an echo in there. Kubernetes can proxy forbidden. Cannot access config maps in group system. There's been a value for resolve conf in this is interesting. Of course. Because when you upload this information to the config map, it only lives for two hours. Alright. We're just gonna have to call it on the worker, work on the control plane, I think. I don't think we can recover from that QPDM reset. Oh, the work is ready, apparently. They would look at that for some time. So let's see. Shit. Let's let's check our pods.
1:29:13 So the pods are all running now. Should we try hitting it? You can try the admin. Yep. I guess we can oops. Curl local local host. Action refused. That's not surprising. Alright. So there's a bunch of stuff still not running here. This app's not running. So the things that we're terminating seem to have okay. And there's there's a new replica set on the clustered deployment. So There's a good comment in the chat as you wanted to rejoin the worker, but I don't think you you need to. We could have used QEDM token create to get a a new working
1:30:06 token. But I I think we can just work with a control plan. Let's see. Let's take a look at the YAML for this because there's a new app. Rolling updates. Wondering what they changed. Yeah. Okay. The health checks are on 9090. That looks wrong. Let's look at the replica sets now. Yeah. Everything runs on 8080. Yeah. Whoops. Let's see. There should be two replica sets. We can look at the old one. 80 80, and then let's see. View before, and the other one was this. And then after and just see what else they changed. After.
1:30:11 Investigating Application Ports & Editor Issues
1:31:10 Alright. So they changed the ports, and then the rest looks real. A failure threshold changed. Bill or replicas for your okay. So that's status. So let's see. So editor, Nano, kubectl, edit, deploy. Weird. Well, that's fun. I can't even control c this. They fuck with me. What have you done? Maybe it's still not working. Alright. I I've opened a new session on control plane east. Nano does appear to be okay. Let's see. Move tab to the other group. Alright. Well, that's fun. I don't I it's it's Nano Breaks on these shared XTAR JS things. Oh,
1:31:59 Fixing Application Ports (via Emacs)
1:32:31 well, okay. I'm installing Emacs. I can't I don't speak with them. Eight minutes left. No. I need Emacs. Maybe maybe a bio break. I don't know. This is taking a while. Shouldn't back out as next minutes to see if it finishes installing the time. Yeah. This is There we let's so let's see. Kubectl edit. Let's try this one more time. Oh, you need to reexport your config as well. Export. Yeah. Okay. Let me grab the thing from my cheat sheet. KubeConfigadmin.com. Alright. Five and a half minutes. I have confidence in you. Alright. I have here. We've never seen an Emacs
1:33:51 on cluster before. That's a first, by the way. Just Well, that's because and we when we broke your cluster a year ago, whatever we did to rename v I to emacs didn't work. Let's see. Eighty eighty. This should be eighty eighty. There it is. Yeah. It should be 80. Does the max have search and replace? It does. Because v I does. It's really good. It does. I've been using Versus code. I'm a little rusty. That's It's the shit. I mean, it's not installing Versus Code web on these servers. I think, actually, a lot of people would really appreciate
1:34:45 that. Let's see. HP local host at 30,000. Still a connection refused. So let's see. Let's check that service ads. Let's see which service also changed to Huddl edit cluster. That would service cluster 9090. Okay. So we want 80808080. I keep I keep using Nano. Alright. It did not oh, for some reason, it's sticking garbage at the beginning of the file. Let me save. And here, this garbage. Eighty eighty. Okay. That'll be fuck up. No. It's not okay. Hey, also mess with this. No. Here we go. Failed to connect the database. Three minutes. Yeah. So they probably changed the ports on
1:36:16 Fixing Database Service Port & Typos
1:36:20 that too. Yeah. And it's there. It's maybe. So let's see. Postgres. What port is it supposed to run on? Five four three two. That's the MySQL port. Right? It's the MySQL MySQL port. Yeah. Nice. Let's see. So I'm guessing that there's some similar changes on the Postgres side. So that's running as a staple set. Postgres Uh-huh. QL. The service is Postgres. The they will set as Postgres QL mostly because I hate people. I really need to fix that. It should be the same. So this is named 5432. Yeah. Okay. Port five four three two. Is that what port is supposed to be?
1:37:14 It is. Yeah. But somewhere there's a 330 6 in one of the configs at least. And the service was it not? No? And there's garbage. Yeah. I think there's garbage at the beginning of file too. Yeah. Yeah. I'm not saving it. So they may have just messed with the service on this. So 5432? Yes. Just put Postgres. Yeah. There we go. Yeah. It's for some reason, it's inserting garbage. Every time it's open. So let's see. Yeah. There it Yeah. 432. 5 4 3 2. Great. Right? Yep. Alright. Let's scroll. Close to fail to connect. Let's give it
1:37:16 Finalizing Database Service & Upgrading App Version
1:38:02 a second. Failed to connect. Not quite. Let's see. You were pretty quick on the end points changes. Let's grasp end points. None. Do you think by fixing the typo and the service, you broke the Oh. You mean, type on the service. So you think that maybe it's yeah. Get deploy. You've got all. Get deployment. Stateful set. Well, I only changed the selector in the deployment. Right? Well, maybe they've done the same with this service. That was first one. Clustered. Oh, yeah. So this looks right. App plus We can look at the same thing on CS PostgresQL.
1:39:11 So PostgresQL, and if we look at the service, which is Postgres. Promise to fix that for next time. Postgres. Postgres. Oh my god. Let's see. Postgres. Time is up. Nice to see if we can finish it in the next few minutes. I I feel like we're close. I blame I blame Emacs. Installing Emacs. Let's see. So that, curling it. Hey. It's back if we edit the deployment. V one. Keep hitting control o. Alright. And now we have grab it. Some sort of end with the control l. Alright. So v one v two. I think we may be done.
1:40:25 Yay. Did you curl it? Yeah. It should v two web m. Oh, there we go. Perfect. If you wanna try it, we can we can confirm, but I Yeah. I think you got it. Let's let's have a look anyway. Right? So east. Sometimes it's You may need HTTP. So as well. Let's see. Okay. Oh, there we go. Sometimes it just doesn't work, and I could role play an IP. But you do have to dance. But I'm really confused right now because there is no worker node. I I think it's there. Think it's connected. I think
1:40:34 Cluster Fixed & Final Application Check
1:41:30 Yeah. Yeah. Yeah. I I think it it got joined successfully. But we could be the end reset it, and the join didn't work. Right? The join did work, but it did error a bunch. But it did say in the end that it did join. Oh, alright. Alright. Alright. Oh, good. There we go. Phew. That was tough. Yeah. After the hours, got showed up in there. Wow. Must be fun to work at chain guards because they're all pure evil. We use our power for good. Uh-huh. Alright. Well, it was good of you to use your power for evil for custard at least.
1:41:54 Conclusion & Wrap-up
1:42:08 Those are two great breaks and two great axes as well. That was a really good fun. It's much more fun breaking them. Yeah. I guess it is. I mean, when you do show up to FX episode with VMAX though, I mean, things are always gonna go wrong. I actually really like VMAX, but I I've never got the hang of all the different key combinations to really be effective with it. The funny the funny control characters and my completely broken muscle memory from using Versus code and nano on my MacBook, I think, have ruined me for Emacs, but
1:42:51 it's still faster than me me trying to use VI. So Yeah. I might actually experiment with running the web version of Versus Code on the server and just giving it access to the entire disk. I think that could be funny. Alright. Well, thank you all for joining me. I hope you enjoyed that. I'll let you get back to your day. To everyone who watched, thank you for all your comments and the chat. It was a pleasure watching you help out. And it's always fun for me to try and ponder the teams as much as possible.
1:43:23 I really felt guilty after that Kubeadm reset when I don't think we'd get that back. So I was really relieved that that worked. And thank you to our sponsors, Equinix Medal and Teleport. Mhmm. Custard will be back next week. Until then, have a great day. Thank you, Yeah. Thank you.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments