Jetstack & CrashBeerBackOff | Rawkode Academy

Watch / Klustered Live

The embedded player needs JavaScript.

Open the video stream (HLS) Download captions (VTT)

Overview

About this video

What You'll Learn

Fix Teleport session access issues caused by ACL and join permission changes when joining Kubernetes nodes remotely.
Troubleshoot broken control plane and worker behavior by reading kubelet, API server, and deployment controller logs.
Detect and fix networking, manifest, and scheduling mistakes in IP tables, kubelet secure ports, and node selectors.

Team JetStack and Team CrashBeerBackOff tackle two broken Kubernetes clusters, debugging containerd log limits, a disabled deployment controller, kubelet service typos, and iptables NAT rules redirecting API server and DNS traffic.

Chapters

Jump to a chapter

Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

2:07 Introduction to Clustered and Guests

2:07 Hello, and welcome back to the Rawkode Academy. I am your host for today, David Flanagan. Although you probably and hopefully know me from across the Internet as Rawkode. Today is an episode of cluster. It has been a few weeks since we've had a Clustered, but I'm very excited to have two great teams they work with us to fix, well, some unknowns. And we'll get to that just in a minute. But I am on a new Mac, which means I don't have any of my scenes prepared, but, you know, I do want to thank both of our sponsors, EquinixMail and Teleport.

2:39 Sponsor Acknowledgments (Teleport & Equinix Metal)

2:39 So we use Teleport every single week on Cluster. It's how we give the teams access to all the clusters using GitHub authentication. It's very, very cool. And it's also what we use live on the stream as well when we're debugged. So we actually get shared terminal access where we can all collaborate and pair together, which is awesome. So go check out Teleport by visiting Rawkode.link/Teleport. And that keeps the sponsors happy and means we can keep doing more episodes of cluster. Also, all of the hardware that we use for cluster, and then we go through a

3:12 fair amount of bare metal clusters here. We're not doing things cheaply. It's provided by Equinix Metals. So thank you so much to Equinix Metals. You should try it out. We have a $200 code. Just use the code Rawkode. And if you want to use that, go to Rawkode.link/metal. Simple. That $200 will get you, like, anywhere from a hundred hours, two hundred hours depending on the instance that you use. I enjoy spinning up the big, beefy ones, but you go nuts. Do whatever you want. Alright. Thank you to both of them, and thank you to our teams.

3:45 No easy task getting people to join me on a session and do live streaming of fixing broken Kubernetes. It can be stressful at times as much as we enjoy it and learn and watch and have fun. So I really wanna say thank you to and JetStack. And we are joined now by team JetStack. Hello, team. How's it going? Hey. Yeah. Good. Alright. Could you please do me a favor and introduce yourself? I'll I'll just little bit about you, and then we'll get started with this customer. Cool. Yeah. I'm David Cullen. I'm a leading staff solutions engineer at Jetfla.

4:00 Meet Team JetStack & Introductions

4:27 Hey, everyone. I'm Peter Phillips. I am a customer reliability engineer at JetStack. Hi, everyone. I'm really happy to be on. My name is Tom. I'm also a solutions engineer as well as David. We're working at JetSet. So we're a consultancy working with working with customers and doing Kubernetes stuff. More recently, I've also been working in software supply chain security and trying to sort of better understand that space. So less time breaking Kubernetes process as I was really good at before. But, hopefully, today, can do a little bit better and fix some. Awesome. Alright. Well, with that, we're gonna jump straight

5:03 Starting with Cluster 1 (CrashBeerBackOff)

5:06 to it. So I have my my screen shared. I have given everybody access to all of the machines. I'm going to open a session on crash beer control plane one. Please look under active sessions and join the session. And just so we we oh, I need more faces on this. What am I doing? It's been so long. I don't even remember how it cost them. Alright. I'm gonna do this ad hoc. If you could all join the session, please. Just type echo or anything like that so that I know we're all on the same page. That would be fantastic.

5:43 This is what I get for buying a new machine and not having everything set up. Okay. I get an access denied. No. Yeah. Same goes for me. But our faces are on the screen. So well, we're we're halfway there. We're ready. Alright. We are running a new version of Teleport, so I'm assuming I have also broken something. So I'm gonna come into the JetStack rule. I have given you access to all the nodes. You have access to log in as rich. Looks fine. Still an access to join sessions? Hello? I bet this what that rule out. Okay.

5:48 Troubleshooting Teleport Access Issues

6:30 I don't know what to do. I'm gonna have to fix this first aren't I? So this is a good excuse to look around teleport though. Normally, just add so normally, I just update the node labels which gives you access to the window. Can you join this? Can you open your own session and grasp your backup? You wanna try that? Do you get permission denied too? Are we okay to enter his route? Or Yeah. Go for it. Just I just wanna know if that works. Yeah. I got it. Yeah. I got the same. Should be fine.

7:15 Alright. Do we want to join each other's sessions rather than Well, I'll tell you what. I'll join one of yours then. I'll join David's. But I'm brute. Come on. I can't believe a custard custard. Right. Okay. Test denied. That was useful. Alright. Let's take a look at the teleport config. So simple, there's nothing to do here. Alright. So I'm going on to the cluster teleport node. So this is where everything is configured. There's going to be, won't be GitHub in here. It's okay. But there's a join token, please nobody join my cluster. This must be a new change.

8:39 I'm not sure what to do. I'm just going, why can I not join? Teleport join cluster permission. Because I must have been teleport have changed the ACL, and there's a new permission for this. So resources session. We need any of this stuff. We don't need to press join session. Okay. Yeah. This is new. Okay. Don't bite. Welcome to this episode of Clustered where we turn the tables. Yeah. I need I need a a shiny name for that's been teleported. It doesn't really work with the telephot. Let's call it telephot. Okay. So there is some sort of join

10:13 sessions thing here that I can apply. When clustered gets gets clustered. So clusterception. Resources session less than read. Alright. We'll try like, I don't wanna spend long on this, so we will try very quickly. And if we can't, I guess, I'm just gonna be doing a lot of typing and you're gonna be doing a lot of talking or maybe one of you could share your screen. But let's try and fix it first. I don't wanna be here. I'm gonna go to role, JetStack rule. And we want to allow that rules. That gonna work? Alright. Now I don't know what's gonna happen here

11:17 if this is even going to work. But could one of you just try and join that session? Yeah. It worked? No. Damn it. Okay. I am I'm rude. That this is what I don't understand. The TCCL users get where do get users? Oh, wait. I'm not on my control. Right. Let's try one more thing, and then we'll have to come up with a backup plan. Very sorry about this. So from here, I can run TCTR command. So I should be able to run get users. But let's do a YAML file and modify all of our user accounts. Now

12:24 what is annoying is I am here and I'm going to oh, actually, because I don't use the normal rules. So let's do auditor, access, and editor. These are the main rule and I'm gonna do that for all users. It's gonna be fun. So three. Alright. Hey. Well, if anyone wants to learn how to configure Teleport, I do have a whole bunch of videos on my YouTube channel. So we've got this. Roll. Let's just search for rolls, shouldn't that? Crash beer. Crash beer. And we're good. Okay. Now we can do TCTL. I think it's update-fuser.net. Or is it just update? I don't know.

13:24 Thanks. I remember how to apply the config. Create -fuser.yaml. Alright. Now we all have mega rules. Now I don't think the rule permission is gonna work right away, but you may have to log out in back end. But, however, you should hopefully be able to join my session because I've given you access to pretty much the entire house. So let's go. Peace. And if I don't see somebody typing in the sessions, then I'm gonna cry. No. I I don't think I've got access still. Nope. So you can't join? Unfortunately, not. Do we do we think of this do

14:26 do we think at this point? Did you log out? Did that have my share screen? Or I logged back out. Like, out and back in again. It still doesn't work. Yeah. Say Don't run that. I logged out. Telepath internal join. Access the name. If I keep it, it gives you everything. Users, rules, author access, editor. Oh, login is just wrong. Alright. Okay. Last try. Yeah. User. We've got this. I'm not the one that's supposed to be under pressure here. That's supposed to be feel completely safe. Alright. So I think what I missed here is logins.

15:39 So I think you've got the permission now, but not for this user. So I'm gonna pop all of this in. Let's save. Create dash f. Alright. All the users have been updated. Let's let's see where we are, and then we'll just need to go up with a backup plan if this doesn't work. No. Did you log out back then? Yeah. Logged out. Logged back in. Damn it. Alright. Then backup plan. I will be your typey typey. Unless one of you wants to share your screen, in which case, I'm happy with that too. But otherwise, I am happy to type,

16:24 Switching to Screen Sharing for Debugging

16:33 and we can work through that. It's up to David, really, because I think we we elected David as the as the typist. So it's up to him whether he'd rather whether we rather we instruct you or Alright. Amazing. I'll do the type in then. I've exported the kube config. We now have the ability to run a kube control command. Which one would you like me to start with? Oh, wait. I should've seen you're sharing your screen. So I I can share your screen if you want to do that. It's a Yeah. That's fine. Yeah. I'm happy with that. Okay.

17:08 Initial Investigation of Cluster 1

17:08 Alright. Seven. Need to go to the game Alright. Can you zoom in and your font size will be perfect. No. I got it. I got it. One more. K. Go for it. That's all good. Cool. Right. Where do we start? Okay. Nice. Do do we have a CubeC panel? No. Well, yes. But Yeah. That's what I was about to see. What are the permissions on QTTL? We have seen this before on clustered. So now I'm gonna find out who has paid attention to the previous episodes. Normally, do I slightly mutable, but yeah. But your binary

18:27 Checking Kubectl Executable Permissions

18:27 does not have the executable bit. That's what it looks like to me. Yeah. I can't read it there. You can't chamad the chamad. You can't you can't chamad with the chamad. Yeah. Do you know what I mean? Hold on. I'm I'm I'm I'm frantically giggling. Hold on. Yeah. We we looked in our interviews as well for some of our breaks. We've seen this on clustered with team Red Hat and team Talos. This is a good break. Think this one. So you could, you know, you could use Perl to do this. There is another way, but

19:48 Perl would be one option, assuming they haven't removed the executable bit from Perl. That can also develop me. And you can also Oh, yeah. That's a good one. You can run the loader directly and pass pass it the command you want to run. I think we're on Yeah. You could you can use l d dot source to also execute them on, which is what team Talos said on the previous episode. It's a great tech. I love that. Alright. You've got a broken control player. You don't need have Too early for hints. Come on. Use the hints.

20:25 Executing Kubectl via Loader

21:02 Oh. So what was the error message? I I feel like you you cleared that screen really, really quickly. Sorry. Yeah. I have a habit of that. I actually have a ZHL plugin that clears the screen after every well, before every command, but people get complaining at me when it's live streaming. So, yeah, we got no containers on on control plane. Pay attention to the error message. Yeah. That sounds familiar, like, DNS or data things. Trick into the request and succeeded. No. Well, you run a PS, you get for a cube, and the only thing you have is a

21:07 Cluster 1: Kubelet Logs and Containerd Socket Errors

22:28 cubelet. Yep. Yeah. That looks pretty broken. Yep. Container d. Connect from the fuse. Focus there. Switching off and on again. That's exactly what it was about. Just In fact, stop. Think it's got something more sinister going on. It's happy. Yeah. Happy. Alright. Yeah. Cool. Maybe not. I see stopping Pods, Sandbox, and teardowns there. This is a case that they're running something that won't likely stop it. So we still don't have any Kubernetes component, but container d looks like it might be happier. Do you still have error messages in the log? That was failing to speak to the container

23:57 Checking Logs, Disk Space, and Inodes

24:35 d. Right? No. We're we're getting the control. Crash be a control thing. Just that keyword. Are any of the any other of the error messages ones that we should be looking at right now? Or are we I don't think there's any just like an actual file system. Do you see anything there? Do we do a hint, guys? I think that sounds like a good idea. I guess we need that. We will come from the. So there was that you're you're very fast, but it says no log left to log. And I don't see a capacity warning

26:24 on your I think it was your log. I did see something around the file system. Yeah. Yeah. I don't I think use maps. Are they alright? They're fine. The ones. You send it to user. Else can Is that CD running? No. Kubernetes, isn't it? So it would probably be a container as well. It should be a static manifest. Yeah. Actually, Kubernetes is one of best. I'll check-in the iNotes. Mhmm. I've seen that one before. Yeah. That's one of those ones that when the bungee bungee good. Yeah. I had to recover from a % iNotes saturation.

27:44 True. That's fine now, though. It does. That's free. As a I mean, right now, my biggest concern is there's no NCT. You're not gonna get a working control plane without it. Maybe start looking for HTTP logs. Actually, one minute ago. Can you do a quite detailed logs? Yeah. Docker warnings? Mhmm. Well, I think that is the log entry. Log level info. I mean, you can go to bar log containers rather than using try control if you want. It might allow you to search or get a bit more visibility. It's really gonna Is there nothing in that entity fail?

29:16 No. Nice. Can also say they cleared the history as well. So I thought I thought asking to look in the history was cheating. If people forget to clear their history, that's a shame on them. That's perfectly perfectly illegal. Alright. So that SCT log. Is that all the Senate? Yeah. So, like, something's limiting something's limiting the ability to print logs to the file. Right? Which am I right in saying that's what's causing to error out? Well, it's something causing or is there some sort of maximum? Because all the files are also four k, which seems a bit This is well, this is

30:03 the thing. It's like I can you set a a pod log limit? The only thing is if with with the pods all over, it meets the limit, I don't know. Ah, nice catch. Where was this? Sorry. I was And they could see their new config. There's some sort of Okay. Got you. Yeah. Yeah. The four k is kind of I mean, you could just remove that plugin's link altogether. Yeah. Yeah. It's a good Or you could remove that config. You actually don't need that config with the container b. So nice catch. Oh, dang it. Oh, we got queues there. Something is coming

30:21 Identifying and Fixing Containerd Log Size Limit

30:53 up. Wow. And it survived for more than eight seconds. That's good. Wouldn't you expect just just coming through a container d noobie here, if if the log size is set to be really small, would you not expect the logs to just flush and then keep going in that limit in that limit scenario? Or am I naive for picking that? I I don't know. I've never seen that set in tweet before, so that's the first time I've ever seen that. I wouldn't have expected the containers to to stop, but I guess container d doesn't want them I I

31:32 guess container d just says, hey. You're you're too noisy. They shut the down. Yeah. I have no idea. It's something for someone to go find in the docs and share in the comments. Yeah. I know. Yeah. I just thought it'd be some something interesting to consider for those that are watching. So Alright. So that error getting node error messages typically means that Etsy is not quite or it's not ready yet. Oh, there we go. I think It wasn't Yeah. I was more worried that actually we haven't started up in a So this is like a health fair.

31:46 Cluster 1: API Server Connectivity Issues

32:07 Woo hoo. And you have some pods and an APIs there. Very nice. Yeah. That break was interesting. I'm gonna have to look that one up in the doc. Yeah. So we just updated Docker image. Right? In theory. That's it? No. Oh, no. You don't have a it's not ruled the pod. So I I'm gonna look on the work window. Just just kinda put around on it. How are we doing on the timer? You've you've got plenty of time. Okay. You you've had a bit just under twenty minutes. Good stuff. Hold on. My light keeps going off. Bear

32:21 Attempting to Update Application Deployment (V1 to V2)

33:12 with me. Yeah. My screen saver keeps coming on because I'm not doing anything. Let me disable that. Right. So you modified the deployment manifest. You changed the image tag from v one to v two, but we don't seem to have a new replica set being rolled no new pod rolling out. Something is not quite right. Yeah. I think some control plan being not ready as well. It's got me somewhat suspicious. Yeah. You don't know. Haven't Can we just restart the cubelets on the work? It's not the worker nodes. It's the it's the control plane that we've just started up.

34:11 Okay. What in here? Think it finds. Oh, if you do cube get events, there's a warning for invalid disk. That might be quite old, actually. Yeah. That's probably the thing we've already made so much. Was unable to pull a new image because of some capacity warning. It was, like, right, or is it something else? Because it's not even creating a new replica set. So I think the image capacity is a problem, but I don't think it's the one that's stopping the replica set being created. At least not yet. But how would you stop a deployment from

35:01 Debugging Deployment Update Failure (Image Not Changing)

35:04 creating a new replica set? I don't know. That might be given the game away, but the the other team. So the change, what? Well, can you run a get event on the deployment or describe the deployment to see the event? Alright. So we can see is the image definitely v two? No. It's v one. Right? So your change hasn't persisted. Allocation revision to you, though. Yeah. But go down there. Yeah. Your image is speed one. The image tag? Yeah. It's view one. And then I'll remember that. Nope. Not muted. Webhook. That was my first thought as well, but

36:29 I kinda know. Just to double check. No. Maybe this is the image capacity. Maybe it wasn't actually able to write the change to x c d. To x c d. Yeah. Disk space looks good. What was it? The error message in drive disk capacity. Let's have a look. You might wanna take a look at the entity manifest, see if anything's been changed. Let's have a look at the schedule config. But why are you thinking scheduler? That's not the scheduler, though, is it? Deployment is from control manager. You were about about quick to jump out of that

38:16 control or manage manifest, I think. That doesn't look healthy, does it? Yep. Future resource log. I think no. I think I don't know if Tom's talking to us or not. He's muted. No. I was yeah. Someone came in quickly. There there are two things right there that that have me because they're one, I think entities are unable to to write the rates. The fact that we don't see v two tells me that that edit we does as a deployment of this just didn't work. And I think entities don't know where to block that if it's not

39:21 for invitation. Also, you dead x exit the API server static manifest very, very quickly, and there was I I think you think there was a change in there that you missed. Good. Oh, was it the controller manager? Yep. There we go. It disabled the deployment controller. Yeah. Alright. But I'm still worried about that rate, the NCT. I mean, you could try adding the I don't think the stable deployment controller would block that rate. I never I never saw anything in the entity logs to say, like, like, this this is the entity Okay. Entity logs. Yeah, I couldn't see anything related to that.

40:50 Further Debugging: Deployment Update Still Failing

40:50 Still not seeing a replication set rep new replica set. I guess you could try, yeah, try and edit it again. See if the generation does that changes. Yeah. No. No. Damn. Alright. So what would stop cube control edit from working? I'm stumped too. I'll give you that. Can we get it over YAML and use a cube apply rather than I don't know if that's gonna be any difference. Yeah. Cube control edit does hide some error messages, and doing a Q control apply would definitely bubble those up. And the remainder in the chat that there are hints.

42:13 I think we're slowly creeping towards hint time, if you want if you want my opinion. Let's take a look. It is it possible that someone's messed with sorry. I was looking to chat. Is it possible that someone's messed with, like, the size of of whatever disk that Etsy is writing to? Would that but I guess it would spare us for that if it run out of space. Right? It would. Yes. But I think this hint reveals quite a lot. We were on the hint. So is your cube control actually talking to the API server? So, Watson, you're slash e t c slash

43:01 Investigating API Server Connection Details

43:35 kubernetes slash admin dot com. That is the config correct? Well, I can see the patch through. Oh, hang on. Oh, 14728. Is that what we see in the I can see 14728141217Port6443. Yeah. Yeah. It looks okay. So now you're trying to look for IP table rows to reroute traffic? Yeah. Not seeing anything. No. Nothing. No. I mean, are you familiar with malady? Nope. I am not. It is an HTTP proxy. So Why is that running? It's gotta be load somewhere. That looks okay to me. Yeah. Good call. API server. Is Is it the right API server?

45:55 Yeah. Oh, yeah. That's what I was going through. It looks it looks good. Looks official. Are there any more containers listed on that file? No. No. Interesting. Joe in the chat suggesting there are more deaths. I was so sure there was gonna be something binding on 6443 that wasn't the API server. Now I now I feel like I don't know anything. Have a look have a look at the queue API server and pause for the load. It's it's only in there. Yeah. One container. Otherwise I don't think that's related. No. But I can't get logs.

47:33 Oh. I feel like the patch requests are being dropped. Can you run a why don't you do a cube control, get deployment, cluster, and write the file? Make the change and then do a cube control apply dash f, but with v nine, and then see if we can see some of the noise. And if not, we'll need to just jump into some heads, I think. We've got about ten minutes left. Oops. Yep. Because when you do the cube control apply, it won't use the I thought it's like it uses the patch bar, but I think

48:46 it gives you let's let's see. That looks better. We used the patch, but it's a little bit different. Did it change it? Nope. Was there anything in that that the output? Like, scroll back up. Like, did we miss something? It says the 200. Yeah. Yeah. Gen oh, wait. That response body still has the old generation. Where's the image source? Yeah. Alright. Oh, it's truncated. Okay. Andrew in the chat is suggesting you run a get all all namespaces to see if there's anything wrong in the cluster. Yeah. You got that. Oh, apparently, you were close with the IP

50:11 Hint Points to Networking & IP Tables / NFTables

50:11 table. What is it? I'm gonna take a hint. I think are they suggesting to look at NF table? No. This is a new one. Yeah. I think they're so they're not using IP tables. Ouch. Okay. I would just let's just see the next tab. I'm really kidding. Let's just go for it. Can can can you can you literally can you type literally NFT? There is an NFT command available, but I'm not sure what it does. I'm not familiar with the NFT command. I don't I I have no idea. I I'm angry at them for using this just for

51:27 Debugging NFTables Rules

52:18 the for the sake of having to say NFT repeatedly. Yeah. I was thinking that. What can you do? NFT less? That's what I'm trying to find at the at the moment. It's magic. Okay. That looks fine. Yeah. So I was a bit like this rule set. Oh, hey. Hey. List rule set. Alright. How do change them. I have no idea. I know. I click Google's suggestion. You can run NFT delete rule. Flush? So there's there's NFT flush. Yeah. Try a flush. But then Oh. Flush. Hang on. How about DPKG dash p and tables? I don't know if that would just remove

53:50 everything or not. What was that again? Sorry, David. Well, you can run n f t delete real iNet table IP net. You can do n s t delete rule or maybe rule rule set. I don't know. And then table IP net. Yeah. That sounds right. Or try rule instead of rule set. Rule. No. Great. Team crash beer. Can you tell us how to delete this Try try NFT flush rule set. Just Flush. Just that. Hey. That's that's something. At least I did something. I've got nothing. So yeah. Configured is good, maybe. Yep. Yeah. I hate I hate NFTs.

55:03 It's still it's a shock, really, isn't it? Oh, they messed with IP tables as well. Thanks. I think it's saying it's blocked. Oh, they're saying run IPtables.tnet. So you have to list the NAT rules. You still need list. Just flush the net, Rose? There it is. Oh, yeah. Redirect four eight five three. You could just do an IP tables dash t net dash capital f and just flush them all. Yeah. I was looking for the number output, so I can never remember the There's another one there, though. You know, we say add dash b two

55:55 Fixing IP Tables NAT Redirect for API Server

56:58 and dash a two to the graph, and you'll get the extra link. No number? Pre reroutes. How do we do the numbers? You're right, though. Is this can we just do IP tables flush? You'll need to do a dash two there. I should get rid of it. Because then yeah. Anything like Selum and things should put It they'll rewrite them. Yeah. Alright. We are out of time, but if you want a couple of minutes, you I'll I'll give it to you. But I think dial diving into the RBAC may require slightly more time. That's gonna take

58:05 Cluster 1: Out of Time / Final Status Check

58:16 80. Alright. Let's let's look at the rest of the hunts, see what we've got. Maybe it's okay. There's maybe quite a lot still to debug here. So I think we'll just leave that there. Alright. I think we were so close. Well Yeah. That's gonna take a long time. Well, let's see. You wanna tell us what what we missed? Really? Yeah. Hi. You did great. You had it in the IP tables. It was like five minutes into the session and you had IP tables running and then you're always missing the the correct chain. We had a a man in the middle

59:05 Cluster 1 Breaks Revealed (Proxy, Dry Run, IP Tables)

59:23 proxy running who was applying a dry run to every cube command you were running. Where is the proxy running? I thought that at the start on the the cube control output actually showed the dry run equals all. But Yeah. Yeah. This is the thing. This is why you shouldn't keep quiet in these scenarios. I was gonna say, why is there a dry run? But then I just didn't say anything. Sorry. Bad teamwork. No. It's supposed to be You had it almost. And then we had the Rawk stuff. It was just you had to work as a

59:56 different user. You had to use ads and work as a different group and user. And the deployment the missing deployment controller you fixed already, and I think that was it. Alright. I got I got I thought that deployed the what is it? He was taking care of you with the, yeah, the controller manager. The dry run was a bit messy. Yeah. That that proxy was mean. Very, very mean. Good fun. Alright. You if you had deleted the pod, the proxy had a problem. So it was watching always that the pod was removed, but then it

1:00:36 was stuck, but we figured you won't delete any pods. But you did Okay. Thank you team JetStack. We're gonna move over to the other cluster now. So feel free to join us in the YouTube chat and we will see you later, but thank you. All right, let's get this faded up. Alright. User mean, but a very good very good breaks there. So you're you're very using malady as a proxy. Right? No. We were referring to this to this communication model where you always have the eve and Bob talking to another, and Mallory is always the one eavesdropping or man in the

1:00:48 Transitioning to Cluster 2 (JetStack)

1:01:21 middle style. But it was the same. You you you have gave a great hint because Mallory was intercepting the traffic. Yeah. Great. Well, let's see what JetStack have prepared for you. So because I have obviously got a broken teleport configuration, I am kindly gonna ask one of you to please share your screen. And Yeah. I think I'll oh, I'll have to you'll have to use Google Chrome for screen sharing. So, Joe, you're using Google? Yeah. That's right. Alright. Could you share your screen? I'm happy to do all the typing and share my screen if you would prefer,

1:02:00 and I don't mind either way. It's up to you. Do you have a preference? I can try to share. Okay. Yeah. Restless. Has nothing here. Next to the hang up button, so be careful. No. I only mute and hang up. That's an option for to share a screen. Alright. I could do the typing, and we'll work through this together. So let me close the Oh, you give me a second, and I'll try to get my GitHub login working on Chrome. And I'll fix all my stuff. Really need to Restore all the crew files from my old computer.

1:03:18 There we go. That's better. Alright. So JetStack control plane. Oh, we had prepared such a good script for bootstrapping. Okay. I can run a script for you if you want. We'll do it without. Okay. So but maybe you wanna do the completion? Uh-huh. Perfect. Maybe you wanna alias cube to k. Okay. How would you like to test for a control plan? What's up? Yeah. Let's do it. Get pots. No control plan. Okay. So let's see if are there any containers running? Cryo CTL PS. And they are running. Yeah. It does look like we have controller manager scheduler. We just don't have an

1:03:55 Initial Investigation of Cluster 2

1:04:33 API server. There's no API server. So maybe the static manifests. Is there a yeah. Perfect. There is one. Man, what would you like to see? Scroll down a bit. I mean, it's interesting. Right? The manifest is there, but Kubelin is not picking it up. Maybe the Kubelet path for manifests is not correct. Yeah. Check that one. Maybe do a systemctl cut kubelet. Okay. So there is the kubelet config is in varlib kubelet config. And then there should be a static port something. Yeah. Okay. Kubernetes manifests. Is there a typo? No. Nope. That looks good. Okay.

1:04:36 Discovering Missing API Server Pod & Checking Manifest

1:05:53 We wanna check the the Cubelet locks? Journal journal control, a few yeah. Cubelet. Well, they can't speak to the API server. And we just restart the Kubelet. Sometimes it helps. Yep. Okay. But every other part is picked up. Shall we check the logs for the API server? Yeah. I can see that the manifest for the API server is modified. So if you run a list command to the manifest folder or manifest directory, I can see a modified date. Yep. It does look. It has been modified. So this is probably the time where we will have our backups from our cluster,

1:06:46 Inspecting API Server Static Manifest

1:07:20 and now we are diffing the manifest files. Well, I mean, let's go through it. Right? Yeah. Yeah. All the flags. Is there anything here that worries you? Is the secure port correct? It is not. Good catch. Well, let's try. Well, would that stop it from running? No. This wouldn't stop it. Yeah. This is not the case. So the live, I think it's correct, and the ready c is is there as well. Yes. The ports the ports are not right. Right? Yeah. Yes. Okay. And it's now? Let's try it. That's wow. Okay. Now the port should be fixed. Right?

1:07:43 Identifying and Fixing Incorrect API Server Secure Port

1:08:47 Yep. And I don't think there's anything else of particular interest in here. Although I am worried about one thing, but I don't wanna mention it yet. Let's write this. Okay. We have an API server. So And it appears to stay. Let's check pots. Done again. No. The advertised address wasn't right. Right? It was a different advertised address. It's 1065. And in the admin conf, it was, I don't know, something different. The annotation, you mean, on the upper? So this is the Yeah. I see This is the yeah. 28141245. Yeah. So that's the IPV 4 public address.

1:09:03 API Server Starts Running

1:09:50 Yeah. And the other one is the private one. Right? Yep. I think that should be I think it should be okay. And it is still running. Do we have any cron drops? Oh, the secure port actually is Yeah. The sec the manifest was probably changed again. Right? Yep. Should please check the current jobs? Let's confirm. That's yeah. Let's 443. No. Check yeah. It was probably 8443 again. Mhmm. Okay. Check the see Chrome. Yeah. This is empty. All the Chrome there are multiple Chrome directories like ETC, Chrome tab. Chrome d. Yeah. And do I cut all?

1:10:57 That's something. There is. Remove it. Alright. You're fine. You're bad. Okay. Now now we have to upgrade the manifest again. Perfect. Alright. It's pretty laid back if you don't have to type anything. Alright. It's not restarted. Alright. There we go. It's restarting. No. So the port is not not But it's still six four four three, but it's not No. The kubelet can take a few minutes before it does that. So we No. Restart the kubelet. Yeah. Yeah. We can encourage that along. There we go. Okay. Get pots. Oh, wow. But it's terminating. The POSCUS is terminating for

1:12:13 Cluster 2: Application Pods Terminating/Pending (Nodes Not Ready)

1:12:16 some reason. Let's describe it, the state four set. It's the stateful set. Is it called like that? No. Yeah. Is there event oh, no. Events. Okay. One desired. So this the stateful set looks okay, I think. But something is terminating it, and something is preventing the clustered app from starting, but that's maybe the missing Postgres? No. I don't think it it will be. Now, typically, in clustered in the past, when we've seen things that are stuck and terminating, it usually means that if hand means our kubelet isn't alive on the worker node. And we run just kubectl get events.

1:13:23 Look. The notes are not ready. Oh. Yeah. Okay. I don't think we have a kubelet. Joe, what do you think? Should we call in the the worker notes? Or should we fix it on the worker notes? I guess we should fix. Right? Yeah. Try to fix it. Okay. Alright. Let's pop open Yeah. Just back. Worker one. I'm assuming if we run a system control data, we're just gonna see the kubectl is not running. Yeah. Okay. So but wait a minute. Is it is it is is enabled? Yeah. And it's failed. Okay. Check log. So Yeah.

1:14:03 Checking Worker Kubelet Logs (CA File Error)

1:14:06 Yep. Unable to load CA file. We said JetSec had to mess with the certificates. We need a cert manager right now. Yeah. Let's confirm, but I think yeah. Oh god. Is he there? Or What was it looking for? CA certificate dot CRT. It is. It is. Then maybe yeah. So maybe but maybe it's the the system d unit file. Maybe it's blocking access to this directory. Yeah. Could you do a system CTL cut Kubelet service? I think it's easier than that. It looks okay ish right now. It's all, or it was the complete file? I'll leave that there for ten seconds. Hey.

1:15:24 Close attention. Certs there are certificates. There is a missing s. Yes. Alright. So I'll rename it on the, yeah, on the file system. XFL. There's no way to remove it. Like I said, I should also see template config. Or is this what the correct name There we go. Okay. But fixing one worker, I think is enough. And we could cordon the other one, so yeah. Thanks. Do you need core node? Okay. No. It's okay. Perfect. Okay. So get pots again. Okay. Yeah. I think they're unfortunate that they're on that that Yeah. They just deleted. Yeah.

1:16:34 Worker Node Becomes Ready

1:16:51 Nothing bad ever came from deleting a Yeah. No. I don't want dictation. Uh-oh. Stop. Okay. Let's describe it. Temporal layer in name resolution. Okay. Same messed up with DNS. Okay. I guess I guess on the router node, not on the control plane. Uh-huh. Of course. Because the error is coming from. Yeah. Okay. So the host is the DNS on the host is not working. Okay. Let me check if the system d resolve d is running. Yes. Could you do could you do a resolve CTL status? Yeah. Red. Resolve. Yeah. Yeah. Yeah. This is suspicious. Those are the Equinix metal

1:17:36 Debugging DNS on Worker Node (Host DNS Working)

1:18:22 DNS servers. Those those, I believe, are okay. And we check on the master on the control plane what's the correct DNS? But it was an external d n external IP. Right? It's the same. Yeah. The is the Equinix method. So these one four seven dot seventy five are Equinix method. So those those are okay. Yeah. Some IP debits. Just drop drop to domain. Yep. Wow. And If you do the dash dash line dash number, I think or it was four maybe. But Line numbers. Yeah. Yeah. Okay. What was that? Oh. Oh, I don't remember. It's a chain output. Right? Yeah. Yeah.

1:18:47 Discovering and Fixing DNS Redirect Rule in IP Tables

1:19:31 It's gone. Just to want to confirm, is is is there additional drop rule? Do we have a drop rule on the control plane too? Mhmm. But DNS was working on the control plane. It was more broken. Okay. There's at least another drop row too. Okay. Okay. So we had this DNS resolution. Right? Yeah. Perfect. Okay. Ta da. Running. Okay. Good. And more is pending. So describe the pending one. Okay. Now we went. Can we support? Ah. Okay. But Oh, Yeah. No scheduler. The scheduling. Could you edit the deployment? I mean, we have to edit it anywhere.

1:20:38 Inspecting Deployment for Scheduling Constraints (nodeName)

1:20:43 Yep. What would you like to change? We go with the v two first. Yep. And, Joe, what do you think? Should we do the node name? We can try the node name Yeah. In this in the spec. In the spec? No. It here? No. No. It's in the pod template. I guess it's in the Yeah. Yeah. Here. Yeah. Yeah. Yeah. Oh, but, David, you have to remember the name. JetStack worker one. Okay. Running. Are we done? Let's, yeah, curl it. I can't remember the back. It's been so long since I've got a cluster that I

1:21:07 Fixing Hardcoded nodeName in Deployment

1:21:30 don't remember the plus. Sorry. Doesn't curl. Version one. Okay. Let's see what's happened with the pod. Is it version two on the pod, the image? Yeah. Already present on machine. Maybe we should Always. Yeah. It's a low image. See, I can't I can't type. There's too much pressure. Perfect. Alright. This is a working method I could get used to, so maybe I'll have to find someone typing for me. Okay. It's working. Question two. But I would like to get into the scheduler issue. There should be a scheduler conf. Right? Yeah. It is Kubernetes scheduler conf.

1:22:42 Application Now Running V2 / Cluster 2 Fixed

1:23:00 That's just authorization configuration. Oh, yeah. There's basically, we are doing the same, but can we get the locks off the scheduler? I just assumed that it's gonna be I mean, you could remove the node name. Right? We could. Yeah. But you want to do it and see the logs. Right? So let's Okay. Let's do that. Actually Peter, it was chat saying we should try the actual video URL. Okay. Yes. It does. Oh, wow. Beautiful. So you have Okay. But have fixed it, but we do there is something wrong with the scheduler. Yeah. You're thinking that the scheduler name is

1:24:21 probably not the default one? You want to remove it? Let's remove it and delete the pod. Yeah. But if you if you remove the default one but now it's running too. Mhmm. That's it. Well, the time you absolutely smashed that. I mean, obviously, it's because I was doing all the tape and then I'm not Okay. Alright. Well played. Thanks. Yeah. Have a good one. This was cool. Involve especially involving the the worker nodes, we always thought about maybe, should we do something with the worker? But then, I mean, we we already had enough on the control plane.

1:24:54 Cluster 2 Breaks Revealed

1:25:08 But, yeah, nice. And no real certificate issues. Thanks a lot. Oh, well. Welcome back, Pierre. Hi. Hey, guys. Well done. Let me pop you all back up. There you go. Yeah. Good job, both teams. Those were those were fun clusters. Annoyed that we couldn't use Teleport properly. I'm gonna have to work out with that, but I will get that fixed with the next episode for sure. But I hope you all you all had fun. Alright? Perfect. It was good. Alright. The thing or two for sure. Yeah. I was very impressed how methodically you guys worked through the problem.

1:25:17 Wrap-up, Thanks, and Reflections

1:25:55 Yeah. That was we were both very, very calm through that. I was just like, alright. Let's look at this. Let's look at this. You yeah. You just Yep. The good thing, there was always an error message. And we knew that the man in the middle proxy we prepared, there was no error message. So we thought, ah, this is gonna be hard without an error. Yeah. That was very, very sneaky. Where was the proxy running? Sorry? The proxy was a bit hidden as well. We hit the pit, and we named it system d home d. Oh, god. Yeah. But Yeah. It was never

1:26:30 we never wanted you to find the process or kill the process because if you kill the process, then the connection to the API server wasn't working. You had to remove the IP table. And your proxy was dropping oh, no. It was doing the dry run. Right? You were injecting the dry run on film. Yeah. Yeah. That is very sneaky. I'll never forgive myself for not saying why is there dry run-in that API request in the first two minutes. I know. I never said it either. I've seen it right at the start. I just I just ignored

1:27:00 it because I didn't think it Yeah. For anyone watching, it's a top tip is why you don't say something because you think it's gonna be a stupid thing to say. Always say always say it even if it's gonna be stupid in your head. Mhmm. Alright. Yeah. Well, thank you all for joining me. That was really two good two good two very cool questions. So well done both of you. We're gonna see David, thanks for the for this episode. Thanks for preparing everything. I mean, I gave you all a broken Teleport installations. I don't know why you're thanking

1:27:34 me, but I just love that we get to have, you know, clustered at all where we can you know, I think that the way that Kubernetes is still complex. Right? Like, we all have to learn the hard way when it comes to it, and it's always through trial and error and a lot of pain and sleepless nights typically that we learn how to configure or just, like, why should anyone know you can disable the deployment controller or the or the pod controller? But all these things do exist. And I think cluster is a really cool way where we can just take

1:28:02 that knowledge from all of you and share that with people that are watching. So, seriously, thank you for joining us, and thank you to everyone that is watching. This is what makes trusted fun and what makes it work. Alright. I'm gonna say goodbye to you all, and thank you everyone for watching. I'll get there's loads of, like, thank yous and awesomes and thumbs up in the comments. We'll get them all on screen as we say goodbye. Thank you again to our team, thank you to sponsors Teleport and. Hopefully, it will be back where more custard

1:28:28 soon. So thank you all. I'll see you all soon. Bye. Bye, everyone. Bye bye. Bye bye.

Meet the Cast

David Flanagan

@rawkode

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Additional Resources

More from Klustered

View all 45 episodes

Alex Jones & Alistair Hey

Alex Jones & Alistair Hey

Hans Kristian Flaatten & Zach Wachtel

Hans Kristian Flaatten & Zach Wachtel

The Community Vs. Rawkode

The Community Vs. Rawkode

IBM & Nisum

IBM & Nisum

Marino Wijay & John Anderson

Marino Wijay & John Anderson

Adobe & Zapier

Adobe & Zapier

More about Kubernetes

View all 173 videos

Hands-on Introduction to Kueue

Hands-on Introduction to Kueue

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

More about containerd

View all 23 videos

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Server-Side WebAssembly

Server-Side WebAssembly

Hands-on Introduction to KWasm

Hands-on Introduction to KWasm

More about Teleport

View all 38 videos

Alex Jones & Alistair Hey

Alex Jones & Alistair Hey

Hans Kristian Flaatten & Zach Wachtel

Hans Kristian Flaatten & Zach Wachtel

Let's Take a Look at Teleport 10

Let's Take a Look at Teleport 10