About this video
What You'll Learn
- Bypass broken cluster shell access with corrected kubectl binaries, shell setup, and session-specific teleport fixes.
- Diagnose expired Kubernetes certificates and control-plane disruption by tracking teleport loss, API server downtime, and etcd quorum status.
- Repair worker-node readiness and Cilium networking by fixing kubelet hostname mismatches and restoring the loopback CNI plugin.
Teams from Talos Systems and Red Hat break each other's Kubernetes clusters and race to fix them, debugging expired certificates, etcd quorum loss and snapshot restore, Cilium network policies, and a broken CNI loopback plugin.
Jump to a chapter
- 0:00 Holding screen
- 1:45 Introductions
- 1:47 Introduction and Housekeeping
- 3:08 Introducing Team Talos
- 5:00 Team Talos
- 5:38 Talos Team: Initial Cluster Check (Permission Denied)
- 10:10 Bypassing Permissions with ld-linux
- 16:15 Replacing the kubectl Binary
- 19:09 Fixing Shell Issues (Installing Fish)
- 20:03 Identifying Certificate Expiration
- 20:41 Attempting Certificate Renewal (kubadm)
- 22:28 Restarting Control Plane Pods
- 24:16 Talos Team Loses Teleport Access
- 26:04 Regaining Access
- 27:33 Debugging Kubelet and Static Pods
- 30:46 Checking Application Status (Database Failed)
- 31:27 Investigating Cilium Network Policies
- 40:27 Identifying Default Cilium Policy
- 46:00 Team RedHat
- 1:00:50 Introducing Team Red Hat & Initial Cluster Check
- 1:01:41 API Server Not Running
- 1:04:50 Identifying the ETCD Issue
- 1:08:35 ETCD Quorum Problems ("No Leader")
- 1:11:40 Consulting Hints for ETCD
- 1:15:18 Hint 1: Insufficient Quorum
- 1:19:40 Attempting to Remove Failed ETCD Member
- 1:25:32 Hint 3: ETCD Snapshot Restore
- 1:29:04 Performing ETCD Snapshot Restore
- 1:33:30 Kubectl Working, Worker Node Not Ready
- 1:34:16 Application (v1) is Accessible
- 1:34:32 Red Hat Team: Attempting Application Upgrade
- 1:35:55 Application Pod Pending (Worker Node Issue)
- 1:39:29 Consulting Hints for Worker Node
- 1:41:01 Debugging Worker Node (Kubelet, Hostname)
- 1:43:41 Talos Team's Hostname Trick Revealed
- 1:44:18 CNI Plugin Issue (Sandbox Error)
- 1:45:15 Consulting Final Hints (Cilium CNI)
- 1:46:07 Fixing the CNI Loopback Plugin
- 1:46:44 Rescheduling the Application Pod
- 1:48:01 Application (v2) is Working!
- 1:48:10 Post-Challenge Discussion and Wrap-up
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:47 Introduction and Housekeeping
1:47 Hello, and welcome back to another custard at the Rawkode Academy. I'm your host, Rawkode. Today is Clustered Teams part two. We have two great teams lined up. We have a team from Talos and a team from Red Hat. Before we get started on the Kubernetes debugging weirdness madness, I'm not sure which one to use, there's a little bit of housekeeping. So first and foremost, please subscribe to the YouTube channel and tick the bell. This will get you alerts for all new episodes on Rawkode Live, which means that you can come and learn all of the weird tools and the cloud
2:20 native landscape together. Also, if you wanna chat about Kubernetes, cloud native, and if and in between, there is also a Discord server that is available at rockode.chat. And also, I wanna thank Teleport. Teleport have been sponsoring Clustard, and it was the easiest decision that I ever had to make because we have been using Teleport since the very first episode. Today, we're using Teleport with some new, easier mechanisms to access our applications via private link, and you'll see that just soon. If you wanna support the show, please go check out Teleport. You can do so at Rawkode.live/Teleport.
2:54 I would really appreciate it. Also, thank you to Equinix Metal. They provide all of the hardware that we use on Clustard, and you can also use the code Rawkode for $200 to spin up your own bare metal stuff. Alright. That is all the housekeeping and I am happy to introduce our first team. Hello, team Talos. Hello. Hello. Thank you Excited or nervous. It's always a bit of both. Right? Yeah. Both. A little bit of both. Alright. Well, could we start, I guess, top right and then down to Sean and along? Just a small introduction. Share
3:08 Introducing Team Talos
3:31 whatever you want, and then we'll get into today's first cluster. Sure. You said starting from the top right? Yeah. You, Andrew. Do we see the same oh, okay. That's me. I don't know if we see the same phrase. Yeah. Exactly. I'm Andrew Reinhardt. I'm the CTO of Talos Systems, creator of Talos, longtime container user, Go programmer for fun on the side of doing operations SRE work. I'll make it quick. I wanna get into the action. I'm Andre Smirnoff, developer at Telos Systems. So working on Telos and projects around it, like developing stuff, breaking clusters not because I want
4:13 to do that because I make bugs and fixing them all the time. Nice. Hey, y'all. I'm Spencer. I'm an engineer at Telus as well. Of course, I mostly work on our cluster API side of things. And I'm also here as Andrew's stunt double if needed. So Hi. I'm Sean McCord. I am an engineer at Red Hat. Oh, no. I'm sorry. At Telos. I've been building and breaking distributed and fault tolerant Linux based distributed systems since the mid nineties. Got in early on Kubernetes with CoreOS and loving it here at Telus. Awesome. Thank you very much all. And I
4:59 will also let you know that the Red Hat team are in the chat saying that they have been very bad today. So Yeah. I saw that. We we were debating how bad we wanted to be. We ended up being nice. Because, you know Well, you know, there there's, like, very few rules as you're now aware. Right? But one of the rules is do not break Teleport because if we can't access Teleport, we can't pair to fix the cluster. Very boring. Yeah. And what did Red Hat do? They broke Teleport. So Nice. Yes. They were Basically, there's only one rule.
5:00 Team Talos
5:35 Minus 10,000 points for Red Hat anyway. Let's get my screen share up. Now there's a few nice additions to our Teleport setup. I wanna thank Ben at Teleport for helping me with some of this new automation. But you can see we have a nice shiny URL. You no longer have to type. This is insecure into your browser. Nice start. I have also exposed the clustered application. It is available here on applications. So when you want me to try and see if it is working, just let me know, and we will hit the link. I am going to open our first session
5:38 Talos Team: Initial Cluster Check (Permission Denied)
6:03 on the control plane. And if you could all do me a favor, go to act active sessions and join and type echo hello, echo hi, echo whatever you wish. Hopefully, can see everyone in here in the next moment. Alright. Looks like I got a couple of you. I've got three names here. I'm expecting to see five. Who have I got? Let's see. Andrew and Spencer. Sean and Andre, I think you opened your own session, Or you're just not listening to me. I'm not sure. I'm just fighting with the CLI. I'll join on. Alright. Okay. Yeah. Yeah. No worries. He's
6:43 just nice. Alright. We've got a little bit of setup that we have to do anyway. So well, Spencer, Andrew, do me a favor. Export our cube config, configure any aliases, and let's check if we have a control plan. Oh, where's my auto complete? You can enable completion on kube control if you wish. Is Admin.com. Admin Com. We free to just go or what? Yeah. Let's get some notes, man. Let's see. Permission to know. Oh. Let's check. And you've hit your first snag. There we go. I see we got a fourth. That means we've got we've got one off. There we
7:33 go. They're all here. Was someone driving there, or is that me? Sorry. That was me. Go ahead. Okay. Make it plus x. Yep. I need my autocomplete stuff. I'm lost. You didn't install fish while you're here too? I can I can type if you want? Oh. Okay. Nice. Okay. That's an interesting start. I think we were a large group. Think it's just a group. Okay. I think they took away Yeah. I'll send on change mode. The chimotes to the chimote. Yeah. Yeah. And do an do an l s dash l a on user bin. Change mod for probably just bin change mod.
8:36 Oh, bother. You don't even have tab completion? What is this? I don't. Yes. It's telling me. But, hey. We'll make it work. If you don't, I think that's probably because they've changed that. Yeah. Exactly. So just look at the whole slash bin. Is there anything we should execute a bot call? Well, l s is obviously. Well, l s is a built in. So let's see. No. L l there's a bin LS. It's not a built raised, but that one is built in. So, yeah, let's do okay. Other I could scroll back. Can we, like But for anyone who's not
9:17 familiar, green means executable. Right? Yeah. Can we have to install something? I mean, just to fix things. Easy way. Yeah. You can do whatever you want. Yeah. What is this? Core utils? Just remove or move it out of the way and do an app. Oh, sorry. Go ahead. Whoever was doing that. Go ahead. Oh, I mean, I was just I'm not sure what is Core utils? I think it's core tilt. Might be core dash. I you need a force, but oh, yeah. You can't even do that. Beautiful. Nice. Oh. So there's people in the chat saying this
9:59 is SE Linux. Using Schmalt. But these are Ubuntu machines. Just FYI. Apps are using Schmalt. That's the worst. Yeah. So okay. Can I curl some Download the yeah? Let's download the Debian file, and maybe you can can we download the Debian file? Oh, look. That's there. Maybe it's already there. We just need to reinstall it. It looks like it's in the the cache. If we can unpack the Debian file, that's what I'm getting at. The app cache. That's a good idea. Oh, interesting. I found a cool hack too. I can try that to execute things
10:10 Bypassing Permissions with ld-linux
10:50 We'll make it executable. The we're sorry directory is where they have left hints if you require them. Oh, okay. Oops. Sorry. We don't wanna cheat yet. No. Do we have this? Where is this? Apparently, you can have SE Linux on the bin to these days. I had no idea. Thanks, Walid. Where is looking for? There is, like, a For LD? Yeah. No. LD, like, a cool hack to run an executable without marking it as executable. But I don't know whether LLV Linux is here on this. Yeah. Where is LBP? It should be in Lib somewhere.
11:40 Yeah. Where it is. Or Lib 64. Okay. We have okay. Here it is, I guess. Can we oops. I did that. As the as I have the inspector now. How to paste here? I have no idea. Can anyone paste that? Copy and paste that. What are we trying to do? Oh, the name of it? Yeah. The name of that. I can't. Okay. Yeah. I can't even select. I got it. I can type I can type the rest. Thank you. Because I can't copy paste in the web browser currently. I think at least we were able to run
12:48 it. Right? Yeah. Yeah. You are. Yeah. Nice. So we need user bencher mode instead of bencher mode. Yeah. What it means we need attributes. Let's check extended attributes. I don't remember how to do that, but I think that's the only last dash what to see the attributes. Oh. You can use l s t r, but they also remove the x from that if you check. A a yes. What is a l s a t t r. All one word. Yeah. That's command s l s t t Okay. Not like that. That's the command Sorry. Yeah.
13:32 L s a t t r. No dash. Yes? Yes. There we go. Okay. Oh, it did work. I guess immutable. Right? Yeah. Whatever it is. It's not our my Strange. Lord. So we need to copy paste that LD Linux. We need to copy paste that LD Linux, I guess, once again, to execute There you go. This. Okay. Yeah. Who knows what to do next? Is it Chatter, Chamad. Chatter, I guess. How to remove immutable. Is it oh, I haven't used Minus equals I? That sounds familiar. No. What is it? I haven't wait. I wanna say minus I.
14:23 Minus I. It looks like minus I. Okay. Let's see. We can that now. Okay. We have. Oh, Okay. Oh, that's fun. User. That's a cool one, though. Mhmm. That was like Using LDD LD. Yeah. Thanks, Nicole. Gonna tell me. Okay. We we are gonna have to install fish or something. This this lack of completion is maddening. Okay. Probably, it's same tree. Same tree as the using the router. Some other maybe? Less adders on that bad boy. Oh, yeah. Okay. Yeah. Let let's fix Chatter probably first. Yeah. It's mutable as well. Yeah. So we need to fix Chatter itself. Right?
15:17 Yeah. Oh, go ahead. So we need to You need that l d Linux copied again? I don't know. No. It worked Oh. At least. Right? Cool. Nice. It looks like. Right? So dash I. Oh, I'm always pressing. Okay. Do we have Yeah. I know. You can load your batch now. Now it should See if my now can Yeah. Do I have to do this on that right now? Nice. What? Give it the full path? Oh, is well, kubectl is there. What is our path? Was there. Was there just a moment ago. Nice. Is it actually kubectl?
16:12 Oh, let's see what it what it is. Is it Yeah. Let's see if it's binary or not. Might be a script. Yeah. I think they're promoting some product. No. Okay. Let's different. I mean okay. Yeah. Let's just curl it. I think let's get Yeah. Square. We'll just get it from the source. Okay. Whoever can copy paste because yeah. We really could've you know, that's that's like we really could've just skipped all of this, couldn't we have? And that's cool. Good too, man. Yeah. We did all that just to just to get kubectl and kubectl itself is replaced. Yeah. Yeah.
16:15 Replacing the kubectl Binary
17:03 I wonder if that's the only Change mod. Again. See, we would do it anyway. Yeah. Yeah. Yeah. Okay. Here we go. Alright. Get nodes. Okay. Alright. Let's move this back. Alright. Sorry. Yeah. Okay. Time 1646. It's six minutes expired. Oh, you're gonna have to load those completions. That's gonna get really annoying. Yeah. Yeah. How well, I don't know what you're talking about. You want me to type what I think it is? Yes, please. Let me fix this real quick. There you go. Yeah. Which completions are you I think bash completions. Right? Bash dash completion, I think. Is
17:49 that what it is? Let's see. Fire as an argument. And then normally, you would source it. Dash oh, that don't work. Bash dash completions altogether, all one or completion singular maybe. Oh, that's roughly the command. If one of you knows what it is Something like that. Take over. Something like this, Spencer? Oh, go ahead. Probably need to install it. I think it's just not enough. Hang on. Sorry. But, I mean, we we probably need to fix this cert. Yeah. Just install fish. Yeah. Who cares? Yeah. You can just install fish. Go for it. Why not? Then
18:40 you're definitely gonna be the one driving, Andrew. Yeah. Did. Did. There. We never fixed echo. We need to just do the chain chatter on all of the files in bin and user bin. Open everything up. I imagine it looks it looks a bit of a wildcard. Just do one star and the comment and the chat from Waleed saying you could probably source /etc/profile.c/completion. Sorry. I missed that. Oh, that chatter. I don't know. You want you want the the Chatter? Yeah. I someone do it because I heard five people talking. So who wants to do it? Who
19:09 Fixing Shell Issues (Installing Fish)
19:32 has the idea? Do it. K. Okay. There we go. Yeah. I think it's alright. Hello, fish. There we go. Okay. Cool. Alright. Does it feel more like home there? Yes. A little bit. Yeah. A little bit. Never look at the server. Certificate expired or not valid yet. Yeah. Certificate expired ten minutes ago. So Yeah. I think we need a Is there the time right on this thread? Yeah. It looks like I mean, it it's right enough. Yeah. We could probably just alter the time and cheat. That will break teleport. Oh, well, then let's not do that. Okay.
20:03 Identifying Certificate Expiration
20:27 That's what we can do. Alright. I have to generate the cert. These hard Scoopy DM. Two beta d m. So that's Spencer. Oh, yeah. Alright. I don't know. Let's let's rebuild the cluster. Yeah. Oh, so know about it. Well yeah. But kubadian doesn't need a command for this. Kubadian certs, yeah, has a looks like renew. Yeah. Kubadian certs. Only renew just a single one if you if you just Oh, what somebody somebody has been using Yeah. Someone tried it. Yeah. Oh, I need the v I command, the v I bindings. Mode. Oh, yeah. Just change the
20:41 Attempting Certificate Renewal (kubadm)
21:20 clock back. Okay. So we could do that. Yeah. That's what he was saying. You know, the break telecode or teleport. I mean, I so, obviously So I don't think we wanna renew the API server. We wanna renew the the admin cert or whatever Kubernetes calls the admin cert or all yeah. Sure. Why not? I don't know. But admin.com looks like. It's gonna be fun if we break it even worse. Yeah. Let's not break it before we know it's broken. Oh. I cannot oh, right. We have the other screen. Pair reading configuration from the cluster. Is that not successful?
22:08 Yeah. Because it's it relies on something like EK access, Yeah. Right. You're right with the all. If you run the all, you should be able to bypass that. All. Now we need to restart things. Yeah. Yeah. Probably Just kill the pods. Everything should be a static pod except for Kubler. Right? I mean, this is probably c r I c t l and just kill the pods so that they will be restarted by by the Kubelet. I forget what's the socket runtime endpoint. That's right. CRI CTL, and this is going to be var run or is it var container b?
22:28 Restarting Control Plane Pods
23:11 Sorry. Say that again, Andre? No. I mean, just pods. Yeah. I mean, just kill all of them and make kubel restart them. Well, they should be static. Right? Can't I just move them out? You can move them out, yeah, as well. But I wonder why they are not ready, though. But Yeah. We'll get x let's get API access first maybe? Or Yeah. So at least, yeah, resolve the cert error. Well, let's be let's see what's in here first so that I know I'm copying. We are halfway through. Okay. We don't even have the control player. Pressure.
23:54 Yeah. I know. Right? You have been dealt a rough hand, I must say. ETCD, though. Is that Yeah. Yeah. Can a directory for it first. Obviously, you we were saying it would be cheating to check the bash bash history, but, you know, when fish is showing us the stuff in front of you. Well, it's their fault for not clearing it. History dash c is your friend on. Yeah. Well, this is a little strange. This move is hanging. Yeah. It seems bad. Yeah. And I don't have access anymore. It is. Anyone lose access? What? You don't have access? Oh, wait. Uh-oh.
24:16 Talos Team Loses Teleport Access
24:50 Is Teleport running? Did Teleport depend on something here? Oh, wait. Ku VIP. Oh, no. Is that gonna remove the Well, Ku VIP was not ready anyway. We have lost our BGP advertisement. Nice. But the teleport was connecting to local host, so it shouldn't have been using the VIP. The BGP address is for the DNS name. Can we attach it to the packet after the Equinix metal? No. It's we can still SSH in. Somebody can still SSH into that. Yeah. I can. Yeah. Yeah. Just as a point here, KubeVip was running, API server was running, and that CD was
25:41 running. There were just two copies. But, yes, what we have done is killed our so I'm guessing this is how Red Hat killed the Telnet or Tel Teleport, whatever it is, to begin with, is they killed the VIP if it's dependent on the VIP. Alright. So I am on the control plane. Well, I've now learned a vital lesson. We we can go through go in through the backdoor and SSH and through the through the worker and into the the problem is the audience wouldn't be able to see that. They can see me doing it right now. We're
26:04 Regaining Access
26:18 good now. Yeah. We're good now. You wanna go into Etsy Kubernetes PKI. That's where I made a directory called what. And then the kubectl needs to be put into Etsy Kubernetes manifest. Like so? Move one star. Yeah. Let me see. Yeah. That's I don't know where you are, but yes. I think so. But yeah. Okay. The static port manifesto back in the directory. Should hopefully start. We broadcast our IP. And then for all future episodes, I will never use the EIP for teleport. Yes. Well, he just asking why do you need a a VIT and a single API control
26:58 plan? I don't. Probably a fair question. It's just a cheap and easy way of doing the BGP advertisement. Exactly. Alright. So back. Do I need to, like, leave and come back to the session? My queue has not started yet. Oh, wow. I bet you it won't start because Red Hat broke it. Now we can't start stuff. Well, I mean, that was their goal. Right? They wanted the breakings, maybe the pods just aren't starting because of whatever Red Hat did. And now yeah. Yeah. Okay. Okay. So that means we have to go now. We have remote happens,
27:33 Debugging Kubelet and Static Pods
27:40 essentially. Alright. Well, the cubelet looks like it's and ears on the ground. I'm not supposed to be playing, but I will type whatever you want. I don't know what you think. Let's look see right CTL ports. Right? I mean, just to see what's yeah. This one. Yes. Doesn't look it's picking up these at all. Yeah. We need to look at the log now. Yeah. So We should be running static pods. Yep. Should always be doing that. Yeah. Exactly. Unless it has some command line options telling it to look in a different place. We know that was running, though, of course,
28:24 because when we moved them out Right. Yeah. Was. Yeah. Still, it's worth checking the command line arguments to the kubelet. Kubelet kubelet conf, maybe. Yeah. Oh, it was in that directory. It yeah. Yeah. What'd wanna see? Just look for manifest if it's there. Yeah. Yeah. Static bot bot. Yeah. Did we I don't know. Let's try turning it off and back on again. Yep. Node not found. That's No. This this is normal. Yeah. That's it. Yeah. I think yeah. Right. You're I think you may be back in the game now. It's it's good. Yeah. There you go. Yeah. We're back. Teleport's back.
29:34 Nice. I got disconnected, but I got it. I'm gonna allocate you a little bit of extra time because of that. I was I I was afraid we were being too mean. We were way too nice. Alright. All the nos are good. Alright. Where were we? Remember to join my session, please. I think I joined it. The the new one. Right? Oh, there's a new session. Oh, there's a new one too. Sorry. Okay. Yeah. I think the old one might be on the end of the Okay. A few seconds ago, first. Oh, yeah. There's two. Alright. I can see
30:14 who's joining. No. There we go. Yeah. Oh, straight into fish. Yeah. That's Alright. There we are. Cool. Yeah. He's doing all of that. Yeah. Yeah. Okay. Good own thing. Alright. Looks good. Cool. Yeah. There's a there's a Wonderland namespace with a rabbit hole that's not like the look of. Oops. Those are rough. Everything. Cool. What does it look like when we hit the service? Is there anything where Let's check the application. Yeah. Can we check the application? Alright. The application. That's it. Yep. So let's hit applications. We hit clustered and Failed to connect the database.
30:46 Checking Application Status (Database Failed)
31:18 Alright. Alright. So Okay. Have postcard. Going by, so I guess the broke. So Let's check the network policies or whatever else. Yeah. There's endpoints. Yeah. Let's make sure that endpoint actually matches up with the clustered pod, and they're not Yeah. Pointing us to the wrong place somehow. Yep. What is it called? Is it called clustered? Is that it? Just The postcard service is the one that's supposed to communicate right now. Oh, right. Right. Right. Yeah. There's been some keen observations in the chat if any one of you wants to take a quick look. Selector is Postgres.
31:27 Investigating Cilium Network Policies
32:16 That looks good to me. Yep. Sorry. What was the chat? Are there any So no one in the chat that naughty noticed some weird DNS configuration in the. Okay. Maybe we look at No. Network policies. Okay. That's a with that maybe. Yeah. That's probably good. Mean, these are in the default namespace. But either way, let's go take a look at that Keyblet. That's that was sounds interesting. In the Keyblet config The config. They were saying they found they saw something somebody saw. That was Barliv from it? Yeah. I think so. Yeah. Oh, I didn't realize this. Okay. I see
33:15 the chat now. I was like, How do I see this? I see DNS. That's the only mention of it. It looks looks nice. Nice. The resolve conf was the one that Noel pointed at. That looks fine. It's a system system needs. So oh, unless that's an extra e on the end of resolve. Doanls-lon@cresolve.com. That should be a link to the run. Oh, it's not a link to the run. Okay. So go to slash run slash system b slash resolve. Cap that. Is that different? I don't think so. Well, maybe it was, actually. No. Looks the same. I'm gonna email. So
34:18 Yeah. So that's fine or should be fine, assuming we can say ping Google dot com and we are resolving. Well, we have a duration. Oh, but but but we're not resolving to the well, we don't know that. In other words, are they overriding the core DNS? Can we get into the application port and try to But, yeah, let's let's take a look at Like, yeah, ping the endpoint and, like, debug from that. Yeah. We should be able to you should be able to exactly on the bash on that. I'm gonna Okay. How do we k.
35:09 I didn't know you could do that to a deployment. That's nice. Oh, yeah. Yeah. That's so great. How to use the So nice. So nope. Really nice. Useful. Just try to ping it. Yeah. That's what we tell us more about it. Alright. DNS resolve. Action. Mhmm. Yeah. And I think the IP is But is it to the right? It's correct. Alright. So it's just being locked. Do we have a route IP route? 0220. Maybe. Could be right. Well, it's isn't it weird? No network policies. So so ping Kubernetes. Yep. So we can't even get to the API server
36:17 in that subnet. But, also, interestingly, the Kubernetes service IP is different from the I wonder if they switched the service IPs. Take a look at the silicon configs. Yep. That service IP looks okay to me. Were you expecting to see something different? Well, I don't know. I don't know what the service subnet is. It may be fine depending on the scope. What are we looking for? Slider here? Or yeah. Cluster pool slider is 100008. Oh, what is there? Is it? Yeah. Yeah. So it's fine. Yeah. Let's look at the nodes. Maybe. We'll have a suggestion to check the Cilium network policy
37:16 whether it's blocked on the We That's a good idea. Oh, yes. Cilium has its own Cilium Yeah. Policies that are not in standard network policies. Yep. Cilium network policies. Yeah. And also Cilium plus UI. I do understand why they did that. Everybody seems to have to their have their own network policies, but it's still to do. Okay. Yeah. This is one hell of a No. It's like CMP, I think, but okay. And, yeah, there's there's just Cilium network policies as well. Just Cilium network policies. Yeah. Postgres. Yeah. Yeah. Can we just delete them? Delete them. Postgres.
38:13 Yeah. Just delete them. Nothing bad ever came from deleting a policy. Yeah. Actually, once. Just do dash a with delete, and that's it. Even Then the no. Dash dash Yeah. No. Yeah. But okay. That's what You're fine for that. Alright. Can we check the whatchamacallit now? Yep. Is the egress, the Postgres one something we should think about too? I don't know. I mean, probably, but that's in cube system. That's, I don't know. I mean, judging by the name. Who knows? Connection to database. Delete that anyway. Yeah. Delete it. Yeah. We should delete it anyway. They
39:00 were obviously not at the same. And yeah. I'm just gonna do the way I know. Dash dash Yeah. Just the dash oh. Yeah. I think you're just missing the namespace. That's not shit. It is the scrolling keeps messing up. Yeah. I don't know what's going on. It's probably a mismatch in the resolutions, and Teleport is not It's it's not dash dash. It's dash dash. Yeah. Dash dash. That's right. Yeah. That's the easiest. Dash dash. Dash dash Dash dash. There we go. Can we try it now? Build to connect the database. Alright. Well, we got that out of the
40:00 way. So default policy. And do we have a default policy in Saleem? Should we check whether ping is working now? What? Yeah. Sure. Kubernetes. No. No. And why this didn't work? Because this I mean, I haven't looked into these policies by judging by the name. It should block access only. Okay. So Yeah. Let's check the check the Cilium config and make sure the default policy is not enabled. Just How can we configure my session? You'll want namespace as You don't have the plug in for config config and yeah. What are looking for? Pipe grip policy.
40:27 Identifying Default Cilium Policy
41:04 Should we check the hints if we're running out of time? You have ten minutes with the additional time because of the teleport break. I'm not giving up, but just Enable policy by default. So let's turn that off. Set that to never. Okay. I'm not familiar with the enable policy default. Do you wanna does anyone know what that is? Yeah. So we can add, of course, network policies, but you can also assign a default to require the network policy. And so if we are requiring a network policy and there is no network policy, the default is to block.
41:43 Uh-huh. Never? Yeah. Never. And will that require It's a good catch all for for making sure that you don't accidentally expose anything on the network. I guess we're gonna have to do a rollout or a delete. Yeah. Maybe don't dash dash all that. No. Right. Please. Let's kill teleports once again and get more time. Clock, please. Right? Alright. Check it now. K. Cilium is not yet ready. It worked. Oh. Hey. Hey. So the objective is to right now? So is this v one, I guess? Just so we could go edit our deployment right now. What else are we gonna encounter for v
42:45 two? Yeah. But Yeah. True. We'll see. Let's try to edit our deployment real fast. Yes. So let's just edit it. Is that all we do? Uh-huh. For everyone that's not familiar, the objective is to upgrade the clustered application from v one to v two and then see v two working. I should be doing my happy dance. Only deployment. Deployment. That Spanish? Yeah. Deployment. Edit. Just push push enter, and it works. You have only one. Yeah. I know. Yeah. It's just a reset. It's medium OCD and pedantic. You can search. Search for image. Yep. Why is there that? So much space. You
43:34 have seven minutes to exit them. There you go. Go, Alex. Alright. Give me a second. It's running. That's my sad face. Let me refresh again. Are you sure we have v two? Well, it's I mean, the old one is not yet done or according to the last can do another GitPods? Somebody? Yeah. It should be up now. But let's let's check the Yep. There we go. Again. You got the dance. Well, that's alright. With one minute to spare. Well, you you had six minutes to spare because of the extra time. So Oh, okay. Okay. You swear you smashed it. Well done.
44:29 Lots of claps coming in in the chat there. Damn. They did not make that easy. No. They did not. No. They did not. That was fun. Well Mhmm. It's tedious. That was a lot of fun. See, that's what that's what we wanted to avoid in ours is eliminate the tedious that nobody We could've But we No. We had to go through the tedious. But it was I could take ten minutes by just downloading Kubez ETL. It was entertaining, and you explained the processes and what you were doing very well. I think we got a lot of learnings out
45:04 of that. So That was very cool. Yeah. That was Yeah. Now I know what to do. A lot when mod is not executable. Yeah. Yeah. Who knew that trick. Right? What? Alright. Well, thank interview question. Yeah. Yeah. Thanks. Thanks, Red Hat folks. Thank you. Alright. Well, thank you very much, Talos. I'm gonna say you may now exit the call. I will invite Red Hat to join us, and we'll see what tricks you have in store for them. So thank you again. It was an absolute pleasure. Alrighty. You. And good luck, Red Hat. Alright. We'll jump over here. And while we
45:38 are waiting, I will just quickly say thank you to our sponsor, Teleport. It's been a pleasure working with them. I like all the extra things that they've helped me configure for the environment here as well. It's really nice been able to just pop open our secret clustered application through the teleport interface and check if it's working. Red Hat, hopefully you are all joining our session right now. I don't see anyone yet, which means I just have to talk. Although the latency isn't too bad, there was lots of great chat there, great advice. Although Noel, come on Noel. The DNS,
46:00 Team RedHat
46:12 you send them on a wild ride. And I oh, we got a comment from Candice from team Red Hat saying that it's actually one of their interview questions. There you go. Oh, team Red Hat are here. Let's pop over. Who do we have? We have Christophe. We've got James. There's Candice. And fashionably late, g f. Alright. Let me get your names on the screen. Meet I gotta meet the stream. Oh, yeah. Don't have the streamer running in the background. Here's our crew. Yeah. You didn't give them access to and removed executable status from KubeControl, which was wasn't
46:57 even KubeControl anyway. Come on. So we did a lot more than that. Actually, them circumventing the Cilium network policy was amazing. We did not plan That was brilliant. If they would have looked at the image that was running Cilium, it actually was running, our custom operator that would re recreate the, the network policies over and over and over, so they couldn't delete them. And they couldn't update them either. But, I mean, turning off all the Cilium network policies was beautiful. Around that problem. Yep. Well, there you go. The objectives are clear. You just need to be able to upgrade
47:38 the application and browse to it. So good job, Talos. But now it's your turn. I'm excited. Alright. So I'm gonna pull up Talos.cluster.live. Oh, sorry. Introductions. Can we start with you, Christophe, and just work your way around? Can you please all say hello and tell us a little bit about yourself? Sure thing. Hello, everyone. My name is Christophe Lecker. I am a principal engineer with Red Hat. I also am involved in the Kubernetes upstream project. I'm a member of the steering committee, and I'm a technical lead for, the contributor Experience Sick. Either one of you, Jiffy, James, go for
48:21 it. Whoever whatever direction you just wanna look around. Oh, nice. Alright. I guess, hello. I'm James Harrington. I'm a principal engineer on the, OpenShift dedicated team here at Red Hat. I'm also a a team lead alongside, Christophe and I'm part of the team with Candice and and Jeff. Hey, everyone. I'm Candice. I am a senior software engineer on the team with Christophe, James, and Jeff here. I'm also one of the region leads for our North American region here at Red Hat. Hello. Hello. My name is Jeffrey Sika. I'm a principal engineer at Red Hat. I am
49:08 the team lead for the CI team that works on CI for OpenShift dedicated. I'm also kind of all over the place in the Kubernetes community and typically posting like, geese memes and random gifts on Twitter. So just see me around. Alright. Thank you all, and thank you for joining me. I'm going to hand over now your first control plane node. So hopefully you all have access to what we need. I have opened a session on the control plane. If you can all join, we'll keep an eye on who is here and then we'll get this configured
49:41 for our debugging. Yeah. I can see one already. Oh, five. There we go. Everyone is here. It's always fun seeing what hello message people like to choose to make sure the echo is working. Sup? Good day. Well, hi and hi. There we go. Please configure your cube config and good luck. No control plan. Who'd have thunk it? By the way, did we decide who's driving? We did not. Half the call is gonna be who's driving. James said he would drive. James, you're driving. Lucky me. Alright. Let's go. Let's see. What do we got here? The
50:37 connection to the server is refused. Did you specify the right port? I would hope so. Let's have a look. Where is this QConfig? So First thing that opt first thing that I would check is look at the IP address for this post and make sure that they line up. Do you want to create that alias, by the way? We can. I know that Which alias? Alias. So you can the Yeah. Or. Does someone wanna do that for me? For everyone laughing in the chat, this is a problem. Did I do that already? So what is OC?
51:44 OC is the OpenShift flavor, like, the OpenShift client for interacting with a cluster via the CLI. So Got it. There we go. Today, I learned. I cannot reprogram my brain to use anything but OXY. So these are different. Yep. So one of them will be the Elastic IP rather than the machine. Okay. I'll start with this one and use two QTPL. Okay. So if. So this looks like If you look on the look back, you should see the BGP address. Looking for a I see a bunch of I p v six stuff. Is this this one here?
52:50 The local address? Is that what I'm looking for? If you Do mind if I type? Yeah. Go for it. Oh, old school. Sorry. So, yep, that is the one on low here as our BGP address, which I think was correct on the cube config. So am I putting that one back? Yeah. I wish. Yep. No worries. Okay. So the next thing that I would look at is, did they change the port that the API server is listening to? Also, please don't let me be the only one talking. Alright. Let's have a look. I always feel
53:40 bad. Man, OC undone. Alias. OC is alias for you, James. Thanks. You're always so nice to me. We try. Okay. Alright. Kube server API IP address to get that. I'm going to look at cryo, maybe. What am I gonna look at here? There's I checked with flags, kubelet's running with Okay. To start with. Like, I just do a p s a u x and grep per cube. Yeah. So don't kill Don't kill CubeVip. Got it. So is it running on that local address? Is that what's going on here? The API server is just not running. It's
54:42 a node IP. Okay. The the the API server is not running at all. Like, there's Cubelet. There's the controller manager. There's a scheduler. There's etcd. Yeah. But I don't see the API server at all. That's usually a problem. Just in my experience, I thought I'd share that with you. That might be an issue. Thank you. You're right. Alright. So how do we get this thing running? I'm assuming that maybe the manifest is missing. It's there. What's in there? Alright. And we need to look at some logs here, probably. I checked next the Ubuntu service status,
55:32 Like Yeah. Service space dash dash status dash all. Do you like this? Oh, sorry. Status dash all. I don't know what the recommend is saying. Let me drive for a sec. Go for it. K. Not listed in there. What were you looking for there? Sorry. I'm seeing how the the API servers like this configured to It's a static pod manifest. So it's run by the cubelet, which is healthy and running. Maybe we can look at the cubelet logs, Christophe, and see if they've got any errors in there. Maybe it can't be passed or something because it was in the manifest
56:42 folder. K. Go for it. I don't know where the cubit logs go. The cubit logs are available over journal control, and the API server logs will be in bar log containers if you wanna look. Here we go. So trying to read this. There are so many errors. Oh my god. There are a lot of errors. Yeah. It fail it definitely failed to start container, cube API server. Network not ready, but that would probably be because the a p yeah. So API server is definitely plug in not not initialized. But it needs the API server for that.
57:33 Yeah. Where is the API error Crash loop back off? Yeah. Crash loop back off coming from the API server. Error syncing pod skipping. Okay. So Can I drive for a minute? Go for it. Are both of you done looking at the logs? Yeah. Candice? Yeah. So, the CNI issues, know, that's usually in something like, at the CNI, net.d maybe. I don't know if that's where we wanna start. What I want to do I already forgot because I was waiting. It had to do with the manifest. Wanted to like, verify that the image was correct first.
58:53 And, like, maybe they put in an image that just didn't make sense, and that would cause it to crash. That all looks good. We got a suggestion in the chat to collect the must gather. Thanks. I think you really need to look at the API server logs. Like, because you don't know why it's not running right now. Right? Correct. But API server log? Oh, yes. So the kubelet will have attempted to start it. I assume it's run and we probably have logs and bar logs containers. Mean, I I I would check. You'll probably find there's a cube dash API
59:43 server. Yep. Okay. So that Okay. So that's now to It can't connect to that's Saturday. Right? Yeah. But NCD was running? It looked like it from when we were running PS. Is the etcd port changed? Like, there's some things there that I wasn't looking at. 2380 no. 2379. Oh, crap. Sorry. Natalia's team are reminding you there are hints if you want them. Did we see etcd on PS, though? I don't remember seeing it. We only looked for q. Read that. No etcd. Okay. That'll do it. We should check the journal logs for etcd, try and see if there's any errors in
1:00:50 Introducing Team Red Hat & Initial Cluster Check
1:01:05 there for that. Wasn't the etcd also a static pod? Correct. Yeah. Uh-oh. No leader. Well Is it configured for when it's not we're not in configuration? It's rough. Alright. I'm tagging someone else in because I have no idea how to look that up. What's that? We should probably look at the NCD configuration. What's in the the static pod manifesting? NCD is one of those things I keep saying I'm gonna get better with debugging, but I am terrible at that. So I'll need some help to figure out how to check if this is configured for It's also saying that the
1:01:41 API Server Not Running
1:02:17 obviously, the health endpoint wasn't working in that log. Right? So start up pro local host 2 3 8 1 for the health check. Is that right? Should that be zero? I think that port is okay. About with ETCD. I do think that port is okay for the health check. Okay. I think. Just trying to look quickly what we can do with etcd here. So the listen client URLs kill the second one out of there. Listen client URLs. Where are you seeing that? Wait. Actually, one sec. No. Kill the first one. A loopback. Yeah. I think that's a good shot.
1:03:28 Yeah. Keep keep the private IP one, but kill the loopback out of listen client URLs. In this where is it? Yeah. Up. It's in it's in the up, up. It's in the, the command line flags. So there's one listen dash client dash URLs. Kill the loop back off of that. Keep the private IP address. I guess what you're thinking there is that the existence of that second IP address is putting it into an HAMode, which doesn't exist. Right? Yeah. This one? That look good? Yep. Yep. Yep. Alright. Now You should have the logs in this directory.
1:04:13 I would just do it. Yeah. Let's have a look. Scheduler API. There we go. There it is. Looks healthy. Hey. Yep. Ready. Is the API server running now? Still refused. Will that be still changed? The API server logs next. Yeah. Might be a layered issue. Or actually, there is a yeah. Yeah. Kube API server tiles control point one. Oh, let's see. Loop back address. Still can't connect to SCD. What if we change that to the private IP as well? Yep. Kube API server. 137. Alright. Anything else in here? No. No. That looks good. That fail will probably get rotated. You might
1:04:50 Identifying the ETCD Issue
1:05:49 wanna Yeah. It's a good call. Yeah. Because you all broke teleport, you've only got fifteen seconds left. Good luck. Yes. They're the only one I think. That should be at least a new one. Right? Well, the thing is the logs are still showing local host. Do I need to kill this container? Which it might it might just be the the crash loop back off. Like, it might be it might not have restarted the container yet. No. No. No. Put a new address. Okay. Yep. Still not happy. Happy Erbil. Right? I mean, it No. Maybe not.
1:06:58 NC might be available. Netcat. It is. I have no idea how to use Netcat off the top of syntax. I've added dash z to the start. K. A z. Oh, sorry. That's sorry. And it's a space, not a cool one. So same syntaxes. So spaces that are cool. Yeah. And then echo the status, I think that yeah. It's okay. Okay. So that's working. Okay. Let's have a look at running. The port is open. Sending new addresses to Looks like that file got rotated. Same thing. Okay. So it's a decent so, like, a whoops. I just lost my screen. There you go.
1:08:06 Was the ETCD still is it happy? I have I have Yeah. Go back to the ETCD logs. Yeah. Okay. I think it is happy. Oh, no. It's still throwing no leader. K. Health error, no leader. You wanna double check the manifest again? Yep. Advertise client URLs. Is that the right port? Client URLs. Yeah. But why would we so should we even be is on local host. Does that need to be changed? No. That should be okay. Should we even be have the okay. My brain. Should we even have peers Yes. Set up? Okay. Is that city control available?
1:08:35 ETCD Quorum Problems ("No Leader")
1:09:23 I always use a cheat sheet, but I'm gonna grab it because nobody in Kubernetes. But you will have to install it to the client if you wanna throw that into the thing just there. No, we got it. Oh, that means that someone has definitely been playing with it. There you go. SCD is now configured. You should be able to speak at CD, but obviously it's got a bit of a issue at the moment. Do we need? Connection refused. I mean, it's not, like, IP tables or something like that. Right? Well Well, hold on. Should should we have
1:10:05 okay. Before in the advertised peers, we deleted the local host. Should we have instead deleted the private IP? It's looking that way. Yeah. We can we can check that and push it back towards using the just the loopback. This Is this not listening on local host? Nope. No. It's not. Is that the issue? It might be. Yeah. Let's let's edit the etcd manifest to swap out back to the local, the loopback as opposed to private IP, and then we also need to do the same for the Kube API server. Yep. So etcd. Listen client. And the listen peer, same thing.
1:11:13 No. It it you'd it doesn't have any peers. So I'd say that's fine. You might need to change the advertised client URLs. Yeah. This is mean. I don't know which one of these it should be. We deserve everything that we're getting right now. Oh, we were nice. There's also an annotation. Okay. Yeah. I see. Yep. Alright. You have twenty minutes left. And remember, there are hints if needed. Yep. Under pressure. Alright. So that'll probably take a second while that's booting. We need to update the API server. Yeah. Manifest. At least I got auto complete. That could've
1:11:40 Consulting Hints for ETCD
1:12:12 been that was that was me. I'm sorry for that one. That was Candice. Yeah. That was before sorry. Harsh. Yep. There you go. Yep. Okay. Now let's give it a hot minute. Until the I mean, you should be able to check if it's a d is healthy just now while you wait on the API server. Now that it's the d c t l is configured. You should be able to just do a it's the control Status. Status. Yeah. I think. I don't I'm maybe It's the control cluster health. Like this one word? Or dash Oh, sorry. Endpoint
1:13:25 health. Endpoint health. Endpoint space health. No. Can't connect on local host. Check the etcd logs. Like, did it boot properly? I love that error message. Retrying of unary invoker failed. It's good to have etcd. Nobody? Nobody. Yeah. Let's have a look here. Could not connect to checking for PR, and it's got this IP. Do we miss something somewhere? Delete delete that flag. Like, I'd I'd edit the Etsy config and just delete that flag. K. So can I ask for something out? Like, the way that you spend up Etsy d on these machines is, it doesn't really care
1:14:35 if there appears or not upfront. So there's probably something and say it's data directory that tells it where the peers are rather than modifying the flags at all. So Where is the bar libetti. Thank you. Not bad. Alright. I'm officially running out of ideas here. I've got an idea. Yeah. Where's the hint? Talos team, if you're still watching, let us know where the hints are. Can you try NCT control member list? I guess not. Okay. So I was gonna do a a find across the file system for that IP it's trying to to connect to and try and service the
1:15:18 Hint 1: Insufficient Quorum
1:15:59 file that it's in. Apparently, there is a hint dash one file. In where at? Root. Where, Talos? Come on. Slash route. Oh, yeah. They're saying slash route. I think they've maybe it is not in the control plan. That would be silly, though. I don't know how long this is gonna take. Alright. Let me check the worker notes just in case. Alright. I don't mean to worry you, but SSH is not working on the work or not. Andrew from the team is saying a member list would have worked if we hadn't restarted at CP. Hello? Alright. I don't know what I'm doing here.
1:17:22 And they're telling me to check the workers, which I cannot get access to. So what I'm hearing is Red Hat has broken, Teleport twice today. Doing well. And they're breaking up in your control plan. Like, is that proper oh, that's okay. Yeah. It's just the workers aren't working. Alright. I will SSH on to the workers manually, see if I can find you the hint. Oops. I've now broken everything. Okay. So is it worth taking a look back at this? We're sure you'll find the answer just like that. Alright. If you just wanna pay attention to
1:18:15 the broadcast here, we do have MOTD and we have HENTS. You just want HENTS one then? HENTS one. Quorum, I don't okay. So Latin? In insufficient quorum. Yeah. Is that a hen or a ton? I'm not sure. That's Probably a little bit of both. So we Can't I can't say we don't deserve it, though. No. I'm fine being stumped. At this point, we know that there's something wrong with the etcetera d, like, how it's configured. It is trying to when we when we reconfigured it to listen to loop back, did we check that it's actually able to listen to loop back and
1:19:01 where we can connect on it? Yeah. But I I think the problem is they've added a second member at one point, and now the cluster expects to find quorum and it can. Yeah. So maybe we should be googling how to go from an HATD to a single egg. Already doing it. Or we can fire hands two up. I don't know if that'll help or not, but then, yeah, I'm stumped. That's is. It's harsh. Oh, so removing a failed SCD node, it looks like there's, like, a flag for SCDCTL. I think you and I just found the
1:19:40 Attempting to Remove Failed ETCD Member
1:19:57 same thing. Capital c or something? What was that host? Forget the log again. It may have rotated again. Oh, yep. I'm also not in the right directory. That that won't help. So if we go in here, sale the the latest log, which is going to be this one. Yeah. I think you wanna remove that member A5E7C240BB16. Oh, yeah. Probably should use the ID. Yeah. Yep. ID. Alright. So it's at c d c t go ahead. Was just I was gonna say the same thing. You are at c d c t l member space remove. Like so?
1:21:14 Yep. Silence Oak. Hold on. Hold on. I think I can help a little bit. Let me make this a more pasteable fashion. Keyboard, hard to type. We have ten minutes. Yep. Where is Chrome? Where is Chrome? So that's how you remove it, a member if it's completely lost quorum. We need So all that serif stuff is already configured. So you can probably just add the dash capital c. Well, the one so the one thing that we need is we need that member ID again. I think I've You need the surviving host IP address. Right? Which is Which is
1:22:27 You can use 127001. Right? Yeah. Just on the loop back? Let's try it. Alright. I'm pretty sure I severely messed up the terminal. Sorry about that. Yeah. You kinda did, but, you know, it'll Yeah. That I'll take that one. Who's typing? I am I am typing. I swear. But I need tail to be I just need one of them. Copy. Member remove. There's a do you wanna type reset to fix the terminal? Thanks, Jason. I think we're ready for hit number two, actually. Unknown shorthand flag. Do we just need dash dash endpoints there? Yeah. Oh, did I miss that? I think Candice
1:24:25 is right. I don't think that dash c. I think that might be either an older version or newer version. Looks like endpoints is the same thing. Thanks, Candice. It could be an s t d two to three different. It did a thing. An angry thing. Conflicting environment variable is shadowed by corresponding Oh. Yeah. We already have the endpoints configured in the environment from that cheat sheet that we copy and pasted. So that member removed should maybe be working, but it isn't. Is there a force flag? Dash dash kill with fire maybe? If only. So according to the chat, we already figured
1:25:15 out hint number two. Yeah. Yeah. Do you mind if I look at the help on this? Alright. We got a you wanna see hint three? I think so. Sure. There's a snapshot saved at var slash at c d dot d g. Oh, that was nice. That's that's mean. Okay. We can we can restore time. Yep. Etcdctl snapshot restore. Where is it? Slash VAR. There you go. Is that the only way to recover from a highly available STD to a single node is a snapshot? I'm not 100% sure. Member list. You may have to restart NCD.
1:25:32 Hint 3: ETCD Snapshot Restore
1:26:31 Sure. Okay. I'd probably play it safe and remove the static pod manifest to the restore and then bring it back in. Are saying there is a way to do it without the restore. Maybe they can share that with us in the chat, and we'll share it with the share it with the audience. That's a tough one. Alright. It's not running. Looks good. I would run that restore again before you move it back just in case. Oh, glad you. Boom. Is it? Maybe you can just start it. No problem. Maybe maybe the restores is done and happy.
1:28:02 I'm not sure. Let's see. Fingers crossed. But I'm assuming if there was one component that Talos were gonna mess with, you were really hoping it wasn't gonna be entity. That CD is hard. As the APIs that we're connecting now, is the at CD log still showing the the issue? Well, you'll you'll wanna make sure you can run a member list on that SAT. We need to know that's somewhat healthy. We need to we need to specify the data directory when we restore it, when it's off. There you go. I think the data there will be
1:29:04 Performing ETCD Snapshot Restore
1:29:56 a varlib etcd and then the snapshot fail at the end. I'm gonna go out in eleven. So let's delete and then do the restore. Are we feeling brave? I'll do it. Let's do it. Oh, I think did we overwrite the snapshot with the restore? Restore. No? That seems to have worked. Alright. And fingers crossed. Come on, We believe in you. Anyone have a hard spot? Are you okay to go for another five to ten? Okay. We've got a members list. If we have a members list, I'm happy to give you a a little bit more time if
1:31:08 as long as no one has to leave. I have to leave in an hour and a half, so I got plenty of Alright. I'll see. I'll sell in. Right. Okay. We need we need to okay. Okay. Now we need to make sure that we are talking to the right SCD service. I don't know if we updated the cube API servers config to talk to local host. I forget. That was years ago. It was years ago. Yeah. That'd be the next thing I'd check. Alright. What are we checking? K. We did. The NCD server's line. Yeah.
1:31:48 Okay. Yep. The only other thing is NCD is happy. Is everything in that config file okay? Let's see if the API server's up now. Like, is is there something else in the etch in the API server logs? API server is up. So let's look at the logs now. Logs are in bar log. Containers. Containers. Container. Yep. It's not erroring. It's not error. Notes. So on your HCT static manifest, the advertised address is the I P V 4 address, I believe. Even though it listens on 127, I think it advertises us another one. I don't know
1:33:01 if that matters. Alright. Well, this is the external address of the API server as opposed to it connecting to FCD. So that should be fine. Try get notes now that it's up and running. Oh. There we go. K. We're closer. Pods. One node. So Talos worker two is not ready. Cilium is restarting like mad, but that was probably because of the API server. So TELUS worker What TELUS worker two might actually recover on its own because of psyllium recovery. Perhaps. What's the, what's the status of the actual application? Let's check. Are we getting it are we
1:34:16 Application (v1) is Accessible
1:34:18 getting an error from the application itself? I'm assuming so. Oh, it's loading. It's working. And now your mission is to upgrade it. Oh, boy. Alright. I'm concerned. As we are. As are we all. Alright, Al. Is it deployment? Yeah. Deployment in the default namespace called clustered. It seemed a bit easy. I'm scared. It's exactly what I'm concerned. I mean, STD problems are not easy. No. True. No. And we did we are we are five minutes over. Yeah. They were they were much kinder than we were. So it's either not pulling or still pulling an image.
1:34:32 Red Hat Team: Attempting Application Upgrade
1:35:40 So did I read the image name right on that deployment? It was, like, g h c r or something like that. Yep. It looked okay. Well, that's not ideal. Nope. So is the I thought you said you were pretty nice. Okay. So whatever is wrong with Talos worker two is actually causing this. Yeah. What are the well, does it need to schedule on TAL Server two? Is it on TAL Server two right now? Is there a node selected on here somewhere? You can just add node name to bypass the scheduler. That look great? Yeah. I think so. That's a hard way.
1:35:55 Application Pod Pending (Worker Node Issue)
1:37:33 It work. Ending. Well, if worker two is not ready, that's never gonna schedule. Wait. If worker two's. Is there is there a taint or something on node one that's preventing it from scheduling? No. It did schedule on node one, but we got problems setting up a sandbox because it's got networking CNA issues. And I don't think node two is actually ready, which means it won't be able to schedule on it. Schedule there. Yeah. Celine is up and running on Celine is up. False. Is not up and running on this node. Yeah. I'm sure when you ran get nodes,
1:38:45 it said worker two was not ready. Right? Yes. Have you described the note? Do we get a message? So I didn't see any errors. I mean, it could just be tinted. Oh my god. Is it really just that? Alright. Or you can look at hint four. Is try this one out. Yeah. It's not it's not cool enough. Mind if I drive for a sec? Go. Is Google not running on that node? Because it says Kubo has stopped posting node status. This space? No. Look. It so in Talos worker one, it did start kubelet after it had everything.
1:39:29 Consulting Hints for Worker Node
1:40:16 And you if you look at ours, it doesn't start kubelet. K. So kubelet's not running. David, can you open a session on worker two? Indeed. I would love to do it through Teleport, but I don't know what Talos have done. I have a session open in the work or two. Do you want me to just start the Kubelet? Oh, did your session work? Yes. Interesting. It does appear that the cubel is working, but let me try and do this in a way that we can share it with people. So let me try this one more time.
1:40:59 Oh, yeah. There we go. Alright. You can join that session. Take it away. Let's call it five more minutes, and then we'll we'll wrap up. So use the hands of Nizit. Have a look at the flags past the kubelet. Maybe it's just got the wrong IP or something like that since it was running. Wrong session. Which session is the right session? A few seconds ago. That's the right one. That's the one? Alright. Interesting. Why is it trying to register itself as Talos worker one? You have an idea? How we start? The kubelet. Sorry. What? I heard two things at once.
1:41:01 Debugging Worker Node (Kubelet, Hostname)
1:42:25 Sean McCart suggested just restarting the kubelet. Who actually takes thought service? Who's that? Me. Alright? Just because. At least you don't grab the whole thing first and then get the name and run it. It looks good. People do that. Alright. I'm gonna switch back to The node is ready now. Alright. We are back on the control plan. What is the pod pending if you describe the pod? That's the one we added the node selector to, which may or may not work. Actually. Are you at Down. Sneaky change there from Thanos. Sean is saying they changed the host name, restarted the Coupa,
1:43:41 Talos Team's Hostname Trick Revealed
1:43:45 and then changed the host name back. Oh my god. That's so cool. That is nice. Alright. Now, Ion, why don't you start? Yeah. Describe the pod. Container. So whenever I see sandbox, you wanna check the silly and pods are scheduled. I'm seeing invalid capacity zero on image file system in the events. That's on Telus worker one as well. Oh, and Telus worker one. Look at capacity and items. Is that what's going on here? Have we really gone this far? Alright. We're gonna have to start wrapping things up because I do have to do daddy bedtime, I'm afraid. So why don't we why
1:45:15 Consulting Final Hints (Cilium CNI)
1:45:15 don't we look at these hints? Like, Sean is saying, let's burn the hints. So let's work let's work our way through. So five, let's turn it off and on again. That must be the criblet. Six, it's not C and I. There's no way it's C and I. It is C and I. Definitely a selam issue. Alright. There's your hint. And ops, c and I bend, there are lots of fails. Running look back gives an output that's wild. Well, let's go take a look at ops, c and I bend on a worker. We will do it on worker two.
1:45:54 Opt c and I bin. Am I dumb? I'm probably an LS? Yeah. Oh, there's a loop back dot back. So the original loopback is a bridge plugin and then the loopback is a loopback plugin. Oh, yeah. And also, I mean, that was quite brave, not even running them, just going for it. But sure. I yeah. I I yell at them, definitely. Alright. Let's see if we can get our pod scheduled then. You may have to delete the Cilium pods. Alright. Don't know if you'd have to delete the Cilium pod. I'm not sure to be honest,
1:46:44 Rescheduling the Application Pod
1:47:02 but you definitely need to delete the clustered one. We can default. Oh, it's running. It already finished. Yep. You Do we get a dance? Let me I don't think we have v two yet. Alright. And there are oh, there are normal hints, so that potentially may be the last one. Oh, it depends where it's scheduled. Okay. Well, let's delete the pod, see it get recreated, and then Yeah. I know. Oh, try refreshing it, David. I I just loaded the app on my side and that it works. There we go. Bravo, Talos. Bravo. Yeah. That was tough. Well
1:48:10 Post-Challenge Discussion and Wrap-up
1:48:10 done. Well done. Those were two very, very brutal clusters. These are all terrible, terrible people. I gotta say. Aren't you happy? You're not the one fixing them? I am very glad. I am not the one. Like, the minute I seen it was STD, I was like, I'm I'm glad that's not me. Like, I I think that's that's my my crux. I just can't deal with the STD stuff. So good perseverance, good working through it. And I think we I definitely learned a lot. I'm assuming everyone else learned a lot there. But the snapshots and the member lists and stuff like that.
1:48:44 Absolutely. Absolute evil. I'm not sure what's worse though. Removing executable on the things and then removing some binaries or etcd. Both particularly harsh. Alright. Thank you, team Red Hat. Was it fun? Did you enjoy it at least? Thanks for having us. Yeah. It was fun. Alright. Okay. Thank you very much. Have a wonderful day. Thank you for joining me, and I will speak to you all soon, I'm sure. Thanks, Dave. To our audience, thank you. That was absolute slog. That was very difficult. Thank you, Talos. Thank you, Red Hat. Thank you to Teleport. Remember, if you want a sponsor of the
1:49:24 show if you want to support our sponsor of the show, Teleport, please visit rockwood.live/teleport. Have a wonderful day. We will be back next week with more custard. I will speak to you all soon. Thanks a lot. Bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments