About this video
What You'll Learn
- debug Kubernetes control-plane failures by tracing etcd authentication, API server startup, and control-plane connectivity issues.
- resolve node readiness and scheduling problems by checking taints, run-once settings, maxPods limits, and container runtime socket paths.
- fix image pull and registry access faults by correcting ImagePullPolicy, DNS hosts, and kubelet manifest and certificate issues.
Teams Aerospike and Pixie Labs tackle broken Kubernetes clusters via Teleport. Aerospike untangles etcd auth, ImagePullPolicy and registry DNS; Pixie hits etcd ports, a rogue manifest, kubelet run-once and containerd socket paths.
Jump to a chapter
- 0:00 <Untitled Chapter 1>
- 2:02 Introduction & Giveaway Info
- 2:41 Sponsor Thanks (Teleport, Equinix Metal)
- 3:55 Introducing Team Aerospike
- 5:37 Team Aerospike Begins Debugging
- 6:47 Investigating No Control Plane / API Server
- 11:46 Investigating API Server Ports & Logs
- 13:21 Investigating etcd Connection & Certs
- 16:22 etcd Permission Denied & Red Herring Hint
- 21:03 Hint: Check Root Directory (File Permissions)
- 23:11 Discovering & Disabling etcd Authentication
- 25:46 API Server Starts
- 25:59 Nodes Not Ready
- 28:30 Debugging Node Status / Removing Taints
- 29:46 Nodes Become Ready
- 31:08 Checking Service Status (v1 Working)
- 31:25 Attempting to Update to V2
- 32:14 Discovering ImagePullPolicy 'Never'
- 33:59 Image Pull Back Off (V2)
- 34:42 Checking Worker Node & Container Runtime
- 36:16 Investigating Registry Certificate / DNS Issue
- 37:33 Fixing Hosts File (DNS)
- 39:07 Pod Stuck in Backoff
- 39:22 V2 Pod Running & Service Working
- 40:05 Team Aerospike Success & Debrief
- 41:11 Transition to Team Pixielabs
- 41:55 Team Pixielabs Explains Breaks
- 43:30 Introducing Team Pixielabs & Product
- 1:00:57 Why the Api Server Was Not Started
- 1:01:03 Cni Plugin Not Initialized
- 1:09:49 Error Logs
- 1:45:33 Team Pixielabs Begins Debugging
- 1:46:10 Competition Winners
- 1:46:57 Investigating No Control Plane / API Server (Pixie)
- 1:50:30 Finding Incorrect etcd Port in Manifest
- 1:51:48 Fixing etcd Port in Manifest
- 1:55:38 Kubelet Cannot Process Suspicious Manifest (`delete-me.yaml`)
- 1:56:06 Discussing & Removing `delete-me.yaml`
- 2:01:37 Fixing YAML Parsing Errors
- 2:06:16 Checking Kubelet Configuration & Logs
- 2:17:38 Discovering `maxPods: 1` Setting
- 2:19:00 Investigating Excessive Pod Creation
- 2:20:51 Investigating Worker Node Kubelet Issues
- 2:22:37 Finding Incorrect Containerd Socket Path
- 2:23:41 Fixing Containerd Config
- 2:26:02 Discovering Kubelet Run-Once Setting
- 2:26:40 Hint: Remove Kubelet Run-Once
- 2:28:24 Fixing systemd Unit Run-Once
- 3:00:35 Discovering & Bypassing Broken Scheduler (`nodeName`)
- 3:17:40 Finding Worker Node `maxPods: 1`
- 3:21:40 Reviewing Pods & Pending Issues (Scheduler, Resources, Port)
- 3:23:14 Investigating Port Conflict (NGINX)
- 3:26:38 Calling for Explanation / Cluster is Dumpster Fire
- 3:26:59 Hint: /run Partition is Full
- 3:27:20 Calling Time / Multiple Unsolved Breaks
- 3:27:59 Pixielabs Debrief / Reflection
- 3:30:01 Giveaway Drawing
- 3:31:11 Conclusion / Winners Announced
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
2:02 Introduction & Giveaway Info
2:02 Hello, and welcome back to the Rawkode Academy. This is clustered day. We have two great teams that are gonna attempt to fix a broken Kubernetes cluster broken by the other team. Now, if you want to win a clustered t shirt, which I should probably have been wearing today. However, if you want to win a clustered t shirt, jump to Rawkode.livewin. All you have to do is retweet a tweet that will enter you into the draw, and we will draw two winners of a t shirt at the end of today's episode. So Rawkode.livewin. I will pop that back up later to
2:38 remind you all, but first I need to thank our wonderful sponsors. So first up is Teleport. Teleport have been sponsoring a show for a long time now. We have used Teleport every single episode of clustered as an amazing tool. I always say that every single cluster, every production environment should have a Teleport. Why? Well, you can use GitHub, Google, and other OAuth based providers to commoditize access to your cluster in a secure fashion that doesn't rely on SSH keys. You can also do a pairing and debugging like we're going to do today. We have multiple people in a single session debugging a
2:41 Sponsor Thanks (Teleport, Equinix Metal)
3:14 production issue. That kind of latency when fixing things is invaluable. I also wanna thank Equinix Medal. They have provided the hardware for every single episode of cluster two, and they also gave us a code that you can all use to check it out. So if you want to check out Equinix Metal for 200 USD, which is around one hundred hours of compute, you can use the code Rawkode. Go to Rawkode.live/metal to find out more information. Equinix Metal is a bare metal cloud. We don't use VMs for clustered. We use big chunky machines with tons of cores and tons of memory.
3:48 Why? Because it makes it more fun for me. So go check out Equinix Metal and have some fun with that code. Alrighty. Now it's time to introduce our first team. We are joined by team Aerospike. Hello, team Aerospike. How are you all doing? Doing good. Alright. Can we please say hello? We'll start top right and feel free to introduce yourself, and we'll get started just as soon as we've done that. Angeli, would you like to go first? Yeah. Hi. My name is Angeli. I am a new grad DevOps engineer at Aerospike, and I'm excited to participate in this,
3:55 Introducing Team Aerospike
4:31 event. Awesome. Thank you. Jeff? Yeah. Jeff McCormick. I work for Aerospike on their database as a service project and Kubernetes developer. Anyway, happy to be here. I'm broadcasting or coming to you today from San Antonio, Texas. So, anyway, happy to be here and happy to give it a try. Awesome. Thank you. And familiar face, Merrick. Yeah. My name is Merrick Counts. I absolutely love clustered. I learn something new every single time I I I I'm here. I'm super happy that, the rest of the gang, we're a new team, so we're just kind of, this is
5:15 kind of a team building experience for us. I'm super excited that they were willing to join to join me and do this. And and so yeah. Yeah. Super excited to be here. Yeah. I work for Aerospike, obviously, on their database as a service, and I work with Jeff. So yeah. Anyways. Awesome. Thank you, sir. Awesome. You all for saying hello. We're gonna dive straight into this. We've got two clusters to get with today, and you're unfortunate team one. I don't know if forced or unfortunate. I I guess it could go either way. It depends on the cluster. Right?
5:37 Team Aerospike Begins Debugging
5:47 So let's pop up my screen share. We have teleport. I have already modified your role, so you do have access to the Pexi Labs cluster. I I can see someone's already ran LS, so there's someone else that's already in my session. But I have opened a session on Pexi Labs control play in one. Now if you can just all give me an echo or something to let you know that you're here, and we'll get things kicked off for today. Hello, hi. I'm not sure who hi was, but hi. Alright. We got some random characters. That's always
6:18 a good way to go too. Alright. And one more high. Hopefully, not the same high. But, however, I'm pretty confident that we're all in a session. You have forty minutes team Aerospike. Best of luck. Take it away. So I guess, yeah, we can set that up. I guess the first thing we do is try to connect to the cluster and see if yeah. Uh-oh. No control plan. Alright. Maybe we check since this is kubeadm, maybe we check we can check the manifest, but we could also just check to see if the kubelet's running, just like system CTL.
6:47 Investigating No Control Plane / API Server
7:02 The kubelet should be running on this. Definitely Kubernetes thing's running. And what appears to be a kubelet? Yeah. I never just say there's a kubelet running anymore because I never trust anything in these clusters after they've been broken. Who are we thinking team member spike? I don't know. We can run system CTL status on it. I have also learned not to trust the binaries. Yeah. Why would you trust the binary? Hello to everyone on the chat. Remember, you can win a cluster t shirt at Rawkode.live/win. So it says the container runtime network is not ready.
8:28 Someone's opened them already. Come on. Let me try the session from the web. I quite like the the ping yahoo.com. I never trust a ping to google.com either. I usually go about off field and pick something like that. Just to challenge the ping command, you know. Google is a bit easy. So we have a kubelet running, appears to be healthy, but no API server. We're in the static manifest for the API server. I mean, the timestamp on it doesn't look like it had been changed, but I'm assuming you don't trust the timestamp either. Well, I mean, you wouldn't be able to
9:49 because you're not connecting the certs there. You're connecting to the SSL. You'd have to also curl with the proper certs. Right? It's true. Are there any endpoints that are not Let's see here. It shouldn't be. I I was gonna say, can you also just do, like, a port scan like, list of the ports and see if it's actually on that port? Oh, wait. I didn't see its port configuring that. Is it oh, its secure port is at 6443. Mhmm. We have to check one more. Yeah. It is also there at 6443. Yep. Looks alright to me.
10:45 Mhmm. We could check the wait. Wait. And I'm sorry if I I missed something that was reading chat. And thank you, Jeff, for typing because last time I typed, I catted a binary file and I'm a little scarred still from that. Happened two weeks in a row. You'd never believe it. It happened two weeks in a row. I was so happy when somebody else did it. So I was just like, ah, phew. Same thought. Easy. So the other thing we could do is check the cubelet because this is a static pod. So if it's not running for
11:24 some reason, we could check the container runtime that as well as the kubelet logs for why it might not be running, if it's not running. Or do we think that we just don't have access? Maybe I'm a little behind. I haven't I don't know that we verified that it's actually running and listening on that port. Yeah. You said something interesting earlier. Can we port scan or list the ports? I I think that's a good thread to pull that. Can we grep that on the yeah. So I don't think it's actually running for reasons. I think you're right. Can
11:46 Investigating API Server Ports & Logs
12:12 you get logs from the API server? So Remember the big log file position there. So container log Actually, could you check template. Was Mark, are you gonna say that, or should I continue? No. You continue. I I I stopped. Yeah. So you can go to varlog containers or varlog pods, and that will allow you to see the logs for containers that spun up temporarily. So there is an API server directory or something there depending on the colors. Yeah. Directory. There we go. Alright. So they screwed with STD. I hate you guys. Every time I see STD in a log
13:21 Investigating etcd Connection & Certs
13:27 file, I'm ready to go home. So Yeah. Yeah. Okay. So there's a couple things here. There's actually two certs I was going to say that they could screw up that I I can think of off the top of my head. You have the one that allows you but you have the cert talk to etcd. So can we actually just cat the etcd comp file? Not the binary. Not the oh, yes, please. Not the binary. If you go into the manifest go into the manifest. There should be a manifest that describes etcd. Right there. Right there. The the
14:10 the etcd. Yaml. And here's where it's going to define its certs, and then we gotta make sure that the API server is loading the right certs as well. Oh gosh. Did you check to see if SCC was running first? Good point. It does appear to be. It it's running as a static pod manifest, so it wouldn't be in system CTL, I don't think. Okay. Well, it seems to be running anyway. So Well, if you had the ethic d control, I don't know if there's a nicer way to say that command. You could run health and status commands, but I think you'll need
15:09 to install it. It's c d dash client, maybe. And then there is a bunch of setup, which I never remember. So I'm gonna grab a cheat sheet that I always use. I was gonna say that you have to pass it to Yeah. The from the API version and a bunch of other crap. It's not the one I normally use. Where's the one I normally use? Oh, I'm never gonna figure that out because I use Dot go. That's the one. Do you mind if I paste onto your session? Please do. Alright. So that exports all the
16:07 stuff that you need, and then you should be able to use etcd control get. And as you'll see, you now have the permissions that I've replicated. So you can maybe use some other etcd control commands to work out what's going on. So it said permission denied for the get. I I'm wondering if they changed those certs on us. Try an entity control health or status. I can't remember. I think it's health, but I don't remember either. Yeah. Unknown command. That may just be it is. I don't know. People don't play with that too directly. Right?
16:22 etcd Permission Denied & Red Herring Hint
16:56 That's too hard. I'm looking, Jeff. Give me a second. Try at c z dash dash help. Endpoint health and endpoint status might be a good one to run. Think you have to pass an endpoint for that. Nope. Both returned good. But I think their entity customer might be alright. What else could the permissions be? Problem. So, actually, we could look at the API server config. Does it use the same certs as we're using here? So is it use did they just update the the certs that it's trying to use to communicate with it to bad certs or screw with them?
18:03 Potentially. You could always go into the PKI directory and check some file timestamps with a superficial check. But we do have someone from Pixielabs, Philip, in the chat saying not a search problem. Oh. Oh. But where else do we have permissions? File system? I mean, is it a permissions problem? Or oh, actually, they could have changed where it's it's endpoint that it's trying to hit to be the wrong endpoint. Did we validate that it's actually going to the right endpoint? Oh, does etcd have our back? I don't. Kevin also confirming that permission denied comes from
18:57 etcd, and it wouldn't be a CEREC issue. So definitely not CEREC. Yeah. Oh, yeah. Lots of confirmations. Stop looking at this, sir. Alright. So it's not certs, permission denied coming from etcd. It could be the user, like so I guess our back, our user doesn't have it. I don't know how to configure RBAC. Let me look at RBAC or etcd. So And from the Hang on the Pixilabs t l what'd you say? The Hang from the Pixilabs team is saying a reminder that the entity control get also failed. That that's important. K. Get also failed.
20:11 Could you also do the etcctl user list? I'm not I'm not an etcd person, But Oh, it's always just jury commands. I thought you went on to one earlier. Oh, well. Yeah. Did you try user list? My session broke. I'm reconnecting. I'll be right back. There are hints in that route directory. Might be a good time to check the first one out. Yep. I agree. And we got some confirmation from Philip and Vahang. So firstly, the entity image has not been changed. And secondly, I promise you, this is less than the store that it's seen.
21:03 Hint: Check Root Directory (File Permissions)
21:22 You modified STD. That's as sinister as it gets on clustered. Come on. Alright. So they're saying root to root. That looks like Linux file permissions. Yeah. That's where my my brain was going to. So where's the etcd file database live? Oh, is it etcd that's having a problem loading its file? Like, it doesn't have permission to open the database file? It's just something data. Yeah. That's helpful. Barlib etcd. Jeff, could we look at varlib etcd? Yeah. And look at the file permissions on the file? I'm sorry. What's the path again? Varlib etcd, I think, what the yeah.
22:26 Member. Well, everything's owned by Roo, didn't it? So Yep. They have said that the failed permission is a red herring, And we've been encouraged to run the entity control dash h once again. But no disk limits failed permissions, but we do need to be curious about the root root. Well, let's take a look at dash help here. Oh, I think I've worked it out. There's a a user flag that can take a password there. You wanna try the get again? Passing on the user? Try rip. Yeah. Hey. They've turned on authentication or at least non
23:11 Discovering & Disabling etcd Authentication
23:48 x five zero nine authentication. Can we just turn that off? What's the easiest way here? Like so I honestly don't know. I I've never done this with it. Let's go take a look at the entity manifest. I'm not entirely sure. Let's see let's see if we can find out if it's in here. Oh, let me jump back over. Oh, wait. Wait. That's cert auth. I don't think that's Yeah. We've been told to look at entity control dash h again. Or maybe that was maybe we're celebrating the fact that we did look at the control dash h.
24:42 Not in the manifest. Okay. Not in the manifest. Alright. So Would be Let's see. Member add route maybe or So there's a change to the password. Do an off disable. Yeah. Off disable is good. Well, it doesn't turn it off. Oh, you'll need your dad. Yeah. You got it. So now we're looking to see if we have an API server coming online. So maybe worth checking or tailing those logs again, see what happens. And there we go. Alright. So the nodes aren't ready. So this maybe says something with the cubelet. We could give them just a second, but
25:59 Nodes Not Ready
26:10 I'm guessing if they mess with that CD, it's they're not coming back. Yep. I guess maybe we check the can you actually just do, like, a describe on the node? Overcommitted. No. No. It's not never mind. It's not overcommitted. Sorry. We've got twenty minutes left. Wait a minute. Loads of things. Like, are the notes taking too many resources? Got plenty of memory. Yeah. Yeah. These are fairly chunky bare metal machines. So Yep. So what else causes a node to report its status as not ready? The labels. So I I apologize. One question. I my terminal keeps not tracking the like, tailing his
27:49 things. Am I doing something to screw that up? Or do I just have to Resize it like a pixel. It it seems to kick it back into shape as as a external JS bug. So, unfortunately, we do just have to work around it. So if you do think that it's not tracking properly, a small resize of your window will give it a nudge. Just don't resize it too much because then it throws off for everybody. Terminals are hard. I'm on I three, so the resize was taking up to changing the half of the screen. Oh,
28:21 well, we'll see how we get on. Alright. Let's let's work out this not ready. Alright. Now those tents can just be removed manually. You can always try that and see if they come back. Right? I mean, did they cordon them? I could have just cordoned them. I I I don't know. You'd have to do a describe or an edit on a note. But Philip in the chat says they weren't expecting not ready status. So Uh-oh. So we may have some collateral damage. This is what you get. Well Yeah. That's the way it goes sometimes. Yeah. Let's try and edit nodes. You just
28:30 Debugging Node Status / Removing Taints
29:16 remove it. If it comes back, we can look at the events and we can work our way. But edit nodes, pick hexylabs dash control plane dash one. Oh, well. Worker one came back at least. So Is this just etcd being slow with, Yeah. It's back. There we go. Yes. I'm about to debug something. It wasn't broken. Always fun with Kubernetes. Well, it looks good. Could you could you check the stateful set? There should be a database in the stateful set. Looks good. You're gonna have to run get pods now. It put us all out of misery.
29:46 Nodes Become Ready
30:20 Yeah. So we wanna do this. Let's see. Clustered. Puzzled people always call this your cluster d. Maybe I should rename it. Can you also do, like, a dash capital a and see if they, like, did something sneaky in a different namespace? Like, just it don't do git deployment. Do, like, git pods and then do cache. Or you could you actually do git all dash capital a. Yeah. I think rubber cable is right here. Like, you haven't checked the web page or the service. You don't even know if it's working or not working. It could be fine. Right?
31:08 Checking Service Status (v1 Working)
31:11 How do we check the web page again? I I can try curl on local host 30,000. V one is working. So now we try to update to v two. So we just go ahead and edit the deployment and go to v two? That's the mission. Uh-oh. It So image never pooled. This is either networking You would have to get pod dash o yamo. Yeah. There we go. It's not installed. It's container d, isn't it? So they have an image pool policy of never. And the image is still set as v one, which looks interesting. Yeah.
32:14 Discovering ImagePullPolicy 'Never'
32:53 Wait. Wait. Oh, you you grep the one that's running. Grep the or do the get pod on the one that's not running. It's the the six q yeah. Image never pull. Right. I think that we need to change the image pull policy. It's like they did something else and expected us to look at that image pool policy. Couldn't have been that simple. And now it can't pull the image. Image pool backup. URL looks okay. What next? So that would be that could be, like, the container runtime, which would be going to pull this. Right? Yep.
33:59 Image Pull Back Off (V2)
34:22 Yep. Container d. Jeff, maybe we check oh, what were you saying, David? Yes. These are container d clusters. Container d is responsible for for pulling the image. I was gonna say maybe we could check the logs on container d on the worker node since this would have to be running on the worker node. Would you like me to open a session on the worker node? Yeah. Alright. Opening that. All yours. Take it away. You have twelve minutes. No pressure. So does container reach up with any CLI tools that allow you to pull images? I was wondering about that image is definitely
34:42 Checking Worker Node & Container Runtime
35:42 out there. I can see it. You know? So we have access to cry control, c r I c t l, and to CTR. Those are two command line options for working with continuity. Anybody ever wanting to know, nerd c t l is a fantastic one that is like a docker compatible API for container d. Install it. Yeah. Oh, we got an error method. That's interesting. Google.code.com. So it looks like it looks like they made it so that they gave a certificate that's valid for things that aren't GCR. So my image comes from GitHub container registry,
36:16 Investigating Registry Certificate / DNS Issue
36:44 and you're getting certificates for Google Container Registry, gcr.io. But how how can they how can they modify that resolution? DNS. Wait. Can we look at the hints? Are there hints on this file this one? Oh, you haven't I mean, you maybe look at the host files first, so the resolve con. Oh. Alright. I don't know what dot 82 is, but I'm guessing that's wrong. I can go look. Yeah. That worked, didn't it? So was the I missed that because I was joining the session. So did they add something to the host file? Yeah. They overwrote Yeah. G c r. So
37:33 Fixing Hosts File (DNS)
38:12 I commented it out, and then that ran, which means back on the other, now we should be able to, get it to pull that image maybe. There we go. Alright. What next? Actually, Jeff, could you just they did it on both. Thank you. So what happens when you run get pods now? Well, maybe try deleting our back off pod. Okay. Yeah. Rub our cable in the chat with the same suggestion there. Good call. Hey. It's working. Or it's maybe maybe it may be working. Alright? Alright. So they got you both a dual policy and a DNS check.
39:22 V2 Pod Running & Service Working
39:42 And that is the v two image. I can pull that up here too if you wanna confirm. Yep. That's interesting. Okay. Sometimes that happens. Yeah. I really need to work on it. Alright. I think that works, isn't it? You have to dance. Well done. Probably the end of our time. Because if there is such a thing as easy. Right? Well, that he's worked through that problem after problem. Good job. Etsy d is is a good one. You know, honestly, I can say, like, I can't tell you how many times Etsy d has bitten me in the hind end.
40:05 Team Aerospike Success & Debrief
40:34 Yeah. It's it's a good one to actually troubleshoot. I it's been forever since I've touched it last. So Yeah. Yeah. I'm not an SPD expert at all. So Well, at least now we've all learned a valuable production hack. It's a entity control off disable. Like, that's now my my go to. So, you know, awesome. That's exactly what you wanna do in production is disable all security. Alright. Well, you just managed to work through all of those problems with six to seven minutes left. So well done, team aerospike. I'm gonna ask you to kindly go watch from YouTube, and
41:11 Transition to Team Pixielabs
41:13 I will invite on our friends at Pixielabs to see what you have in store for them. So thanks a lot. Good job, and I'll speak to you also. Thank Alright. I'll just cut my off there. Sorry, mate. But we'll wait here until we get our team from Pixie Labs over. I'll also remind you that you can win a cluster t shirt by going to rockode.livewin. Cool. Let's see. We've got a few people joining now. So we got Philip, Kjell, and Ben. Hey. How's it going? It's going well. Thanks for having us. Fun to watching someone
41:52 else fix your broken cluster? Yeah. Absolutely. Work to that team, to aerospike. Yeah. Awesome. We actually had one more break left, but it somehow got overwritten. So, yeah, we we added a network block on ghcr.i0 using XDP and EBPF. And I don't know. So maybe the container got cached or something like that. So, unfortunately, it did not show up. But you could actually see it in one of the, I think it was the CRI CTL calls. Like, there was no network resolution there. So yeah. Unfortunately, something went wrong. So maybe we just kept it cached, the container
41:55 Team Pixielabs Explains Breaks
42:39 See if I have something. I mean, I don't wanna speak for them, but I haven't seen eBPF breaks on this before. Nobody ever gets them. It's it's one of those, like, black boxes. It's really difficult to, like, understand where the eBPF is happening and how to how to get around it. So that would have been will show you that one of the network interfaces has been messed with, which is which is what we were hoping for. I guess we'll send them over as part of the show notes, but if anybody wants to see them. But, unfortunately, did not get to
43:26 see them in action. So that's Oh, well. Well, I'm looking forward to reading that at least anyway. Alright. Can we start in the top right? Please say hello, introduce yourself. We'll work our way around clockwise, and then we'll hand you over a cluster. Hi, everyone. I'm Philip. I am an engineer at New Relic working on Pixi. Yeah. Oh, and my Twitter handle is at Phil because. Cool. I'm Vihang Mehta. I'm another engineer at, at Pixi, now, a part of New Relic. And I think this will be a common theme. Philip, me, and Michelle all sort of
43:30 Introducing Team Pixielabs & Product
44:08 work on the same team at Pixi. Yep. And I I am Michelle Nguyen. Just to give some context about what Pixi actually is, we are an open source CNCF sandbox project. And, basically, our thing is that we are a real time debugging platform. You deploy Pixi to your cluster. You know, you can deploy using YAMLs. You can run through our CLI. But, you just deploy Pixi. And right there, out of the box, you get a bunch of baseline visibility metrics and information about your cluster. So we use eBPF to collect things like full body HTTP
44:45 requests. We get, you know, continuous profiling. So a bunch of interesting stuff without you really having to do anything but deploy Pixie. So, yeah, definitely check it out. You should check out px.dev. That's our website with all our information and docs and links links out to our blog. You can join our Slack community. So I think that's slackin.px.dev. But, yeah, just check us out. We're super excited to just, like, get people using our stuff. Leave comments, feedback, and we love contributors as well. Awesome. Thank you for sharing. And if people do want to learn more about Pexi Labs,
45:19 they can go to Rawkode.live and search for Pexi Labs. I did a ninety minute stream with Natalie. I think it was around six six or seven months ago when we dived into Pexi. Was an awesome product. It was a lot of fun. Encourage people to check it out. Alright. I see someone's already joined my session, so awesome. Let's get everyone else in there. I'm gonna pop my screen share up now. I've opened the session on airspace control plane one. Feel free to join. Give me an echo. Let me know you're there. Thank you. We've got one. We got two.
45:54 It always helps when people tell me to type in. Yeah. Here we go. Perfect. Alright. Have have around forty minutes. Best of luck. Take it away. Okay. Let me let me copy over some commands I want. Always nice to see the k nine Yeah. And then getting k nine s. This might be interesting because sometimes it doesn't quite work with the x term. Oh, okay. There we go. You know? People have tried to use it before, and it's it's okay in the web. So I'll keep it there. Oh. Oh. So as long as you're fine with that. Yeah. If you
46:46 could see it, it's fine. I'll just try and not use the terminal one for this bit. We're all good. Feel free to use it. Okay. Okay. I guess I'll see what we can do. Okay. So I'll just do this for now. Yes. That's our part of the thing. No control plan there. Connection was refused. Oh, okay. Got any ideas. Okay. So no control plan. We kinda need one of those. Right? Yeah. Wait. Do do you wanna get that CD con control and just just double check? Case. Where did we get it before? App install at c d
47:33 dash client, and then I still got the cheat sheet here with this setup. I just use this one all the time. At c d hyphen client. Yeah. Oh, sorry. Alright. I will paste in the setup for you if you want. Oh, sure. Yeah. Looks good. There you go. Okay. Okay. So I'm not as familiar with this. Michelle, what do you what command do you run here? Oh, get guest said that that CD is running, so we're probably okay on that. Yeah. Think it's fine now. Yeah. Okay. Okay. Okay. So control plane is missing. Do we have the
48:23 cubelet? You have a kubelet. Do you have an API server? Just kubectl. Sorry. Alright. You got Kubectl, scheduler, controller manager, Kubectl. You have no API server. No API server. Yep. You definitely need one of those too. Yeah. Okay. Control planes are overrated. Okay. The null channel is Marek from. When he messages that, you you may wanna pay attention. Control plane is overrated. Oh. Oh. I I think he's just joking at it. I I would be able to take a the static manifest and see or maybe the logs for the APIs server. It would be a
49:27 good starting point. How do you find those? Is it under, like, varlog or something like that? Varlog containers or varlog pods depending on how much view you want of the cluster. API Tube API server. Oh, okay. Alright. So connection refused. Cannot reach. Wait. No. It's not certs. Okay. I'm just trying to think. Well, what did the other team think? Well, the the port looks fishy. Right? One two six nine? Yeah. I don't think that's the standard port. I have not seen the port number before. Yeah. Yeah. Good catch. Let's look at the manifest for the
50:43 for the API server. Yeah. Kevin's saying the port looks wrong to him too, so let's go check that. Wait. The manifest is at Etsy. Kubernetes manifest? Yep. The one? Yeah. Oh. You can either use them or a cat it with a pager. Yeah. That's a good way to do it. I mean, it is. I thought someone's gonna have to, like, delay case c d, like, ran overuse of cat. I definitely do that all. Is trying to connect to +1 269. What does that CD say? It is 2379? Right. Yeah. They just they just narrow down. Right?
51:37 Yeah. Yeah. Did not know you could go just edit files like the manifest in the file system like this. Said 2379. Right? Yeah. Yeah. That's the one. That's the one. We need to restart anything? The kubelet will detect that change. If you wanna speed that up a little bit, you can do a system d control restart kubelet. Oh, system control. Sorry. System CTL. Oh, yeah. Yeah. Okay. That's what I thought. System CTL. System CTL is gonna change something else. That's the kernel tool. There we go. Should we check it? Yeah. Oh, dot slash dot slash.
52:31 Yeah. Not there yet. Coming low, I think. Should we check the API server logs again? Yep. I mean yeah. Var logs. Right? Yeah. I have it in the history. Let me check. Var log. Oh, I guess the the container ID might change. So I would hope so. Yeah. That's the same one. Right? And eight hasn't incremented, so I'm not confident that we have an API server. Alright. 3219. Okay. Yeah. This is four minutes ago still. So What else is wrong here? If you check the kubelet log? Yeah. Let's let's before we restart. Oh, is it not in the same place?
53:56 Maybe in continue? The kubelet logs, because it's not system d service, you'll need to use journal control dash flu kubelet. The journal c d l dash f l u and then kubelet. So that's standard. That just means it can't reach the API server you get. Too. It it could just be that our API server Oh, I see. Oh, goo Kubernetes. What did you notice? Oh. A Goobernetes. Yep. The static manifest bunch of Goobers. Kidding. But you know what that message mean? The static manifest location has been changed potentially. So it's not actually reading the manifest from
54:55 Etsy Kubernetes manifest. Where is that configured? A kill command if you want me to share. Yeah. The system control cat and a service name gives you quite a lot of cool information. Alright. So there's a there's a there's a bunch of files listed there that I would start. Bootstrap tubelet and a bunch of other stuff there. Right? Can you can you see Google, Netiz? Oh, it's gonna be in one of those config files somewhere. You don't need to start. Right. There's no bootstrap here? You may wanna start in FireLib kubect config dot YAML. Typing on camera is always fun. Right?
56:16 Yep. It's not just three of us watching. It's, you know, a few dozen. Don't tell me. Don't tell me. Oh. Oh, that were that was it. That was it. You go. I saw it. Nice. Okay. There you go. Static pod path. We've ungoovered the cluster. I I would definitely restart that kubelet again. Save yourself a few minute. Alright. Now we can make a cup of tea, open a beer, have a sandwich, wait for the API server. Not yet. Note that we want journal control. Pause it, maybe. I would expect to see API server and a PS coming up pretty
57:14 soon. Okay. Oh, yeah. But then I expect a lot of things. I'm wrong all the time. But, I mean, I hope you get an API server soon. Right. Nothing yet. If you go back to the logs, what I think I think we're missing some files. The CNI plugin not initialized isn't is about or is it log? You got a new error message there. No? Failed it. No. Okay. The least one you can ignore. This is still just API server missing stuff so far. Well, do we know if the API server is on yet? I don't think it is.
57:58 Right? No. No. It's not on. You know, what's preventing it? We may wanna check for API server logs again and hope we get a new pod and container increment ID. We don't we don't really wanna see 8 dot log anymore. Pods. Right? No. Kubernetes pods. No pods is right. Barlog pods. Just pods. Kube API. Oh, I That's 8. Log. Yeah. That's the old one. Yeah. I mean, you can catch it, but I don't think we'll see anything new from that fail. Sometimes you miss a t. +1 935. No. O '2 '19 '30 '5. Yeah. I
58:50 don't think that's changed. Right. So we haven't had the Kubelet attempt to start a new API server yet, it looks like. Well, we could tail it, but I don't think it's gonna tell us anything. Yep. No new logs. Yeah. That's not spat log out and I don't think it'll Sixty hours. I don't think it'll log. How long has our cumulative been running out of interest? That looks like Seventeen. We restarted it three minutes ago. Yeah. So some more Kubelet configs here. I think Do does anyone know if Bootstrap kube kube config needs to exist? Because I'm pretty sure it doesn't exist.
1:00:03 It does not need to exist. No. Alright. Alright. I think you should be looking at the journal command again and maybe take it slow, I think. I'm assuming we we could be missing something. I wonder if I can type this. Is it gonna straight well, okay. Unable to write event. Connection refused. Yeah. One is to to the API server. Right? What is usually at 6443? I thought that was the API server? Yeah. Which we don't have. So we're looking for something that would tell us why the API server was not started. Yep. That's CNI plugin not initialized.
1:01:03 Cni Plugin Not Initialized
1:01:07 Okay. I feel like that the CNI plugin isn't dependent upon the API server, is it? It is not. No. An API server isn't dependent on CNI either. Ah. Okay. Could not process manifest file, cube API server YAML, couldn't parse this pod. Please check config file. And I lost that. So there's a adder and your static manifest for the API server. Where was that? Etsy Kubernetes manifest cube dash API server dot YAML. I spent a lot of time in that file, strangely enough. And the error message pointed to line a 30, if I believe correctly. Interesting
1:02:12 character. I guess that's it. That was said far too confidently. Oh, wait. That was the that was the error. Yeah. Wow. I mean, that would definitely cause the YAML parsing to fail. So I mean, I'm not saying that's the only error, of course, but it's definitely a problem. Also, I'm pretty sure this is flipped. Oh, nice catch. I didn't notice that. Comments helped. Oh. Oh. Russell. Good catch, Russell. Oh, and James as well with that catch. Well done. Yeah. But here too, I believe. And team here, I respect shushing everybody in the comments. So Robert Campbell says the status looks out
1:03:13 of place. It it's okay for the status to be empty in this. Oh, I'm sure that the status will get filled out later. Right? So safe. Yep. Maybe misplaced, but Alright. Let's see if it's running. Not yet. You can try CTL? Yeah. You you could try a restart. The like, so the way the Kiplet monitors the static manifest directory, it can take, like, up to four minutes, I think, before it will detect the change and restart it. A restart of the Kiplet will it a little bit faster. Should we journal CTL again? Well, let's check let's check this manifest file here.
1:04:11 I have a feeling there's a bunch of issues here. I think you have an API server now. Alright. Let me check. Damn it. Not yet. Is journal CTL still complaining? Or is the are the logs still complaining about any manifest or something? Well, it still can't talk to the API server, which is because API server doesn't exist. Connection fused. I don't see any other errors, which is good. Should we try this just to make sure? I mean, you can even grab for API server. Alright. So now That's a kubelet restart. Yep. These are still infos.
1:05:36 Here's the first error. Could not process manifest file. Delete me dot YAML. That kinda looks suspicious. Alright. Let's see. Where do you see that? Sorry. I never caught that. Oh. Oh, it's It's all good. You can let's just go take a look at it. Yeah. You'll you'll see it. What is here? I'm pretty sure I should just delete this thing. Safe networking. Wait. Woah. Woah. Woah. If it's a trick a trick? Oh, wait. Do we have another networking policy manifest? How about I do this? Well, I mean, I don't wanna be too confident in my statement, but I
1:06:40 don't think the kubelet would even apply that to your cluster. If it just exists? Because yeah. Because the way the static pods work is that it actually creates projection pods. I don't think putting a network policy in that directory will do anything. I mean, I could be wrong, and I often am. The null channel says it will try to run every manifest in that folder, apparently. Oh, so that maybe add them as a blocker stopping it from deploying anything else afterwards because it can actually deploy a network policy from that directory. That's a cool break. Yeah. Yeah.
1:07:17 Nice. Alright. So let let's go back to restarting and looking at logs. This is the way to do it. Alright. So far, unable to find data and memory cache from CRI stats. I think that's alright. I think you're good. Right? That's a healthy equivalent in API server. Nope. Nothing yet. And we did you delete that file? I moved it to home. Oh, cube VIP is readable by other folks. The other ones are not. Don't know if that Pretty sure the kubelet runs other. But, I mean, you can check that too. We didn't we have that CD? It won't work.
1:08:44 Or STD worked. Right? Yeah. The STD can work. Actually running this. Yeah. Yeah. It should be running everything. Do we have an API server yet, or do we need to start panicking? No. Not yet. Still time to panic. See. Alright. Let's see. Cloud Fire. I can stop scrolling. Please just let me know. So oh, no. There's connection refused. Before the connection refused, it says there's waiting for API server, actually, if you if you scroll up. Waiting for yep. No. Creating container manager based on. Should we just look through error logs? Do we have an API server log yet?
1:09:49 Error Logs
1:09:57 That's not it dot log. Oh, good good call. No. That looks like the old one. Yeah. Oh. Okay. So maybe we still need a CNI plug in thing. Or maybe I'm wrong. There's a there's that log about the CNI not being initialized yet. Okay. Let's because I I I'm really confused. So I I feel embarrassed to do this. But we have a bunch of static manifests and it's a Kubernetes manifest. Can we verify that any of the control plane components are running via PS? Oh, it can just come up with that. Yeah. There should be a KubeScheduler,
1:11:04 a controller manager, an API server, and NCD. Alright. So NCD is up there. So STD and GoogLeP we have, but nothing else. Nothing else. So we expect scheduler. So I'm curious if we journal flu the kubelet again and just grab for API server, if there's anything. And if not, I'm not actually convinced that it's trying to start it. Yeah. That's it. Just trying to speak to the API server. So I'm not convinced it's trying to start the static manifest. Yeah. Right? We can also check hints. Yeah. We do have hints. We do have hints. Alright. Trust split verify.
1:12:18 Sometimes more is not better. These are subtle. Should we go to three? Yeah. We got fifteen minutes left. Let's take a look at three and see if we got any ideas. Alright. Okay. I think you fixed that one. I think we fixed that one. Yeah. Alright. So we do have Marek saying that he would have at this point, he would expect the kubelet to try and start the API server. So I know your ports. Okay. I could really fix that one too. I'm gonna make a suggestion, and this is a bit wild, but sometimes this stuff works.
1:12:59 Okay. If we go to the manifest directory and let's move stuff out of it except for etcd. Okay. So let me yeah. Move yeah. That'll that'll work. Go for it. That's the worst that could happen. Alright. And then restart? Restart the kubelet. Yeah. So I'd expect to see kubelet disappear at c d. Stay as it is. And then we can add on the API server and see if it helps. CubeVip is still there and so is that CD. The SDD is the only one left in that directory. Right? Yes. I believe so. Yeah. Should we move the API server in then?
1:14:20 A cube head is that? It is indeed hard to write good heads. I hope that a kill made his name on cube head. Yeah. We can try that. 19788. Yeah. Oh, give it I'm just curious if it comes back. That's all. I mean. Well, we can restart the kubelot. America's saying, does the kubelot shut down pods when the static manifest is removed? I thought it did. Maybe maybe I am wrong. Oh. I think I think it does. I do think it does. Well, what was in delete me? That was just a networking policy. Right? So
1:15:20 I think that break was that the because it's not a pod, the kubectl actually stopped on that. So maybe I'm confused. Yeah. I guess we try to move the API server manifest back in here. Yeah. Let's give it a go. Okay. Nothing yet. You feeling brave enough to kill minus nine at CD? Because if right. At this point, it feels that the Kubelet is not starting anything. Right? Yeah. Let's check the Kubelet config. That's what null channel is saying. I got the boring way. I think he needs the viral load to let Oh, okay. Yep.
1:16:54 Alright. Ten minutes to go. It was config. Yep. Oh, that's oh, wait. Zeros are okay. So it's okay to see zero seconds, I believe. Oh, I see something. But I'm gonna I'm gonna leave leave it. The held z port? You can ignore that. I'm surprised there's no header message for this, though. Memory swap? That's is that okay to be empty? Yes. But you're you're close. Yeah. Mark's given it up already in the chat. I was trying not to mention it, but max pods one. Oh, there can only be one pod running. Ah. It's just a
1:17:52 Delete. A 10. That's a good one. Yeah. I I expected an error message from there and from that. Robert Cable says, possibly no error message because report frequency is off. Well, I guess there's not an error. Right? It did exactly what we told it to do. Normally, would say couldn't start your pod because without the max pod. Like, if you have the 10 limit and you hit it, it's very vocal that you hit it, but then maybe it's different for static pods. I'm not entirely sure. But good break. Nice one, Nice. Yeah. Hey. How's up?
1:18:44 Nice. Oh. Oh, did it wrong. Do that. There we go. Oh, lots of stuff. Oh, boy. Oh, there's probably a deployment that's, like, making thousand and the replicas. Oh, a 79 requested. And I could just I'm there straight with it delete. Like, go away. We don't we don't care. Oh, I get NGINX here. What what is busy bugs doing over here? Hey. Robert, table said this is the less is not less is not more. Yeah. There was a hint about that, wasn't there? Yeah. So Postgres is running. Yeah. NGINX is DDoSing our cluster. And nothing about why this is pending.
1:20:00 Be that NGINX doesn't sorry? It could be the pod limit that we're still reaching. It should be terminating these. Right? So I mean, they they are terminating. But your worker online? Is the Kubelet online for that worker? Oh, we don't know. Do we have a session for that yet? Oh, you can see it from k n check nodes. Yeah. Oh, yep. Yeah. So if we don't get a kubelet report and status, those terminates will never complete. Alright. Open a session. Alright. There. I will remember to join the session. No cubelet. That's a problem. Another problem is six minutes left.
1:21:31 VLC? I do not remember these. FLU is the one I tend to use. FLU. Could not find the container destock it. Yep. I would apologize to Pixi Labs, but you touched ETCD. Does your ACTL tell me anything about container d running? No. Yeah. I mean, you can use system controls data as container d for a quick quick look. Claims to be running. It does. Is it at the right sock? Oh, Kubernetes dot sock. Yeah. That would be a problem. Where is the container deconfig? Use system control cat. It is always your friend. Oh, d units.
1:23:05 Just user bin container d. Where do you see that? Yep. Yep. Exec start. Oh, no. That's just the that's just the exec. Oh oh, my bad. My bad. Yeah. That is a good directory or something. You can check-in slash e t c for a container config file. That that would be the default location at least. Google. Google. Alright. I love it when the cursor just lands right on the break. Yeah. Oh, I think you might have had a typo. Oops. Good catch. Nice. And a I can type sometimes. You know, I was told from the container
1:24:04 team that's print out container, but I still call it container d. I don't think I've ever heard Amazon say there's a container d team meant to say it as container. I guess we can restart Kubelet too. That wouldn't hurt. Yep. Can do a system control restart kubelet. Oh, that was a status. I thought I definitely typed restart. Alright. Yeah. Interesting, James. Is it Nerd Control or I think it officially, it's Nerd Control. NerdyControl. NerdyControl. Yeah. I'm not I'll need to Not nerd. Not nerd cuddle. Nerd cuddle. I mean favorite My favorite was Quebec del. So you
1:25:23 know? Quebec. So many ways. Alright. You have a kubelet. I that should be reporting statuses. Your API server, hopefully, is now terminated a whole bunch of yo. That it looks better. Oh, that's node. Nope. The node's not ready yet. Should we tail the logs? I would check your should put logs on that worker again then. Yeah. Yeah. Oh, it's just not started yet? Is that it? Or did you restart it there? No. It's 55. It's running? Started Kubelet as run once. Yeah. That's not cool. Started Kubelet as what run once? Yeah. I think they've broken more stuff in your
1:26:14 Kubelet config or container d config. And but there's a bunch of interesting things in that log mess. Well, you know, we can we can check that. Okay. No no hints on the worker node. I mean, I can just throw something out there. You do not need a container deconfig file. Alright. But you're also gonna wanna remove that run once in the kubelet. That is a system decent if you're not aware. So you need the system control cat kubelet, and that will tell you where the unit fail is. That's the sneaky one for people. It depends
1:27:07 on if you're comfortable with system d. Yeah. Lab systemdsystemkubelet.service jump into that fail. Wait. Sorry. It was lib The first line after the cat. Yeah. Just just copy it. Oh, okay. Okay. That was interesting. This seems fine. Right? This is where my system do acknowledge and let me down. I would maybe add to the service. I wonder if run once is a default if you had to add type simple. Someone in the comments help us out. What's the best way for this to get fixed? Let's see. Oh, it's a cubit config. Cubit config. I saw it in system d config.
1:28:14 Is that corrected? There it is. Ah, run one. Yeah. I wonder what the use case for that is. Someone go find that PR. Alright. Our time is up, but we're close. So let's let's see this. Alright. Oh, yeah. Mara said it's also a system deconfig. Yeah. Thanks for that. Yeah. You'll need to do a daemon reload because we modified the the units fail. Alright. It's just system System CTL. Reload. And daemon dash reload. Yeah. They just delete the kubelet. Them and reload will reload all the unit files, and then you can restart the kubelet from
1:29:28 there. Alright. Maybe it's working. I don't know. Yep. It Working heads up. Hasn't terminated anything yet. But Do you wanna do you wanna edit the deployment? Oh, no. Why? Edit the deployment for for cluster d. So if it works, you know, it'll just start. Alex saying you're trippers. He handed you a dumpster fire. Kevin suggested that people are playing the long game and adding features to the just for use on clustered. So if that's why Rawkode was added, that would be great question. Let's just do if not present, and hope that it's been cached. Scheduler is broken.
1:30:37 You can bypass the scheduler. I don't think I know how. Edit the deployment, jump down to the templates back, and add node name. My favorite hack on cluster. They do it all the time. A template It's just that. Bet. Yep. So you're in the yep. And then there, add node name with a capital n for name, and then just do air spike worker one. Air spike worker one. Yep. There's Russell in the comments. Here comes the Rawkode scheduler. Honestly, I tell everybody about that trick. I'm like, we don't need a scheduler. Like, just never waste time fixing the scheduler on cluster.
1:31:25 Something something is Yeah. Something is controlled busy box. Is there another break? Airspike. Come on. Does that busy box keep coming back? I was deleting it at the wrong level. Alright. Alright. Alright. Space. Alright. Do we have a clustered v two? Woah. What's going on with the cluster deploy? I'll set it in the day. How many replica sets do we have? Interesting. Oh. Oh, that one's gone. Right? Oh, is the deployment is the deployment set to only deploy one or something? Well, it doesn't wanna scale down to one. And Russell said he hasn't seen out of
1:32:38 pods before. That's when you hit the max pod limit on a kubelet, which is a 10. Which which is 10. But something is creating loads and loads of cluster pods. These are controlled by a replica set by five seven nine. And it wants to keep coming back? No. I don't I don't think it would create the same one. Yeah. I think the worker might have its own max pods. Yeah. Yeah. Yeah. Yeah. You can change your note name to the control plane and remove the tint if you wanna just quickly get this working. Alright. Super hacky mode there. Right?
1:33:36 Control plane one. So I get that right? I think so. Yeah. Okay. Schedule name Bobs. And then if you go to get nodes. Yeah. Yep. I guess delete that too. Yep. Can can you have no spec? Can't. Alright. It's creating. Creating. Alright. Let's see if we have a let's see if we got the dance. No. Come on. There can't be any more breaks. Do we have a service? We do have a service. If you click on the service, it does. I guess, will it resolve to the cluster? Yeah. Does the clustered service have any endpoint?
1:34:58 No. You can just enter. If you if you k k nine s usually maps the service to a pod. Node port. Node port. This is supposed to be a Node port. Right? Yeah. It is. It is. Is there a selector on the service? Yeah. App clustered. Is our deployment not got those labels? Let's see. Is there a network policy still? We're getting a clue. Ah. Check the network policy, CRD. It's just NetPoles? Oh oh, but yeah. This was an absolute dumpster fire. Not yet. Any settling of network policies? Are we still getting There's CMPs and CCMPs.
1:36:08 Nope. Nope. I think the pod hasn't been created yet. Oh, I thought we got it. I think it was stuck on container creating. There's no there's no oh. William API client time out. Alright. I don't think the error spike could get invited back. Okay. Maybe try deleting it and restarting it. So no. Not the don't restart the cube. Restart the pod or kill the pod, and then we'll just see if it gets reset. Sure. Yeah. I mean, I think the new one's gonna also get stuck, presumably. Might have still been using the old network policy or something.
1:37:14 Yep. I'm still curious to know what's creating this on the worker. Yeah. That's a lot of pods for a replica set set to two. I guess the Kubla config is maybe still max pods on one of the node. I mean, we still have a session over there, don't we? And fire that kubelet config dot yaml. Well, just give us a few more minutes, and we'll see if we can find whatever random thing is left here. But oh, yeah. MaxPods one. There you go. Strange. Oh, there was something alive there. Was it or is that the old one still?
1:38:15 What's our newest deployment? This Yeah. That look that look Nope. It's still on container creating. Right? Out of memory, out of disk. The disk is rather large on these machines. Yeah. We got tons of tons of memory, tons of cores. You I mean, you can remove those limits and requests. I don't think I even added them. So Yeah. I think those are usually they'll show up in the events. So You're adding the pods instead of the deployment? Yep. I have I can't I can't explain that pod behavior why there's so many. That's that's weird. I'm I'm surprised that I can't seem to
1:39:12 get rid of the replica set that's creating them, which is 579. Right? Wait. So canines does this is one of the we've seen this before, actually. Drop out of canines, and I bet if you delete it from the CLI, you'll get another message. We have seen we have seen this before. That's the key that's can I propagate all the stuff? Yeah. If you try to get replica sets and then try to keep control delete r s with the name, you'll probably get the correct error message. So we want to keep we wanna get rid of
1:39:53 all of them besides the newest. Yeah. Yeah. I mean, you could just delete every replica set in the cluster if you really want just to attach that at all. Like, this deployment controller will bring them back. That actually worked, surprisingly. Alright. And then Did they just pull an image maybe? What node is that? Again. You start cube let on control plane. Don't why. I don't think we need to It's on control plane. So Okay. Wait. What's this telling us? No. No. That's stuck on creating. Kill the pod again? I mean, we deleted the replica stuff. Right?
1:41:16 Yeah. It shouldn't be a problem. Postgres is also stuck. So Oh, post crisis stuck? Okay. Yeah. Why what else might be unhappy? Let's try and get pods all. I I think canines is our detriment here. I think it's hiding stuff. So Absolute back off on API DXT. So we can ignore emissary. We got one of our Scyllium operators. That's okay because we only have we don't have enough nodes for that third one. Our Scyllium does appear to be okay, although it has restarted a bunch of times. We got container creating for a whole bunch of things. Can we do it all white
1:42:11 on this? Get pods dash all white. Yep. So, yeah, so Postgres hasn't scheduled because of the scheduler. Oh, okay. Yep. The scheduler is broken. So And I'm curious if we describe that clustered pods, if we can get any information too. Oh, okay. Yeah. Yeah. Great. There's really nothing. The pending can only be one of two things. Right? It's the scheduler, which we've alleviated by hard coding the node name. Yeah. Well, do we need to also hard code the scheduler for Postgres? We could. But then the fact that the cluster part doesn't start and tells me that
1:43:06 that's still pending because either the port isn't available on the node Oh, is NGINX blocking it? Is The NGINX is the is NGINX necessary for this? No. You can you can get rid of that. Yeah. Nothing's on eighty eighty. I have no idea what's wrong with this cluster. Let's let's cluster these. Alright. Marek, if you're still watching and you wanna jump on and tell us just for a couple of minutes what the hell is going on here, that would be useful. But I'm there there's so much broken in this cluster. This is literally a dumpster fire.
1:43:59 Jeff from the team has seen slash run as full. And he created RAM discs to eat up all the memory. Alright. I'm happy I'm happy to call it there. I think that there there's too much broken here to fix in a couple of minutes. Yeah. Tricky. Wow. That was a tough one. Two really tough clusters for that last Yep. There's there's slash run. Was it? Oh, yeah. You see, you have to scroll up, but Oh, yeah. I guess you can't speak to any of the sockets potentially on that machine. I'm not sure Yeah. What the side effect of that
1:44:55 is, but not something I've seen before. Damn. I never got a party with team aerospike or team pixie labs. Well, you know, our eBPF breakfix itself. So Yeah. I'm not even sure if there's enough eBPF in the world that could detect that many breaks in the cluster as well. There was a there was a lot going on in this one. But I think you managed to work through quite a lot of the the things there. Some of them were particularly tricky as well, like the max pods, the run ones. These are just flags I had no idea existed, which
1:45:31 I think is great because now we all know that these completely have to be malicious flags exist, and we can delete them from all our links. Alright. Thank you so much for joining me, Pixie Labs. Your breaks were really fun and entertaining to watch. Fixing that was fun and entertaining, and thank you to. So I'll let you skip back to your day, but thank you again for for joining in. Thanks for having us. Awesome. Thank I'm just gonna pop over here, and I'll do the t shirt now before I let you all get away. Competition.
1:46:10 Competition Winners
1:46:12 Winners. Let me pop my screen up. Alright. Draw. There's We still got a few of you on the call. That's alright. Go. Who wins? I don't see names. I'm sure that they're somewhere. Your friendly cloud native gopher, I think is the name and Hervey Nickel. Alright. To find your t shirt, just DM me on Twitter or email me at david@rockode.com or ping me on the Discord. Just get in touch. Give me your address. I'll get you those t shirts. Thank you to all of our sponsors and thank you to everyone who's watching Equinix by Rawkode and everyone else.
1:46:57 Investigating No Control Plane / API Server (Pixie)
1:46:59 I'm gonna go have a beer. Have a great day.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments