Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Diagnose pods or deployments that reset to zero replicas and recover running state with kubectl workflows.
  2. Trace CoreDNS and Cilium operator issues by reading deployment YAML, pod logs, and rollout descriptors.
  3. Unblock tainted worker nodes by adjusting tolerations, removing rogue process constraints, and validating services.

Teams from IBM and Nisum take on broken Kubernetes clusters live, hunting down sabotaged Cilium operator labels, a tampered CoreDNS ConfigMap, and other breaks with kubectl while pair debugging over Teleport.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:00 Holding screen

2:01 Hello, and welcome back to the Rawkode Academy. This is Thursday, which means today is clustered. Today, we have two teams joining us with some broken clusters and the hopefully successful attempt to affect both of them and share some wonderful Kubernetes knowledge along the way. Before we get started, we have a few thank yous to give and some housekeeping. I wanna thank Teleport. As I say every single week on clustered, we have been using Teleport from the very beginning. It's such an amazing product, and you're gonna see us use it for the next hour and a half to fix some Kubernetes

2:39 cluster in a collab collaborative pairing way. Go to rockode.liveteleport because it keeps their sponsors happy and makes it possible to keep doing cluster. So please check them out. And plus who doesn't want to access all those Linux servers using not SSH, but get help authentication, which is pretty special. So alright. Next up, we have a oh, there we go. See, I go away for one week to KubeCon and I forget how everything worked. I was the one that's like Xbox metal. They have provided the hardware for every single episode of cluster. We don't use VMs. We use

3:14 bare metal machines with 48 cores, 96 gig of RAM, a whole bunch of 10 gig networking cards. It just makes it more fun for me and everyone else. So to say thank you to Equinix Metal and to try it for yourself, go to Rawkode.live/metal and use the code Rawkode. This will get you 200 USD in credit, which means you can get around one hundred hours of compute on their smallest instance. And by the way, their small instance is not that small. It's still a sizable machine, so go check it out. Okay. Let's see. We're gonna introduce our first

3:45 team today. Team. Hey, all. How are you doing? Fine. How are you? Well, thank you all for joining me today. Let's start with a quick round of introductions and then we'll get started on today's first cluster. Hello, Rawkode. Yeah. Hi, everyone. This is Ibrahim. I'm working at Nissan. So great to see you all. Thank you. Hi. This is Bharat, and I'm working in a Nissan as a DevOps engineer, and I'm loving my job. Thank you. So this is Rasa. I've been I'm also working at Nissan in the same team as Fahad and Ebrahim does, and we usually work with a lot

4:36 of Kubernetes clusters in the past year or so. And so we are quite excited to see what's in the IBM cluster that we are gonna be fixing. Awesome. Thank you all. I'm assuming you typically work with working clusters. Right? So this will be a little bit of fun. Alright. I'm gonna give you access to the IBM clusters. So I'm gonna pop open my screen share now. I'm gonna go to team and roles where I modify the Nissan team. And instead of your own cluster, I'm gonna give you carte blanche and you can now access everything.

5:18 On the command line, I will open a session to IBM control plane one. Once this opens, you will be able to see it under active sessions on Teleport, which is here. There we go. Please join a session. Type some sort of command so that I know you're there, and then we'll get this thing underway. I will also join from the web in case I need it. I can see that Carlos from the IBM team is also in the chat. So there we go. Alright. We got one, two, hopefully a third. But feel free to export your KubeConfig, check for

6:05 a control plane and upgrade a cluster or at least an application. Best of luck. That is equals to a t c Kubernetes admin dot com. Kubectl. Hopefully, this works. Kubectl. Yeah. Okay. It looks good. Seems You have no control plan. I see. This is localhost 8080. Well, this shouldn't be the case. E t c kubernetes admin dot com. And here looking at the server, it says 6443. So there's something indeed wrong with the Kube CTL. Yeah. It shouldn't have been going to the localhost CTLT for the first time. Well, here it comes. Install hyphen You're not going to

7:03 take a look at it first? You're just gonna blow over the top of it? Yeah. We can surely take a look. Alias, it could be an alias too. Oh, here you go. Good catch. Okay. Let's let's just check with me. We know other teams do it and that this team has run. You can also run an an alias if you want to remove it from your session too. And Carlos on the chat says, boo. He's not happy that you found that so quickly. So if we even thought we can see the database running, and there's the deployment as zero.

8:03 So Yeah. Should I try updating the replica now? Yes. Yes. That might be that. Interesting. Interesting. New as STD in input output error. First, can you describe it first? Describe it, sir. I have deploy clustered. There's one revision. Select Replicas. Set labels here. We haven't created any replicas. So it is zero. Zero desired, zero updated, zero and zero. You're trying to edit again. The thing that interests me is, like, why are we not able to edit this here? How about we get the YAML and try to reapply it again? Would that work? Yeah. Apply clustered.

9:21 Demo. Clustered deploy. Got my ML. Yeah. That's Here you can see that the replicas are set to zero. Mhmm. And apply hyphen f. Let's not deploy. Let's deploy. It's still set to interesting. I love it when that happens. So so okay. So as it tried to even here, you can see that it scaled the replica set to one, but then there's something fishy happening around that is causing the replica set to reset to zero again. Maybe there's Yeah. Must can you check any service? Or jobs? It might be any job. Yeah. Here it is. The scale deployment set to

10:27 zero. Uh-huh. Yeah. Interesting. Still Still that. Yeah. The only thing is could it be in the current tab that could cause it to fail? Like, let's just see if it gets created again. So here you can see that the process started again. So contab hyphen e, could it be there? No. Contab for use an n using an empty one. So, David, like, we are not able to open the demo or something like an editor. I'm assuming that's a break. Yeah. So one But it So, Nasir, can you check service is running with some another user?

11:15 I can't exit it, like Yeah. Yeah. There we go. Control d. Okay. So if you look It's a break, and I know what it yeah. You got it. You got That is equals to VIM. So e q. Good job whoever did that. Wow. Okay. So that was a good thing to do. System CTL. It could be a system retimer. That is we can search for something like a cube CTL. Any process has been running. So here you can see that there's this cube for the separate. Yeah. Interesting. This sounds a little interesting. What's this? This is an e b p f filter

12:06 that is running, and this cube. Is this a pod? PostgresQL, cube resolver. There's something fishy going around with the cube resolver as well. Fourteen hours, fourteen hours, core DNS is messed up as well. CoreDNS is messed up as well. Any ideas, folks? Ebrahim or Fad? Let me see. First of all, I get, like, some service system deservice, but it doesn't look like so. Well, just to clarify. Right? You've you've found a process on a process table that is doing the bad thing. So how how can you get us payment? Pardon me. What did you say? We got

13:14 the process back? Yeah. If you do do your p s command again with kubectl. The correct p. So you you've got you've got a process here. Mhmm. Yeah. What you wanna do is is find this parent. You're trying to work out how it's being scheduled. There's also something very interesting about this process. Even from what we can see right now. I'll give you a minute before. Minus. We see Hey, Nasir. Can you check the user file, the shadow file? Yeah. Good belongs to one double zero one and check the Chrome tab for that user. Sorry, Fahad. Can you please share the

13:59 command again? Pass yeah. Pass w d. ETC pass w d. ETC pass w d. Yeah. There's different user, basically. One double o one. First one double o one. Yeah. No user found, I don't think. Okay. Sorry. Okay. Can you check with the front tab for for that user, actually, the one double z y? K. I think we're just not done with the user ID as well. How do we do that? Oh, can you can you switch that user with the ID? S u one double zero one? Yeah. So one I use it now. That user doesn't exist in there.

15:16 What might be the issue? Address. Because, like, I could see that in system CTL status. So it could be a container. Yep. That that has the user ID 1Double01. Mhmm. Let's see if this guy has something. Queue scheduler. No. DNS. Is a container because the user is, 1Double01 and the user is not on the main system. It is inside a container, but we are not really sure which container is it. Yeah. If the user resolved on your p s command, you would see the username rather than the number. So it doesn't resolve to an active user, but you're looking for the

16:07 parent process so you can work out how to kill this thing persistently. So there's flags you can pass to p s or you could use a proc fail system. Press props. And the process ID is 415284. 4 1 4. Yeah. This is interesting. Fax session ID, net maps, limits, c m d line. Go ahead and filter. What is this route here? Root is CWD. CWD and there's this x z? Oh, I got into the container? Could it be that more hyphen a u I? This is the process inside this one. So two hyphen nine. Just check the name.

17:50 U x I didn't know you could get into containers like this as well. To see Are you in a container or are you on a namespace? I mean mean, it's a thin line. A thin line, but still like, mean, from what we see in the CryoControl, I don't think you're necessarily in what we would call, like, a complete container. But Just checking, like, the front tab and the front tab has any no contact. Contact. Yeah. This is a Linux namespace. Okay. So you have some namespaces, but you still have access to system d, which means you still have access to some

18:39 of the host namespaces. Take a look at the status file and the proc file system. Status Yeah. Cat proc p I d status. Cat proc p I d. I think the process changed as I build it from Uh-huh. The inside. And a good tip from Nolan in chat as well. Taking a look at the command line fail will also reveal some useful stuff. But I would start with data. So cat block 418291 and pressure two nine one. And what should I do afterwards? Status. Status. Okay. This is log data. SH, your mask, 0022. You see the p p it?

19:46 P p I. Yeah. Right. Yeah. So that's your payment process. We know you could start to track that. 61856. Oh, parent process. Yeah. Yeah. That's 618 618 56. 5 6. Interesting. Container d and some namespace. This is a container. It is a container. Yeah. Yeah. So c I o c t l might might not do that. There's this container d c l I, if I could recall it properly. So container d c l I won't help you here. You can use CTR or control. I don't wanna give too much away because you're doing really well. But if you wanna hand,

20:36 I'll give you one. C t r hyphen hyphen help container d c l I containers. C t r. C t r. My bad. C t r containers l s. Have you used the c t r command before? No. Alright. So everything in the everything running in Kubernetes lives in a namespace called k. Okay. Minus add, like, minus n k. Well, that would show you the Kate's namespace. Yeah. Do you think this is in the Kate's namespace? Oh, sorry. I didn't get the command properly. So you'll you'll have to put the dash and kits before l s, I think. Maybe even before containers.

21:39 I don't recall it. Correct. I don't remember exactly. Dash sounds interesting to do You can just c l s instead of containers l. It's up to you. Okay. C l s. C space l s. Sorry. Yes. It's sometimes my advice is bad. Yes. So that's your case namespace, but there could be others. Okay. C t r. How do we get all the namespaces then? The namespaces Yeah. Mean, s Namespaces. N sls. L s. K t s l I 0 and and default. So we can have, like in both of these, there could be IO. Okay.

22:24 There's two VIP. All of it looks quite so, yeah, here's the container that is running, docker dot io bitnami with the kubectl, and this is the container ID. So hopefully, I think we might be able to kill it from here. Yeah. I think you've got it. Well done. Hyphen n, it's dot I o. Oh, the thing works like that. It's delete. It's The namespace. Cannot delete a nonstop container. So you've I think you've got to stop it first. Stop. Let's see. Sorry. Let's see. Stop there. There's no stop there. I believe the delete slash r m command

23:21 does take an extra parameter. Stop. Okay. Delete. When I got the help for delete, so let's just get that c t r c delete hyphen hyphen help. Oh, I thought it was a dash f. Maybe there's something amongst in the CTRC command, like, that could somehow put into, like, stop a container, delete Yeah. CTR it is CTR task l s you can do and kill that task. CTR task l s. Namespace. Dot I o. Uh-oh. This is, I've been in the habit of writing the namespace in the end, and this always leads up to bad stuff. Uh-oh.

24:17 This doesn't feel so here's the task, Ebrahim. And how do we get the task of so here's the container ID. Yeah. So how do I stop this? Yeah. You can do basically CTR task kill and the word missing the ID. Today I learned. CTR. Good tip. Nice. C l s. I could still see the container here, so should I now try to delete the container? Yeah. Just write. And not delete a nonstop container. I stop how do we stop a container in C t r c l I? Can we try some task kill that task? We

25:17 tried killing the task, but it still seems that You you tried to remove. You should kill the task. Like, this is how we did that. Like, c t I hyphen n task kill. Oh, okay. And you tried. Okay. Okay. And then that didn't end up really well. Uh-huh. So I just wanna throw something out there. I mean, there there in my head, it feels like there's two options. Like, either Carlos has been very sneaky and run a container in the case IO namespace outside of Kubernetes. But this I mean, it came back after the test

26:01 call. Maybe it's scheduled via Kubernetes and we missed it. I don't know. But it could be there. It should be a part, I believe, in order to oh, it could be a con job or job as well. Yeah. Job was, like, a con job that is running, like, intermittently. No. No. There's MSA, there's HCD, IBM, there's the cube resolver that is failing every that thirty eight minutes ago. I think we could take a look at the hints. Oh, you don't need the hints. You're so close. I mean, are are you familiar with the cluster dump command that allows you to view

26:49 all the container images? Like, that could be useful right now. Cluster dump command. Kubectl cluster info cluster dash info space dump, I think. So that's everything you need to know about this cluster. And I suspect if you were to grab that, uh-huh, for Bitnami or a kubectl, you might find something interesting. Maybe go for Bitnami. Yeah. Here's something that's Bitnami, and it says that the last apply it is a deployment with the name of labels that has Selium. Name is Selium operator. No. Not this one. There's this API metadata annotations. Name is core DNS. So the core DNS

27:45 port somehow seems to be running this. There's, like, the core DNS pod which is running this command instead of Let's take a look. DNS interesting. Kubectl get pods hyphen pods hyphen n cube system. And if you would add it to the edit to the where's it going in this one? Yeah. This is fourteen hours ago. So there's one running fifteen hours and there's one running fourteen hours ago. That calls for some suspicion. Mhmm. Code DNS. Oh, port. Not all pods. Pod yeah. I hate this. Code editor. VIM. Why is it still defaulting to EQ? Kubectl

28:55 editor. There's a way to set in kubectl editor? Mean, you could just do a get dash o yamo for now until you need to modify it. But, yeah, I'm not sure what sneaky thing he's done to break that there. He has done something. That sounds good. Complex thing, I think so. Oh, but And Just get system. Yeah. Yeah. I think your grip led you down a bad path there because you were missing one flag for full context. Try adding a dash b 10 to your grip. We should give you 10 layers before the match as well. So I I think what

29:45 happened is on your grip. Sorry. There's a cluster info dump grip. Cluster info and give c t l. Cluster rental, dump wrap. Grab. Uh-oh. Cube CTL hyphen b 10. Yeah. Done. No space, and I can't remember if it's a capital b or not. You'll need yeah. There you go. But search for Vietnam instead of Cube Control. I think that'll make your life a bit easier. But, yeah, let's just do that. Because kubectrol, you're getting all the clients that apply stuff as well. Then that is running the metadata. No. We're not gonna continue. Look at the

30:41 double dash here. Again, we're being misled because we don't have enough back buffer. So increase the 10 a little bit more. You you get there in a minute. Try it. Yeah. Try 15. Try 50. 50. We want that fill JSON context there. Right? So you've got your bitnami image here. Let's just follow-up. I'm trying to see what's the name of the container status. There you go. You got it now. It's there now if you see it. The improved resolver ID. Just go. Yeah. Rick, I am it But just to just jiggle the size a tiny little bit, and it should help.

31:42 Or you can type reset on the CLI, which will also help. Okay. Yeah. Right now. So what we did was we get we got the. What was the name of the container or the pod? I wasn't able to look at that properly. It was the cube resolver. Yeah. Cube resolver. Just go again and just cross check it. Kube system. Like, if it's Kube Resolver, then it should be a static part. So I will is the Kubernetes manifest, you resolve where there's nothing like this here. The the core name, the cube resolver one. Cube resolver and the whole thing one, I

32:31 think. So that's having something with it. But having this in the end, IBM control plane one, this calls for, like, it could it should be a static port, but, like, it could be something that's, like, kept inside any of these files because, like, the kubelet's just gonna run and see for this. We could search for something like code d resolver. It could also just be a pod. Can you just check it while deleting it? Scratch a pod? Can you remove it? I yeah. It will create, I think so. Let let's just quickly see if it's the place.

33:27 Yeah. This is the one, so I could try deleting it. Yeah. Click CTL delete pod hyphen n gube system. I think that was a a code break from Carlos because it looks like a static pod, but it's not a static pod. It looks like a real control plane component, but it's not a real control plane component. And this came back, so it was not just the pod, which is also a fun mystery. So And this is a fun mystery. Like, we can look at the if something is applying this cube resolver. Can you do, Rasu, can you do cube CTL get all

34:04 on namespace cube system? Yeah. Okay. So the logs Should we try getting the logs of cubelet? This is a cool comment from Russell on the chat. Journal sheet. YAML fails can contain more than one definition, can't they? I was thinking about Yeah. You you had to tag it and then didn't do the grip, but I was I was surprised. You were so close. I mean, assuming we're right, which we don't know. Good. So here we have Fahad, who is, like, our Linux, like, these utilities experts. So, Fahad, what would be the flag to get the

34:54 con search the content in constant sorry, content inside the file? Like, I want to search for something that could be in any of these profiles. Yeah. You can search for anything and so you are trying to find anything in all the files? If you want to list the files like as a dash l. Yep. Hyphen l and then dot? Nope. And then you'll do resolver space star. Resolver. Space. Yep. Resolver. Dash l should show the space star, not dash star. Okay. Okay. Not there. There's nothing. So I am thinking that it could be the cubelet being fishy, and then there's another

35:45 flag which, like, which would get it to run static ports from some other location as well. Cubelet. And here, you should see. I don't see anything. Is this? This sounds fishy like c s n 10 up. This seems like someone messed up with this Kubelet file. And, let me just read the file. If you'd open the file, this is the container. So it is the kubelet. Very cool stuff. Yeah. So where do we get system CTL? Edit kubelet? Where? Where do we have the kubelet configuration file? The system Var lab kubelet slash config dot YAML.

36:48 Var lab kubelet slash uh-huh. PEM config dot yaml. And there there should be a flag as manifest. It's not too good. Can we But, like can we install the utility and bypass this error, I think so. Which utility? The cube CTL utility. There's this is this seems to be not a KubeCTL issue, but something on the Kubelet level. Like, whenever the Kubelet is starting, there is this manifest URL, which causes to run this additional pod. And in the pod, if you would visit this URL, you are gonna see the definition the pod definition to the manifest URL flag. And then if you

37:45 do open that, you are gonna see that it's a pod which is running kubectl run, like Mhmm. Scale the deployment back to zero again. Very good. So it is, like, the system d service. This like, I would have to reset it again. Yeah. Uh-huh. So it does something with the system d service. If you see system So your your best command for that is system control cat and then the service name. It will show you everything you need to know. Yeah. Cat. And in that, we could search for a kubelet. Mhmm. So this is the thing environment.

38:32 Okay. Good job whoever did this. And this was quite is really fishy. And the kubelet, it added bootstrap kubelet.com, kubelet configures, and then I'm gonna just close this one out. Right? And it had only this. Yeah. Yeah. Will be for it. It should get back to blitz system CTLs. Restart. Restart. Restart it as right. You blitz system CTO, I was telling the same thing to, like, yesterday when he made the same mistake, and now I'm doing the same thing here. But and, wow. It seems like, it should be reset. Yeah. CTL. Yes. System CTL, daemon reload, system uh-oh.

39:35 System CTL restart. Kubelet. Yes. P s hyphen a u x. I don't still trust that thing, so I would need the kubelet. Good god. Who? Yeah. It's it's well, Kubectl get deploy. So I think I got the deployment in the root location. Clusterdeploy.tml. So I should think I should just, uh-oh, just like Yeah. Yeah. Every time. If CTL apply hyphen f, cluster deploy.ym l. And hopefully, this wouldn't It is taking too much time. Weird. Operation cannot be filtered. Okay. You can deploy hyphen l clustered. And then place it to clusterdeploy.Lima. Uh-oh. Not this one. Instead of solving, why do we why are

40:44 we not able to run WIM? We are just, like, taking another path rather than trying to solve the whole thing That is and we could set the replica count to be one. There's something messy with core DNS as well. I feel that, so we are gonna probably see that later. Right after this, and we still don't You have not applied it yet. Ah, okay. This is, like, the trying to fix something and then just trying to see well, it should have trying deploy. Uh-huh. Very well done. Yeah. They should get that. And, hopefully, I would I see it.

41:26 Well, I think the pod is still there, and I would have to now kill the pod again. Also, the cubelet is restarted. There could be a copy of the pod in the HCD. So let's just try getting the pod again. Update the hints. Yeah. We've got five minutes left. I don't know how many more breaks we have. I thought we were close, but then Carlos has also said, I think I did too many break. So Okay. Let's just check for hint view. And the alias as RC can be best to start. Okay. That there's something that we

42:08 You have fixed that. Yep. Yeah. Hint one. Can you There's hint one. I can't see the hint now because, like, I'll have to reset it. Weird. Super weird. Hint one. The perky type those perky typos don't let us resolve DNS. Okay. That is that was something that was expecting the what is the issue to how many core DNS ports do you really need? No. That's, like But there's that is something that is not related to us yet as well? Yeah. Yeah. I will check the rest of the hands and see if you can find anything related

42:55 to this respawning process. And quickly, like, do this one thing and try killing the pod again because I still feel cube scheduler? What what was it? Cube resolver. Resolver. That's what resolver, it is gone. It's not there anymore. Yeah. Oh. Yeah. That is interesting. How many okay. Cube, c t a, kubelet k. We fixed this one as well. A typo is always typo always miss an endpoint. I think looking at the current setup, like, the score DNS, these two, and I feel like there's something wrong going around with this. Okay. So as our deployment scales back to

43:41 zero, that's what I'm curious about at the moment. We get deployed. Tube CTL get deploy. Yeah. It scaled back to zero. I could try, like, searching for tube CTL again in case there's something running in the background. There's nothing that's causing it to behave like this. There's something wrong with the controller manager. It had a restart Kubernetes manifest. You have controller manager. Could it be that there's a flag? Entry level manager system CTL. It looks good. It does indeed. Can you check, like, cube config is going for right path? It is That's correct. Kubernetes into manager. Yeah.

44:45 Manager. Do we have any do. Sorry? Well, one of the hands told you to look at the core DNS deployment. Are those both really core DNS? Kubectl get deploy. Perfect. Cube the code DNS runs as a deploy. Right? Okay. And I so is it just So one on one. Not able to see? So on one on one. Like, can you go for, like, check the ports, patch ports of cube system? Oh, yeah. I can't, like, do anything because the so what should I do? Sorry. If you resize your browser, it will fix that. Okay. Let's see. Get pods.

45:32 Get pods hyphen e? No. Yeah. Can you do it? Yeah. So there's two core DNS there, and your deployment was set to one. Just kill kill the both one. Kill the both. Yeah. So delete bot and code DNS code DNS hyphen n coop system. I guess what you're hoping for with that is that one doesn't come back. Right? Yeah. Yeah. Cube system. And Ebrahim, you were indeed right. There's something fishy. And I think while doing this, I broke the controller manager. Have you done any changes on that? I am not really sure because what happened

46:24 was I wasn't able to see something. Why didn't you you have that you don't have done anything wrong with it. Okay. Why is it getting in the error now? Describe pod. Okay. There was the code DNS pod running the same thing as well. So if you would scale it now, kubectl apply hyphen f cluster deploy. Well, you don't have a controller manager. Yeah. That sounds Which may be an issue. Seems to be the opposite. Tube system. Not sure what caused it to go wrong. Why is it killing the pod? It says the pod sandbox changed.

47:25 The host that to the only one. Created container, stop started container, and then stop container. Yes. Hyphen a u x. There I see, like, there's some there could be something fishy causing this to happen. The controller manager, the VIP, the scheduler. There's HDD running. There's QA server. There's QA manager. I think we could refer to the hints one more time and see what's the those spiky typos don't lesser let us resolve DNS. I think what happened here is when you were messing around with the profile system and entered, you probably entered the controller manager, which was

48:18 responsible for starting the manifest URL pod. I think if you force delete or restart the kubelet, you may get it working again. But that's a speculation. We restarted the kubelet. Also, I feel like the deleting the pod with force, so that should do that. That should do it, like, getting this pod and hyphen n cube system. Uh-oh. Delete pod. Pod. Pod. Okay. Cube controller manager. It can take up to four minutes unless you restart the kubelet. Okay. Let's just do that. System CTL restart kubelet, and it should come up. Interesting. Maybe it is, like, looking at the file and

49:28 to see if there's something that got changed. Yeah. You you can add a blank lane if you want. When hyphen hyphen. So should I just, like, add a Yeah. Just add a black line to the bottom and save it. Let's see what happens. And try your get pods one more time. Fresh loopback. Why does that loopback I think this is related to the DNS itself. Describe pod hyphen n same. Namespace. Hyphen n cube system. And what's the image that it's running? That is correct. Code in that is correct. Could it be the case that someone has,

50:29 like, over at the gates.gcr.i0 to be to, like, point to some other registry? I don't think so. I the hints never alluded to that, and I don't think Carlos would have us chasing down something that's not correct. I'm a bit concerned about the controller manager not coming back. I don't think that is intentional. Carlos, if you wanna just jump on the link though, we can walk us through the the last step or to to fix this. Yeah. Because I feel like these are so close. They use a used a work through so many amazing problems here. So

51:08 let's see what Carlos says. Yeah. I didn't think that was intentional, Carlos, he said. I think that was the the entering the name space and doing the kill is that we we probably made that control manager broken by ourselves. That was weird. Like, I updated the I just opened it and then closed that, and then it didn't let the controller manager come up again. That's, like, weird. So Carlos would be jumping on the call. Right? Oh, yeah. He'll be here in a second. I assume. There's a link in your email, Carlos. Come on. Pay attention.

51:55 Alright. Let's I'm gonna just poke it already because I I'm super curious myself. Right? Oh, control pod. We have a controller manager running, which is good. Our core DNS is definitely broken, but I don't care about core DNS actually. I think we were I think we were just maybe very impatient there. You've got this cluster deployment and the root directory. Right? So let's reapply that, which I think will get you your pod. Try again. Now you'll have to. Oh, you're gonna have to do it again, aren't I? Okay. Yeah. D. It's not there. Yeah. Deployment.

52:33 Deploy at n o at the end the email. That's about it. Alright. And we're gonna get pods. That's surprising. There's no application set at all. I haven't I haven't edited the the KL file. Right? So Oh, you didn't update that. Yeah. Yeah. And we will reapply that one more time, plus Carlos has just joined. So what's that? Apply. Oh, come on. Yeah. Yes. You could update that. Yeah. There you go. So now r f pods. Right. Now you got a different number. But we are out of time, so I'm gonna bring Carlos in. Hey, evil person. Hey, Carlos. Hi,

53:44 Carlos. So I think I'm hearing too. Yeah. We can hear you. Can you hear me? Yeah. Yeah. Okay. Yeah. Yeah. Yeah. So, yeah, controller controller manager wasn't wasn't planned. So how do you fix that? We waited. And you apply the sorry? We just waited, and the controller manager had to come back after Yeah. Yeah. The forced delete and the restart. What's the error now? Or image pool? Yeah. Image pool on a clustered application. And, yeah, do a describe on that. What's wrong. So, yeah, I I you you fixed the two the two things that did the scale

54:24 to zero. So this one is sending telling you that fail to put image to GCR, fail to request head, look up HTML. Oh, yes. They can't look at GCR. That's inside DNS. Yeah. DCP. So it cannot look GCR dot I o. So do you have a DNS issue? I see config map. Yeah. CoreDNS hyphen system. So the the this config map for CoreDNS should look okay. But Well, we can't edit any files. How did you do that? Yeah. Sorry? We can't use kubectl edit. How did you break that? Oh, it's a global profile. Export export kubectl editor equals vim. It's an

55:10 environment called cube add cube underscore editor. Don't like that. Yeah. So Yeah. This this this looks correct. So the the break was I messed up with the DNS on the on the workers. Oh, okay. Oh, interesting. Wait one. That's a good one. Oh, you used the two workers. I don't know how I managed that. I must have broke my automation again. Sorry for too many breaks. It should it should have stopped there. So this one looks correct, but this is not the real one, right, in Ubuntu. So the real one resolver? Yep. Exactly. I have to remember where that lives. So

56:04 that's It's the slash system cache. Resolve.com. Resolve Com. Oh, yeah. DNS is zero zero zero the DNS and fell back, and DNS equals zero zero zero is wrong. Yeah. Alright. Let's leave it there. I I think you said I I thought that the scale to zero was going to be, like, quick. Right? We're going to get into this one. So good job. Good job. Alright. Well, we don't have time to fix the DNS and work through that, but I think that was the last thing. I think you just did a really good job. Like, you just played through a lot of breaks

56:51 there. And Carlos is very, very sneaky. Yeah. There's a I see that. Yeah. I I I did a, like, a silly thing. I I applied this the this the same break twice. Right? Scale to zero with two different pods. He didn't. Alright. It was, like, fun to solve that because, like, at first when we deleted the first one, I thought the process is over, but then there was doing the same sneaky thing again. So, yeah, that was a quite interesting to look at. Alright. Well, let's see what needs to have in store for Carlos. Well, thank you for

57:24 joining me. Feel free to watch the rest and join the comments on YouTube, and thank you again for for working for that. You've done very well. Alright. Let's pop over here. Let's move Carlos to here, and let's get you access to the other cluster. While I do that, do you wanna just introduce yourself for anyone that is not familiar with you, please? Yeah. I am a principal engineer in IBM, work with OpenShift and DevOps assets for IBM middleware and also upstream. I do upstream Kubernetes with sick release, Kubernetes office hours, and Knative. Alright. Thank you. Well, you now have access

58:05 to every machine. I'm gonna share my screen now, and I'm gonna open a control play open an SSH session. SSH session to our Nissan control plane. And I'll just pop these off. There we go. Alright. So session is open. Please feel free to export your cube config, and best of luck, Carlos. Hey. You're not bearing with me by myself? No. No. I'm here. I'm here. Pleasure to work. So I think I'm in the right one. Crossover playing. Russell's saying, good luck team IBM, or is it just Carlos? Unfortunately, Carlos's teammate is unwell. So I am team Carlos today or team

58:53 boot camp as we're gonna call it. So did it work? I put echo Carlos. Did it show up? Nope. You've opened your own session instead of joining mine, I think. So I need to go to sessions. Sessions. Yep. And then the bottom one should be my session or at least the one that says Rawkode. I don't see that here. See clusters, activity, ah, activity, active sessions, and then which one? The Rawkode one. Here is The one that says Rawkode right here, if you can see my screen and my IP address. Yeah. Awesome. You are in.

59:37 I mean, you're you know what you're doing. I'll I'll be there if you need that, but go for it, man. Okay. So I have to keep keep CTL. Okay. Finally, I keep CTL. So keep Well, we already have the cluster running in the pod. Sorry. I have activated hacker mode on your background since you've got a green screen set there. Oh, thanks. Yeah. I couldn't figure out the the thing. I'm doing the the things. Okay. So the body scratch look back. Let me do something because I'm I'm OCD on this. I use k. Epsityl completion.

1:00:38 Epsityl completion. Completion bash, and then you'll need to Yeah. You just I'm lazy, so I was, like, get it from here. Why it doesn't have autocomplete in tab? What's wrong with this? You don't have auto we don't they break auto complete? It's normally auto complete. I thought I had auto complete on mine. I mean, crazy. Okay. Sorry about that. It's like it just just I'm not That is a lot of restarts. Yeah. Let's see what's going on. Logs. Prescribe. Failed container. Putting image. So it's backup Failed to create container. The task to create shim. OCI runtime create

1:01:44 fail. Run c create fail. Unable to start container process. Error doing container in need. Error setting c group. Config for prod hooks process. Unable to set memory limit. Current usage. Peak usage. Unknown. They messed up with the with the container d around c. I think that's Is that a question, or are you talking to yourself? I wasn't sure. Oh, no. I'm just talking to myself. So I think it's just could not start this because of some something with run c create failed. Yeah. It says error settings c group config for process unable to set the memory limit to this

1:02:46 value. Oh. But the memory limits are fine. Five m one m one. I mean, does the machine have enough memory? I don't know. Maybe maybe they mess up with the nodes. Didn't say anything about the note. Oh, the notes. Not even being sched did it get scheduled? K. It was scheduled. Yes. On which node to say here? No. You'll need to get pods or wait. Oh, because there's a pod. Okay. Yeah. Worker one. Yeah. I'm not sure how I managed to revert my automation to give you two worker nodes. I need to revert that. Does anyone want get two worker nodes? I

1:03:55 only get one worker node. Oh, yeah. But that rebase went wrong somewhere. K. Join us so I can see VIM. Conditions. You don't see VIM in your end? Oh, I've just joined the wrong session. That's all. Don't worry about me. It's done for me. It's running. The the pressure, pods, capacity. Memory. Yeah. It's fine. The capacity. Yep. We have 48 CPUs and 65 gig of RAM. Yeah. Pretty big. Right? Yeah. Not too bad. Nice bare metal. What is current metal. Plug, plug, plug. I oh, that I've hinted in here. So should we check what how do we check the the run

1:05:03 c configuration? At c system? Why run c? Question for you. Well, the error says something about the run c cannot create because cannot allocate the c group. So maybe So I would be checking the kubelet config or the container d config if one happens to exist. The TOML from container ID. Let's check kubelet. But we'll have to jump on to worker one then for that. Right? Yeah. So how do we do that from the GUI session? Yep. I'll just open a new one there. A new session should be there entirely. There you go. I see. Yep. I see

1:05:52 something here. This one. Worker one. Right? You see that? Yeah. Yep. Yep. We're good. So, yeah, kubelet and container d is what you wanna check. The plug in is not on this one. Right? Duration. Let me see if I see something funky here. That looks clean. Looks clean. Let's see. And the other one any other configuration? Well, there's the QB It's on the plugins. No? The plugin is not where it gets configured. No. Right? Why not try container d config dump and see if that's best of anything? Container d what? Space config space dump, which will tell you if there's a custom

1:07:18 container d config, of which there is not. Generated file to load Tomo. So that's not the one that's using? That just means that container d is not configured. That's the default set at the box, so we don't need to worry about that. So let's check system control cat kubelet and see if there's two other locations that they can modify the kubelet. So we've got over here slash e t c slash default slash kubelet, and we have the kubelet and flags dot e n p. I think that YAML and the bootstrap kubelet. I'm not worried about the bootstrap kubelet. I

1:08:00 think that's mostly just authentication stuff. But the it's ECC default kubelet and the Kubernetes and flags probably worth checking it. Then we can start panicking. Sorry. I missed what you said. How about checking which flag? So you see the environment file in that cat? There's the Environment? Varlib kubelet kubelet kubelet dash flags dot a n b. So we can look in that file Oh, yeah. As the password. Yeah. To see if there's anything in there. Yeah. And there's also the slash e t c slash default slash kublet, which could be modified in some way. Environment file. So the two

1:08:45 The two environment files. The two middle ones. But the front ones. But it's using one of them. Right? Environment file and environment file. Like, the variable is set is set twice. If they get appended. They get appended. Yeah. It's just in both. Okay. So let's check the first one. Yeah. Let's do that. Leave keep that. Keep anything from here. Looks good. Container runtime endpoint, the sock, and pause. And the other one is at c c default to doesn't exist. A file doesn't exist, which I think it might be okay. Right? Yep. That's fine if it doesn't exist. Okay. So they're not modifying

1:09:39 the kubelet or container d to give us that error message, which make this more fun. Where where do I check container d config? Where's that TOML that it's using? Well, if the TOML doesn't exist, this is the default hard coded one. So you don't have to worry about it. So I think let's see the error message again. Yeah. Let me see. I found the work at control plane node. And the other one. This one. Right? That's the one. Why is this tiny? My window. Because is trying to resize our thing to be the same size. Yeah. You said it earlier.

1:10:28 It says, network to container the task. Failed to create shim. OCI runtime create failed. Runc create failed. So I'm I want to see, like, where is the plugin configure. But you said that there's no TOML for container ID on this system because it's using the default one. The inner Why why are you saying plug in? Am I missing something? Like, I don't see anything that says plug in. Oh, no. No. It's just just me, like, defining, like, the cubelet using a plug in to for for running container c run c. Yeah. I don't think this is run c

1:11:16 or container d. Or c group related? I mean, I'm not gonna say it's not c groups related. I don't know. I I have never seen this one. I've never seen this error message before. Interesting. These are the memory. Right? There's no resource range limits don't apply here. The quota doesn't apply here. Right? There should not be any quotas for resource I don't I don't think we're dealing with quota yet. Let let's check the container d log. I think that might be the next best bet on the worker node. And that's from system From journal control. Yeah. But on the worker,

1:12:05 not the control plane. Do we want dash u container ID? I would do f l u, make it a bit easier to read and follow. F l u. Let's see. We got that same error message. Okay. So container d definitely as the one that's getting this message, which tells me because there's no container d config, probably something on the host. Because that might yeah. I don't know. Let's try a system control cat container d and see if there's any limits put on the process. System CTL cat container d? Yeah. Like, minutes on the on the on the

1:13:11 container ID process itself? Yeah. There's a there's a couple of things here that I'm worried about this score adjust. See that? This one? Let's remove that. Remove that? Yeah. And that's located in system Yep. You've got it. That file. System d slash system slash container d. The system again. So system d slash system. Yeah. And I'll type today. Slash System. Internal container do there? There's no container do here. Where did we I mean, this thing prints toward the location. No? The the Yeah. The cat should oh, yeah. Sorry. Lab system d system container d service.

1:14:09 Where you saw that at the top? Oh, here it is. Yeah. Yeah. At the top. Lab system The system here. So I think what that score adjust is doing is even though we've got sensible memory memory limits, it's it's actually augmenting them to make them appear worse than the system actually is. I think the main thing the thing. By putting a negative number? Yeah. Yeah. I think. So I think if we just delete that and go with the infinity and everything else, we might be alright. Just deleting it so that default should be infinity. Right?

1:14:47 Yeah. Just comment out. I think that's gonna be okay. And then we can do a system control daemon reload and restart. Demon. Oh, it's not it's then reload daemon reload. You have to put reload daemon. I forgot. No. No. It should be daemon reload. I just not don't know how to daemon. Yeah. I think you had it right. Does it not work? No. It says daemon reload. No. No. Cannot spell. Never mind. CTL restart. Continue. Said is in the chat telling us that the arm adjustment score with the minus nine nine nine. I think he said it's actually

1:15:48 less likely to call processes, but that could have been a complete red heading. I'm not sure. Okay. So So if we run that Yeah. Let's do get pods and see. Yarn all again. See the the logs? We don't see the error in the logs. I'll restart the pods. Let me see. We're here. Oh, okay. That get that get you just did it? It's pending. It's pending. It's pending. Pending to Still pending. Yeah. It definitely feels too long, doesn't it? Yeah. The describe. Alright. Now we've got a scheduling error, so we can work with that. Right?

1:16:54 So it wasn't Although, the limit for five m four. The five m is never gonna run. Sorry? We've got a limit of one meg of memory, which might be okay for this Rust process because Rust is awesome. But I I don't think it'll run on five m, so we might wanna remove those limits. Okay. But I let let us let schedule it first. Right? I think we can fix the schedule thing first. Uh-huh. Scheduling disable. So it's taint tainted. Did not tolerate Painted or carded. Taint. Yeah. We can just edit the these two guys. Right?

1:17:34 You could just So the other so the other okay. So I thought it got scheduled before before we did the container ID. But, anyway, would would just edit these two worker nodes. Right? Remove the paints. Mhmm. No schedule. Yep. You can delete all those things. And that. Schedule go through? Yeah. Delete that. That this one. Right? Yep. Delete. Then I will not have a spec. That's that's alright. That's okay for notes. Just empty object. Oh, you could just remove that line too. To make it a gist? I won't care. K. I'm I'm all CD on on YAML. Sorry.

1:18:36 Okay. So that should've moved better. Much better. But Okay. So now we need to move that CPU limit. I guarantee you that run container editor is the CPU limit. I think so. Let's see. Yeah. For sure. Zero three nodes. Put image. Create a container. Successor fail to create container task. Oh, that was an error. That's that was the old one? No. Wait. Let's remove the limits. We don't need them. Our question is just to update this container. So There's no limits here. Where's the container? Here it is. I can't see because there's probably bug, but

1:19:23 My bad. Are you editing or am I editing? You're editing. What? You jump around. Okay. So that's it. Let's see. Add new pod. There we go. Yep. We got should be okay. We can check. Yep. We can test. Let me pull up the Oh, why? Yeah. It should be any IP on 30,000. Right? Connection refused. On 30,000? Uh-huh. Service. Upward. 30,000. Doesn't work. Are we running the right image? Yeah. Compute share issues. I was checking that. I'm back. And there's no logs. Right? No. Oh, sorry. I'll I'll stop typing. Yeah. There's no logs. My application does not log.

1:20:46 But but they'll fix that. No. We have we don't have endpoints. Oh, there we go. Yeah. Yeah. So what's that? I get bought. We'll check the service selector. So just save our service. Yeah. Let me get Oh, just show label. First. And then oh, it's crash loopback. I bet. Oh, Yeah. It's it's sick again. Describe it? Yeah. Fine. There we go. Where's an error here? Terminated. Error exit 137. But the image is looks okay in cluster v one. Nice error. Let me see. I wish I had logs. It's just plain error now, and there's nothing in describe.

1:22:18 Yeah. Error and crash and crash loop back. I wonder if there's a rogue process killing it. Killing the container? Well, calendar process. They could either be in the host or it could be in the container. There's no logs. But the fact that we go straight to exit one three seven makes me think, like, there's no Kubernetes events to say why the process would be killed. It's either the wrong image or something that's explicitly killing the process. Is that is that the correct image? I mean The GCR. Is that right? At least it's pretending it's real image. We we

1:23:04 don't know. It could be. What's the pull policy? What is the what? The image pull policy. On this part? Let me see. Edit part. Always. Policy here. Yeah. It looks okay. So Is it always? It is always. Yeah. Which is fine. Or there's at least we're not getting some stale cached image with the fake thing and the horse. Yeah. Let's run a Yeah. P s a u x. I'm curious if we've got something on this machine that shouldn't be there. PSA UX? Yeah. Something that is running, maybe killing the container. 16. Well, let's check the I mean, you want

1:24:23 to check the pods? Let me check the pods. Everything's running. Yeah. It should be on the here. Seating operator two days ago. Because something is running XRX off and cry control stop on the workers. Oh, there's a bash while true. Oh, I'm looking at the controller port. So I I should be looking here. Right? Yeah. Sorry. Yeah. We've got a road process, which is in the state of a container. Oh, no. That's just It's running in a CTR. Is it as in c crawl c t l bin? Yeah. It's running a crack control. And that was the parent to that.

1:25:10 The parent to that is bash and the parent to that is a container. 552816. Okay. So we've got a I can assess the emissary. Maybe I'm reading this wrong. I'm gonna let cat rock data Head one. Okay. Let's see the system d unit fail causing chaos. Well, let's find it. System d. Yeah. Process running a show script, I guess. Right? Ground? Could, but I'm not convinced yet. You think so? No. Serum service. Sometimes reading this dataset, I just find it too difficult. Can we grab all the system the definitions for CTR? CTL? There's a damage. Service.

1:26:52 Because of the time stamp? Yeah. There we go. Let's card in worker two. Worker two. Have to fix it. Yeah. And then do you wanna delete our pod, get it rescheduled, or mark our button, and see what happens? Yeah. To force it there. Let me see. Let me jump to pods. It pods. Delete it. Yeah. I'm just gonna force kill this process. It's running now. We have an endpoint on 6 6 6. That's the port number for your Uh-huh. App? Okay. I don't remember. It says connection refuse. Yeah. It's not letting me call the service, so we're

1:28:00 gonna have to do it the hard way. I was gonna p code bash, but then realized that would be a fucking stupid idea. There you go. Alright. Have you So that It should be okay now. So why why did it work before the stop of the system b? Because if you take a look at the damage fail, we have an exact stop, which also runs command, which James c p. Pointed out. Nice. Okay. Very sneaky. Nice. Alright. Let's see if we have our Give it a try again. No. It's not taking a while for me. Yeah. It's not

1:29:04 working. Let's check network policies. And IP tables, I guess, for the last second. Oops. Sorry. Okay. I'm so invested now. I wanna know what's going on. That looks bad. Yeah. Yeah. Okay. So now it's CMP and CCMPs? I deleted I deleted it. Just just checking. There we go. That's the serial network policies you were checking? I was checking the cellular network policies. Do wanna upgrade this to V2? Well, it's not loading. Oh, it loaded. It loaded. Didn't load it didn't load it for me. Okay. Now it's loaded. Okay. So let's edit. Edit. Deploy. V two. I can't see that, but I'm

1:30:07 not gonna juggle. You can update record version. Hopefully, we don't have a mutation webhook like I was trying that so we bought Dance. Cool. I think that's it. Do you want to see the the hints? Yeah. Let's see if we messed up. Let's see. At hint one. Simon says worker can sometimes be sleeping. Okay? Yeah. That was a carton. So request as much as you can. That was the memory stuff. Three is there's something going in the mind of the worker. Something is super fishy. What's that one? That was the damage system that you're gonna fail. Okay.

1:31:08 And then for his communication policy. Okay. Smashed it. Well done. Cool. Thank you. David, for your help. Well, Carlos, you're an evil, evil man. But oh, always a pleasure, and well done, team. Just did amazingly well. There was a lot of breaks there for you to work through on that. So cheeky Carlos for and extra. Starboard shredding. But good work, Carlos, working on that cluster as well. So thank you for joining me. It's always a pleasure. Thank you to everyone watching. Thank you to Teleport and Equinix Mail for their continued support, and we'll be back next week. Any last

1:31:53 words, Carlos? No. Always follow us both on on Twitter. Right? Definitely. I will retweet the Twitter handles after this. Alright. Have a a great day. Thank you, everyone, and we'll see you all again next week for special episode. It's not Rawkode versus the community. It is the community versus Rawkode. I have two broken clusters and $400 of Amazon vouchers to give away to anyone that can fix them. So join us next week live, and we'll see you then. Have a good day. See you then. Thank you for watching Rawkode Live.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Cilium

More about Cilium

View all 36 videos
CoreDNS

More about CoreDNS

View all 21 videos