Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Use kubectl and kube-system checks to confirm API server readiness and identify pods failing to start.
  2. Detect and fix WordPress app issues by spotting port mismatches between container specs, readiness probes, and Service endpoints.
  3. Resolve startup and connectivity failures by adjusting Kubernetes resource limits, image pull policy, and selector labels.

Newcomers edition of Klustered. Jeremy, Jason, and Tom from Equinix Metal debug a broken Kubernetes cluster with kubectl, fixing a port mismatch, a 1Mi memory limit, a scheduler startup delay, a scaled-down Postgres StatefulSet, and a service selector typo.

Chapters

Jump to a chapter

  1. 0:00 Viewer Comments
  2. 0:50 Introductions
  3. 1:01 Introduction and Show Premise
  4. 1:52 Guest Introductions
  5. 4:04 Connecting to the Cluster
  6. 5:00 kubeadm KUBECONFIG
  7. 5:04 Initial Troubleshooting Strategy
  8. 6:19 Checking Kubernetes API Server Connection
  9. 8:00 kubectl get pods
  10. 8:12 Listing Pods in All Namespaces
  11. 8:30 kubectl describe pods
  12. 8:38 Investigating the Failing Application Pod
  13. 10:28 Analyzing Pod Description: Sandbox Error & Port Mismatch
  14. 11:06 Troubleshooting Sandbox Creation Issues
  15. 13:31 Finding Pod Node Assignment
  16. 14:49 Connecting to the Problematic Node
  17. 16:50 Containerd logs
  18. 18:20 Kubelet logs
  19. 19:40 Describing the Application Deployment
  20. 19:45 kubectl describe deployment
  21. 22:16 Confirming Correct Application Port
  22. 23:20 kubectl edit deployment
  23. 23:21 Editing Deployment: Fixing Port Mismatch (8081 to 8080)
  24. 25:16 Checking Pod Status After Port Fix
  25. 26:18 Analyzing New Pod Describe: Resource Limits Issue (1Mi memory)
  26. 28:00 Pod Requests & Limits
  27. 32:37 Editing Deployment: Increasing Memory Limit
  28. 33:05 Application Pod Running/Ready (Problem 1 Solved)
  29. 33:46 Investigating Kube-System Pods: Scheduler Issue
  30. 35:00 Liveness & Readiness Probes
  31. 35:33 Describing the Scheduler Pod: Identifying Startup Delay
  32. 38:01 Fixing Static Pod Manifests (Scheduler Delay)
  33. 39:00 Static Pod Manifests
  34. 41:30 Checking Scheduler Status After Fix
  35. 43:00 Debugging Kubernetes Services
  36. 43:06 All Pods Running, Application Still Unreachable
  37. 45:11 Troubleshooting Application: Database Connection
  38. 45:51 Checking for Database (StatefulSet)
  39. 46:57 Identifying StatefulSet Replicas = 0
  40. 47:00 kubectl scale
  41. 47:51 Scaling the Database StatefulSet
  42. 49:44 Database Pod Running, Application Still Failing
  43. 51:17 Troubleshooting Application Pod Again (Checking YAML)
  44. 52:01 Analyzing Application Pod YAML: CPU Limit Issue ("1")
  45. 53:06 Editing Deployment: Fixing CPU Limit and Image Pull Policy
  46. 54:30 ImagePullPolicies
  47. 55:32 Checking Pod Status: CPU Limit Resolved (Pod Pending)
  48. 57:17 Editing Deployment: Fixing Image Pull Policy ("Never" to "Always")
  49. 58:13 Editing Deployment: Fixing CPU Limit Again ("1" to "1000m")
  50. 59:17 Pod Running/Ready, Application Still Timing Out: Checking Services
  51. 1:00:00 Service Endpoints
  52. 1:01:17 Describing the Postgres Service
  53. 1:02:32 Analyzing Service: No Endpoints
  54. 1:05:25 Identifying Label Mismatch: Service Selector vs. Pod Label
  55. 1:07:17 Editing Service Selector to Match Pod Labels
  56. 1:09:49 Application Access Confirmed!
  57. 1:10:10 Conclusion and Lessons Learned
  58. 1:15:22 Checking Containerd and Kubelet Status/Logs
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:01 Introduction and Show Premise

1:01 Hello, and welcome to today's episode of Rawkode live. I'm your host Rawkode. Today is clustered, but slightly different. Today is clustered newcomers edition. We, I have broken a Kubernetes cluster in a few ways to aim and hopefully provide a journey for people that aren't as long in the tooth and the Kubernetes landscape. They wanna understand some of the more primitive aspects, basics, I don't know a whole bunch of other stuff. It's really hard to break a cluster when the people have no idea. So we're gonna see how it goes. Now before we get started, I wanna encourage you all to

1:38 please subscribe to the YouTube channel, click the bell. This will get you notifications for all future episodes of Rawkode live and clustered. And, of course, if you wanna talk about cloud native Kubernetes and anything in between, please join us in the Discord server. Alright. Now joining me today, I have Jeremy, Jason, and Tom, three wonderful colleagues from Equinix. Hello, all. How are you doing? Hey. Go. Hi, Charles. Great. Today? Excellent. Awesome. Why don't we just do a quick round of introductions, get to know everybody, and then we'll get started for today. I pass the baton to you, Jeremy.

1:52 Guest Introductions

2:23 Excellent. I am Penguin, Jeremy to my mother-in-law. I am David's teammate at Equinix Medal, and I'm looking forward to, he said, pretend you know nothing about Kubernetes. And so this is not this is not method acting. It's this should be a this should be a real easy one. Oh, I suppose a quick question. Should I also be in the discord, or is that where you'll be taunting us off screen? You do not have to be on the Discord. No. It's all good. Everything will come through the YouTube comments, and my laughter comes in all places. Very good.

2:55 Perfect. Yes. And I'm Jason. Also, of these fine gentlemen at Equinix Metal. I've been active in the operational life cycle management space in Kubernetes since 2015, and I'm looking forward to help walking Tom and Jeremy through fixing this cluster. And I am Tom Crow. I also work here on the developer relations advocacy team. And, I wear as a badge of honor or shame depending on where you are that I have never been able to keep a Kubernetes cluster up and running. So, I have a lot of experience on the working with broken clusters, very little experience fixing them. So I'm looking

3:45 forward to what we can do today. Is there such a thing as a working Kubernetes cluster? I'm not entirely sure. That's that's my white whale. Alright. Well, I wanna say thank you all for joining me today. I know this was a little bit short notice, but I really appreciate this and I think this will be a whole lot of fun. So we're gonna start by popping over my screen share. We have teleport as we always use on clusters. We have just the one cluster that is broken. My involvement today will merely be to click the

4:04 Connecting to the Cluster

4:16 connect button, which I think I do rather finely. I will zoom in on the phone and if you could all do me a favor, join this active session, give me an echo hello, not all at once. Well, I guess you can go for it if you want. I see one connection so far. Let's get two more. There we go. Thank you. Four connections. I've never noticed this on teleport before, but it has this lovely little number at the top left, and then you can hover it to see the name. Cool. Yeah. Wow. There we go. Right.

4:49 We have hopefully a still broken cluster. Fifty fifty at this point, I think. But I am not gonna say any more just now. I'll let Jason share his thoughts and insights with you both. And Jeremy and Tom, best of luck. Yeah. So from the spot that we're in right now, I think we have one of two options. One of them is is we can try to access the application deployed on the cluster and see where things are from that state. Or since we're already at the command line, we can do a little inspection with kubectl

5:04 Initial Troubleshooting Strategy

5:22 and see what the state of the control plane is and try to see if we at least get a response back that way. Yeah. How would we check on if the I was voting option one, but alright. Yeah. So our what is our desired state? What what should we be seeing if the application was working? So, ideally, I believe that we're looking at a WordPress application, normally deployed on these clusters. Although, generally, I don't think we actually expose it out to the Internet. So you you would actually have to proxy things through. So we probably

6:05 need to start with the control plane anyway and see where things are there since kubectl proxy is probably not gonna work for us if our API server is not up already. Okay. So probably the easiest place to start is we can do something like kubectl version, which should, if the API server is up, actually tell us what version of the server we're connected to. And, actually, in this case, we are going to have to set the cube config either through the environment variable or the command line flag to connect to it. So in this case, if doing the environment

6:19 Checking Kubernetes API Server Connection

6:47 variable, it should be kubeconfig, all capital letters, no spaces. And then we can do set it to etsy kubernetes admin dot conf, I think, is the default one by no. The default admin config. Yep. That's right. I think that's true for all Kubernetes clusters. Right? Yes. Oh, you didn't actually export that environment variable. I was gonna say, I'd be a little surprised if Micaiah started to wrap up. Already? For for times like this where we have to go back and redo things, David, it'd be nice if we had some circus music. I'll work it back in?

7:48 Nope. Yeah. I think it's supposed to be just conf, not config. Oh, admin dot conf. Yeah. Ah, there we go. Okay. So we did get a response back from the server, and we can see that the server is running version one dot 20 dot four. It matches our client version, which is what I would expect on a Kubernetes based cluster. Now we can do you know, basically, we can start with just looking to see what the state of pods are in the cluster. So if you did, like, a get pods dash capital a, that'll or kubectl get pods dash capital a.

8:12 Listing Pods in All Namespaces

8:28 That should tell us, you know, what the current status is of the pods across all namespaces. And it looks like most everything is running except for that one pod in the default namespace. So what we can do here is we can actually do a kubectl describe on that pod, and that will give us additional information whether it's events that have been generated for that pod or the status of the pod. It'll give us a nice summary of where things sit there. You'll have to do a dash in to set the namespace there. I think Tom was riding on top of

8:38 Investigating the Failing Application Pod

9:36 me. Sorry. Jeremy. Let's do this Passion. Pair programming styles. So Jeremy, you take just now and I'll yell switch, and then we'll have a talk to you over for a bit. There we go. Alright. And you now actually have to tell it the resource that we want. So you wanna tell it a pod, and it doesn't matter what order you put the flags in with everything else. This is actually gonna give you all of the pods, but there's only that one pod in that namespace. So it works effectively the same. Okay. We got wait. Alright.

10:26 So we can see here that it could not actually create the pod sandbox for this. And we're seeing a connection reset by peer. So that's that's a fun one. The other thing we can see, we can see some various other information here. If we scroll up a little bit, we can see that there are no node selectors involved that we're dealing with here. Is it one port off? Should eighty eighty and 8081 be the or different? Where are you seeing that? Where yeah. Where's 8081, Jeremy? Oh, up there. It says port eighty eighty one TCP,

11:06 Troubleshooting Sandbox Creation Issues

11:24 but the readiness is showing Liveness and readiness below are looking at Eighty eighty. 80 80. Yep. So I think that you have something there. Now we just gotta figure out which one of the two is incorrect. So let's talk about that. Right? Are you gonna say something, Tom? Well, I'm seeing up here as I went up higher, it's saying the connection to the server eighty eighty was refused. You specify the right host report? Okay. But just thinking out loud. Okay. So what were you gonna say, David? Yeah. I think it was a keen eye spot in the 8081 and the 8080.

12:25 But, Jason, would that cause a container not to start? I wouldn't necessarily expect that to cause the container not to start. I would expect that to, cause more issues with, after the container's already started, because the liveness and readiness probes don't kick in until after the pod has been created. Alright. So what are our likely candidates? So if we have a unable to create sandbox or container not starting, what are the go to things we should be checking? Well, probably the first thing, we'd probably wanna see what node this is assigned to so that we can go ahead and inspect things on

13:14 that node. Obviously, you might wanna check to see if container d is actually running. And if there are if it is, are there any errors in there or in the cubelet logs trying to connect to container d on that server? Okay. So Jeremy, Tom, do you know how to check which node part was scheduled on? Nope. Where's that? So one of the places that I would probably look is if you actually do kubectl describe on the nodes, it'll show you what pods are scheduled on all of those nodes. You know what? I didn't know that.

13:31 Finding Pod Node Assignment

14:17 I had no idea you could describe a node and get a list of pods. That's awesome. This looks like ours. Yep. There we go. That's it. So what's the node name for that node? It is VVXCW. Yes. Yep. So we should probably switch over to a teleport session on that node. I have made it. So DVXCW. Okay. Here we go. We're in. I think Tom oh, there we go. Alright. I had to do a weird reload, but we're we're back. Alright. So probably first thing first, we can check to see if container d is actually

14:49 Connecting to the Problematic Node

15:28 running on this host or not. You know, you can do that either through, like, p s or system d commands to check the status of the service. So p s on its own will actually only list a few processes. You'll wanted to add some selectors to it. I tend to use p s a u x to list all processes. There we go. Alright. So we can see container d is actually running. Running. Yeah. And we know the cubelet's running from knowing that, you know, there's a node resource there and it at least tried to schedule

16:30 it on this host. So we can take a look at either the kubelet logs first or the container d logs and and try to see if there's anything in either of those saying why we're not seeing a running container. Let's look at the container d logs. Where are those found? Alright. So we should be able to use journal cuddle dash u. I think it's just container d, the service. And then probably give it a dash l too to make sure that we get the full line length. You'll need to add a dash dash no pager.

16:50 Containerd logs

17:20 Or I think you can do l f. I can't remember. Go for it. Yeah. Try l f. No. The dash l f. Sorry. Yep. Just an add an f to the l. Yep. There you go. Yep. Oh, failed to create container tasks. What are we getting here? Just so that I feel less guilty, I did think the error message on this one would be a lot more explicit. OCI runtime created failed. It doesn't look like container d is gonna give you much. I would maybe try JSON's suggestion. Kubelet logs would be where? It'll be the pretty much the same journal

18:20 Kubelet logs

18:25 coddle command except instead of container d, we want Kuplet. Alright. Only I'll switch now. Okay. RPC. Alright. I think I'll throw a hand in because the header message isn't what I expected it to be here. I only modified the deployment resource. So you can do this all from the control plane by taking a look at the YAML. Yeah. That that's where I was gonna head next because as far as the kubelet and container d, they're not really giving us much. Yeah. I'm surprised by that, to be honest. I should have checked it, but I I

19:32 I thought it would have been okay. Yeah. Yeah. So I'll I'll jump back over. You can close this tab. Think it's safe to say we'll we'll work from the control plane. Yeah. Okay. There you go. Alright. So let's go ahead and do a describe on the full deployment for that. So if we do a kubectl get deployments, let's see what that deployment is called so that we can go ahead and describe it. We're doing a kubectl. What was that? Get deployment? Get deployments. Oh, easy enough. Deployment singular. It helps if you spell it right. I quite like deployments.

19:45 kubectl describe deployment

20:15 Alright. And so we can just do a describe on that cluster deployment. So kubectl describe deployment clustered. With a k. Yep. Yep. I see that. Got it. Alright. Nobody can take the nerve and watch. That is the rule. It's a lot more nerve wracking to tell you why, you know, you're live. This is insane. Are you on a nonstandard keyboard, Tom? I'm on a standard keyboard. That's what makes it worse. I hear you on your mechanical keyboard, so you have an excuse, Jeremy. Where are we here? Okay. Now, live link is ready. Yeah. So you should

21:02 probably just go through this top to bottom and look for anything in over there. Ordinary. Okay. So we've got There's default. Is that correct the correct URL for GitHub Container Registry there? Yeah. Yeah. That that that's all good. The next thing is we can take a look at the port configurations. There's one in there that's kinda standing out to me. It's a little bit odd just looking at it. The liveness and readiness URLs have no have no IP associated with them. Just a just a port. Yeah. There there's also some else that's standing out as a little bit odd to me

21:56 too as far as some of that container configuration. What's that? Host port zero? Yeah. Zero TCP. Yeah. Host port zero is okay. We're not actually exposing the host port. That's just what that looks like. Okay. What is GHCR radio? That said, I do think y'all were on to something with the the port configuration and the liveness and readiness probes not matching up. So we probably need to bring those back into alignment. Yeah. So I've spoken to the developer, and he expects the application to run on port eighty eighty. Just put that there. Okay. So we need to change

22:16 Confirming Correct Application Port

23:01 this the eighty eighty one to eighty eighty? So Noah was also asked in the comments about the no IP in the probes. Yeah. This is just standard convention on liveness and readiness probes. The IP can be embedded that runs against the pod itself, so will know the port as necessary. The output always throws me as well. It's weird, but it is okay. Okay. So we wanna edit this thing in place. We can do kubectl edit, and that'll let us drop us into an editor to make changes to the resource. You wanna tell it what resource, so in this case,

23:21 Editing Deployment: Fixing Port Mismatch (8081 to 8080)

23:35 a deployment, and the name of the resource. So in this case, clustered. Clustered. I think with an e. Yeah. There we go. K. 8880. Looks good. Scroll down. Either that or you can search for 8081. Yep. I'm not seeing 8081. You hit the bottom? You can search in Vim by pushing forward slash and then type in your search term. Oh, I I see what I've done. Yep. There you go. Classic table. Yep. There we go. Where did it go? Here it goes. There it is. Yep. Line 51, I think. Yep. Shift and a to add to that line.

24:41 Shift. And then push one and then hit escape. Yeah. Sorry. Vem is the default editor. If you've not used it before, it is a pain in the ass. So just type one zero and then escape and then colon w q. Yeah. Sorry about that, Tom. No worries. How many how many ways are there to enter into There are probably I don't know. I use four or five regularly, but I'm sure there's many, many more. But shift a to stay in the current line, j to jump down the next line. There's a whole bunch more. Or I. I think so. But that's booting.

25:16 Checking Pod Status After Port Fix

25:16 That took okay. So if we we fixed that. Alright. So we can go ahead and take a look at those pods again to see if that pod does finish coming up. Because right now, the state of that deployment is showing one desired, one updated, two total, and two unavailable still. That would be too harsh, Kevin. What is it? He said, don't tell anyone how to explain them. I haven't touched them at all in a in a minute. Let's see. Good Throw the throw the whole throw the whole computer away. Did you run it get pods?

26:05 I think he ran the deployment again. Yeah. I did. Describe on the deployment. Hello. Get pods. Alright. And we still see it's container So we'll wanna hit go ahead and do another describe on that second instance there, the one with the age of seventy five seconds. This one here? It doesn't show me your cursor, so we want the second one of the two. Yep. 44 v 22? Okay. You need to describe pods. Just the last part. Yep. And I need pods. Yep. Describe pods. Delete space. You need the whole name. There you go. Need a space.

26:18 Analyzing New Pod Describe: Resource Limits Issue (1Mi memory)

27:01 Okay. This is nerve racking. It's alright. You can swap back in in a minute. If you have not done this on Rawkode Live, I recommend I recommend it. That's a lot of fun. Almost a decent year. Alright. And we're still seeing the same issue there. Yeah. However, one of the things that I'm thinking of is as we're setting up the pod sandbox, we need to interact with another pluggable component of the cluster to kinda make that pod sandbox, and that's gonna be whatever CNI is deployed. I have not touched the CNI. You have not touched the CNI. So that

27:45 was good. It's it's really you'll kick yourself, Jason. It's staring you in the face. Alright. I I would like to phone a friend. Audience watching. Give us a hint if you see that. I don't think the audio like, the audience are normally pretty quick to to phone in what they think it is, and no one has mentioned this yet. I'm really, really surprised. But in the interest of time, do you want me to get throw something out there? Is that your way of saying that not only are your guests, but your audience is also disappointing you greatly on this one, David?

28:00 Pod Requests & Limits

28:14 Yeah. Exactly. Yeah. Can can you throw us a bone? Alright. I've highlighted the oh, you might not be able to see the screen share. Yeah. I see the share. Limits c m. C p five. Oh, yes. So I really expected the error message to complain about the limits. But Yeah. That was disappointing. So we we probably want more than one Mibi byte of memory for our pod. I have so much respect for you for actually calling it a Mibi byte and not, you know, maxing up SI and non SI standards. Alright. So we probably wanna go in and

28:59 re edit that deployment and increase that memory count a little bit. What's the the full deployment here? After this command, it's swoppy time again. Nope. The deployment's just called clustered. That's the the pod. Yeah. Yep. That's the pod. I'm not going to do the full thing. And, Jeremy, you're up. Alright. We are going back down to two Mibi bytes. We'll just keep adding a Mibi byte until it works. So should we talk about resource requests and limits where we get the quick TDLDR on that? Yeah. What do you what do need to get a Kubernetes cluster up and running? What

29:45 what should we have here as a minimum? So that's highly gonna depend on your application. And underneath the resources, have two different kind of types of resource limits that you you have the limits and then you have the requests. Mhmm. The requests are used to make scheduling decisions to say, you know, where should I schedule this pod based on how much memory I'm gonna request from it. Limits actually put actual kinda guard barriers. It will not let the application use anymore. It'll define some c groups on the host and limit the amount of memory that you

30:29 have specifically to what you set in the limits. So this is what'll keep me from so this is what'll keep me from deploying a a 300 gigabyte server somewhere for my my cluster. Right? Yeah. Exactly. This is what allows you to basically define, you know, how much over provisioning you want in your cluster and Mhmm. You know, avoid, you know, too much over provisioning for those applications. And generally for Kubernetes clusters, you disable swap on the host. So when things do actually run out of physical memory on the host, you're looking specifically at oom kills rather than

31:12 thrashing swap when those things go bad. So you really wanna make sure that when you're defining your applications, you want to try to match your request and limits closely to what, you know, the application generally uses. And you can you know, if sometimes it'll surge a little bit, you can set the limits a little bit higher than the requests, that sort of thing. In this case, we're talking about a WordPress application, I or is this a WordPress application or is this something else? Yeah. It's just a small rust binary. I mean, we could either remove the limits altogether or you can

31:48 throw it up to, you know, one CPU and a hundred mega RAM or something like That's probably over provisioning, but they also don't need to be set. You can set that to 500 or 1,000 whenever you fancy. And the way to work this out as a question I think we all probably get a lot is, oh, how do I set them as this continuous profile, good metrics, all of that. Just really really start to understand your application under multiple scenarios to low load, high load, etcetera. Yep. And there's things like the vertical autoscaler that you can use

32:26 to get a baseline for what your application uses. It'll automatically increase those limits for you as use kind of increases on the application. We're running. David, did we do the thing? I'll be honest. I did not expect that to be running after that. So clearly, one of the other breaks has not really panned out, but that's okay. But if you can do a get pods on the cube system namespace, we can maybe take a look at what should have happened. There are more things to fix, don't worry. Kube cuddle. Nope. Kube cuddle. And I can see the chat.

33:05 Application Pod Running/Ready (Problem 1 Solved)

33:12 Oh, this isn't my turn anyway. This is your turn, Jeremy. Sorry. To follow-up on the profiling discussion, Noel in the chat is mentioning that there was a talk at KubeCon talking about using the VPA to profile applications. Nice. Yep. There we go. Oh, okay. Which namespace am I? In kube system namespace, please. Kube dash system. That's where all of the Kubernetes control plane stuff runs. I think I'm about to learn something else today that I didn't know. Yeah. You see anything weird here? Our scheduler is not happy right now. But we did get a scheduled pod.

33:46 Investigating Kube-System Pods: Scheduler Issue

34:03 We did. It's running, but it's not ready. Yes. So, we can go ahead and, do a describe on that, and see what's going on there. And then in this case, we'll also have to specify the namespace like we did above, so dash n cube dash system. And then you'll probably wanna copy and paste because I we haven't configured tab completion on this host. So I do have a question. If the scheduler is not working, why how did it schedule? So I didn't expect the scheduler to really respond to any requests because it wouldn't end up in a service

34:58 endpoint, but that was really naive of me and I now understand that that's stupid because it's a controller that's actually listening to the Kubernetes API server. So the thing I did is merely amusing at best, but completely pointless and clustered at worst. Because there's only one schedule. Anyway, it doesn't matter. You you can fix it or not fix it. It's up to you. It may be interesting to look at what I changed. Alright. Describe Let me just scroll up, scroll down. Like Let's go up. So and because it's showing that it's not ready, that tells us that the,

35:33 Describing the Scheduler Pod: Identifying Startup Delay

35:54 readiness probe is not firing. And if we look at the liveness probe definition there, we can see that there are indeed failures. And we can also compare the actual command line used for it to that to that probe configuration. And if I jump out at you, Jeremy, or Tom? Not just yet. So there's a flag at the end of the defined command there that is standing out to me. I sense failure. So if we're looking just a little bit up of where Makai has highlighted on his screen, we see the tail end of command line

36:59 arguments there. Is that a little bit more? So you've got port zero there, but also, port ten one twos, one zero two five nine as the for liveness and startup. Yeah. Yeah. I haven't I haven't modified ports. Okay. Not out of. Yeah. That port zero is another thing that always confuses me on clustered, but preferably legal. Oh, yes. That's probably is that disabling the HTTP endpoint for it? I don't recall, but let's say that sounds like a pretty good shit. Again, it's it's it's it's simpler than user thinking about it. Anyone in chat see the

38:01 Fixing Static Pod Manifests (Scheduler Delay)

38:08 see the thing? Oh, you're gonna kick yourselves again. Hold hold on. I'm not It is a it's a pressure, though. Do you know you got people watching you and you're looking at it, like but I I I promise you, I have kept all of these superficial breaks. You see any difference between the liveness probe and the startup probe? Oh, there's a nice long delay in there. That fifty to thousand second delay. You know how many hours fifty thousand seconds is? Thirteen. Okay. That felt too obvious though. I I have kept I have kept the the break simple and I love the one

38:57 that's just showing. It's like even the simple things are so easy to overlook when you're trying to fix something because but yeah, there you go. So how how do we fix the scheduler, Jason? What's different about that to other workloads on the cluster? So, obviously, what we're looking at here is the static pod mirror manifest. So the control plane components on a Kubernetes based cluster are deployed via static manifest. And if we actually try to edit them through the API, they would not actually have any effect. So we'd actually wanna look on disk at the configuration here,

39:00 Static Pod Manifests

39:40 and that's basically gonna be in the Etsy Kubernetes manifest directory by default. So assuming that Micaiah has an thrown us a curveball and modified where the manifests are, that's where we wanna look for the kube scheduler manifests scheduler configuration there. I have not done that. I promise. Yeah. So a lot of the comments there are were pointing to the Port Zito as well, but that is a a red herring. And that's one of the fun things because Kubernetes now is let's see. It's 2021, originally started in 2014, the project. So we have a little bit of legacy

40:32 now. So every once in a while, there are a lot of command line flags that need to be set in kind of awkward ways to kind of deal with that legacy. So you do end up with some of those things that look weird but are actually perfectly fine. Yep. And it's only through trial, editor, and experience that you actually pack up in a lot of those. Like, even prior to clustered, I had no idea about all these random port zeros and stuff. Yeah. Nor nor should you. I was gonna use Vom to edit it to see if that was it jumped right

41:05 to the place. So I assumed that that was the That's cool. Vim has been useful for that before on clustered, I gotta say. I assume there's a way to erase your most recent most recent edited line. Yeah. I'm sure there's probably some Vim cache fail. I could blow away, but yeah. You can save that five ten and it's all good. Just save it and then I should get. There you go. One bug fixed. Get pods as your your nother star here. I'd keep focusing on that. Yeah. So that should be on. You should see that go to ready and

41:30 Checking Scheduler Status After Fix

41:54 what so the delay to ten seconds. Yeah. Just let it open seconds. And so you can actually do a get pods add a dash w to the end of that. That would actually watch the event. Watching it? Yeah. It depends on how many I think it was three. So when I guess to thirty seconds, we should see a one on one. And after this is healthy, we'll swap back to Tom, please. No. Thank you. Noel is enjoying it. The simple things that get overlooked. Yeah. Definitely. It's it's easy to overlook the simple things. One scheduler.

42:37 We don't have to wait on it as as we found out the break doesn't even work anyway because the scheduler continues to act as a scheduler even when it's not ready. So we we can just get pods and move on just now. No. There are no honks or rec rolls. I was there. Let's take it easy on this one. Alright. Thanks for listening. We look like we're running. Right? Do we wanna go ahead and run that again for all namespaces and see where everything else is sitting to? So get pods dash capital a? Yep. Yep.

43:06 All Pods Running, Application Still Unreachable

43:22 Everything is running. Everything is running. Everything is ready. Yep. Mhmm. Do you want me to try a port forward from my local machine? Yeah. Let's give that a shot and check on the status of that application. Newcomers. Export. K. Get pods. He's a little faster than we are, Jeremy. Well, it's it's it's that auto completing there. Exactly. Alright. So our application. Uh-oh. Uh-oh. But we fixed the thing. You fixed one thing. Could you tell your browser could you tell your browser that it should be working and ask to try hard try harder? It's a part of a horror movie where

44:20 everyone's dancing around the fire celebrating while an emergency emerges from the lake and creeps up onto the dock. Khalid thinks it seems fine. My browser doesn't seem to think it's fine. If we pop over we're not getting any error messages. So we are getting successfully a request to our cluster pod. However, it would seem that we are not getting a response from our cluster pod. I'm not sure what the timeout is supposed to be here. You will see a connection to a database timing out when it eventually lands. And you are talking directly to the pod

44:55 here, so we don't have to look too deeply into things like the service networking and and that sort of thing. So we can definitely just start looking directly at the pod. If we're expecting it to talk to a database, we probably wanna figure out where that database is supposed to be so that we can, you know, actually talk to it. One of the things standing out to me is that I didn't see any pods related to a database. David, are we supposed to have a database? I have a drone. Keep control something something. I mean, how does this speak to a

45:11 Troubleshooting Application: Database Connection

45:38 database? That's that's not a command, David. So I'm gonna have to tweak the time out on my Rust application. Really, you should have an error message right now saying it's trying to speak to something called Postgres. So there's your queue. There's your queue. Okay. So one of the things we can look at here is we can look at, you know, what types of resources would we use to even deploy a database. So we can look to see if there's any stateful sets to find in the cluster. So just like a kubectl get stateful set. Postgres is not ready.

45:51 Checking for Database (StatefulSet)

46:19 Yeah. So we can do the same thing that we did with the previous ones. We can run describe on there and see what type of data it gives us there. And you'll need to tell it what you wanna describe. So in this case, a stateful set. Stateful set. Okay. Alright. And one of the things that I noticed is it wasn't even trying to deploy a pod here. Otherwise, we would have seen that when we were doing the pod output. So if we look at the configuration of this, you know, there's some that stands out to

47:00 kubectl scale

47:05 me immediately that explains why we weren't seeing any type of pod creation. Yeah. Please don't write down my password. I use this for everything. I was gonna say besides besides the terribly weak password? It seems that we've desired zero replicas, which seems like exactly what we've got. Yes. Yep. What we can do is we can Database list. Database list. We we can actually use the kubectl scale command here to go ahead and change that. Alternatively, you could just do an edit on that resource and change it like we did before. But since since stateful sets exposed the scale subresource,

47:51 Scaling the Database StatefulSet

47:52 we can just do kubectl scale. We're gonna and and if you just hit enter there, it'll help guide us through what we need to do. So let actually, do kubectl scale dash dash help, and that'll give us and I think in this case, we just tell it what we want to scale. Yep. So we'll wanna specify that dash dash replicas, and we probably don't wanna go higher than one here. I don't know if this There's there's no PVCs being used. All the data is loaded through and then a container. You can scale it to whatever you wish.

48:39 It'll it'll be fine. Yep. Replicas, however many, Three ten whatever stateful set slash. And that says swap your time again, please. Anything else? We good? You know, Postgres q l. No space and probably singular instead of Big reset? Yep. Slash. So you can use the singular, the plural, or the short versions across all of these commands as well. Okay. But yeah. Best to pick one and stick to it. There's an extra e in there, looks like. Oh, there it is. Post egress. Failed. Alright. So we take a look at those. Take a look at that again, we should

49:42 see it's created a pod. Yep. There we go. Create pod, stateful. Successful. And now we're just waiting to see if the liveness and readiness probes finish up. It'd probably run a git pods or something. Yeah. Great tip, whoever's typing, but control r and then type part of a command will bring it front and center for you. Alright. So it looks like those Postgres pods are in a good state now, so we can potentially go back and try that application again. And, David, that'll be hitting command r. We'll do that for you? No. So if you

49:44 Database Pod Running, Application Still Failing

50:34 do Oh, control. Control r and then type pods. There you go. And then hit return, you get your command. So that's Oh. Still timing out. I meant Nope. But so it's not gonna work if you refresh. So we got well, look at your custard pod. Up at the top there, MGTPW Right here. That one's not restarting. Unready restarted five times. Yep. So one of the things we can do here is we can actually look at the logs for that clustered pod and see I'm really sorry. There there's no logout. There's no logout. Just because it's it's a really simple application.

51:17 Troubleshooting Application Pod Again (Checking YAML)

51:36 I have to terrible Who built this application? You'll wanna just I would just dive through We can only do we can only do so much with what we're given here, David. Just just do a see, we've done a lot of describes. Do I get pod dash o yaml and and take a look at the definition? I'm gonna be honest. I don't remember what the last break in that fail is. Get get pod yaml? Get pod, the name of the pod. So just copy and paste the clustered one in and then do a dash o YAML, which

52:01 Analyzing Application Pod YAML: CPU Limit Issue ("1")

52:07 will output the spec. And a great tip from the audience there as well. Alias and cube control to key is a good time saver. Yeah. Really? I need to start upgrading these clusters to one two one so we don't go to managed fields. Let us see. Oh, okay. You broke it yourselves. That wasn't me. The CPU should be 1,000. Yeah. It wasn't me. Good. I I couldn't remember what the other break was there. Yeah. You just wanted to keep control head on that. Who's got the the keys? That was Tom most recently. Okay. Oh, it's your turn, Jeremy. Sorry. I'll stop

53:06 Editing Deployment: Fixing CPU Limit and Image Pull Policy

53:18 talking to you. Go ahead. Hands off. Alright. We're almost there. Come on, team. We wanted our kube control edit deployment clustered. Is is someone firing keystrokes in there? But I'm judging by Tom's laughter. Yes. You can't trip me while I'm while I'm crawling. And while we're in here, let's fix one more thing that I don't think we we need to worry about, but let's search for policy. I did modify the image pool policy. Maybe Jason will be so kind just to tell us why that was such a stupid thing to say. Yeah. So the image pool policy,

54:26 basically, you know, it it depends on how you're validating and deploying your containers that you wanna use in the environments. So right now, that one being set to never means that, basically, it will never try to pull down the pod onto the host as part of bringing up that that pod. It'll expect that that container has already been pulled onto the host by some other process. Generally, you want it to set at least to, if not present, so that when the pod is scheduled, it'll, you know, pull that container image if it isn't already on the host.

54:30 ImagePullPolicies

55:09 In some cases, some people set it to always, but that has some other implications too. Because if you're relying on the behavior of always, that means you're modifying the tags that you're using for images, and that could lead to other potential issues there too. So it doesn't really matter what we set it to for this case. But Go up, Jeremy. Which line? CPU one. Two more. Yeah. CPU is one down two. Down three. Right there. It should be 1,000. So one CPU actually means one millicore of a CPU. But I'm surprised that's a string. Is that supposed to be a string?

55:32 Checking Pod Status: CPU Limit Resolved (Pod Pending)

56:02 There's a few fields in Kubernetes that are defined as int or string, and will take either an integer or a string as content. So I expect that's probably what we're dealing with here for the CPU limits. Alright. So that should assuming your container gets scheduled on the original node where the image is pulled, you should be okay. And it's pending. So you we got really unlucky there, and you're gonna have to modify the image pull policy, I'm afraid. Yep. Where's that? Just Oh, did it. Same file? Yeah. So you you can just search for never

56:52 here, and for simplicity, just replace it with always. With a capital m. Yep. Are you on your ERCODOX today, Jeremy? Yes. That's not does that what jumps me there? That jumps me there. Right. So always and have not present. Always is probably easier for now. Perfect. Okay. There's one more thing, another to fix our application. And hopefully, just pulling on a new node should be harmless. You can do it watch again if you want. I dropped the dash a here because just getting things in the default namespace will make the output a little bit clearer.

57:17 Editing Deployment: Fixing Image Pull Policy ("Never" to "Always")

58:07 Handing is what I need to be honest. Mind if I type for a second? Go for it. Is that breaking the rules? Are we allowed to do that? Well, I mean, it's not a break that I brought, so I think it's alright. Okay. And there's me using my alias. Sorry. So I think I Oh, insufficient CPU. One k? There we go. So that I think that was that string conversion thing and then goes change it to one k. It must be using like string string to y or whatever that function is. So the 100 m just as explicitly say one

58:13 Editing Deployment: Fixing CPU Limit Again ("1" to "1000m")

59:04 one CPU, 1,000 milli cores. That's weird. I've never I've never seen that before. Me now, you're all teaching me stuff as well. That's awesome. Hey, you got your pod back. You want me to do a port forward? Can we check, see if it's working? Of course. Get pods. This one. Okay. Now it's complaining, and I'm sure we'll get a time out and hopefully just a few seconds, but I can't reach the postcard service. You'll need to do both of that. It's the last thing, I swear. Well, so if it can't reach the Postgres service, you know, some of the things that we

59:17 Pod Running/Ready, Application Still Timing Out: Checking Services

59:58 can look at are, you know, how's it trying to reach a Postgres service? How's the application configured to talk to the database? And based on that, we may need to look at the actual service that's fronting the stateful set if there is one. I think that's a good idea. Yeah. That was subtle. So we can go ahead and just do a kubectl get services here and see what's defined as far as services. Who's up now? Jeremy. Yeah. Jeremy, go on. I may need to refresh my refresh my teleport. You should hear you should hear banging clicking. Alright. Tom, you're up. And

1:00:00 Service Endpoints

1:00:43 no Alright. So cube control get SVC or get services. Oh, woah. Woah. What just happened? I could see. To reload sometimes when someone joins it. I guess a low can Yeah. Okay. I could see it, but not control it. Secure Cube control mode. I like it. SVC. The cuddles. SVC. Okay. There we go. Cubic TLS. Yeah. So we do have a Postgres service. I'll let Jason get It is running. Yep. We can go ahead and I'm I'm curious, Jason. Do mind if I jump in for a second? Yep. Like, based on all of the kube control stuff we've been

1:01:17 Describing the Postgres Service

1:01:29 doing so far, Jeremy and Tom. Right? We've done a get service. What do you think's next? Probably describe that service. Describe that service? Good call. I was just curious if, you know, hopefully things were were landing there. If at the there's said jumping in too. Describe it. One of the other things I was gonna say too is a lot of the commands also have a wide mode. So if you do like a dash o wide, sometimes, you know, it'll give you more detail on that GET as well. Yeah. That's a great tip too. You should do that as well.

1:02:05 Like the kubectl thing, as you know, it's a rest like client get, describe, edit. These things are just the staples of all Kubernetes interactions, at least through a kube control, anyway. So what's wrong with that? It's get service dash o wait. Yeah. It works for the get, not for describe. Gotcha. And you guys need to describe. Delete describe. Yes. Yes. Yes. Okay. Boom. Okay. Alright. So let's see. Looking at that Postgres service, we can see that it's a type of cluster IP. That makes sense. We have an app equals Postgres for the selector. It has an IP.

1:02:32 Analyzing Service: No Endpoints

1:02:58 And if we look at the port And this is where it's been a while since I dealt with Postgres. I don't quite remember if it listens on TCP. It's supposed to listen on TCP in addition to UDP or if it's just a UDP service. Good catch. So we need both? No. TCP. Just TCP. Okay. Yep. So Edit. And then it's s s you go at the resource tab first. Edit the service. Here we go. Protocol. Go down. Alright. You're almost there. You'll probably wanna describe your service again with a a dash of weight. So this is the output that's the most

1:04:42 important. Alright. And if you look at that service above the Kubernetes service, you can see that there are some endpoints defined for that one. And we're don't have any endpoints currently for the Postgres service. And and the way that the service maps pods is it actually looks for that label selector that's defined. So in this case, app equals Postgres. So one of the things we can do, we can look look at that Postgres stateful set and see to make sure that it has that that label actually set on it. But we don't so since that's not set,

1:05:25 Identifying Label Mismatch: Service Selector vs. Pod Label

1:05:29 no endpoints from Noel. No. Well, yeah, the only thing you know right now is there's no endpoints, and Jason has given you some great advice that endpoints are connected to selectors. So we probably wanna check that this selector exists on our deployment, The app equals Postgres. So if you describe our cluster deployment yeah. They already answered it. They're jumping in with the endpoints and label check too. Because I think that's based everybody at some point. Right? Oh, yeah. Deployment custard. Yeah. I'd have to consider calling to see these custards, but it wasn't as catchy. Alright.

1:06:24 So you The important part here is what we're gonna wanna look at is the actual pod template, not necessarily the labels on the deployment itself. So underneath the pod template actually, this is the clustered deployment. We wanna look at the Postgres deployment. Oh, sorry. Not deployment. Stateful set. Yep. So if we look at that pod template, labels app equals PostgresQL, And I think the service was for Postgres, not for PostgresQL. Ah, Postgres. Clumsy fingers. Sorry, y'all. I I I don't believe you're sorry. Sorry is So Sorry is the cover of box. And you would do you would do it again.

1:07:17 Editing Service Selector to Match Pod Labels

1:07:33 Does it matter, Jason, which one we change? So it shouldn't matter. The the one thing, the stateful set, I don't think a stateful set define. Does it have a label selector for the pods that it manages? I do not Yeah. I would edit the service just in case. It's it's more like Yeah. I think even potentially the template labels as well are immutable on deployments and stateful sets. I don't even remember if those can can be changed. You want No. No. Close. You don't need deployment there. You just want the service. The service is called Postgres.

1:08:28 Yeah. That was the I've the I've got the typo there and an extra space. Let's edit that, and let's find our label. And it's actually the selector. Oh, selector right there. Okay. See it. Postgres. Let's change that to q l. Just go back. You wanna describe that service again. Yep. There should be an l in Cuddle. Yep. I'm gonna add documentation and and a code of conduct for the show where you have to pronounce the cube controller. You're not invited back. I'm just I'm just Alright. So now we do have endpoints. Good job. So assuming the application is test

1:09:44 it? Configured correctly to talk to the service, it should be able to. Ta da. That's it. He was just like, random box. I keep meaning to re encode it. But we're getting we are getting our wonderful cloud native quotes at the top of our service. So it's all good. It's working. Good job. Fair enough there. Yay. Awesome work. I need a nap. I'm just glad for this time I wasn't having to type live on air. Come fix a cluster they said. It'll be fun they said. Well, I mean, let's let's focus on right. Did you did you learn a little Kubernetes

1:10:10 Conclusion and Lessons Learned

1:10:28 today? Yes. I hope that's coming through. I think a lot of these breaks are things that just happened day in and day out in Kubernetes. So easy to make labels selectors being wrong, ports being mismatched, protocol and just, you know, really simple things that can really trip people up. You know, I hope this helped build a little bit of familiarity with kubectl. Jason does a great job of guiding you through all of that. And I think you both answered both asked a lot of really good questions throughout that as well. And and hopefully we just provide some really good fundamental

1:11:01 Kubernetes knowledge. So thank you for joining me all today. I really enjoyed watching you suffer a little bit, but also Was was glad was glad to be here and also easily illustrates the use of a second set of eyes even if you don't know what the problem is. The the frequently phone a friend is a and bounce something off instead of bang your head on it by yourself. I am now only referring to this sadism with David. Well, I'm gonna move the comments so I stop getting my eyes covered up. So no. That was fun. Thanks for having

1:11:38 me. That was a lot of fun. Nerve racking, but it's so much fun. And, Jason, thank you for joining us. This would have been a six hours later, we'd still be I'd still be where I where I was when I started, so thank you for your help. Absolutely. Alright. Thanks for having me. Thank you, everybody. Have a wonderful day. We'll see you all next week. Thanks. Bye all.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
containerd

More about containerd

View all 23 videos

More about Teleport

View all 38 videos
PostgreSQL

More about PostgreSQL

View all 22 videos
Rust

More about Rust

View all 22 videos