Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Diagnose a non-responsive Kubernetes API server by checking control-plane pods, logs, and etcd connectivity.
  2. Identify and kill rogue host-side processes and scripts, including rickroll symlink tasks, blocking cluster recovery.
  3. Repair DNS and storage stability by fixing resolv.conf, CoreDNS config, node taints, and corrupted manifests.

Jason DeTiberus joins to debug two broken clusters from Thomas Stromberg and CloudNative Wales. Expect rogue processes, mangled resolv.conf, exhausted inodes in /tmp, a rickrolled manifest symlink, and recursive bind mounts.

Chapters

Jump to a chapter

  1. 0:00 Viewers Comments
  2. 3:18 Introduction to Klustered and Guest Jason DeTibbers
  3. 5:10 Introducing Cluster 9 Breaker (Thomas Stromberg) & Challenge Level
  4. 5:59 Accessing Cluster 9 via Teleport
  5. 7:05 Initial Check: kubectl get nodes Fails
  6. 7:36 Debugging Control Plane Processes (ps, ctr)
  7. 9:24 Suspicious Process Found ("whale true")
  8. 9:49 Killing Suspicious Process
  9. 10:23 Checking Kubelet Logs (API Server Crash Backoff)
  10. 11:18 Checking API Server Logs (etcd Connection Issues)
  11. 12:20 Investigating etcd Port Status (Listening?)
  12. 12:50 Checking etcd Logs (No Obvious Failures)
  13. 14:45 Forcing API Server Restart (Touching Static Manifest)
  14. 15:24 Hint from Breaker: etcd Listening But Not Answering?
  15. 15:44 Attempting etcdctl Install (DNS Issue Discovery)
  16. 16:25 Fixing Host DNS Resolution (/etc/resolv.conf)
  17. 17:15 Verifying Host DNS (Ping Google)
  18. 17:40 Querying etcd with etcdctl (Success - Binary Data)
  19. 18:51 Forcing Control Plane Restarts (Touching Static Manifests Again)
  20. 19:31 API Server Back Online (kubectl get nodes Success)
  21. 20:16 Checking System Pods (CoreDNS, CCM Unhealthy/Pending)
  22. 20:50 Troubleshooting CoreDNS (Logs Unhelpful)
  23. 21:00 Explanation of Static Pod Restarts via Touch
  24. 21:50 Restarting Cloud Controller Manager Deployment
  25. 24:48 Restarting CoreDNS Deployment
  26. 26:50 CoreDNS Pod Running but Not Ready (Probes Failing)
  27. 27:20 Inspecting CoreDNS Manifest (Probes Config)
  28. 30:05 Inspecting CoreDNS DNS Policy (Default)
  29. 32:44 Checking CoreDNS ConfigMap (Corefile)
  30. 35:01 Identifying Potential ConfigMap Modification ("lame duck")
  31. 35:58 Teleport Session Interruption (Resource Exhaustion?)
  32. 38:47 SSH to Cluster 9 Fails
  33. 39:49 Moving to Cluster 10
  34. 40:00 Initial Access Issues on Cluster 10 (Teleport Hangs)
  35. 40:55 Direct SSH to Cluster 10 Control Plane
  36. 41:10 Cluster 10: API Server Down, Kubelet Running
  37. 41:45 Cluster 10: Static Manifest Backups Found (~ Files)
  38. 42:41 Returning to Cluster 9 (Teleport Restored)
  39. 43:15 Cluster 9 Continued: API Server Down Again, Checking System Load
  40. 44:51 Checking All Namespaces: Evicted Pods (Honk, Rawkode)
  41. 45:51 Checking Pod Eviction Reasons (WordPress Volume, Honk iNodes)
  42. 47:50 Identifying iNode Exhaustion Source (/tmp Directory Size)
  43. 51:35 Confirming /tmp Issue (ls Hangs)
  44. 55:00 Deleting Excessive Files in /tmp (Using find)
  45. 57:50 DNS Issue Returns Again (/etc/resolv.conf Modified)
  46. 58:58 Identifying and Removing the DNS Cron Job
  47. 1:01:50 Fixing CoreDNS ConfigMap
  48. 1:02:27 Restarting CoreDNS to Pick Up New Config
  49. 1:03:00 CoreDNS Pod Creation Failure (DNS, Image Pull Again)
  50. 1:04:40 Host DNS Reverted Again (Despite Cron Removal)
  51. 1:05:50 Deleting the Problematic Cron File
  52. 1:06:40 Re-fixing Host DNS Resolution
  53. 1:07:45 CoreDNS Image Pull Backoff (Still Failing)
  54. 1:07:59 Identifying CoreDNS ImagePullPolicy Issue (IfNotPresent)
  55. 1:08:39 CoreDNS Healthy, Control Plane Stabilizes
  56. 1:09:00 Changing CoreDNS ImagePullPolicy to Always
  57. 1:09:05 Troubleshooting Application Pods (Pending/Evicted)
  58. 1:10:06 Confirming CoreDNS ConfigMap Modification (Health Check)
  59. 1:12:58 Identifying Node Taints (Disk/Memory/Network Pressure)
  60. 1:15:50 Removing Node Taints
  61. 1:16:40 Pods Start Scheduling/Running Faster (Ceph Recovers)
  62. 1:17:20 Debugging WordPress/MySQL Volume Attachment Issues
  63. 1:18:20 Force Deleting Pod with Volume Problem
  64. 1:19:00 Continued Volume Attachment Issues
  65. 1:21:37 WordPress Pod Running but Stuck on Volume
  66. 1:21:48 Deleting WordPress PVC (Acknowledging Potential Data Loss)
  67. 1:22:55 WordPress Pod Recovers with New PVC
  68. 1:25:30 Cluster 9 Appears Fixed
  69. 1:25:55 Returning to Cluster 10
  70. 1:26:10 Cluster 10 Access Issues Resume
  71. 1:27:58 Investigating Cluster 10: API Server Down, Netstat Shows Port Not Listening
  72. 1:31:20 Discovering `/etc/kubernetes/manifests` is a Symlink
  73. 1:31:44 Viewing the Rickroll Manifest
  74. 1:31:50 Identifying Empty etcd Manifest
  75. 1:32:24 System Instability & Teleport/SSH Hangs (Recursive Mounts?)
  76. 1:33:10 Access Restored, Symlink Target Identified (`/home/kube/manifests`)
  77. 1:33:50 System Hangs Again (Recursive Mounts Confirmed?)
  78. 1:35:00 Using `mount` to Confirm Recursive Mounts
  79. 1:37:10 Access Restored, Confirming Recursive Mounts
  80. 1:38:10 Attempting to Unmount Recursive Mounts (Fails - Busy)
  81. 1:38:55 System Hangs Again
  82. 1:40:00 Access Restored, Unmount Still Failing
  83. 1:41:45 Stopping Kubelet Before Unmounting
  84. 1:42:10 Unmount Attempt After Stopping Kubelet (Still Busy)
  85. 1:43:25 System Hangs Again
  86. 1:44:45 Using LSOF (from another terminal) to Identify Busy Processes
  87. 1:45:39 Identifying and Killing the Recursive Mount Process Script
  88. 1:46:10 Unmount Attempt After Killing Script (Still Busy, Faster)
  89. 1:47:20 Identifying Remaining Busy Processes (Shells in Mount Path)
  90. 1:50:00 Killing Remaining Suspicious Shell Processes
  91. 1:50:40 Removing the Recursive Mount Symlink
  92. 1:51:20 Static Manifest Directory is Gone, Need to Recreate
  93. 1:51:58 Locating Static Pod Path Configuration
  94. 1:53:05 Recreating Control Plane Static Manifests (Using kubeadm phases)
  95. 1:54:52 Lazy Unmounting Remaining Busy Paths (`umount -l`)
  96. 1:56:21 Starting Kubelet
  97. 1:56:55 Kubelet Logs Show Progress, API Server Pod Creating
  98. 1:57:50 Cluster 10 API Server Restores (kubectl get nodes Success)
  99. 1:58:30 Final Pod Check on Cluster 10 (Healthy)
  100. 1:59:05 Cluster 10 Resolved
  101. 1:59:30 Conclusion & Thanks
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

3:18 Introduction to Klustered and Guest Jason DeTibbers

3:18 Hello, and welcome to today's episode of Rawkode live. This is the fifth episode of clustered, a show in which we have two Kubernetes clusters broken by members of the community, and we will try to fix them live. Now before we get started and I introduce my guest for today, I'm gonna remember to do my housekeeping. So please remember, we have a Discord channel. Come and join us in the chat. Right now, we are giving away a cluster t shirt. All you have to do is join a cluster channel and thumbs up on the contest, and you'll have a chance to win. The

3:50 prize should be drawn in roughly one hour. Also, remember to subscribe to us on YouTube and text that bell. That helps other people find their content. And I wanna thank Equinix Medal, my employer. They donate bare metal, the resources, and the time for me to do this. As much as I hate it and enjoy it at the same time, I've gotta thank them. It's a pleasure. Alright. Let's crack on. Today, I am joined by a friend and colleague, Jason DeTibbers. Hey, Jason. How are you? Hey. Thanks for having me. I'm doing well. So for those who don't know me,

4:25 my name is Jason. I've been involved with Kubernetes now for, oh, wow, about six years now. Been doing mostly cluster life cycle management and trying to help people get Kubernetes clusters. So it should be fun to help try to rescue some broken ones. It's been a while since I've had to troubleshoot and and fix live clusters. Yes. Yes. Definitely. I think that your experience is gonna be fantastic today with all the the things that you've done in the Kubernetes community and the projects that you've worked on. It's like I'm hoping we're we're gonna sail through this in the

5:05 record time. That's what I'm hoping for today. Okay. Let's see. We have two great clusters today. Our first cluster was broken by Thomas Stromberg, a name that people from the Kubernetes community may be familiar with. He is responsible for the mini cube project and has set a little bit of fear into my life today. I'm gonna pull up a tweet. Oh, scroll, scroll, scroll. There we go. Thomas was concerned that the break was too breaky if that's the thing and isn't actually sure if we'll be able to fix it, but I'm hoping we can surprise

5:10 Introducing Cluster 9 Breaker (Thomas Stromberg) & Challenge Level

5:46 him a little bit here. So let's crack on that and Thomas has joined the chat. There we go. Thomas is a horrible goose. Yes. Thank you. We're excited excited to tackle this one. Alright. Let's see. Let's pull up my screen share. We should have a shared teleport instance. We use Teleport to, commoditize access to the server, and, hopefully, it means that Jason and I can share a terminal. Let me remove the Discord chat for now. Alright. Jason, you should be able to join the session by going to active sessions and clicking join. We hope. And feel free just to type a echo

5:59 Accessing Cluster 9 via Teleport

6:24 hello if you can. Yep. Let's see. That's me. Oh, there we go. There we go. Hey. Nice. Okay. So we both have a shared terminal. Our cluster is a single control plane. I have learned that now after four episodes, three episodes that I don't wanna replicate my fixes to three different cluster nodes. Although we do still have three Oh, there's only two worker nodes here. I wonder if one of them is dead. We're gonna find out. So what would you like to do first, Jason? What do you wanna check? Well, I think the first thing we should

7:05 Initial Check: kubectl get nodes Fails

7:11 probably do is try to see if we can reach the API server and go from there. So and it looks like we do not have access. So Go figure. That's at least telling us something. We're already on the control plane machine, so we don't have to move to a different host. So we're using container d here. So I can do a a CTR namespaces because I can't remember exactly what the namespace is for Kubernetes, and it's kates dot I o. And I don't have tab completion, but I'm just gonna reload that. That's that weird thing. We're good.

7:36 Debugging Control Plane Processes (ps, ctr)

8:14 And I So I've never used the ctr command before. I think I can do c l s to look at all the containers in there. Now, alternatively, I could probably use the I don't even know how you would pronounce it. The c r I c t l tool, which would let us see pods. Okay. So we have an API server that's not responding. Would you not just use p s? Is there an added value to maybe going through the CTR command? Yeah. I mean, I I think probably not. I mean, either way would work. Normally, if it was a docker based host,

8:59 I'd just do a docker p s and and go from there, and then docker logs to get the logs for the host. Okay. Alright. But Cool. We can do a p s. That'll work. It helps if I grep. Right? No API server. Do we have an etcd? It looks like we at least have etcd. Oh, I see something very suspicious already with a a whale a whale true there. Yes. So I mean, I'm I'm does it we've seen that? I'm assuming we can just go ahead and kill that. We probably don't want that to be running. Right?

9:49 Killing Suspicious Process

9:49 Yeah. Probably not. So let's go ahead and Thomas has commented and says that cry control is crying cuddle, which is I think is our advice for today. Alright. That at least got rid of that. We'll see if it responds again at least. Yeah. So I think that was just killing EdCT. Anyway, I don't think it was Yeah. And we can see EdCD there, so that's good. Yep. The other thing we can do is we can also look at the kubelet logs as well. And that's because the kubelet is responsible for spinning up through static manifests all of the

10:23 Checking Kubelet Logs (API Server Crash Backoff)

10:36 control plane controllers. Right? Exactly. So in this case, we can see that there's a crash back off loop. Failed to get status for cube API server, TLS handshake time out. That's not too surprising there. We've got quite a lot going on here. So I I think So should we take a look at the look at logs. The API server logs. Right? Yeah. Let's go there. Okay. I think they get output to the bar log somewhere. Duffy is suggesting we restart the kubelet and I'm I'm not falling for that one again. It caused me more problems than than it

11:18 Checking API Server Logs (etcd Connection Issues)

11:42 fixed last time. What do we have here? We got cube API server. Oh, I think I need to reload again. Oh, no. You were just looking at tab completion. Okay. That does That's not shared. Alright. Hey. So does that mean it's unable to speak to etcd? I see a context. Cd speaking to local host two three seven nine. Remember right that it's the etcd ports. It is. So we can probably go there to s c d. Alright. And we have fixed s c d, so I guess we just need to either wait or hurry up a little bit.

12:20 Investigating etcd Port Status (Listening?)

12:49 So looks like we have two logs here. I'm guessing at one point, we had multiple containers. Oh, that's good. Yeah. That one looked good. At least no obvious failures. So the only error we've seen in API server logs was the the context design for etcd. Etcd was broken by, we think, the infinite sleep or well true sleep kill, which we've stopped. Should we just be should we just check the API server's running again? We can. One of the other things, because we saw that not being able to connect to twenty three seventy nine, we can just do a

12:50 Checking etcd Logs (No Obvious Failures)

13:51 netstat and see if we're actually listening there. And it looks like we are. I don't see a 2379, do you? Oh, I guess. I I scrolled up, but I guess we don't share the same scroll button. No. We don't show the Yeah. Oh, yeah. Okay. Yeah. It is there. Cool. Looks good. So we at least have an etcd port. We can go ahead and take a look at the API server logs again, or we can see if there was a second container there too. Alright. Whatever you think is best. Alright. It looks like there's only

14:42 the one there. So if that was in a crash lit back off rather than like It probably just hasn't Yeah. There'll be some sort of exponential back off on that one there. So can we we encovage it? Yeah. I should be able to just touch the file in here. But since I have watched an episode or two before before I touch it, it doesn't hurt to take a look at the the time stamps here. And it looks like the time stamps are relatively normal, so it's not gonna hurt to touch it. Now. Thomas is dropping us a small hint. He

15:24 Hint from Breaker: etcd Listening But Not Answering?

15:27 says, it's listening, but does it answer? We may have to dust off our Telnet command or something or etcd commands. I need to pull up my my famous cheat sheet that I keep borrowing. So we don't have We don't have etcd cuddle installed. We should be on an Ubuntu host here. I don't know if it's prepackaged for Ubuntu or not. Etcd dash claim. That'll work. Yeah. I've done that a few times now. If I can. Something a little bit effy was a networking here. And the plot seconds. So we have connectivity, but we don't have DNS that appears. Hold

16:25 Fixing Host DNS Resolution (/etc/resolv.conf)

16:29 on. Someone said there's always DNS. I need to I need to pop that up there. Well, he's he was in early with that. There we go. Well, that's an interesting name server that we have there. Yeah. I think we could probably just drop that out for 8.8.8.eight. And I know some people are probably screaming at the screen because I don't remember my VI short code short codes to do things like delete stuff. Do you wanna try curling Google or yeah. I'll ping Google. There we So we have that. K. So that wouldn't have I think that was just yeah. Luck that

17:15 Verifying Host DNS (Ping Google)

17:27 we found out trying to do an update. That won't be our NCD problem, so we still need NCD control. Do you want me to grab the cheat sheet or are you have you got this? If you have the cheat sheet handy, let's go ahead and use that. Otherwise, I'm gonna be fighting with the I really need to bookmark it at this point I feel just because I use it so much, but these are the way my favorite. I'm gonna put this on the show notes actually. I keep forgetting to do that. So if I just, if I just paste this,

17:40 Querying etcd with etcdctl (Success - Binary Data)

18:07 So that last command was the etcd get Kubernetes. So we can speak to etcd. I'm not sure if that mumbo jumbo is what we wanna see though. I don't remember there being UTF or binary values there. Well, so a little while back, native Kubernetes resources were switched to a more binary protocol. So the native Kubernetes resources aren't stored in JSON. The only CRD resources should be stored in JSON and etcd. So it don't surprise me to see kind of funkiness from from the result. Okay. And we got a comment from Duffy, suggests that the running containers will probably still

18:51 Forcing Control Plane Restarts (Touching Static Manifests Again)

18:56 be have an old copy of that DNS as well. Something for us to keep in mind if we just get further along. So it probably wouldn't hurt to go ahead and touch all of these and that should trigger the cubelet to pick those up. Okay. So I we got etcd configured with the cheat sheet. We've confirmed that we can query etcd, so I'm not sure of that comment from Thomas about it's lessening but not answering it. It looked to me like it did answer. I wonder if we have an API server running. Like, should we check?

19:31 API Server Back Online (kubectl get nodes Success)

19:38 Ta da. And yeah. So that's progress. So should we try running get nodes? Let's go ahead. Why not? Hey. So there we go. We only had the two worker nodes in teleport, but we have three here. So it looks like one of those isn't accessible to teleport for some reason, but it does appear at least superficially healthy here. Alright. Let's see we can find any other symptoms. You wanna run a get pods on the cube system namespace. See if things are happy, we'll try and schedule a workload and then maybe maybe we're fixed. Oh, no. We got our CCM off. Hello.

20:16 Checking System Pods (CoreDNS, CCM Unhealthy/Pending)

20:28 That's happened a few times now. I don't know if that Thomas, let us know if that's intentional or something else. So this could be related to the DNS configuration. So I wouldn't get too worried about that yet, but we can just look at the logs and see what we get there. Not longs. So we got a question from William in the chat. What did the touch star command cause? So, apparently, if we touch the files in the static manifest directory, the kubelet will notice the change and then restart those pods for us. Is that correct?

21:00 Explanation of Static Pod Restarts via Touch

21:17 Yes. Exactly. So what it does is, basically, the kubelet looks in the static pod directory, and whenever it notices any change, it tries to reload it. And if things are running healthy after it reconciles, you should see a mirror pod for it, which are the pod resources that we see here. But it's the kubelet that actually controls anything defined in the static manifests. Cool. Perfect. So our DNS is still broken in the CCM. Should we just start restarting stuff now that we've fixed the DNS configuration on the host? I think so. So the cloud controller manager is a deployment.

21:50 Restarting Cloud Controller Manager Deployment

22:03 So what we can do is basically do a a rollout of the deployment to kinda restart things. Yes. That's such a nice way of doing it. I I tend to just kill all the pods. Keep control, delete dash l, label selector, off you go. Normally, have tab completion and because I don't set it up very much, I'm going to look at how to re enable it here. We could set up that. I think you can do kubectl complete yeah. And then source that into the completion and then bash. And then you can source that. It's a

22:44 little big sub shell. Yeah. Cool. Oh, you'll the sub shell will be the source. Yeah. We should be okay with just the sub shell, I think. Will that work? Yeah. No. Or do we have to do a source and then dash for standard and or redirect it in? Oh, I feel like I'm failing basic Linux knowledge test here. Yeah. Me too. Yeah. Try that. Okay. I've got the I've got the docs. I've got the docs. Kubernetes completion source bash. Yeah. There's the docs. Oh, you have to do echo source. Yeah. You want me just to copy this?

23:46 Oh, sorry. Oh, we got it. There we go. We got it twice in there. Alright. Thank you. I see Duffy in the chat with the with the help. Oh, yeah. There's a while it's saying no sub shell, and then Duffy got the command as well. Cool. Alright. Is it working now? Yeah. I gotta remember cube system. Something weird with the control codes that they are all complete only goes to you, I think. Need to send that to the teleport team. We want that fixed. Well, core DNS isn't very happy either. So before we continue with the controller manager, we

24:48 Restarting CoreDNS Deployment

24:55 or the cloud controller manager, we probably wanna take a look at DNS. Kube system. Okay. That wasn't very helpful, No. It was not. Is it running? Can you see it on PS? Like, I don't know if it's just not ready. Is it and it's running but failing? Like, I'm not really sure what session it's in at the moment. Let's do a rollout of core DNS because it might actually be affected by the DNS issue. Okay. And it's been a while since I gotta give it a restart. That's why. And I gotta give it Oh, it'll be

26:22 And I forgot the namespace. And we still don't have anything there. So it's running. It's just not passing its probe, which means the service won't be adding the endpoints. Do we wanna check the probes? Let's take a look at those logs one more time. I don't think we have enough time to go through all those logs. Yeah. That might take a while. One of the things that I didn't think about is is we fixed the local DNS issue, but you can also set DNS policy on the deployments as well. So should we take a look at the

27:20 Inspecting CoreDNS Manifest (Probes Config)

27:32 manifest for core DNS here? See what we're doing with it? I think so. So, here we have this. Well, I need to scroll up because we're Yeah. I just sat back letting you do all the work, but that's that's not gonna work now. Okay. Let's see. I'm always really bad for going through this too quickly and not actually paying attention. So I'm gonna focus this thing. So the liveness probe is set for slash health on port 8,080. I don't know if that's correct. We might need to check that. And there's a dash red. Oh, the readiness

28:26 probe is 8181. I don't think those will be on different port numbers. So should we you should maybe try and check that, and I'm just gonna scan for anything else. I think everything else okay. So I'm gonna assume that one of those probes is on the wrong port, which is causing it not to add the endpoints to the service. That is definitely a possibility. And my default here is basically to look at the code because that's how I've learned quite a bit about troubleshooting clusters, especially in the case of lack of documentation in the early days. So

29:23 what I'm gonna do is I'm gonna look into the manifest code that we actually deploy from Cube ATM. Well, I've pulled up a random manifest from the Internet or from the core DNS organization though, and those reports are are correct. They do run health and readiness on different ports. Interesting. That is true. Code never lies. You just gotta make sure that you're looking at the right code sometimes. The other thing that we have too is we have the DNS policy default, which default is not actually the default DNS policy, if I remember correctly. Where did you see that? Sorry.

30:05 Inspecting CoreDNS DNS Policy (Default)

30:19 In the Oh, here. Yeah. Deployment config. Yep. So what does the default DNS policy mean here? I have to look. It it's been a while, and I remember it being confusing. So what I'm doing is I basically just googled for documentation around it. See if I can see Default is not the default DNS policy. If DNS policy is not explicitly configured, then it's usually used as cluster first. And you also have cluster first with host net when you need something with host network set to true. We don't actually core DNS doesn't use host network, so that's not an issue.

31:11 Okay. So the default policy uses a node DNS configuration, which makes sense for core DNS. Right? That probably is correct since core DNS is providing the end cluster DNS. Quite possibly. And and I'm going into the kubeadm code just to verify at this point because that's just what I do. Yep. And I'm looking at the deployment on the core DNS organization and they do have a DNS policy of default. So I'm I'm inclined to think that's maybe alright. I'm gonna I'm gonna type stuff just because I'm not sure what else to do. So if we don't get services,

31:58 we've got kubed n s. I'm assuming if I describe that, we're just gonna see no end points. So Yeah. Okay. So we need to work out why Oh, I'm gonna alias that in a second. Alias k equals. Oh, okay. Weird stuff going on with our CCM as well, but we'll come back to that later. Yeah. One is effective, the one is pending. But I k. Core DNS. Where can we get more logs from Core I don't think there's another source of logs for CoreDNS outside of the pod logs. Outside of checking the manifest, we can also

32:44 Checking CoreDNS ConfigMap (Corefile)

32:46 check the config map as well because it gets its configuration from the config map. Nice. Had we not run get pods yet? That's We did not. Alright. So what I want you At least not in the default namespace. Yeah. What I want you to do is jump inside one of these and actually try and use the deck command against score DNS to see if we're getting any responses. But we we can't do that from anything here. Okay. Do we have anything running? Yeah. We must have some stuff running. Alright. I'm pretty sure core DNS is distroless or

33:32 scratch, so I can't actually get inside of that. Are any of these other ones gonna be able can I exec in any of these? Do you know off the top of your head? Not off the top of my head. I don't think etcd is, but I could be wrong there as well. And I finally got to the manifest for core DNS. I got him one of the cube proxies, but we've got the same DNS problem that we have there. And I'm kinda worried right now if that if we try restarting any of these ones that are running,

34:15 they're probably not coming back based on what I see in the default namespace on our CCM. So what's what are you thinking just now? So I'm still looking and just comparing against the the manifest for that's laid down by Cube ATM. And so far it looks like everything lines up pretty well. I guess, could we port forward to core DNS and use that locally to see if we're getting a response. Is that worth doing just now, or is there something else you wanna chase down? Let's go ahead and do that outside of well, actually, I wanna double check the core file real

35:01 Identifying Potential ConfigMap Modification ("lame duck")

35:07 quick just because So is that mounted then as a config map? Yep. It is. It doesn't look like that's been modified though because the age is Yeah. But I don't think the age shows you updates to it. I think we ran into that before. I'm looking forward to all these clusters being one twenty one and never seen managed fields again. Right. So we got the core file, port 53, errors health. So I put But it looks like that health has been modified. Oh, did you lose your session? I did not. I'm not sure if something destructive has just

35:58 Teleport Session Interruption (Resource Exhaustion?)

36:11 happened, but there's no active sessions, and I cannot start a shell. Is your is your session still good? Yeah. I'm still connected here. You need to run double check. Run a command? That's funny. It's it's also still showing no active sessions. Yeah. No. I can still run a command and everything. So this is But we can go ahead and rejoin. Well, I guess we can't because No. I can't. Do you wanna try starting a new one and I can maybe join that? Oh, I I don't know. Oh, alright. I'm seeing an active session now, it

37:06 seems. And that says it's only a few seconds ago. Oh. So Thomas is gonna fix that. Apparently, the resource exhaustion went too far. So something they've configured in our cluster. And, yep, there there have been disconnected as well. So yeah. I guess we'll wait a moment. I think there's been a proper proper breakage happened on this machine. I didn't even think to check what was going on. Although, I wonder if that recursive honk pod we've seen. I wonder if there was something moving on there. Maybe I should have checked the CPU utilization or So I'll cover some of the rules.

38:14 Carlos has asked seem to be able to connect to Teleport at all now. Carlos has asked if there's cron eating resources. Yeah. One of the things I said I won't do on this show is is look at the cron so people could use it in creative ways. And so I'm not sure what's going on there, But maybe we can Maybe Thomas will maybe get us back in. Oh, it's it's it's gone. Yeah. Alright. So Thomas says problem with SSH is not being able to share the session. Alright. Let me pop open a terminal. IP address.

38:47 SSH to Cluster 9 Fails

39:19 No. Right. So I'm gonna just quickly debug what's going on from a normal terminal and see if can get teleport working again. Let's see if we can deal with that. Oh, SSH is not working. Thomas is just laughing there. Alright. Let's move on to the second cluster just now, and then we'll try and come back to this one. If I can't SSHN for the terminal, I don't think there's much we can do right now. So will we jump over to Cluster 10? Yes. I'm working on trying to re log in to the teleport session there

40:00 Initial Access Issues on Cluster 10 (Teleport Hangs)

40:05 right now though. Okay. This is. This one looks okay. Oh, I've got a terminal open. I'll cover the basics just now if you wanna just take a minute to get in. I'm pretty sure I'm not getting my victory beer today. I have a nice rhubarb custard server. There we go. Show people. Yeah. Thomas, if you can SSH in, try and get us back in the game, please. Otherwise, destroyed it. Which is very funny. I guess we took too long to fix it. That's funny. Thomas does that. I guess it's my turn to debug what the hell is going on.

40:55 Direct SSH to Cluster 10 Control Plane

40:55 Alright. So alias k equals kube control export kube config equals our control plan. Get notes. Okay. We have no API server. Go figure standard practice now. Did you manage to join the session at all? Or are still having a little bit trouble there? I am joining it at this moment. Alright. Cool. So I'm gonna do some basic checks. I'm gonna look for the API server. No, but I do love the watch cat command, so I'm gonna take a w w on this so we can see the build. Why did that not show me the full

41:10 Cluster 10: API Server Down, Kubelet Running

41:36 thing? Oh, this must be a teleport bug. Oh, and the watch has disappeared now. Oh, I saw it though. It's coming back. Oh, right. Okay. I missed my opportunity to see what that command was. Okay. We have no API server. What else do we need to check for? Qubelet. Okay. So we do have a running Qubelet. Let's do the same thing. Kubernetes manifest. We have fails that have been modified and we have what loosely I even remember what uses tilde backup fails, nano maybe. So now we know this cluster is broken by Lewis, Dan and Patty, organizer of CloudNative

41:45 Cluster 10: Static Manifest Backups Found (~ Files)

42:28 Wales and and a little assist from the control plane team. So I'm looking forward to fixing this one. Thomas, we will come back to your cluster in a minute. Thomas has said that Teleport is back online. I'm actually wondering if they they made this one a little easy for us by those tilde files actually being the originals looking at the time stamps and the and the size differences. Alright. We'll be trying to fix this one in quickly, or do you wanna go back to nine while it's still fresh in your head? I'll let you to save. We can go

42:41 Returning to Cluster 9 (Teleport Restored)

43:04 ahead and jump back to nine and then Right. Okay. Come back to this one. We haven't gotten too far in, but it looks like I'm gonna have to try to log back into the Telescope session there too. And do I have a session? I do have a session. Thank you, Thomas. So kindly shared the error with us as well. Runtime error, invalid memory address or no pointer. Yeah. You definitely broke it. Okay. So it's only been like two minutes and already I have no idea what I was doing with this cluster. So we we got the API server back online, core

43:15 Cluster 9 Continued: API Server Down Again, Checking System Load

43:46 DNS wasn't running. There was all sorts of crazy nonsense going on in I've lost my alias. The default namespace. They say API server is down again. Oh no, I've not got the export. Okay. You've joined, which means we have that weird reload bug. Okay. So should we just start I don't all these things have fixed. Is this that should have been indicative of resource exhaustion. Things are being moved off of this thing. That makes sense? So like, we have top. The load average isn't exactly revealing too much there. Let's check the memory. Oh, the memory is

44:38 there too. So we don't have a lot used now, but Yeah. We don't know what was happening ten minutes ago, dude. Exactly. The other thing we can do is we can look at so far, I think we've only looked at the default namespace, and we've only looked at the cube system namespace. So we can also look to see if we have pods and other namespaces that may be taking up resources. Okay. So we wanna do an all namespaces. That's a bit funny. Let's see if I can maybe just make that a bit smaller. Yeah. That helps. Okay. So Selium is running,

44:51 Checking All Namespaces: Evicted Pods (Honk, Rawkode)

45:14 which I like. Hubble's not. I don't know what's going on with the honks. But they're all evicted anyway, so probably not a problem. I'm saying with a bit of hope rather than any knowledge. Our WordPress application is not working. Core DNS is running, but not healthy. HCT is okay. APIs server is okay. In fact, the control plane besides core DNS appears to be alright. So let's work out this. Oh, no. Rawk is dead in the water too. Yep. And I I do like Thomas' suggestion of looking at the eviction reasons, so we can probably do a describe on

45:51 Checking Pod Eviction Reasons (WordPress Volume, Honk iNodes)

45:58 one of those pods to and and get the the reason for the eviction or or events. Do you wanna as well. Take the controls and they destroy it? Yeah. So so we know the honk ones are evicted. But I don't know if there's going to be any type of a red herring there. So You don't trust the one of the WordPress ones. K. Question for mister Rawkode. Is this the first or second cluster? It's the first. Yes. We are back on the first cluster. So I'm able to attach or mount volumes. So which pod was this you described? Sorry.

46:51 This is the WordPress pod. Oh, yeah. Those volumes won't be able to able to mount because Rawkode is evicted as well. Yep. So Why don't we take a look at I think you may you probably should describe a honk. Let's just see what his talent is. Yeah. Let's So failed to pull an unpack image. Resolve failed. Oh, yeah. That's using that old DNS entry 192168587. So that was the one that we fixed on the load on the the machine itself. But also, there's another issue up here. The node was low on resources on iNodes. Ah.

47:50 Identifying iNode Exhaustion Source (/tmp Directory Size)

47:53 Okay. So the resource exhaustion was iNodes rather than memory or CPU. So there's a couple of things system could be configured with a low number. But just to double check that, you know, it or it could just be exhaustion on the actual file system device. So we can look at d f has a flag to show I node consumption. Right? If it does, then I just learned it today. Yeah. Because I've I've had this a few times. Yeah. Dash I. Yeah. % on a few things here. Root fail system. Now I don't know if that's gonna be

48:43 normal for an overlay root fail system to show a % I node, but oh, and there's our root file system actually. It's run out of almost run out of inodes as well. And I wonder if that's So Sorry. I'm gonna go. Yeah. So that would tell me there's a process running somewhere open and fails. Yeah. Would you agree with that? Yes. Alright. So we could use l s o f to see if we see anything? We can, but that's probably gonna be pretty noisy as well because we are running a system that's running containers. Okay.

49:29 What's what's plan b? So I see Thomas is suggesting a d f command there. Dash l root. You can go ahead. I believe so. I'm having trouble reading that small. But And, yes, iNodes are not equal to open files. But they are part of the problem. Right? I'm like, you either writing files, opening file oh, maybe I'm failing miserably there. Well, I mean, the files could already be closed too. But let yeah. Let's go ahead and do well, we can do an l s on various directories. We could do we could try the l s o f

50:23 approach. Okay. So we should nothing in the root file system. So, you know, it would have been helpful if it's seen 65,000 files sitting there. But Exactly. Okay. What do you think that packer file is? Should we just use the metadata? Like, if we do a dash l, see any timestamps that are recent. No. I mean, I'm tempted just to reach for LSOF because I feel like I'm swinging though for any idea. So I know if we were looking at actual file sizes, I would use DU. I don't know if there's a flag in d u that would give us something

51:23 to look at inode usage. I'm not aware of anything, but we could do c tame. I I don't know the find syntax that well, but I'm sure we can do like five days. Create a time within five days. Is that right? Think so. Fancy time. There is a way to do it. Google is my friend. So he's given us a hint that it's it was sitting right there in Oh. In the root. And d u dash I. I I forgot how to Linux. Right. D u dash d one. Oh, yeah. D h. Depth one and that will show me the

51:35 Confirming /tmp Issue (ls Hangs)

52:31 file sizes, but not necessarily the nodes. I think I caught something about the temp directory there. With this? Yep. Oh, yeah. So you want to take a look in temp? I mean, four hundred and seventy four seventy five megs does not look because that stash l is taken taken time. If you wanna consume a lot of I nodes and not consume a lot of disk space, you're gonna write a lot of really small files. Yeah. Most of my most of my commands I would use are taking a very long time here. Alright. Let's jump to the comments for help.

53:35 What have we got? Is that a dot dot dot file? Yes. But I I can we can't run l s in this directory. At least not in I'm not sure if that's ever gonna finish. And that's a good point actually from Thomas. Four hundred and seventy five meg is huge for a directory. I mean, that's right. Normally, I would always expect a directory just to be four k. So I've I've I I don't even know what would cause that. What causes that directory to be bigger? That would tell me it's still a directory or it's

54:21 something else? In fact, maybe is there something on Well, the directory entry has a listing of all the pointers to all the I nodes of the files that are in there, doesn't it? Right. Okay. So Thomas has suggested to an echo star. Nice. Yeah. I'm almost kinda partial to taking the RM approach. Oh, the EchoStar has been a terrible guess itself. We should we should check the I node exhaustion before things get out of hand again. Assuming I ever get the ability to write to this terminal. Yeah. I'm gonna see if I can open a

55:00 Deleting Excessive Files in /tmp (Using find)

55:24 new one. Let's see. DFI. And it still gets 35,000 I nodes, which is what was free previously. So I'm assuming that the process in question has potentially been halted. Do you wanna just attempt deleting it? Is that where we're at? Yeah. Let's let's go ahead and I like that. Thomas is feeling pity on us and giving us a fine command to help clean it out. That's assuming it can read enough off the disk to make it work too. Fingers crossed. Guess we sit back and wait. I I have a feeling that there's a potential that those Honk pods may

56:33 be contributing to this. So even after we clean this out, we may see it come back. Okay. We've we've also had a comment saying d u can show I node exhaustion dash dash I nodes. That's good to know. We can try that in a minute. Thomas not filling me with confidence after giving us the command to help fix it as I said, uh-oh. Just another day. Let's do it. Okay. Let's terminal's back. So let's try that flag that was suggested to us. Alright. The fact that that's been out no such fail, I think means the files were

57:13 being deleted as we ran the DU, which is good. So hopefully that same command is working. And the syntax or the style of naming on that temp makes me think that there's a process running m k temp on a loop. That's the default file name, I believe. So we could just run. Oh, look at that. We've got over a million I nodes now and 89% usage. So things are getting better. Yes. Now, I guess our concern is those Honk pods. Are they gonna come back? I I I like the comments right now because he's suggesting that we go check out

57:50 DNS Issue Returns Again (/etc/resolv.conf Modified)

57:55 DNS while Find is running. If you want to take over. So that also tells me that this is likely to refill up on us again. Ah, that's a good point. Like, if our core DNS wasn't if the deployment looked okay and the probes were right, there's a reason it's not responding to traffic, and it's probably that it's not core DNS. Is that potentially? The image could be swapped out. Yeah. We can go ahead and take a look at that. Oh, we're getting slow again. So looking here, we're looking for Containers image. Yep. So that looks like an official image. I'm

58:58 Identifying and Removing the DNS Cron Job

59:01 not sure about the version. I think we should change the pill policy to always just to we don't know how sneaky. Yeah. Should we change that? Nice. Yeah. We can go ahead and change that to always. So kubectl edit deployment, cube system for DNS. Because that f not present means that Thomas could have built their own image on the machine running an MK template. And just for fun, I'm I'm also gonna double check the so it looks like the latest version accordion s is one dot eight dot three. So I don't think one dot seven

1:00:03 seems unreasonable. Yeah. And a good point from Wily as well is we haven't even looked at the core DNS config map yet. We got we got very much sidetracked. Yeah. I had started looking at that, and I did notice that there was the start of a diff there. So So looking through, and I'm I'm comparing this basically to the kubeadm code right now. So we have the config map, core file, port 53, errors. There's a difference in health. So so it's lame duck, five seconds. No comma there. We're also missing ready, I believe. Kubernetes cluster dot local.

1:01:23 That looks the same. Pods and secure fall through TTL of 30. And all of this, like I said, I'm just pulling straight out of the code. Hash 30 loop, reload, and load balance. And maybe we'll get a working core DNS again. And gonna go ahead and do the rollout because, generally, updating config map does not cause a deployment to roll out an update. So at this point, they're still not running, but that could just be We don't know how long the probes need to be healthy before. Exactly. And the container's not even created yet. Okay. So

1:03:00 CoreDNS Pod Creation Failure (DNS, Image Pull Again)

1:03:00 little bit of patience to be exercised here. I guess that pill policy of always, it may just be pulling down the real image now. I didn't want the grip in there. So looks like it's failing to create the pod sandbox. Failed to fill and unpack the pause image. Indeed. Is this still DNS? Oh, connection refused. Trying to look up kates.gcr.i0 on local port 53. So what's the response is that container deal responsible for pulling the images? So I believe it's a responsibility of the cubelet to interact with CRI to pull the necessary images. So we're seeing

1:04:31 that we can go ahead and check the container d configuration. Yeah. I'm wondering if there's anything funny. I don't know. Touching it, draws. And Thomas has suggested we take a look at our resolve config again. Although, I'm sure we fixed it. Oh, no. Is that a just in time hack? I don't know what you're saying up there, Thomas. Alright. Well, let's get that 8888 back in there and get core DNS running. I like how it modified it differently this time. See, that's what I don't like about it. You wait to see if it's gonna change

1:04:40 Host DNS Reverted Again (Despite Cron Removal)

1:05:28 again? Hasn't yet, at least. Let's yeah. There's something in cron. Okay. So since we have permission to look in cron, even though I said that I'll I won't look in cron, we can check out what's going on there. So Resolver, I guess. Either resolver. Nice. They're actually running every minute, so it's probably already goosed again. And I'm not over familiar with running Ubuntu as a host OS to know whether those other cron jobs in there are Let's cat them. Let's see what's going on. I mean, the Scrubble did not sound particularly healthy. That doesn't seem

1:06:40 Re-fixing Host DNS Resolution

1:06:59 I don't know what you two scribble is. Some sort of LDM thing. Yeah. We can maybe just ignore it. Do you wanna fix our resolve Conf again and get let's check for DNS? Because I'm assuming that we maybe got caught by it. Yeah. There we go. Before we remove that. Alright. The rest are normal. Thomas has confirmed. And then politely followed up with don't sweat it. Thanks. Alright. Let's see. Come on, Courtney. Let's see if you're happy. Oh, image pulled back off. So that'll be that pulled back off. So it could be a modified tag version because not all core DNS versions

1:07:59 Identifying CoreDNS ImagePullPolicy Issue (IfNotPresent)

1:08:04 are tagged with the Kubernetes release artifacts. Or it could just be that Nearing the finish line. Yeah. So why don't we just try and pull that manually? I guess I can always just pull it from my local machine. Or maybe we're just being impatient and the back off happened during the DNS break and oh, there we go. Let's pull it up. Okay. It's running unhealthy. Perfect. So we have core DNS at this point. We have etcd. We have the API server, and the controller manager and scheduler are at least running from a pod level, not necessarily

1:08:39 CoreDNS Healthy, Control Plane Stabilizes

1:08:55 beyond that. But since core DNS is rescheduled, I assume that the scheduler is actually functioning here. Yep. So that should bring other containers online. Now we still have the evicted packet cloud controller manager, but that could very well, you know, be residual here. So we can go ahead and restart that. Shall we just grab one of the Honk pods before they stick coming back? Just one, this is called honk. And what's it trying to run? Busy box, well true, sleep, done. Yeah. I don't think they're actually doing anything, which is good. For fun, let's do a get all and

1:10:06 Confirming CoreDNS ConfigMap Modification (Health Check)

1:10:07 see thing here. The pod just seemed to be sitting there. I'd let's take a look at the describe again and see if there's anything indicating Yeah. What created them. Still using that 182168587 DNS setting. You wanna try a rule out on our WordPress deployment and see if it's healthy? Yeah. We can go ahead and well, first, we probably need to check the the status of our storage. So that would be in Rawkode. Okay. So one of the operators has come back. One of the managers has also come back. I mean, we'll I I would just take

1:11:13 the sledgehammer approach here and do a delete pod all namespace rick staff just to force all that to come right back up. Yeah. I got you to do one delete all, which means that I consider that a victorious stream. And hopefully, that gets our storage system back online, then WordPress can come back online. And I think that's it. Fingers crossed. Let's just see if we've got that coming back here. Oh, that was fun. I didn't even Oh, you control c that's why. I did. Get pods dash watch. And the ordering will be important for a

1:12:01 lot of these just because of the way that Rickettseth works. I can see our provisioner is just coming on. Let's see what's running. Yeah. I think this is gonna take a minute or two. Yeah. Cool. Okay. So let's assume our storage system is gonna be online. I delete all has never hurt anybody. Yeah. Making a lot of assumptions now though, so let's see. Hope is definitely a strategy when it comes to clustered, I've found. The thing was Yes. We still didn't remove the honk pods. I suppose we can go ahead and do that. They seemed harmless.

1:12:58 Identifying Node Taints (Disk/Memory/Network Pressure)

1:13:13 At least the first one we looked at was just running asleep for a very long time. Wonder if he was kind enough to stick a common label on it that we could delete them all. No. No labels. In fact, there was an anode message right on that. We should have described that a lot earlier. Oops. Want me to do some hockey magic to delete all the honks? Or we can do that. I'm also trying to keep an eye out to see if they start coming back too. Carlos, there was no common label and no owner ref. There was a straight up pod.

1:14:31 Evil, I know. Oh, so we got a container created now in one of our WordPress. That's exciting. And one of the MySQL. Although Rawkode isn't particularly looking much healthier. Still sitting and pending on some of those Mon and Yeah. Instances. Well, maybe we need to describe one of those. See why it's pending. Let's grab the oh, there's some tents on the notes. So why doesn't give it, but we can just do a describe on them. Yeah. Describe note and let it all All flooders. Yeah. Locked through no execute. Let me just do an edit nodes and

1:15:50 Removing Node Taints

1:15:56 delete them? Yep. I wonder if it's funny. Yeah. Okay. Let's try. Oh, there we go. Oh, yeah. That's coming up much faster now. There we go. Okay. So And maybe not too much movement quite there. I see a container creating there on Yeah. Let's see if we're missing anything here. No. Right. Okay. So everything on the Ceph side is running. I'm gonna try and speed up this. Well, it looks like it's been creating for forty seven minutes now, so I only wanna delete that one. Yeah. So it it's not getting the volume. PVC is those seem okay. Yeah. We can

1:17:20 Debugging WordPress/MySQL Volume Attachment Issues

1:17:42 probably just encourage that along. Terminating. So we might wanna do a describe on that MySQL pod to see if we have any new information. The volume is previously attached. Was that just the terminating one now? Yes. Okay. So let's delete pod force grace period. Should be quicker than that. That is a good point. Is there an are we supposed to have an NGINX No. Out here? Yeah. Let's just Wait a minute. What was that one that you were just describing? That was nginx? The busy box one? I'm just gonna delete it. Get pods. I don't know why we're in a multi

1:19:00 Continued Volume Attachment Issues

1:19:30 detach thing here. I'm gonna try one more time to hurry up. Maybe we should Is that still giving the same error? Yeah. It was still saying it was attached to another pod, which isn't scheduled anymore. But I wonder if that's just collateral damage. There we go. Alright. What's going on here? I think this is just a weird situation. Need to wait for things to I think we've fixed this cluster. Will we move on to 10? You happy? I think I'd like to see that WordPress pod still up yet, because I'm not entirely sure there isn't another

1:20:41 another curveball that he's thrown at us here. You just don't want me to have my beer. That's Riz. Alright. So the problem is, this week, a standard is attached to another pod. Thomas says he has no idea, so I I don't think this is intended. But it's still as a problem. So it seems to think it's attached to another pod. So it can't be attached to another. It's definitely not attached. Alright. No. The MySQL container is running. The PVC on the WordPress one isn't particularly important. I could delete it. Yeah. Oh, wrong one. K a get PVC.

1:21:48 Deleting WordPress PVC (Acknowledging Potential Data Loss)

1:21:48 Delete PVC. That's the MySQL. Delete the WordPress one. And then if we encourage the pod to request a new one, Hopefully, that pending changes. Okay. So now Oh, we deleted the claim. So now there's no claim for it to to use. Yes. Yes. We are teaching bad practices at this point. Well, no. The WordPress WordPress doesn't actually need a persistent volume. It's only for cache. As I I I do not encourage anyone to ever delete any real So yes and no. If if it was an actual running WordPress instance, it would have deleted any of

1:22:55 WordPress Pod Recovers with New PVC

1:23:19 the file system assets that you upload. If you upload any media content, that doesn't go into the database. It goes onto the file system. Oops. That that's one of the reasons why I generally dislike the idea of WordPress as, like, an example app for systems like Kubernetes because it's really difficult to actually scale WordPress horizontally for that reason. You need to end up having shared storage that can be mounted on multiple places to be able to actually access that content. Thomas had no idea that we would help him make the cluster even more broken. You

1:24:04 underestimate me, Thomas. Alright. So at this point, we we likely need to recreate a PVC claim for the WordPress app. Okay. But I don't know exactly what that looked like. I have it on a disk. Export. Our PVC should be workload. Let's just pop this open. There. Yeah. So let's just apply that over the top. Jump back over here. We have our PVCs. I guess I maybe need to either be patient. Yeah. I was gonna say, just delete the pod, but I'll be patient. Yeah. It's just pulling the image now. Okay. That looks much better. Alright.

1:25:30 Cluster 9 Appears Fixed

1:25:35 So I never actually deleted any data. I only did the claim. The preview is still there. So that that worked. It's running. You happy? I think so. I I don't know that I've seen anything else to indicate anything else has been wrong at this point. And we do have some encouragement to jump over to Cluster 10. Alright. Let's see. Okay. Thank you, Thomas. That was painful. Okay. 9 go away. 9 go away. I don't even know if that's 10 anymore. Let's pop this open. And I'm going to go ahead and try to get logged back into teleport.

1:26:10 Cluster 10 Access Issues Resume

1:26:38 Come on, teleport. Timed out. Bad ass. What? That's not good. Pull up the IP address for Cluster 10 in case something has gone wrong. Was logged to this seeing a similar thing. Yeah. I was logged on to this earlier. Oh, it should be s. That's for ten. Alright. Let's see if we've got SSH access. And I will restart. Teleport. There we go. Alright. Okay. So we have our cluster. I believe I already checked. Someone ran a watch. I should know that. Kubect config, etcetera. Kubernetes admin Conf get node. Okay. So we don't have a working API server. Are you on the

1:27:58 Investigating Cluster 10: API Server Down, Netstat Shows Port Not Listening

1:28:21 session? Just let me confirm. Yes. I joined the session. Okay. So we can go ahead and look to see if we have the API server and probably etcd as well running at this point since we can't actually connect to the API server. Okay. Etcd looks good. API server well, I mean, it's running. Doesn't mean that it's good. Okay. I see a node restriction plugin right away on the API server here. What does what does that admission plugin do? Do you know? I don't know off the top of my head, but we can look it up.

1:29:11 There's no restrictions that is dangerous. Okay. So this admission controller limits the node and pod objects that Kubelet can modify. Okay. So that's not a big deal. Alright. So let's go check out our manifest directory. We're assuming something has changed on our API server. It's running, but when we do We should just export it keep config and stuff before I start typing things I don't wanna type. I think we might be on separate sessions at this point. What? Oh, no. I'm on my actual sorry. Okay. I thought my terminal felt a little familiar there. Okay. So let's export config.

1:30:16 And if we run alias k qptl. So it's trying to speak to the IP address, which I'll assume for the time being is this machine on 6443. API server does appear to be running. So if we check the manifest and make sure the port hasn't been touched, I guess is the first thing, and then look at other things. The other thing we can do is we probably can do a netstat and see what we're actually listening on too. Alright. Feel free to tapey tapey. It doesn't like me trying to get alright. Now I see the Kubelet, and I'm just

1:31:03 scrolling through here. I see etcd. I do not see the API server. So Yep. But I don't see the API server either. So that one that makes me wonder if we're not act if the manifest's been modified so that it's not actually running with host networking. Okay. So should we pop up in the manifest? Yeah. Did we just get Rickrolled? I think we did. I think we did. If I remember correctly from the last time we were on this cluster, there was a another file there appended with a tilde. But it's not the actual API server. It's

1:31:50 Identifying Empty etcd Manifest

1:32:00 not the API server. I wonder if one of the tildes is the other one, but no. That's actually a cube controller manager. What's that etcd one look like? Uh-oh. Oh, I think that's just empty. But I it's not popped up in Vim. Or is it just my size that's frozen? No. It's Guess I'm reloading. Oh, no. That session's no That that's gone. Nice. And that fail. Oh, it killed teleport. Oh, god. I don't have this session either. So my normal SSH and my teleporter are now both dead. Oh, is it coming? I think teleports back

1:33:10 Access Restored, Symlink Target Identified (`/home/kube/manifests`)

1:33:15 at least. Yeah. Okay. So my SSH is there to teleport that. Okay. So etcetera Kubernetes manifest is a symlink. They never got any. And then these well, it's crashed again. There's 44 files in this directory and did an LSE. Gonna be one of those days, Jason. Alright. So we know the ETCD manifest isn't happy either. And it's only 38 bytes. Okay. I don't know where to reload it. Okay. So whatever we are doing is pausing. So that SSH session has come back. I'm assuming teleporter is gonna be right behind it. It's really peculiar. I don't know where just to be patient

1:33:50 System Hangs Again (Recursive Mounts Confirmed?)

1:34:53 with it or prod or no. My SSH session is going to. That is a good point to see if it is overlaid. This is what happens when you get security people to mess with your cluster. Oh, and there's there's Ivan, a control plane. I don't know what you call yourselves. Control planer. Okay. So I think teleport's back at least. Is it? Let me oh, that session's gone. Okay. Let's start. Are you in a different session? I did not start a session. Alright. Okay. Okay. We have a new session. Let's be a bit more care I'm speaking to

1:35:00 Using `mount` to Confirm Recursive Mounts

1:35:46 myself there, not to you. Let's be a bit more careful and not trust anything that control plane have done to this cluster. Screw you this. Alright. So I I didn't like that this was a assembling and was it no that suggested just removing assembling. Correct? Yes. Oh, it froze on me again. You get two people to break a cluster and they both managed to break it so much that we can't even debug it. I mean You know what? I'm cracking open this beer. I've had enough. Right. Take six. SSH still dead. I'll try a new session.

1:36:50 I think teleport's hung up right now again. Alright. We're back. Does that mean the session's alive again? Yes. Okay. Yeah. Alright. So I'm assuming we've got a very fitting amount of time to do something. So What you wanna do? We do anything more, I I wanna just see what's mounted on the system right now. Good call. I'll drink my beer. You you do what you have to do. So Thomas, we do have serial access if needed. Yes. It just means going through my standard SSH tools instead of teleport, which I try to avoid. Session are you typing on?

1:37:10 Access Restored, Confirming Recursive Mounts

1:37:42 I thought it was the same one and I just typed mount to see what's going on now on the system. And I see quite a bit of rick rolling happening. Yes. We got a a lot of it's just like a recursive some link thing going on here. Over the file system. Quite interesting. So we can go ahead and just try to start unmounting these. I'm not entirely sure that they won't come back, though. Well, maybe because Alright. So what if we be frozen again. Oh, we are frozen again. Oh, come on. How's your taping, Jason? This actually reminds me

1:38:55 System Hangs Again

1:38:57 of an issue that I've seen in production. Not quite identical, but I I once had to deal with an issue where we would get traffic in at the same time every day once a week, and we only had the window a small window to actually inspect what was happening. Because once we hit the threshold of the event, it would lock up our entire network. We'd have to restart our main routers, and then we didn't have any more traffic until the next week. So it took us a couple of months to figure out exactly what was going on

1:39:39 because we only had that, like, fifteen minute window one day a week to actually troubleshoot it. And, like, the limited time that we actually have on here is reminding me quite a bit of that. Take that control plan. And I wonder if he'll come back. I hope they've kept the scripts that set this up. I need to be I need to see what they did here. Because Lewis sent me a message saying, didn't really have a lot of time, so you might be able to fix it really, really quickly, but good luck. I don't know if that was a joke or

1:40:00 Access Restored, Unmount Still Failing

1:40:37 or not anymore. Hopefully, that's unmount. I really hope it's working. I don't even know. How many were there? Hundreds? Thousands? I I don't think there were hundreds. Alright. Okay. Millions. And teleporters. Yeah. Don't think I again. I don't think unmounted everything because I grabbed through never, so it should have been quite targeted. However, they're all busy, so this unmount has failed disastrously. Worth a shot, though. What you got, Jason? You're up. So we can go ahead and try to see if we can get in an LSOF here to see what actually is well, we know that the static manifests are in

1:41:32 there. So So knows that there were no RX to human there were pass through XRX. We should have been okay. So with the static manifest in there, I'm wondering if we might need to actually stop the cubelet before we can actually unmount these. Okay. So you wanna do a stop cubelet? Yep. And let's go ahead and give that another try. We could also try passing it the force flag as well. Well, I can still type, so the system hasn't locked up on us again yet. But if you wanna control c and try something, please feel free.

1:42:10 Unmount Attempt After Stopping Kubelet (Still Busy)

1:42:43 No. I think we should let it go out. Though I am thinking about opening up a second terminal. Looks like it's still busy, though. Yeah. Okay. That didn't work. We're hung again. No. That's okay. Yep. Are we? Okay. Oh. That was most unfortunate double type in the enter of return. Interesting question from Carlos. I thought the task was to break Kubernetes, not the operating system. Well, in theory, practice, they have broken Kubernetes through the operating system. Well, this may not be a production outage that you may necessarily face. Your clusters are still susceptible to wild geese. So

1:43:25 System Hangs Again

1:44:23 It's funny. I'd be laughing more if I wasn't crying on the inside. Yeah. We're we're kinda struggling with this one here. So on the other session that I opened up, the LSOF did start returning. And I see there's at least a sleep going on. I see a process that looks to be rick rolling us as well. Should we start killing some wild processes? I have another session here as well. So we just start seeing what we're looking. Oh. I I think we're gonna see a process that is Rick rolling us as well. Oh, my sessions are

1:44:45 Using LSOF (from another terminal) to Identify Busy Processes

1:45:23 or Alice or Wes finished as well. Alright. So we do have the shell script here that is running. So I reckon we killed that. Yep. Are you okay with a Kelminus name, or do you want me to? I'd I'd think we can safely kill this process ungracefully. Let's see if it comes back. Not immediately. Okay. Do we wanna take a look at it and see what they've done? Or just think that's a I'm gonna go ahead and try to rerun that LSOF in the other terminal. While over here, we can potentially just try the unmount again, but.

1:46:10 Unmount Attempt After Killing Script (Still Busy, Faster)

1:46:33 Yeah. Okay. Let's try the unmount again. Nope. Still busy. Still busy. But that returned quicker this time, so that's something. Okay. So we got some sleep processes running. Never has not come back, though. So the sleep processes could potentially just be POS containers as well. Yep. And it's hard to tell how many of those random best sessions are actually just us. I think this looks more sensible now. But we do still have the big crazy weird mount thing. And it looks like that process has come back. I've never oh, I don't see it. No. Oh,

1:47:20 Identifying Remaining Busy Processes (Shells in Mount Path)

1:47:53 okay. Never mind. But we do we do have some shell somewhere sitting out that is in that dot never directory. Okay. So what do you wanna try? So I've got ten minutes before my next call. Let's see if we can get this done. So there's a total of four shell sessions open. I'm assuming that's our IPs. I would expect so. I think I'm the 8082. So I think we can probably kill the early ones here. So 1824. I can get it. Having all sorts of problems with My teleport gets a little bit Oh. Did I?

1:49:26 Start a new session. I think I got a little confused there. Okay. So I think we need to unmount that stuff. Okay. So we still don't have any APIs ever. So let's let's try and get this worked out. So those shells are still there. Okay. Let's kill them. Alright. Those are gone. We're now just got this one shell, which I think you might be on or maybe not. I'm not sure. Let's Why don't we just try So I think we're gonna have to remove that SIM link. Potentially, I have to remove that SIM link too.

1:50:40 Removing the Recursive Mount Symlink

1:50:50 But we should still be able to unmount it and it just be a broken SIM link. It's stalled on me again. Still seem to be going. Okay. So now we don't have a manifest directory. Alright. We probably still need to unmount those. So Kubernetes isn't isn't assembling or amount or anything. So how do we configure the static manifest directory? Could that be put in somewhere else? We take a look at the kubelet comp. We can, but at this point, the kubelet should still be stopped right now. But we did have an API server and a oh, yeah. We stopped the. Okay.

1:51:58 Locating Static Pod Path Configuration

1:52:10 So we don't have anything right now. Okay. Oh, no. Well, we'll still have any containers that we're running prior to stopping the kubelet. Okay. Okay. That doesn't configure the thing. What did I do there? Kubelet dot conf. I thought that kubelet dot conf had the static manifest path and stuff in it. Am I wrong? No. So there's there that cubelet dot conf is the the client configuration for the cubelet. There's varlib. Somewhere in varlib is the actual bootstrap config for the cubelet and the actual configuration. Okay. So the static pod path as Kubernetes manifest. Yeah. So I think we're gonna end up

1:53:05 Recreating Control Plane Static Manifests (Using kubeadm phases)

1:53:07 having to recreate those. At this point, we can probably just copy them over from yes. We could potentially use kubeadm to bring back the control plane manifest, or we can just copy them over from Cluster 09 too. Okay. Do you know what the kubeadm command would be? It's going to be one of the phases. So it'll be cube a d m in it, and then there should be phases. Rawk phase. Rawk phase. And then the control plane one. Well, we'll probably have to do etcd and control plane. And it says, etcd, generate a static pod

1:54:28 manifest for a local etcd. Local. Yeah? Local? Yep. And then control plan. All and then will we start the kubelet? Do we still have those directories mounted? I'm thinking so by how long that took. Yes. I think we probably wanna try another unmount on those. Okay. Let's try one of them and see what happens. So let's try if we can get that to work first. Yeah. Okay. So that looks better. I've lost my history. Okay. So mount, awk, never print three. XRX. That look okay? Yep. And there was a suggestion to use a lazy on mount.

1:54:52 Lazy Unmounting Remaining Busy Paths (`umount -l`)

1:56:01 Do know the flag? I don't wanna delete my my text. I wanna say it was yeah. Dash l. Okay. Looks better. Okay. Okay. So we've now started the Kubelet. We generated static manifests with QBDM and all of it. I always say that QBDM is cheating for the purposes of this, but I think that we debugged the problem and we were just skipping out of the really boring bit of pasting in YAML. So I'm okay with it in this instance. If we run a PSEUX, do we have a kubelet? We at least got a kubelet, but not much else yet. So maybe

1:56:55 Kubelet Logs Show Progress, API Server Pod Creating

1:56:55 give that I mean, look at the kubelet log, see if there's anything standing out there. That's here. No pager. That looks a lot healthier. No. Connection refused. And I expect that if the API server is not necessarily up. Oh, yeah. I think it's coming. There we go. Guess we could just follow it. So let's go ahead and see see if we have etcd running, see if we have the API server running. API server looks good. Etcd, I've seen earlier. I think that's alright. Does I wonder if the API server is responding at this point. It is.

1:57:50 Cluster 10 API Server Restores (kubectl get nodes Success)

1:58:09 Pods are running. It's fixed. Ish. Ish. I mean, Rawkode and Seth, I wouldn't expect to be up so quickly. But I think we fixed it. How does our WordPress deployment look? Happy and healthy. Alright. And just double check cube system as well just to be sure. Running, running, running, running, running. I mean, I think Rick and Seth will just need the moment. But based on, you know, Lewis telling me they didn't have a lot of time, I think their hack was the Mountie overlay craziness and that script running in the background doing whatever the hell that was doing that we

1:58:30 Final Pod Check on Cluster 10 (Healthy)

1:59:02 killed, we got the unmount. I think we did it. That was that was quite interesting for sure. That that was two really tough clusters. Thank you, Lewis and Control Plane. Thank you, Thomas. You definitely put us through our paces there. Not only did we have to try and fix Kubernetes in a multitude of ways, but we even struggled to keep our debugging tools running at one point. Well, that was a tough one. Thanks, Jason, for joining me and guiding us through that. You kept very calm all the way through, which is not easy, at least not for me.

1:59:30 Conclusion & Thanks

1:59:44 No. It's definitely not. There were definitely some interesting curve balls that were thrown our way to try to debug. So Oh, yeah. Definitely. It was a fun adventure, though. Thank you for having me. Alright. Well, thank you to our breakers. Thank you, Jason, for joining me on Team Fix. Remember, you can open an issue on the GitHub page if you wanna join and do one of these sessions. Thank you all for watching, and I'll speak to you all again soon. Thank you.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
containerd

More about containerd

View all 23 videos
etcd

More about etcd

View all 24 videos
CoreDNS

More about CoreDNS

View all 21 videos

More about Teleport

View all 38 videos