Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Trace webhook failures by reading pod events and validating Kyverno webhook service, pod status, and networking assumptions.
  2. Recover NotReady worker nodes by diagnosing image disk pressure, cleaning storage, and restarting kubelet after fixing eviction flags.
  3. Fix control plane start by isolating API server startup errors in kubelet logs, then regenerating corrupted PKI certs.

Marino Wijay and John Anderson fix a sabotaged cluster: Kyverno admission webhooks, a Not Ready node with a bad kubelet eviction flag, controller manager RBAC, a corrupt API server cert, and Cilium CNI mismatched with containerd.

Chapters

Jump to a chapter

  1. 0:00 <Untitled Chapter 1>
  2. 0:59 Introduction & Housekeeping
  3. 3:18 Guest Introductions (Marino Wijay & John Anderson)
  4. 4:43 Chat about KubeCon
  5. 5:26 Marino Begins Troubleshooting (Initial Cluster Check)
  6. 6:12 Export Coop Config
  7. 7:17 Investigating Pod Status (Container Unknown)
  8. 11:58 Discovering Kiverno Webhook Error
  9. 13:00 Debugging Kiverno Installation
  10. 14:03 Checking Admission Controllers
  11. 16:12 Deleting Kiverno Resources
  12. 17:51 Trying to Redeploy Application Pod
  13. 18:59 Realizing Wrong Cluster Context
  14. 20:20 Checking Nodes on Correct Cluster (Not Ready)
  15. 21:36 Investigating Node Not Ready Status
  16. 23:51 Debugging Disk Usage on Worker Node
  17. 26:10 Identifying Large Log Files
  18. 27:28 Cleaning Up Disk Space
  19. 28:19 Checking Node Status After Cleanup
  20. 30:58 Debugging Kubelet (Network Issues & Backoff)
  21. 34:55 Identifying Kubelet Invalid Allocatable Config
  22. 35:56 Checking Kubelet Configuration Files
  23. 38:25 Environment File
  24. 38:40 Finding Kubelet Eviction Flag
  25. 39:08 Restarting Kubelet
  26. 40:39 Worker Node Becomes Ready
  27. 40:41 Checking Application Deployment
  28. 41:04 Attempting Deployment Upgrade (v1 to v2)
  29. 42:30 Cube Controller Manager
  30. 42:31 Hint: Controller Manager Role
  31. 43:03 Checking Controller Manager Static Manifest
  32. 44:03 Checking Controller Manager Logs (RBAC Error)
  33. 45:31 Understanding Leader Election and Leases
  34. 46:39 Identifying Scheduler Config Mistake
  35. 47:17 Correcting Controller Manager Config
  36. 47:41 Restarting Controller Manager
  37. 48:27 Checking for New Replica Set
  38. 50:38 Controller Manager Healthy
  39. 51:23 New Replica Set Created
  40. 51:52 Testing Application Endpoint
  41. 52:01 Application Responds (v2)
  42. 52:19 Marino's Turn Ends, Discussion of Breaks
  43. 52:51 Swap Hot Seats
  44. 54:00 API Server Not Running
  45. 54:47 Checking Static Manifests for API Server
  46. 55:06 Noticing Certificate File Paths
  47. 56:10 Poking API Server Pod (No Logs)
  48. 56:55 API Server Not Starting, No Logs
  49. 57:59 Checking Kubelet Logs for API Server Info
  50. 1:02:18 Kubelet Logs: API Server Backoff
  51. 1:07:56 Using CRI CTL to Check Containers
  52. 1:09:10 Getting Logs from Failed API Server Container
  53. 1:09:33 Identifying Certificate Trailing Data Error
  54. 1:09:55 Inspecting Certificates
  55. 1:10:52 Identifying Modified Certificate File
  56. 1:11:06 Finding Trailing Data in Certificate File
  57. 1:11:32 Deleting the Broken Certificate
  58. 1:11:58 Attempting to Regenerate Certificate (Wrong Command)
  59. 1:12:58 Regenerating Certificate via Kubeadm
  60. 1:13:56 Verifying Regenerated Certificate
  61. 1:14:46 Restarting Kubelet to Pick Up Certificate
  62. 1:14:52 API Server Running
  63. 1:15:31 CoreDNS and Other Pods Broken
  64. 1:16:19 Attempting Application Deployment Upgrade
  65. 1:16:58 Pods Stuck in Creating State
  66. 1:17:11 Pod Logs: Failed to Set Up Network (CNI)
  67. 1:17:13 Fail To Set Up Network for Sandbox
  68. 1:17:56 Debugging CNI Configuration on Worker Node
  69. 1:18:21 Identifying CNI Version Mismatch Error
  70. 1:19:23 Checking CNI ConfigMap
  71. 1:22:55 Checking CNI Binaries Version
  72. 1:25:14 Hint: Dependencies & Release Notes (Containerd)
  73. 1:31:40 Hint: Check kubectl get nodes -o yaml (Different Versions)
  74. 1:32:02 Identifying Different Containerd Versions on Nodes
  75. 1:33:03 Downgrading Containerd on Worker Node
  76. 1:34:27 Restarting Containerd/Kubelet
  77. 1:34:32 Application Pods Running
  78. 1:34:50 Discussion of Containerd/CNI Bug
  79. 1:35:45 Discussion of Kubernetes Error Messages
  80. 1:37:04 Closing Remarks & Thanks
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:00 <Untitled Chapter 1>

0:58 You know, I keep forgetting that I've left my face in the middle of that last scene and I never know what to do for like that whole twenty five seconds. So maybe not have that next time. Let's see. Welcome back to the Rawkode Academy. This is everybody's favorite day of the week. This is Thursday, which means it is clustered day. We had no episode last week because I was at DevOps Days in The UK, but we're back this week and then hopefully I get to see some of you all live and in person at KubeCon and Valencia next week.

0:59 Introduction & Housekeeping

1:31 Before we get started today, we have a little bit of housekeeping. First, it's my pleasure to thank Teleport. They have been sponsoring a Clustered for a while now, and we have been using Teleport since the second episode of Clustered, and I am continually impressed and amazed at its wonderfulness. We are able to pair on Kubernetes clusters, applications, databases, and Linux systems to fix all the broken things, which is coincidental considering we're trying to fix all the broken things on Azure. So go to rockload.liveteleport to support this show. It keeps sponsors happy and it keeps us doing these episodes. So

2:08 thank you very much Teleport. I also want to thank Equinix Medal. They have graciously provided all of the hardware that we use for clustered for nearly eighteen over eighteen months now. There's been hundreds and hundreds of bare metal machines and bare metal Kubernetes clusters. So thank you to Equinix Metal. I could use small virtual machines with a few cores and a few gig of RAM, but I don't. I use 48 core machines with 64 gig plus on each node. Why? Well, it makes it a little bit more fun. So I wanna say thank you Equinix Medal.

2:44 You can actually use the code Rawkode. I'm gonna pop that back up. Equinix Medal, can have an extra second. Use the code Rawkode because that will get you $200. So you can go to console.equinix.com, use that code, and it means you can spin up clusters for up to, like, a hundred hours. And also support the show and keep the sponsors happy by going to Rawkode.live/metal to learn more. Alright. Sponsor message is done. It's now time to invite today's victims. I mean, wonderful guests. Oh, we've lost one. Oh, no. He's gone. Oh, he's back. There we go. There we go.

3:17 Alright. Two wonderful guests today. I'm joined by John and Marino. Can you both please wave and say hello? Hey, everyone. Alright. Let's start with some quick we go. Let's start with some quick interruptions. We'll start with you, but you know, and then we'll move over. So tell us a little bit about yourself, please. Yeah. For sure. Hey, everyone. My name is Marina Widje. I am a Canadian. I am a father, husband. I work for solo as a developer advocate, and I love all things networking even though I haven't touched a switcher or router in probably several years now.

3:18 Guest Introductions (Marino Wijay & John Anderson)

3:49 I I work in the service mesh space and Kubernetes, so there's a very interesting intersection there where networking comes into play. I am excited for today's clustered session even though I'm probably not sure if I'll make it out of here, but, looking forward to the session. Alright. Thanks for sharing. Yeah. And I'm John Anderson. I'm a SRE at Zapier. I'm based out of Puerto Rico. Clearly by my face, I'm not traditionally from there, but it's a great place to work on Kubernetes clusters. I live up in the jungle. So it's a good time. Yeah. And so I'm excited. I walked looked

4:26 up Marino on LinkedIn and realized he is gonna be a formidable opponent, and so I'm kinda scared right now. Alright. Awesome. Alright. Well, thanks for that. And and a quick question before we dive into fixing some custard. Are you both gonna be in Cape Cod Valencia next week? I am. I will be there. I'm actually leaving tomorrow evening catching up, like, because I've got a conference on Sunday, Cloud Native Rejects. Nice. Doing a session there, and then I'll be hanging until KubeCon starts. So Awesome. Yeah. I won't be there, unfortunately. I wanted to, but I'm in San Diego this week. And

4:43 Chat about KubeCon

5:11 so I was just like told my wife. I'm like, hey. I'll go to San Diego, then I'll go to Valencia, then I'll be home. And she's like, gotta pick. You you're not doing both. Alright. Awesome. Well, let's get this thing started. First up, we have Marino. You ready? I'm ready. I'm ready. Alright. Let me go find the active session there. Alright. So I'm gonna pop open a session on SonTek control plane one. I'm using Teleport Connect today, their new desktop client for Teleport. Let's see how we're going with this. We have a session open, but, you know,

5:26 Marino Begins Troubleshooting (Initial Cluster Check)

5:53 you should be able to see that and join. And if you could just type echo hello or anything like that to let me know that you are there. Alright. Perfect. Alright. So you have just over forty minutes. You have to export your Kube configs to any aliases, and best of luck. Alright. Thank you very much. Let's go ahead and export Kube config. And I I gotta remember this command now because I don't remember all these commands at all times. I I will be honest. I don't practice Kubernetes daily or on the regular only because I have so many other things I gotta

6:12 Export Coop Config

6:29 worry about. But that's not the URL. I'm, like, trying to copy and paste, and it's, like, so undecided. /etc/Kubernetes/admin.com. There we go. Alright. So let's see. Kubectl config view. Let's see what's going on there. Okay. So it looks like my cluster is nothing's messing or messed around here, so let's just see. Kubectl get nodes. Okay. Okay. Things are looking good. Let's just double check some other things. Okay. Things are looking good. Let's see. Get all I'm trying to resize the window when it was good and painful. Container status unknown. Okay. Okay. Interesting. Alright. So what what I wanna do is

7:17 Investigating Pod Status (Container Unknown)

7:29 I'm actually gonna describe the container because I just wanna know what the heck's going on with the pod I should really take. So let's just see a pod that's actually not doing well. Let's try that. Container state is unknown. It's not an error message we see often on clustered. Yeah. I it's not one that I've actually ever encountered before, so it's, rather fascinating. The container could not be located when the pod was deleted. The container used to be running. Interesting. Okay. Container was low on resource, ephemeral storage. Here. Let's find something else. Oh, interesting. Okay.

8:28 Let's take a look at the deployment deployment and see what's going on there. K. Deployed clustered. And what I'm gonna do is I'm just gonna output this to YAML and see if there's anything wrong with the actual image. If you pipe it through less or more, it just means we can follow along with your eyes. Like this. Perfect. So I gotta remember for a second. Is the Google Cloud registry GCR or GHCR? The Google Cloud registry is GCR. However, I use the GitHub container registry for clusters. Oh, okay. G h c r. Yeah. Okay. Let's see if there's anything else here.

9:36 Okay. Let's see something else here. There's no metrics ever deployed to these customers. That's a good idea, though. I should deploy a metrics server to these customers. And there was an interest in error message and your describe pod, which I think you may have missed that. Maybe encourage you to look there again. Not that the the pod. Yeah. Alright. Darn it. Yeah. That's fine. Alright. Projected volume, multiple sources. The container could not be located when the pod was deleted. Was that the area you're talking about or something? Or the node was low on resources, ephemeral storage.

10:46 Yeah. That that eviction with the message had some information there. Okay. So something to do with the maybe a PVC or something like that. Well, read the rest of that line. Which exceeds its request of zero using 24 kilo or is it kilobytes or kilobits? I can't remember. That is K. Kibibytes. That's just a small yeah. The I. SI standards. Otherwise, I think anyone really cares and just calls the kilobytes anyway, but I I get semantic over it. Or maybe pedantic is the correct word. I'm not sure. Maybe maybe there's something in the deployment, yeah, that limits how much

11:34 storage I can offer. What volume is it referring to? This is fun. I would read that error that's right at the bottom of the screen. Yeah. Internal error occurred. Failed calling webhook. Validate Kiverno. Connection refused. Oh my gosh. Okay. Did y'all, like, throw a network policy or something in here? Can I get on my Press q two? Quite the pager. Just take q? Yeah. I am. I'm not taking There we go. Okay. I'll I'll just do the the q type in. That must be some sort of really weird bug. Unless you have, like, the Chromium Vimeo

11:58 Discovering Kiverno Webhook Error

12:43 extension enabled? No. I don't. Alright. Alright. Alright. We'll see. Alright. So something in that error something in that configuration that was an error here. Alright. Hiberno. What does that what does Hiberno have to do with this? Did your cluster have Hiberno on it? I don't recall. I don't think so. Yeah. I think John's been a bit naughty with that one. I don't know. It's a deal. Get all dash and Kyberno. Because I wanna I need to see what the heck's going on in this little namespace. Pleaded service deployment. Have you used Kiverno before? No. I haven't. So I unless, like, this

13:00 Debugging Kiverno Installation

13:28 is depend well, question, you didn't deploy it into our original clusters. Right? Because I don't have to call everything. That is correct. This is not part of the cluster deployment. Caverno is a policy tool that creates validating and mutating admission controllers. So if I go ahead and kill this, then, basically, I should not have to worry about my admission issues or anything. Right? Well, I would check for validate and mutate admission controllers first. Because even if you remove the Caverno install, I don't think it will remove those resources. Alright. They're called validating oh, what the fuck? Validating web configurations

14:03 Checking Admission Controllers

14:08 and mutating web configurations. Yeah. It's such a long keyword. I'm trying to remember the actual syntax. Yeah. I wish it was the short version of getting them bad. I think they're good. Jeez, John. You're you're throwing me into an area I've never ever Let me just take one and see what so there's one web webhook in this one. I have no idea what this policy is doing. But I imagine if I delete it and then go ahead and delete the, Kiberno namespace, that should resolve things. Right? I'm hoping. I think so. Yeah. I I back

14:59 that plan and go for it. What's that? I backed that plan 100%. Start deleting stuff. Hold on a second. Before I do, though, what I wanna do is I just wanna back these up just in case. I don't know. Who knows what's gonna happen? I probably only have, like, twenty minutes left anyways. No. You're doing good. You've you've got thirty two minutes left. You're cruising. Nice. Nice. YAML into actually, hold on before I do that. K. Paul dot YAML. Let's take this one too, and then I'll just output it to another file just because I don't know.

15:53 It doesn't hurt to review them later on. No. It's very reserved of you. Validating. Actually, I gotta fill up this here. Okay. Delete. Validating. It won't it won't autocomplete a. Yeah. We'd have to enable autocomplete. Another thing I should do by default on these clusters, I don't. Okay. May maybe for next episode. Do you want me to load the auto complete for you? Yeah. If you don't mind doing it quickly, I'll I'll eventually Nice. I think I always get that command wrong too. We're we'll find out. One quick test is if I just erase this and do a tab. There we go.

16:12 Deleting Kiverno Resources

17:02 Oh gosh. Well, I got yeah. It pretty much got it. Yeah. Yeah. Just just tab your way through this cluster. That's what I do in production as well. Goodbye. Delete the namespace. I think we could probably make a coffee now. Looks like it's gone. Good. That was quite fast, actually. No fanalyzer snuck in there then, John. Yeah. I don't like, I was remembering, like, when I was messing with the cluster yesterday, I'm like, we're it was pretty minimal. It wasn't that much on there. And when I see Cuyverno there, I'm like, hold on. Alright. Let's let's try and restart it. Actually,

17:51 Trying to Redeploy Application Pod

18:03 you know what? Why don't I need to why don't I need to restart? I'll just delete the pod, and we'll just redeploy it. So, by re by deleting the pod, because I already have a deployment specified, I don't need to worry about the pod recreation. The deployment will automatically spin up the new pod or the new version. What? You forgot to put pod. Oh, yeah. There there we go. Yeah. That does it too. Kubectl delete pod, and the name of pod will allow you to delete the pod. And if I go back in, get pods.

18:34 Okay. So it's not recreating. Interesting. K. Let's go take a look and see why that might be. I get pods dash wide. The quick test on NGINX dash dash n equals NGINX. So I just wanna make sure that it goes into So I typed this in the chat, but one thing I noticed is that since this is running in Cuyverno, we have a staging and a production cluster, and we only deployed Kiverno in staging. And you're supposed to be fixing the production cluster. Oh, so I shouldn't have deleted Kiverno? No. You should check your config context to make sure you're working on

18:59 Realizing Wrong Cluster Context

19:21 production. Oh gosh. Did you create another cluster with the same node name? Gosh. I I I may have. That's nasty. Wow. Well, that is a clustered first. Nice work. Dang. Okay. So, so for those that are watching, John created a new context, and I'm logged into it, and I didn't realize it. And this is why you should always check your context, like, using a tool like Kube CTX or what's the new version of it? Whatever it is. K. So I'm in the right context now, I would assume. Let me But your engine x pod is there, so I don't think you

20:20 Checking Nodes on Correct Cluster (Not Ready)

20:25 have to use use context. Oh, yeah. That's right. Yeah. Kubi is the kubectx flavor of the week right now. Right? Yes. Alright. Kubectl. I got pods. Oh, okay. Nodes. So where's the other cluster running, John? Shit. It it's actually running at EC two. And, like, my heart was beating because the first thing you guys ran was kubectl config view in the AWS stuff scrolled by. I'm like, please don't notice that. I didn't I understand that. Let's go back. I never thought it either. No. Let's do that view config again. I've never seen anyone run view config as

21:12 well. That you got lucky and unlucky at the same time. Yeah. I totally didn't notice that. And, you know, it was really weird. I was when I was looking at my notes, I noticed AWS somewhere. I'm like, I was pretty sure we didn't spin this up on AWS, but I couldn't think deeply enough about it either. So, wow, that was that was pretty damn good. Okay. So now my nodes are not ready. Interesting. Let's see. So I'm gonna describe the node node. Actually, yeah, let's describe the notice oh, shoot. Because this has never happened before, technically, if you

21:36 Investigating Node Not Ready Status

21:49 had got clustered v two running on his AWS cluster, you would still have succeeded. Oh, yeah. That's true. So, actually So which cluster is less broken, I think, the answer you should be asking yourself. The question you should be asking. Time to delete that other cluster right now. Alright. So I've I've got the deployment here. I could just simply do this. Kubectl get SVC as well. And I could oh, it's a node port. Okay. So I could just do this. I'll get s v c clustered dash o yaml plus clustered s v c dot yaml.

22:38 And for the folks on the, folks on the YouTube here listening or watching in on YouTube, what I'm gonna try to do is get this quickly up and running on that cluster that we think is working. So we'll see how that runs. But I gotta export a whole bunch of services so that I got files that I can work off of. Postgres dash dash dash dash o vmlpg.pg s v c dot e m l m c t l get deploy. So what's wrong with the current cluster? I'm not sure. The nodes are down, and, I was just describing one of the

23:19 nodes. I think I described the control plane. Node is now has its sufficient node has sufficient memory. Node has no disk pressure starting Kubelet. Invalid capacity zero on image file system. So you may wanna jump on to that worker node. Yeah. Let me open a session. Actually, can I SSH to it or no? Well, I've just opened the session. So if you go to active sessions, you'll be able to see worker one session, and then you can type with me there. Okay. Let me load that up. Alright. Hello. K. Cool. So it said image disk capacity is zero.

23:51 Debugging Disk Usage on Worker Node

24:14 So I think we need start looking at the disks on this machine. How the heck do I do that? We could use d u. D u. Okay. Oh, wait. That's for files. Right? D f is the one I meant. Oh, there we go. Yeah. There's some pretty full disk guy stuff going on here. Right? Yeah. It's pretty at oh my gosh. What the heck can we get rid of? Well, ironically, the first command I run could probably help us with that. D u. Are you familiar with d u at all? Not really. Alright. Oh, we're not in the so d u,

25:09 we can do depth one dash eight versus human readable size, which will give us a rough indication of where the large thing is are. Mhmm. Because see that bar has 387 gig. So then we could jump into slash bar and kinda chase it that way. We could maybe also use a thing to look for files larger than a certain size, but I don't know if he's did a saturation attack with many loads of small files or if he's done something, Maybe he just did a DD four gig block size. Who knows? Like, I don't know

25:40 how nasty John is here. So we could do another, what, d You could do a DU and change the depth from one to two or three and then try and get a look at the folders and try and see if you can work out with all this crap. Let's try. So one will show you the top level. Three will break it down a bit further. So what what what is that? Slash our log seems to be pretty chunky. Let me just sorry. Did you say bar log? Yep. If you went to the log directory. And you see that. Where are you seeing

26:10 Identifying Large Log Files

26:25 that? Oh, run your d u command again and just type it in grep for log, and you'll get the size. You forgot. We go. Oh, this loads. But you can see the second bottom line. Yeah. Right there. Right? Down to that directory, it's got, like, the 80 gig in it. So So could I just delete that? Like, goodbye? I mean, you could blow away that entire directory, but let's maybe take a look inside it first. Try maybe do a directory listing with sort by size or something. Yeah. There's there's some chunky and I like the naming of those big

27:20 files too. It's very appropriate. Yeah. It's always Log four j. Alright. Let's let's see if we can do an RM on these bad boys. Good lord. Why is it taking so long to delete? Alright. Let's see. Oh, what? Oh, yeah. I need the arm. Let's see. Alright. Maybe maybe we should be able to well, I I wanna have to go back to the other session. Let's just hope he's not reformatted the desk to set FS, causing the delete, not really to free up the space. But I'm assuming that would be a step too far. Yeah.

27:28 Cleaning Up Disk Space

28:17 Alright. So node, they're still not ready. Let's describe the node. We'll bring you back another session. Yeah. I'm back in the other session. Sweet. Kubelet node Twenty minutes to go. Oh my gosh. Node SonTek. ProPlane one status is now. Node has sufficient memory. Node has no disk pressure. Where did you run that? Sorry. I don't see that. What's that? Can you run the command again? Am I we yeah. Pass it through the pager so we can see the scroll. Yeah. There we go. I screwed up. Invalid disk capacity. Oh my gosh. Still? But this is describing the control plane, though.

28:19 Checking Node Status After Cleanup

29:08 Is it? I don't know. Well, hold on. Let me see what because my skin's a bit weird. Yeah. You described the control plane, which may also have filters. We wanna try describing the worker. Mhmm. So let me My my screen's gone a bit funky, so I'm just gonna take reset to there we go. I'm done. Yeah. So let's describe that worker and see if we can see at least the that node's ready again because that would be good step forward. Sorry. What was that? Describe worker one to or get a run get node to see if it's healthy. I

29:48 would expect worker one not to be complaining about the image capacity. Yeah. It's still describe some tech worker. Which one? What? Node. Yeah. It's probably the word node in there. Oh my gosh. Right. These cycles ago, invalid capacity zero and image fail system. Yeah. Can you scroll up? I expect so all of those are not what the actual problem is right now. Expect it should tell you what the problem is up higher, but that it's been complaining for those for a while. We might need to pass it through less. Yeah. Let's do that. Okay. Here we go. So Liam is up.

30:36 Memory pressure. Oops. Could've been already let's see. No taints. It's not that it's not schedulable. That looks okay so far. The important part there is kubelet not ready. Ah, okay. Then how are we getting a report on the status? Kubelet is ready. That's what I want you to think. We should check on the worker. That's a good point. Okay. So I'm switching over to the worker now. System CTL status Kubelet running too. Hold on. Wait. Let's see. Doesn't seem to be anything wrong with it. Am I not seeing something here? We may have to get the logs from

30:58 Debugging Kubelet (Network Issues & Backoff)

31:51 the queue, but I'm not seeing anything right now. I'm curious about that infection hard warning, but I'm not sure if that's normal or not. So the command I just ran for those on the on the YouTube channel there watching in, I ran something called journal CTL dash u kubelet, which gives me some more logging information about the kubelet itself to give me some details about maybe some files that might be misconfigured or something along those lines. That's I think it's interesting, the fact that our Kubelet is using a a memory state store. I don't know if that's normal.

32:37 Sorry. What was that again? It says initialized new in memory state store, which may be compared to normal, and I've always skipped it. But because we're seeing memory things or disk things, I'm curious about it. Well, I don't spend a lot of time in the kubit logs. Why would you? Yeah. Okay. I don't know where else to look now. The the log should be the right place. I would just say scroll through here. It should show you what the problem is. Waiting for nodes sync container runtime initializer, starting to listen address. I feel like something was wrong here.

33:25 Container runtime network is not ready. Network ready. Network ready equals false reason. Adding debug handlers too. What's the reason? Yeah. Let's try running it again, seeing if we could get some more messages. Interesting. Error getting node. Error node SonTek dash worker dash one not found. And that means it's not able to because our kubel is not online. It's not speaking to the API server. So I think that's normal at the start. Can you tail the logs? I think we're missing the error since it's we're only getting a snapshot in time. Just do, like, minus flu instead of minus

34:19 u. Just do minus f l u, and that'll tail it so that we can see there. There we go. Hold on. I'm having a hard time finding it. Hold on. Yeah. I would kill the tail now. Yeah. There you go. Okay. So where did I run the tail? So the adder's right there and failed to start container manager, invalid node allocatable configuration. Resource memory has an allocatable of blah and capacity of blah. I'm I'm literally trying to find this message that I cannot see because I'm looking at a to start container manager invalid node allocatable. Oh,

34:55 Identifying Kubelet Invalid Allocatable Config

35:16 there we go. Oh my gosh. What does this even mean? I mean, I I I I don't know right now. I think what we need to do I'm assuming he's modified the Kubelet configuration, which means we wanna start looking at system control cat kubelet and see if he's added any extra flags to limit capacity. What else could this be? It could be where is the file located at? I can't remember now. Here. Right? If you're in system control cat, it'll tell you all the config files that are used. Sorry. What was the command again? System control

35:56 Checking Kubelet Configuration Files

36:04 space cat space and then the unit name, which is kubelet. Okay. Here we go. So this will show you the unit file, which looks looks clean. And then we have the far left kubelet config dot YAML fail here. We have the QBDM flags ENV, and we have the eccentric kubelet. An interesting comment on the chat as well from one thousand and one saying this could potentially be container d related as well. Container d? Yeah. So we should probably sanitize each of these files and then take a look at the container d config. I am I am of the mindset that

36:53 we could probably hold on a second. The files are located here in Kubelet. Oh. The first one is bar lib kubelet config dot YAML. Lib kubelet. Oh, yeah. Config dot yaml. Maybe just cut that. Now you have to try and remember what is normal in this file. Help find address. I think. I think that's okay. Okay. And if you run the cat again, we can look at the next file. Sorry. The system control cat. There's another two conflict files. I don't remember where they are. What's the file we wanna look at? Kubelet. Yeah. Okay. Yeah. Kubelet. Now we've got this

38:10 varlib kubelet flags. I'll just cut that. And that looks clean. Yeah. That looks good to me. And then the next file we need to look at is the environment file. Yeah. The Etsy default kubelet. We could take a peek in there and see. There we go. We got extra args. Eviction hard memory dot available less than a hundred gigs. So this is telling the cubelet to evict all pods if there's less than 100 gig of available memory. Wow. Okay. So we can delete that, I guess. You can indeed delete that. Yeah. What I'm And I will have to do

38:40 Finding Kubelet Eviction Flag

39:01 a restart of the. And if you jump back to the control Jump back to control plane and try to get nodes, and we'll see. Still not ready. But Yeah. Let's give it five seconds. It's gonna be ready. Yeah. That that's all there was for making it not ready, so it should come back. Hey. Hey. The worker's good, but not the, control plane. Oh, we don't need that control plane. It's not a part, but it's the same break, so you can fix it quickly. Okay. Hold on. Let's see what I did. So I actually went into which file again?

39:08 Restarting Kubelet

40:00 It was u r m slash e t c slash default slash kubelet. Yeah. System CTL restart. Kubelet. And then kubectl get nodes. Already. That's a watch. Let's wait for it to change. Eventually, it'll change. Hopefully, it'll change. There we go. Alright. Alright. Well, let's take a look at our deployments. Get deploy. Why not? Available. That's good. K. Get SBC. Yeah. That's a node port. And what did you want me to upgrade it to? Version two or something? You have to upgrade our cluster deployment from m h v one to m h v two. Alright. Let me just

41:04 Attempting Deployment Upgrade (v1 to v2)

41:05 And you've got six and a half minutes. Okay. Let it deploy clustered. Nothing. Darn. Why didn't we update? I was worried that was gonna work. Yeah. I mean, you could try the edit deploy again. Something tells me the edit oh, it did go through. Do we have a a new replica set? We do not. No. Now what happened? Could CTL edit the flow actually, let's just get the deploy. Deploy. Actually So I can give a hint for this one. So the replica sets are managed by a thing called the cube controller manager. Yeah. And so if we're not getting new

42:31 Hint: Controller Manager Role

42:36 one of those, you're gonna wanna make sure. Yeah. Wait. We're getting a new one of what? Yeah. So there's are you familiar with the Kube controller manager? I messed around with it, like, a long time ago. So Yeah. If if you go look at the static manifest, I think you'll you'll try. So the one second. It's a Etsy Kubernetes manifest. Right? That's the one. This is one of the weirdest things about Kubernetes. And, actually, I talk about this the talk I did at DevOpsDays Birmingham last week. I was talking about the KubeControlManager how you can disable replica sets and pods.

43:03 Checking Controller Manager Static Manifest

43:17 And I'm just thinking, like why? Why do these flags exist? But of course there are always use cases. But yeah, it's always a sneaky break. Okay. Let's see. Oh, past it. I'll tell you, this is sneakier than you're gonna expect. Oh, it is. I expected to see a dash replica set controller thing there. If you check the cube controller logs, it'll at least tell you what the problem is. Oh, gosh. It would have been faster just to deploy the new cluster that you create. Start with cube 80 every set. It's always the best way. Alright. Let's check the logs for this kubectroller

44:03 Checking Controller Manager Logs (RBAC Error)

44:17 manager and see what's happening because I have no idea. Sorry. What's the command to run through the logs again? For this, because it's a static pod, you need to go to bar log containers. You can also use cube CTL logs on it because it's inside the cube system namespace. Oh, yeah. Oh, yeah. Pawn station Kube system. Kube logs. That's a good point. Oh, yeah. Okay. I just noticed something here. You've got the namespace. Yeah. Oh my gosh. Noise. There's so much noise. That's not noise. That's information. We're retrieving resource block system. Lease coordination is forbidden system.

45:19 What? Okay. I feel like I'm so close. So Kubernetes I mean, I'm guessing, and I'm I'm explaining this to myself to see if this even makes sense. But Kubernetes API supports leader elections, so the releases thing, and it looks like have you modified the RBAC of the role for the controller manager to not have access to the leases object? Leader election was true. Right? I made it look like I did. You can turn the laser light off because there's only one. Right? Is that what you've done? No. So yeah. So since we're getting close to time.

45:31 Understanding Leader Election and Leases

46:11 So the way this is working oh, we'll we'll see if he it it's sitting right in front of you right now. It's right in front of you. Oh, he's been in the wrong authentication for the controller manager. Right? You're meant in the scheduler instead of the controller manager. Yeah. The error is saying that the scheduler can't get these leases. You're not supposed to be working as scheduler. Yeah. Go down four lines, but, you know, that scheduler.com should be controller-manager.com. Control Oh. You know, you could've given me an error. I don't think I would've seen that.

46:39 Identifying Scheduler Config Mistake

46:53 I tried it on two of my friends before this to see if they figured out, and they couldn't. So I'm like, this is a good one. This service is really interesting. I don't know if it's cube controller manager or just controller manager. I just okay. We may have to check. But you can save it either way and then and look in that directory and see what file exists. This is l s. I think it's cube controller manager. So you were right the first time when you listened to me. That was your big mistake. Very good. Very good. Okay.

47:17 Correcting Controller Manager Config

47:35 Alright. If you restart the Kiplet, it will speed that controller manager redeploy up. Alright. Done. No. It's not. I feel like done. Yeah. I was gonna say there's one more, and then you're done. Sorry. This is literally death by a thousand cuts. Is it like okay. Here's the thing. I don't know if it's in the deploy. But do we have a new replica set yet? Actually, that's a good point. Nope. Alright. And did the controller manager restart if you check the cube system namespace? Yeah. You should have a new replica set. Well, maybe not. Oh. No. We're crash it.

48:27 Checking for New Replica Set

48:47 So that maybe it was controller manager, not controller manager. Doing l s slash u t c Kubernetes. I'm assuming that's the fail not found error. No. Just Kubernetes. Yeah. Just inside Kubernetes. I'd say Kubernetes. We're looking at the it is Kubernetes. Alright. You'll need to get the logs on that control manager, and we'll see what it's complaining about. We'll we'll go for a few more minutes, and we'll see if we can work it out. Yeah. That should be working now. So you might just have to that was just thirty seconds ago. Yeah. So there must be a typo on that line

49:24 we modified. Yeah. It's not picking up that fail. Yeah. Cat dot config. Let's figure out there must be a missing character or something. Cube controller. So go to the top of this file. It has the correct one. And then we can just copy it from there. Yeah. Right there. Oh, controller manager. Okay. So it's not cube controller manager. How did we look at the file system and still think it was cube controller manager? Don't know. Because the whole verified this. Oh, no. We looked at the manifest. The manifest is cube, but the actual comp is

50:07 not. Yeah. We looked at the No. We could oh, there's two. That's why. There's the cubecontrollermanager.com and a controllermanager.com. See? No new replica set. You may have to restart you, but I can from that controller manager as well. Oh, it's happy. Okay. Cool. I'll do this. Boom. CTO rollout. Alright. So give you the final hint because we're close though we're out of time. The scheduler is your your final issue. Looks but it looks healthy. You might see a trend with these breaks. Everything looks healthy. Action. Just to see. I want it he's not gonna tell me anything here.

50:38 Controller Manager Healthy

51:12 Replica sets don't need scheduled. So why do we not have a new replica set? Oh, that's true. We should have a new replica set at this point. Not the controller manager. Oh, we do know. Alright. Okay. Okay. So how many pods do we have? Right. Okay. Now we can describe the new replica set, and we can see what the hell's going on there. Oh, it says forty seconds. Yeah. Yeah. Yeah. Yeah. Might be okay. Alright. Let's just do that. Yeah. How do I get to the okay. One second. Could CTL get SCC? Or am I pasting this in to my

51:23 New Replica Set Created

51:48 web browser? You can curl local host 30,000. It doesn't work in my browser, but sometimes that does happen. I need to work on that. Yeah, local host 30,000, see if we get v two. We do. Nice. There you go. Your scheduler break did not take. Hang on. I I did that one on Adobe, and it didn't work out that either. I need to get better at it. That was good. Oh my gosh. Oh my gosh. Alright. Whoo. That was interesting. Thanks so much for that. Yeah. That meant in the wrong cube config and to the container.

52:19 Marino's Turn Ends, Discussion of Breaks

52:30 Gives you, like, a kill effect because it looks like something we've seen before and then was and and it just completely different. That was evil, but fun. I enjoyed it. Thank you so much. Alright. Great work, Marina. And I I love the way you were talking and explaining things as you went as well, which is just fantastic. So nice work. Alright. It's time to swap hot seats. John, you are up. Rock and roll, man. You got this. I I I like the other side better. I don't want I don't wanna do this. Alright. I have opened

52:51 Swap Hot Seats

53:04 well, opening an SSE session on the control plane. Please feel free to join the session and give me an echo or something to let me know that you are there. And we'll get this thing started. Sweet. You have forty just over forty one minutes. Export your KubeConfig, set up any aliases, and best of luck. Russell was waiting on the the Rawkode schedule there. We don't need to bust that one out today. You're supposed to need it, though. That's unfortunate. Alright. So let's look. Did you did you install a tool just now? I just, you know, some small small scripts. Nothing.

53:52 Ignore. Ignore. Oh, you just saw the air. Oh, okay. There you go. Nothing to see here. Okay. You're on control. You're on JK. You're on CT control. You're on helm. You set up your alias. Yep. So Look. You were looking for policies. Yeah. I after last time, I'm like, yeah. I'm checking for all kinds of stuff. Alright. So that means the API server is not up. So alright. Yeah. Let's we have a kubelet, but we don't have a kubelet API server for some reason. So for our log containers, why aren't you working? Uh-oh. So alright.

54:00 API Server Not Running

54:45 Hopefully, it's still there. Manifest. Alright. And so we should have set PKI in there, and they're called API server. Okay. So let's see what it's what was the error? PKI API server etcd client. Is that different? Yeah. So that's not there. So what is oh, there's two. Yeah. There's two s's there. Okay. That was confusing. I'm like, it looks like it's there. Alright. So we got two s's there. Oh, I just deleted the wrong thing, didn't I? Yep. Yep. I was I searched for two s's. Turns out in English, you can't have two s's. Yeah. Quite common.

55:06 Noticing Certificate File Paths

55:46 Alright. Let's see the CA file. Okay. So I'm there we go. That's the right one. I almost made things worse. Yep. That happened. That would've been harder to detect. Alright. Just give it a little poke, and then alright. So it hasn't changed. So And you'll move it and put it back, Drake, as well. Although the tail that you've got, the container ID will change. Yeah. Well, I did star, so I should've got them all. Alright. Oh, yeah. But the star was always at that time. So when then Yeah. Interesting. So yeah. So it still didn't pick up that

56:55 API Server Not Starting, No Logs

56:58 change. Let me look back at it. Didn't pick up my change here, but I did save it. Another way to get logs is varlog pod. Mhmm. Yeah. Which just means you don't have to worry about the container ID changing too much. Yeah. No. We got all of these, though. So I would need to know which one. Alright. Alright. I'll try to be helpful. I thought you're getting. Alright. So I have I have a cubelet. I'm not getting oh, what I could do was it so I don't have a Kube API server. So it's still not starting it up even

57:49 though I fixed the thing it's complaining about. Yep. That's annoying. So there's those logs. So it's doing that because it's not starting up. So for some reason, it's not starting. Let's go back to the manifest API server. It says command QPA API server, which is probably correct. I feel like you're making some assumptions right now. So? Yeah. I mean, I I I'm 100% speculating, but you've made one assumption, which I would probably wanna confirm if I were you. Well, I don't know what that assumption is, but, like, what I've noticed so far is API server is not running. Right? But I

57:59 Checking Kubelet Logs for API Server Info

58:54 do have a Kubelet up. Well, if you modify a static manifest, then you don't see that change happen. Yeah. Yeah. You're that is an assumption I shouldn't be making. You're you're right. System CTO. Good call. I would have been spinning my wheels for a while if that's let's actually look at that. So I can't see my whole screen. I'm gonna wiggle this so I can Yeah. Wiggle it and type reset. Maybe it doesn't know. Kickstart it. Yeah. So let's see if I can And if anyone wants to rewrite Xterm GS, we'd all appreciate it. That would be

59:35 that would be awesome. That's, like, Toml. Yeah. This guy. Very. Oh, that's a great place, though. It's in Kubernetes manifest. Right? I don't know. My screen's gone a bit funny on me now. Okay. So, yeah, it's looking in the right place unless there's a way to override it outside of that because we got static pad pad path here, which is correct. Let's run that cat again. So we can check we can check one little hint. Take a look at that error message when you run that kubectl command. Which command? But he's done a PS, and

1:00:35 there's no API server yet. Just throwing that out there. Yeah. That too. Okay. There yeah. There's something else that you'll have to fix. But Don't don't make it too easy on him. Come on. You see what he did? Yeah. That's true. Yeah. You're an s r. You can hear me. Alright. So you've confirmed that the static manifest location has not been tampered with. Alright. So we do have some extra ARGs, which we don't want as well. So we'll kill those off. Which shouldn't affect what we're doing right now. And then I also wanna check I just resized my window. I hope it

1:01:20 doesn't Nope. It's alright. It came back. I've been trying to drag the bottom of it for, like, the last three minutes. I just I was just like, screw it. I'm just gonna resize the thing. Alright. So and all this looks okay. Okay. So let's Still don't have API. And you're not getting any logs from it at all? Yeah. So that's what I'm wondering. Like, if I go to bar log containers and queue API server, and then get all those. Yeah. So it hasn't logged out since that time that I Yeah. So it hasn't even tried to start it yet.

1:02:13 So what would what would cause it to not try to start a Kube API server? Yeah. You even moved the file, and you restarted to Kubler. Right? So Yep. Yeah. Kubler's been running for one minute. Yeah. It's user in. Check here. Are there any sort of, perhaps, certificates that might impact how all of this might come up? It shouldn't stop the API server for starting because the kiblet reads the static manifest directory and create something called a Metapod, which doesn't require yeah. It shouldn't require it doesn't require any authentication. Those containers should just be created. Right?

1:02:18 Kubelet Logs: API Server Backoff

1:03:04 Yeah. That's what yeah. I'm not sure. Like, since he's given us a hint, I'm questioning my my assumptions here. But my assumption would be even if it was a certificate problem, it would try to start it up and then Yeah. We see logs. Yeah. And we're not getting logs right now. What about our Kubelet logs? Did that give you anything? Let's let's look. There we go. Not found. Alright. Oh, slow down. It's usually a good idea to restart the Kubelet and just go through those logs so you see everything. Because a lot of the times, you get

1:03:42 errors early if they never come back. I'll just let her go for a little bit. Alright. What do got here? Unable to register node with API server. Connection refused. No. No. I think that was because Keybaby BI server actually wasn't up. So some of those errors oh, there we go. CA certs. Etsy PKI. So, yeah, those are kind of interesting since he mentioned the certificates. Right? Mhmm. Sure. Plan. Yeah. So, yeah, that's the only thing I'm seeing in there. I'll check a little bit more. But yeah. A lot there at all. No. But I also don't see any

1:04:42 header message with regards to certificate. Hey. Did you also check to make sure that you're right on you're on the right context? Did we both do that? I don't see it. Yeah. Yeah. I was gonna be like, I thought I was clever. Alright. Alright. Yeah. So what let's go back to these manifests. So because we should be starting the API server, and you gave the hint of the certs. I would really expect a log for those. So it's really interesting that it's failing to start the API server. I would expect log. I mean, we don't even have any

1:05:34 containers or container d. Right? Yeah. So oh, here. Let me look at here. I'll show you. So, yeah, if you do this, we just don't have the Kube API server currently. So let's oh, here. I I was scrolling here. I'll just pass this to Les. So because, yeah, these are the only things I'm seeing, but they don't seem like a problem. Well, yeah, because the API server is not running. So that's exactly, but, I mean, I don't see it starting the matter pod either. There's normally a line at the start of the kubelet that says read manifest in this

1:06:21 location. I mean, do we have that? Can you find manifest? Yeah. Right there. Let's see Kubernetes manifest. So it's looking in the right place, and then it's starting the kubelet. It's still on the the volume manager. Oh, I thought I was. Yeah. Yeah. Oh. Yeah. It's right here. My terminal scrolls all the way back to the bottom. Oh, interesting. Okay. Yeah. So all that looks correct. Runtime status. I don't know. Like, let's select 2379. Is there anything in there for that? Nope. So because what I was wondering is, like, do we have, like, etcd? Yeah. So we

1:07:15 have etcd running and everything. So it's only the Quby API server is not starting. So what would stop the Quby API server from starting? It must be missing. Oh, right here. Nope. Wait. I see it. So, yeah, it's gonna crash loop back off. It it's halfway through. I don't know if can you see yeah. I don't know how to tell you where I'm looking at. But there's a there's a crash loop back off here. But that was here. I'll just grab it so you can see. Alright. Okay. There we go. It should not be that hard to get

1:07:56 Using CRI CTL to Check Containers

1:07:58 that lane out of that log file. Yeah. Okay. So, yeah, it's a back off starting QAPI server. But, yeah, but we're running lot. Container. Yeah. So it's failing to start it, but why? Let's see. Yes. Assuming the correct thing. If not present, I'm gonna just put this as always just to be safe. Yeah. I would expect a log here. Oh, but I guess since it's failing to start the container, it's not gonna have us a log. Okay. That makes sense. So I think I think that's what's happening. Yeah. Yeah. So now the question is why is it failing

1:08:46 to start that container? If you do a PS dash all, does it show you nonrunning containers? Can I control PS dash Oh? Alright. There we go. So we got two x eight kubi API servers. Maybe Yeah. I'll get the logs of Yeah. Okay. About a minute ago. Or even run an inspect on that one that failed a minute ago. Maybe useful. Yeah. So it's there we go. Oh, but that's the old one. Old one. Yeah. So let's do this new one. Alright. There we go. Let's see. That's a good one. Alright. Loaded 11, validating admission controllers successfully.

1:09:33 Identifying Certificate Trailing Data Error

1:09:37 Oh. The trailing data is the the secret sauce there. Yeah. So command failed, X509, trailing data. Bowie. So one of the certificates is wrong, but which one? Cat them all. Yeah. So, c d p k I. Alright. So I I don't know if this is gonna actually work, but maybe they put something in there. It doesn't look like they actually did put anything in there. I'm terrible at certs, so I don't know if there's a way to like, I would have to use OpenSSL's x five zero nine to decrypt them if we wanna actually see the data inside.

1:09:55 Inspecting Certificates

1:10:33 Oh, let's do LS. I mean, the API server entity client, those there has been modified. So let's see. Seventeen forty two. Yeah. So this is the only one that's been modified right now. Have you opened it in them? What was the right thing? Okay. I opened all of them. Alright. So I'll get this. Ah, look there. It's right in front of our face. He's sneaky. I'll give you a minute. Yeah. It's not in front of my head, bro. Read the last line. The second last line. Sorry. The last line, I think. Oh, yep. Yep. Yep. You're right.

1:11:06 Finding Trailing Data in Certificate File

1:11:24 Oh, no. Totally all of it. Oh, I already did. Oh, that that may be broken. That's broken now. I think one second. Maybe just to delete that. I hope you find this. Yeah. Right now. Right. It broke it, so I have to fix it. K. It says search renew. I figured that wrong. No. What is it? Yeah. Keep it in it. So it's the phase is scarce. I don't think you need to type phase. No? Thought that's what it was. Help. Yeah. Sercs. Sercs. And then renew. That's what I said. In sub command Oh, you have to do

1:11:58 Attempting to Regenerate Certificate (Wrong Command)

1:12:22 you have to fill it with Serc. Yep. So what was that one that I just broke? API server entity client. But is it the file name that we do? It's API server. Right? Yeah. The API server dash entity dash client. I don't think you could put Okay. Yeah. Upload existing You may have to remove that one now because it Oh, yeah. Yeah. Would it. You made it worse. Okay. So I think we're good now. Right? Nope. We don't have our. No. So how am I gonna regenerate that now that I jacked it up? Oh, actually,

1:12:58 Regenerating Certificate via Kubeadm

1:13:24 here. Can we I know how we can do it. Alright. I got this. Which one was it? I forget. I need to go back. API-entity-client.crt. Yeah. So it did regenerate it. Oh, no. Okay. Oh, did you scroll back up to get the data? Yeah. I was like, I know I know I can't edit before I deleted it. That was called backup. Alright. Good form of backup. Never clear your screen. That's the secret here. Alright. So now, we wanna do this again. Oh, yeah. So PS minus a. Seven minutes ago. So we're gonna wanna because all we did was minus minus modify

1:13:56 Verifying Regenerated Certificate

1:14:38 a cert. So it's not gonna know that we did that. Right? Yep. Twenty minutes to go. Alright. So now that we fixed that, is it gonna spin one up? Get rid of the minus a. It did. It's running. 337. That's not right. Alright. Only one place. 143. Oh, PDO. Right. We're getting somewhere now. You have an API server. You have broken things there then. One core DNS broken, but there's one up as well. So that's semi suspicious, and then same thing for the serial operator. But, we're gonna focus on the clustered for now. See if that's the right choice. Oh, maybe

1:15:31 CoreDNS and Other Pods Broken

1:15:49 not. Network is not ready. CNI plugin is not initialized. And then this one, both of those are interesting. Right? Failed to fetch token. Yeah. Curious about the age on those adders, though. Like Yeah. They were a while ago. You just go straight for the update? Yeah. Let's not debug stuff that I don't want. Alright. Creating. Got a new replica set. Something tells me that pause is never gonna come. Oh, Alright. So let's do Just to get a clean. Alright. Do you know you don't need to grace period equals zero anymore? That used to be mandatory for a force, but it's been

1:16:19 Attempting Application Deployment Upgrade

1:16:50 it's been Oh, they got rid of it. They got rid Alright. So these are creating now. So it seems like it's getting stuck in creating. So that means, like, that pause container or something. Right? See what it says. Oh, there we go. Failed to set up network for sandbox. Alright. I'll open a session on the worker node for you. Assuming you want one. Yeah. Yeah. I let's see. You got it? Oh, there. Yeah. It's there. Joining now. Okay. Given everything's running, so I don't know where to fix that, though, because we're we have a real network

1:17:13 Fail To Set Up Network for Sandbox

1:17:48 over here. We just need to fix what the sandbox is doing, but I don't know where to go for that. So what happens when you deploy the CNA is it puts some config and some binaries onto the host. Yep. My thinking was maybe they had modified something in there. Yeah. So I can compare the other one. Zero three one Cilium. Yeah. That all looks correct. Yeah. Actually, it says incompatible CNA versions. The config is one zero zero, and the plug in support zero c one. So what's on the host is right? Alright. One second. So go back here. Config, man?

1:18:21 Identifying CNI Version Mismatch Error

1:18:37 Yeah. So config is 100. But yeah. So we're on 031 on ours. So It's system. Ah, okay. Now is they were doing one zero o? I believe so. Nothing in there. So where else? Oh, here. Instead of editing it. Yeah. So where would you define that version if it's not in CNI? That's our only CNI. Oh, well, what is the correct I don't even know what the correct version is. Like, should we be one o, or is zero three one the right thing? I think c o c one is the correct one. Okay. So we need to find

1:19:23 Checking CNI ConfigMap

1:19:48 where happening during o. Creating a pod sandbox. Yeah. And there's nothing in the configuration that's defining that. Must be either Container d. Container d or the Kipling. Yeah. So what is it? Container d Config dump. Config dump. Nothing. So they didn't modify container d. Oh, yep. Yep. Yep. Let's go check over there. Alright. Going back over to the worker. Is it over there, So how could you modify container d or what else could it be? Trying to think what So it sounds like the Kubect config is one option, but it could be container d's unit fail. Maybe cat is at the service.

1:20:49 If you think it's contingency, I'm not sure. Yeah. No. That's pretty vanilla. Right? Yeah. That's pretty. So what I was thinking is, like, inside here, we have, like, the plugins. Yeah. So, yeah, I'm not sure what Private conflict dot YAML. Just capture. So not some sort of c and I thing in here. No. Nothing. Let me do that there. Let's do it here. Oh, so, yeah, I'm not sure. Oh, I'm in the wrong place now. I'm gonna go back to the master for a second. System. Yeah. For here, I'm not really sure what would define

1:21:51 that version. I'm gonna reread that error. Maybe describe the daemon set for Cilium. Could it be modifying the orgs or something? Yeah. So it says failed to create the sandbox, failed to set up network for sandbox, and then it says an incompatible CNI version. Yeah. Tricky. So that has some mounts in it. I mean, it's passed an eyeball test. Don't see I was scrolling. I wasn't using Yeah. I'm I'm scrolling too. I'm I'm I'm suckered onto this. I need to know what this is. This is an interesting one. So I'm trying to think of, like, when the sandbox starts

1:22:55 Checking CNI Binaries Version

1:22:59 up. There's this Yeah. Cilium custom CNI con. Oh, that's just our config, though. So Yeah. Cilium custom CNI con, and it's a config map, which oh, yeah. And we actually We looked at that. Yeah. K. Get Centimeters. So we've got Cilium config. We've got yeah. Well, we can relook at this just to make sure we didn't miss anything. But it doesn't seem like that's it. I guess the question would be, is there, like, supposed to be a default flag in here and it, like, defaults to one o if you don't have it, but that would

1:23:44 be kinda weird. Let's see. Let's think. Pod Sandbox. Alright. So yeah. So pod sandbox. Yeah. Without a container d modification, that's interesting. I just ran container d. I've done that so many times. Yeah. It doesn't that doesn't respect the signals. Yeah. You just have to open a new session and kill nine that. No. I think we're I just think you can stay there. Right? Okay. Yeah. We're good. So I did for future reference, I did control backslash, which is a bigger kill than a control c, and that seemed to The control c is, like, in what's the backslash? Is that

1:24:33 straight into a sick kill? I think so. Yeah. I I forget at this point. I've just been using it so long. I just when control c doesn't work, I control backslash. I didn't know that. Fair enough. Alright. So something with the sandbox. Trying to think. What else? I mean, this could be completely have. A left field, and I I maybe they're not this nasty. Not this nasty. But can we go into the c and I directory and actually run the binaries for the dash dash version? Like, could they have recompiled anything? Which seems I think it's a

1:25:09 where did they put the bins? C and I? So the the one hint I'll drop y'all, right, is think of dependencies for a second, and then also think about release notes. And that's all I'm gonna say. Yeah. That's the same CNI has been recompiled. That's been modified. Yeah. That's Let's run a dash dash version on that. Like, run. Yeah. Yeah. That looks okay. CNI bucket at one ten. Yeah. So we're gonna like, what is the I wanna just go to Cilium and, like, look. What is the version we're supposed to be on? Yeah. So they're one o. So that configuration,

1:25:14 Hint: Dependencies & Release Notes (Containerd)

1:25:53 I think I think we're chasing our tail. I think 013 is wrong. Right? Because they're on one eleven right now if you look at Sillyam's release docs. So you can remove alright. I think you can run Rawkode, C and I, and restart the cinemapods if you want them to redo everything fresh, I think. Yeah. So oops. You're thinking, like, deleting the config right here? No. I was going to delete every single binary and opts, but I see an event and then restart the so it redownloads them all. Because that's what happens when this when starts

1:26:32 as a daemon set, it pulls down all the binaries that are needed for it to work. That was nice. I didn't get rid of it. Okay. And so But did you get rid of every binary, or did you just get rid of the one? Only Cilium was modified. So if you come back here to see time stamps on the old ones are 2020. So did they replace them all with old version except for the Cilium c and I, which is why we're having the config mismatch? That is quirky. Yeah. You're right. You're right. Temp, Cilium.

1:27:08 Alright. Let's get out of here. And then we wanna so you're saying if we delete Cilium, it'll redownload up? Yeah. Do a rule at restart DaemonSet Cilium or something like that. Or delete them all. Whatever way you prefer to to restart services, you go for it. Do y'all wanna hint? No. No. Not yet. I think we're close. I think we're close. I can see you sitting there laughing, though. I'm enjoying it. I know. It's funny because I I'm not gonna say anything more, but, I mean, what you said about chasing your tail is absolutely right.

1:27:54 Alright. So, yeah, so like this. Oh, but the problem is, like, now now that we did that yeah. Let's just do this anyways. So k get pod. I understand. Kube system. That's our yep. Kube system. Alright. So we wanna get rid of So is there anything in ops c and I bin? No. No. So let's see. So we wanna roll out, and then it's set. Restarting. Oh, and then That's the one? And you're thinking There we go. Things are coming back. And then let's see. And those are better time stamps than before. So let's do version here,

1:29:02 and so that's one ten. And so I think over here, you don't want zero three one. Right? Like, we want one ten whatever? Well, no. I think that's the CNI version versus the Oh, yeah. You're right. CNI version. Okay. You're right. So now I need to figure out what is the correct CNI version. Right? I I don't know. Run get pods. Do we have anything yet? Let's see. Oops. Sorry. Let's get out of do can you get prod? No. Oh, so see yeah. So we need to figure out what that correct CNI version is. So I'm

1:29:44 you can't see my screen, but I'm googling real quick to see what the correct CNI version is. Because, yeah, I feel like that CNI version's modified, and we just need to know what the correct version is. 03One's correct. Can I just say that I did not modify the CNIs in any way, shape, or form? Alright. Okay. I've not modified the CLI. Okay. Which means we probably have to do an app to re app install dash dash reinstall Kubelet for any missing binaries then. Oh, well, we got the We only got two back. Yeah. So Yeah.

1:30:36 But we're still getting the same error. So this is the error we're still chasing. Failed to set up the network for the sandbox, And it's saying the config is one point oh, incompatible CNI versions. Config is one o, but plug in supports oh, no. No. So it's trying to use the wrong CNI. Right? Incompatible CNI versions. Config is one l, but zero three one is what we should be on. So where's that config at? Don't know. I think we may have to take that hint in a minute. Yeah. So trying to think. Like, where where would we

1:31:24 define what CNI version it's going to use when it do we have more than one? You know? Can I drop one hint for you? Yeah. Go for it. I may I'll speak to John if you want a hint, John. Yeah. Yeah. Sure. Run a kubectl get nodes dash o y. Oh. Did you upgrade one of the nodes to Kubernetes one twenty four? Yeah. Control I don't think I changed the node version. Yeah. Control Blane's one twenty four right now, and then the worker's 123. It's strange. I don't think I upgraded the node, though. Oh, that I guess that could have been

1:32:02 Identifying Different Containerd Versions on Nodes

1:32:19 me a moment. No. I looked at the re oh, I did that at the Yeah. You just reinstalled that. Yeah. I've just Yep. Should be fine, though. Yeah. It should be alright. Although, there's not like there was any major deprecations on 01/24. Right? Oh, yeah. Not like or anything. You look carefully at the output. The container d version is different. Yeah. That's interesting. So, yeah, the worker is on a different container d version. Could that cause problems? I actually wouldn't expect it to, but I guess if the CNI like, I can go to the worker and

1:32:59 downgrade. Right? But Yeah. Let's try it. See what happens. What version was it on? 159 is what we want. 59 is what we want. And if it's I think it's just one equals. Two equals. You're right. Is that two? Okay. Oh, I thought it was two. Always giving bad advice. It's on the newest version. After reinstall on it, maybe they just replaced the binary. No. It still says so. Yeah. We we need to actually do a reinstall. Let's do version. Alright. So now that's there. And then do I have crack control over here? That's not yeah. We need to go over

1:33:03 Downgrading Containerd on Worker Node

1:34:09 the other one. Alright. So back to control plane. I don't think that restarted container do you look at it. Oh, no. Nope. You're right. I'm on the control plane. I need to go back to Rarker. Yep. K. Now back. Hey. Running. There you go. Eighteen seconds to spare. That surfaced itself in an interesting way because it's saying that the CNI was a different it wasn't compatible, but it had nothing to do with the CNI. So I'm assuming container d is also, must be what specifies the CNI version, and there must have been a bump then from

1:34:50 Discussion of Containerd/CNI Bug

1:35:00 031 to 1Dot0Dot0 with container D16, I'm assuming. So that error was was actually intentional. I decided to use a nonworking version of container d against this, Tilium CNI. It was like a known bug. So you would have either had to go up to one six four or revert back to +1 59. Glad you got there, but if I hadn't said anything about the KUKA, I y'all would have been chasing a little bit longer. Yeah. We've I we would have I mean, I have Christmas. I haven't have caught that. So and that's really know you both

1:35:36 gave us customers a stop to me completely, which is great because this is this is how we learned to invalidate some of those assumptions we make along the way. Yeah. That's the most fun I've had is, like, the, like, breaks like this where it tells you something, but it the the problem's completely something else. Like so, like, the one I did where I changed the scheduler amount. It's like, oh, RBAC's broken. You don't have the correct permissions, and it has nothing to do with RBAC whatsoever. I know. We need better error messages. Yeah. That's that that is pretty telling because

1:35:45 Discussion of Kubernetes Error Messages

1:36:08 you were you were basically chasing after a CNI related version when, in fact, like, it was had nothing to do with it. Right? And it was pretty interesting to see. And even the certificate one, like, if I hadn't said anything about that, it could have been going on for a while. But it's not very verbose about where to really look. Right? And that's like, okay. Maybe we file a bunch of PRs against the Kubernetes ecosystem to say, let's make this a lot more human human readable so that we can troubleshoot better. Right? But then again, this is why tools like Commodore

1:36:39 exist and whatnot. Definitely. Those error messages can be very opaque at times, and I think, yeah, maybe we should be using clusters as a means to fail issues against some of this stuff and see if we can get people to open some pro request because this stuff is already hard with the the tooling fight against us at times. So, you know, well played both of you. Alright. We're gonna wrap this up. Thank you both for joining me. Those were two terribly mean clusters, but you both absolutely smashed it. Well done working through all of those

1:37:04 Closing Remarks & Thanks

1:37:13 problems. Bringing another cluster, it was just plain krill, John. Yeah. Marino finds an actual bug in container d and bringing that to the party was equally as krill. So well done. Thank you to Equinix Metals. Thank you to Teleport for their continued support. And we won't be back next week because of KubeCon, but I hope I do get to see some of you there, and we'll be back the following week with cluster team. Alright. Any last words, John, Marino? No. I had a lot of fun. Yeah. I'd like to say the same. I had a great time. I learned a lot.

1:37:47 In fact, learning on the fly as I'm troubleshooting, and we've got that pressure on top where, like, why isn't it getting why isn't it working? And you got people in the chat. It was it was a very great experience. So thank you so much for for setting this up, David. And great meeting you, John. This was fun. Well, I do need more victims. I mean, volunteers. So anyone watching, if you wanna come on clustered, get in touch. Well, I feel like now that we're acquainted, we can be more evil with each other. So let's just do it again

1:38:16 with no holds bar. Yeah. That's an idea for future episodes, actually. Just getting people back and say it right. The rules are gone. Do your work. We'll see. Alright. Let's go to Chris. Let's go to Thank you again both. Have a great time and a great day, and I'll see you soon. Adios. Bye all. See you.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
Kyverno

More about Kyverno

View all 9 videos
Cilium

More about Cilium

View all 36 videos
containerd

More about containerd

View all 23 videos

More about Teleport

View all 38 videos