Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Diagnose Kubernetes pod restarts by checking resource limits, liveness probes, and logs to stabilize workloads.
  2. Trace sabotage-caused outages by inspecting mutating webhooks, NodePort routes, and unexpected processes on worker nodes.
  3. Recover API and database access by fixing kubectl execution, kubeconfig protocol, and PostgreSQL service port mappings.

Teams Kong and Armo trade broken clusters, debugging liveness probes, Postgres ConfigMap init scripts, a malicious mutating webhook, a rogue NodePort process, a fake kubectl alias, and a kubeconfig HTTP/HTTPS swap.

Chapters

Jump to a chapter

  1. 0:00 <Untitled Chapter 1>
  2. 1:23 Introduction to Klustered
  3. 1:47 Sponsor Thanks (Teleport, Equinix Metal)
  4. 2:35 Introducing Team Kong
  5. 2:39 Team Kong
  6. 3:01 Team Kong Introductions
  7. 3:55 Granting Access to Armo Cluster
  8. 5:08 Beginning Debugging (Team Kong)
  9. 6:06 Initial Pod Status Check (Restarts)
  10. 6:23 Checking Events (Liveness Probes)
  11. 7:33 Problem: Low Resource Limits
  12. 8:06 Fixing Resource Limits
  13. 13:06 Debugging Postgres CrashLoop
  14. 14:11 Problem: Bad Init Script in Postgres Logs
  15. 14:47 Hint: Postgres ConfigMap
  16. 15:29 Fixing Postgres ConfigMap (Bad Script)
  17. 16:35 Problem: App Image Reverts to V1
  18. 17:54 Investigating Mutating Webhook
  19. 21:07 Fixing Problem: Deleting Malicious Webhook
  20. 21:44 Restarting Pods
  21. 22:20 Problem: Service Unreachable (Seeing GIF)
  22. 22:55 Debugging on Worker Node (Containerd)
  23. 27:40 Testing App Inside Pod (Curl)
  24. 28:56 Observing Service Behavior
  25. 29:37 Investigating NodePort Service
  26. 31:55 Reviewing Hints
  27. 33:58 Investigating Rogue Process on NodePort
  28. 34:21 Problem: Python Process Blocking Port
  29. 34:29 Fixing Problem: Killing Rogue Process
  30. 35:05 Team Kong Success & Wrap-up
  31. 36:06 Introducing Team Armo
  32. 36:58 Team Armo Introductions
  33. 41:37 Introductions
  34. 43:00 Beginning Debugging (Team Armo)
  35. 44:57 Problem: Malicious Kubectl Alias/Binary
  36. 46:00 Locating Real Kubectl Binary
  37. 48:15 Inspecting Fake Kubectl Script
  38. 48:40 Hint: Kubectl Help
  39. 50:06 Fixing Kubectl Binary
  40. 51:04 Problem: Kubectl Cannot Connect to API Server
  41. 52:32 Investigating API Server Status (Netstat)
  42. 53:32 Problem: API Server Not Running
  43. 54:19 Debugging Static Manifests
  44. 55:00 Inspecting API Server Manifest
  45. 58:55 Fixing Problem: Restarting Kubelet
  46. 59:53 Debugging Kubectl Verbosity (HTTP vs HTTPS)
  47. 1:00:59 Investigating Kubeconfig
  48. 1:01:31 Problem: Kubeconfig HTTP vs HTTPS
  49. 1:01:41 Fixing Kubeconfig Protocol
  50. 1:02:45 Problem: Worker Node Scheduling Disabled
  51. 1:03:57 Hint: Cordon and Uncordon
  52. 1:04:11 Fixing Problem: Uncordoning Node (with Host Help)
  53. 1:07:34 Api Server Configuration
  54. 1:10:39 Pod Stuck in ContainerCreating (Image Pull Error)
  55. 1:11:00 Investigating Tab Completion Trick
  56. 1:11:31 Testing App Pod Status (Running)
  57. 1:12:34 Curl from Control Plane (Connection Refused)
  58. 1:22:51 Problem: App Cannot Connect to Database
  59. 1:23:33 Investigating Postgres Service
  60. 1:24:09 Problem: Incorrect Postgres Service Port
  61. 1:24:58 Fixing Postgres Service Port
  62. 1:25:31 Team Armo Success & Wrap-up
  63. 1:26:25 Host Wrap-up and Explanation (UFW)
  64. 1:26:51 Conclusion & Thanks
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:23 Introduction to Klustered

1:23 Hello, and welcome back to the Rawkode Academy. My name is David Flanagan, although you'll know me as Rawkode, and there's a nice big articular area as I took this arm. Thanks. This is custard. We have two great teams today competing, trying to affect each other's sabotage and broken cluster. Now, we get started with that, I do just need to say hello and thanks to our sponsors. So first up, we have Teleport. We've been using Teleport on clusters since the very beginning. It's how we secure access to the servers. It's how we collaborate and pair on the

1:47 Sponsor Thanks (Teleport, Equinix Metal)

2:01 debug session and fixing these clusters. And it's just an absolutely wonderful tool that I genuinely believe every kind of Google cluster should have installed. So go check it out. It's just the best product. I'm so happy with it. I also wanna thank Equinix Metal. I see all the comments in the chat. I'll get to use in a second. I also wanna thank Equinix Metal. They provide all the hardware that we use on Clustered. We go through six heavy duty bare metal machines every single week with tons of cores, tons of RAM. Why? Well, because it makes it more fun. So

2:32 thank you to Equinix Medal. Alright. Let's jump over to our first team of the day. We have team Kong. Hi there. How are you both? Good. Good. Good. Alright. Well, thanks having us. No. No. It's my pleasure. Definitely. And I've got alarms going off. Like, it this is just the challenge we've been in the middle of a random industrial estate. However, we'll we'll just ignore that. Could you please both say hello, introduce yourselves, and then we'll get started with today's session? Yeah. Sure. I'll go first. My name is Kat Morgan. I am with Kong. Colin and I are both joining from Kong.

3:01 Team Kong Introductions

3:09 I am in the Pacific Northwest US and happy to be here to see if I could actually live up to expectations trying to solve this cluster stuff. It's madness. It's all madness. Colin? Thanks, Kat. Yeah. My name is Colin Loader. As Kat said, we're both from Kong. We're field engineers within Kong, so we are kind of, like, in the professional services side of things, help out customers with issues they have with Kong. I came from both Red Hat and VMware, working more on the Kubernetes side. So been in the Kubernetes work for the last few years. So, yeah,

3:46 excited to see what the what the the breaks that we have are that we need to fix. It's gonna be fun. Awesome. Thank you both for sharing. Well, your cluster was broken by team ARMOR. They will be joining us later to the cluster that you have prepared for them. I am going to pop up my screen share here, and I am going to give you access to the clusters. So I'm gonna audit a change or a common rule And now you have access. Nice and simple. So I will open the first session on the armo control plane. If you could please

3:55 Granting Access to Armo Cluster

4:28 both use the activity page and active sessions, join the one at the bottom. And if you could just do me a favor and type echo hello or anything so that I know you're in here and then we will get things underway. Alright. Like, we're flying already. Alright. I wish you the best of luck. I can encourage you to export your cube config, set up any aliases that you need, and check for the control plane. Best of luck. Alright. Straight into t mux. No. I I'm gonna use that muscle memory and be mad if it's not there. So,

5:08 Beginning Debugging (Team Kong)

5:19 Colin, do you wanna drive first or you want me to dive in? Sure. Yeah. I can let me start some stuff. Let me just add some completions and whatnot first. And if we get it, I also will mess up if I don't have that. Yep. And I think this is the right syntax. Q t p l k, I believe. Alright. And then we can take. I didn't know you could set up the auto complete for the k. That's pretty cool. Oh, yeah. It's a lifesaver. Oh, it's admin.com. It's not. And also Kubernetes. Kubernetes. Alright. So let's see, see what we have first.

6:06 Initial Pod Status Check (Restarts)

6:11 Alright. K. Well, we have a lot of, a lot of restarts. That's always fun for both the app and Postgres. Alright. What do you wanna check first, Kat? We can go ahead and check events in the namespace if we want to. Alright. Liveness probes. That's easy enough. Ready That's that's a bold statement right off the back. Yeah. That's actually yeah. So I probably shouldn't say that. But okay. So liveness and readiness. I'm gonna check the deployment. Let's see. Deployment. Clustered. Alright. So we have for liveness probe, slash health on eighty eighty. Don't look bad to me, but

6:23 Checking Events (Liveness Probes)

7:22 I believe they're okay. Okay. Billing thresholds look okay. Alright. Well, I do see something already. Yeah. It's just the virus. CPU. I I don't like that. Okay. I mean, there's a rust application, but it doesn't require a lot of resources. So I'd say That's a lot of faith in that application. Alright. This Equinix metal better really hold up. Okay. So let's start with getting rid of those. Are you are you just gonna delete out the limits? I sure Who needs limits? Yeah. Who needs resource limits? Just remove them. Okay. So let's see what's what what that does.

8:06 Fixing Resource Limits

8:31 Okay. Well, that's better ish. Seems happier. Yeah. But we didn't see we didn't see anything for the Postgres pod on events. I yeah. I didn't see it. Because that's all on the cluster pod. Okay. Well, one thing we can check, what are these this is on Port 30000. Right? Yes. Port. Okay. Okay. Well, we can at least connect to the app. Obviously, we know that database connection is gonna be not good now since we're not check? Would you like me to open it? Just Sure. Yeah. Yeah. We can see the That'd be fine. The image.

9:25 This is the RML cluster. Right? Hey, Colin. Can I check and see what pods the or what nodes our pods are on? See if they're on the same nodes or not? Yeah. Absolutely. Alright. Connection with you. They are on different nodes. Oh, wait. I don't know. We might Yeah. That's that's something definitely to check. So we can look at I assume this is a deployment as well. Postgres as a stateful state. Stateful state. Yeah. Alright. What is in here? So I don't see a problem with the health checks. Not yet. Okay. Let's keep going. Health.

10:45 Let's one thing we might check next is our storage to see if there's a PVC available. Defaulting to this. I don't know. I didn't actually see So the one is being used. There's no CSI configured. No persistent volumes. I used it. That's right. That's right. Okay. It's an empty Dir Postgres. Okay. Okay. Got it. Maybe something else in here that is off. Yeah. So I don't see anything here, so maybe it's not that. Can you Can check our actually, I was just wondering if the image I didn't check the spelling and stuff on the image if that's there or not.

11:53 I I think it was okay. It looks like it was just the Postgres 13. 13 outbound. That looks okay. Yep. Looks good. Think. No. No. That looks good. I assume that's the default. I mean, we can see if there's any doubt that this is gonna have any logs. So the thing that you're debugging right now I mean, I pulled up the page. So just make sure you're working on the right header here. Server connection timed out. I mean, I think we have to get past this since the it it's not even started. Okay. But I guess if it is timing out,

12:43 that does point to a network issue as well. Kat, do you have any ideas or anything else you wanna check? Yeah. Let me jump into So your post guys pod is crash lit back off. What does that mean? Something's preventing it from starting. Preventing it from starting? Well, that that means it's not running or it it's not passing liveness. So But it was running and now it isn't. Maybe log? Actually, we might here. So if we do that Navin Kumar is saying, check the log. Okay. So we should have You'd be surprised how often that actually fix

13:06 Debugging Postgres CrashLoop

14:03 these things on clustered. Just delete them. Usually quite effective. Can can we look at the logs cap for that? Yeah. Hang on. I think I've been relying a little too much on Lens lately, and my cuddle is a little rusty for debugging. Okay. Bad in its script. That's interesting. We could check that. There's no. They wouldn't have changed because we don't have persistent. The command could have been edited by chance. So the way the Postgres works on custard is there is conflict map mounted into the container, which is put into a special There is a conflict map. That Postgres will load.

15:29 Fixing Postgres ConfigMap (Bad Script)

15:29 Is it as easy as getting rid of the echo and the exit? I mean, those look suspicious. Right? They do to me. Yeah. Oh, well, yeah, there's no test unit test on them. Alright. It's looking a little better if it comes up. Well, that's two weeks in a row we've had people attack pull squares, which is the first two times we've ever had anyone attack Postgres, believe it or not. It's getting popular then. Alright. So David, do you mind checking to see if it's at least connecting to the database now? It looks like it is on the curl,

16:28 but okay. So it looks good. You've got the the watch tab. So yeah. Alright. So let's go for editing the deployment then for cluster then just change it to v two. I assume it's not gonna be that easy, but try it. Oops. Sorry. Oh, you're good. And it should be seconds ago. It's does that doesn't look good to me. It's still watch tap. So v one. The curl will show you the file name and the title. Yep. V one and v one. So yes. Alright. So let's take a look at that pod and just verify that it is

16:35 Problem: App Image Reverts to V1

17:26 if it's changing it back to v one or if we have bigger problems. Okay. So just changing it back to v one. But it's v two That one's fine. So that feels like it's does that seem like a mutation to you? Can we check that? Go for it. I don't know what the CRD name is. Yeah. Carlos in the chat. It's not a CRD, but The mutation webhook. It's a mutating webhook configuration. Mutating webhook? Oh, there it is. Okay. There's a Cilium one. Let's see what's in there. Failure policy. I Is that the whole thing?

17:54 Investigating Mutating Webhook

18:59 Yeah. I think so. That's all I see. What are you thinking? I don't know for sure. I would say that that's probably not real. But it's called Solium. You think they have a sneaker n as their real name? Yes. Oh, we haven't, like, been looked at labels yet. Alright. So you don't think the web hook is real. What's your options? Delete it and break the cluster more. Investigate it. I'm wondering so I know that I know that in registries.com, if you can create aliases and things like that, That would be on the worker nodes themselves

20:04 if we were doing that, though. What do you think the aliases are doing? Go back to. What's that, Colin? I kinda wanna go back to the to that webhook, though, because I I mean, I don't know I don't work with these a lot, so I I don't know for sure. But, like, I don't like this URL thing. I kinda wanna see what's what's there. Curl it. Okay. Not allowed for this. Okay. Python. Oh, okay. Python application and root stuff Kubernetes webhook example. Yeah. I I don't feel like this is good, like, a real thing. So

21:06 I'm I'm just leaning towards deleting that webhook, to be honest. Go for it. I encourage it. We can we can we can save it off if we wanted to first. But, yeah, if we would just wanna live crazy, let's go for let's delete it. I guess so. I'm happy to confirm that. I don't think n as in Cilium is written in Python. So Yeah. That seems weird to me. That's true. Okay. I'm deleting this. Alright. Do we want us to smash our pod again, or do we need to I don't yeah. Yeah. I'm gonna delete that and have it

21:44 Restarting Pods

21:47 pull a new one and see if the image has changed. We need to flip the cluster d pod, not the post cross pod. Oh, yeah. Right. Sorry. Okay. Just delete all. Delete dash dash all. I mean, what bad thing ever happened from deleting all the parts in our namespace? Now. David, do you mind checking Happy tip. The browser. Okay. Alright. That's not a good sign. That's funny, though. That's awesome. Nice work, team Arvo. Nice work. Very well done. Makes me think that it is messing with the container d. The image. Yeah. So because it's the correct well, sensibly,

22:20 Problem: Service Unreachable (Seeing GIF)

22:50 it's the correct URL to the image. Yeah. So we can go ahead and jump over to the worker node that that's scheduled on Yep. And jump into our registries.com for starters. So they're both on worker two. K. David, do you mind creating a session for worker two? Armour worker. Alright. Okay. I do not remember where that lives. Is it let's see. No. I think I can find it. Okay. Go for it. Oh, snap. The ARMOR team are laughing in the chat. It's just FYI. Alright. That's fine. Well, actually, there is a good point. Local image which is used. I

22:55 Debugging on Worker Node (Containerd)

24:02 didn't check to see what the Wait. Policy was. But Oh, shush. That's a good call. We could just change it to always just to be safe. Do what? We could change the image pool policy to always just to be safe, make sure it's not using a local I believe it was always the last time you edit Oh, was deployment at least. Okay. You can check, but I'm I'm pretty confident I did see it. This is Yeah. Okay. Yeah. Okay. So, yeah, we'll need to probably look at the history. In that case, I'm going to or

25:05 because we're not running see, I'm coming from, like, OpenShift mindset where there's, like, an image stream policy stream craziness, but this is not OpenShift. So I wasn't getting yeah. So there should be a registries.com, But it's not where I was thinking to find it. I think that may be an OpenShift centric thing. What does Redshift.com do with OpenShift? And I'll see if I can correlate it back to Cube admin. Okay. So well, on I know on cryo that registries.conf gives you the ability to, like, set insecure registries and but maybe it's in container d Yeah. But

25:59 I'm not finding any container directory. No Etsy container directory. Yeah. So container d does allow you to dump its config, which would show you if there were any aliases in that regard. Okay. So I can't remember the exact So tag. But I think it's container. Learn container d? If you run container d dash dash help, it should give you what you needed. But there's a, yeah, config dump as Okay. That's the container d as a default. Yeah. Yeah. Because, likely, they haven't configured container d in that way. Oh, wait. It's just config. Yes. Update. Config.

26:50 Yeah. And then dump. So that filter with TOML means they have not tweaked container d in any way with fake registry. So it's just running normal. Okay. Yeah. So if they're not pulling a fake image in place, what other options out there? Yeah. That wouldn't have worked any I don't know. Why don't you confirm something? Can you exec into the cluster pod and curl local host? Yeah. Oh, sorry. Colin, go for it. Oh, sorry. Do you jump into the other container? Yeah. You need to jump into the cluster pod, which does have, I believe, s h available.

27:40 Testing App Inside Pod (Curl)

28:12 Oh, okay. I don't think it's slash s h, so we'll just try. And Carl should be there. 8080? 80. Oh, what we see here is video v two. Yeah. Okay. Okay. So so I don't know what can The pod looks alright. Oh, unless it's But if I if we browse to DNS. Yeah. We get that. Wait a minute. You is it Okay. The image is fine. The pod workload is fine. A curl to that IP address on that port is not fine. So the cluster service runs as a node port on port 30,000. I would maybe start there.

29:37 Investigating NodePort Service

29:50 Okay. Like, is my service still configured correctly? I guess you're in the control plane. Yeah. Oh, yeah. I'm sorry. That's alright. I'll keep up. Looks alright. So how else would it be sending traffic somewhere else? I'm sitting here thinking of all the ways that, like, an ingress controller could do all of this stuff, but, obviously, we don't have that in play. So we haven't looked to see we've only looked at the default workspace. We haven't actually seen what else is running in other places. Fair. Well, we could check, yeah, other services. There are heads as well if required.

31:07 Is it just me or is the TMUX very small? Yeah. I just had to kinda jump around. I think the TMUX is causing some dying. Issues. Oh, if you resize your window just a little bit, it'll refresh the the TMUX frame. Okay. So we're gonna do is there anything else? No. Okay. So I think that hint is for the webhook. Maybe it should be a hint one or two. Yeah. Yeah. I think each one's referencing a different thing, but so labeling things is not nice. We haven't done anything with labels yet, but no. I think number one is the

31:55 Reviewing Hints

32:19 yeah. Don't know. Those are those are very subtle hints. They're correct when they said that. Right? So It's one thing to live and other things to answer. Oh, okay. So hence one is subtle hints, but for this four different problems. Right. So okay. Each one Okay. And then each. So it's a yeah. A series of related hints. So those two middle ones seem relevant. Who owns has the right to answer, and it's one thing to live and another to answer. So our pod is alive and running, and we verified it, but it's not answering the HTTP request.

33:01 So I think that one had to do with the the limits that were set because that's the the same number in each of those. It's like the it is tight here. I can't remember. That has to do with the limits. So I think that one's it, but it's the second one that we need. So everything has to do with Kates. Python Python is everywhere. Oh, not everything has to do with Kubernetes. But Yeah. So how does node port work? If we're seeing that any request to this IP address Oh. For that port gets re registered to the service. We

33:39 could well, the pod wouldn't start if something else was listening on that port that was not in Kubernetes. Well, the pod isn't exposing a high port number. I think you're onto something there. K. Alright. Can see what else is running. Alright. I'm on the worker. Oh, next steps coming up. Alright. No. 830000 is the port we're worried about just now. That's right. Sorry. More Python. Okay. I'd say it's probably pretty safe to kill that process. Yeah. Okay. Straight in with a sick kill. Hope it's not supervised. Alright. How's our service looking now? Have a look.

34:29 Fixing Problem: Killing Rogue Process

35:02 You got the dance. Much better. There we go. Alright. That was That was fun. That was Those are cool. That was really fun. We have ten minutes left. Well done. See? Good fun. Right? That was that was really good. What I'm curious about is do you think you have been meaner? You know what? First, Colin, virtual high five. Yeah. Absolutely. Yeah. Alright. And now I'm nervous if we've been too mean. I don't know. We'll find out in a couple minutes. Alright. Well I Sorry to go. Yeah. No. No. Okay. Well well done. You both absolutely smashed that. So

35:05 Team Kong Success & Wrap-up

35:50 thank you for joining me. I'm going to say goodbye and invite the ARMOR team to come on and see what Team Kong have for us to debug. So thank you again. Great job, and I'll speak to you both soon. Thanks. Bye. Yep. Thanks, David. See you. Alright. So, Armour, please come on and join us. Great breaks there. That gif is gonna have me laughing for a long, long time. And let's get our sessions ready. Alright. Hey, Ben. Hi. How are you doing? Nice guess. Yeah. That that was funny. Hope you like this. Yeah. That was funny. Very well done.

36:06 Introducing Team Armo

36:43 Some good breaks there. I like that a lot. And we've got some more teammates joining there. Oh, there we go. Alright. Hello. Hi. How are you? Doing very well. Thank you. Thank you for joining us. I think we're just waiting on one more. And then we'll start the introductions, get you access to the Kong Cluster, and then we'll see how that goes. Are you impressed with Team Kong? They did a great job. Very nice. Wow. Wow. That was awesome. Okay. So there is a problem with login. Wait a second. I haven't given you access yet. Wait a second for

36:58 Team Armo Introductions

37:34 let's check her check with her. Yeah. She said she's she's saying she has problems with login. Wait a second. I think that she's using the not the right link. Alright. Alright. Well, while we wait, I will just enable access to the cluster. So I'm gonna edit the role for team armo. Okay. Rawkode is moving to to different machine. She's going to be with us within a few Yeah. No worries. Seconds. We're all good here. Oh, we're trying to calm control plan one. So I have opened a session. If you all just want to use activity active sessions

38:41 and join, type echo hello to let me know that you're there. And then when we have a full team, we will get the introductions underway. Okay. So this is the last one. Right? Which one is it? Control plane one. So you go to the very bottom of the list. So activity, active sessions, and then at the very bottom, you will see a con control plane one session. Awesome. And I will join it too. Because sometimes VEM looks a bit weird on the CLI. Oh, no. We have multiple sessions because I have no joint one that isn't my one.

39:51 I would I tried the last one. Oh, yeah. It's it's not the last one. It's Oh, okay. Control plane one. Yeah. Yeah. I guess. I think someone just accidentally started a new session, which threw it off a little bit. But yeah. I think we Wait a second. Wait a second. Okay. So I I I see one that where You should see Rawkode, hello world, Rawkode echo. Yeah. David, my you told me you are in the wrong you are in the wrong one. Sorry. Oh, I'm in the wrong one. Okay. Yeah. Go to the go to active sessions

40:29 in under activity. Right. I'm under activity. There's cong worker two. That's the last one of the list. Yes. So the one before that is cong control point one. Let's jump onto that one. I think there's a few active sessions. Few which is open for two minutes. You see me? Yeah. You are not in the right one. I mean, at least on the control plane one. Yeah. Yeah. But you're opening your own session instead of joining a session. Yeah. Yeah. So when just go exit this this session. Okay? And go to active sessions. You should see which

41:13 users are in each session, and you should see David as Rawkode, We have slash Ben. Okay. Mhmm. This one here. Here we go. I'll just close all the other sessions open here. Yeah. I'm gonna have to start restarting teleport before we go live. Alright. Let's start the introductions just now. We'll start with you Ben and then we'll just work our way around. Please feel free to say hello, tell us a bit about yourself and then I will get you onto this cluster. Yeah. Hi, everyone. I I'm Ben from Armo, managing the engineering here. We are the maintainers

41:37 Introductions

41:56 of CubeScape, and, you know, we came here to have fun. Awesome. Thank you. That's true. Alright. Whoever wants to go next, David, Rawkode. Yeah. So I'm David. Also I work in Armour, and I'm also here to have fun. Let's let's see what what is waiting for us. Oh, here we go. I found it. Okay. Yeah. I I finally logged in. Oh, welcome. Yeah. Thanks. I'm Rawkode. I'm also from our team, and we're all very fun people. So I also want to help them. So that's about it. We like gifts. Yeah. We have Patel here as well.

42:53 Hi. I'm from. And I'm here to a lot of fun. Alright. Thank you all. But I think we are now all on the same session, so I'm going to bring my screen share back. I would encourage you to set up any aliases, configure the Kube config, and check the control plane and best of luck. Okay. David, do you want to start? Yeah. I can start if you want. Let's see with the copy paste work over here. Why not? Oh, not real. Not really. One second. Oh, my bad. Oh, it's this. Let's see. Are we are we? I think Yeah.

43:00 Beginning Debugging (Team Armo)

43:50 Are we in dashboard? Yeah. It's batch. What was your command that you did? Yeah. Oh, we have we have tried. Okay. I think the first you have just to export the cube config. Oh, we need to export the kube config. You Export kube config. Okay. And And it's e t slash e t c. Yeah. You saw it there. And And Kubernetes. This is Alright. Give me a sec. We're losing time. Yeah. You can take it. Here. What's with the Oh, I think I think you found your first Yeah. Yeah. I think this is a might be

44:57 Problem: Malicious Kubectl Alias/Binary

45:04 the issue. Okay. Okay. So we have alias. Check for alias. Let's see. Echo. No. Let's try this. Okay. Now just write You can just run the alias command to list aliases. Yeah. Just yeah. Write alias. Okay. It's not that? No. That's okay. I don't see here anything. K. Let me try something. Okay? Sure. Go for it. Okay. Nice. Sure. Yes. I told yeah. Try where it's Yeah. Yeah. Yeah. No. It's okay. We are removing removing games. Nice games, by the way. Yeah. Very funny. Oh, wait. Should I copy paste for you? Oh, here we go. Yeah. It's it's fun.

46:00 Locating Real Kubectl Binary

46:25 Are you going to look inside the file at least? Mhmm. Oh, okay. Here we go. Oh, this also explains the auto complete. It it does did. No. It's it's still not it it's still not Let's try where is keeps keep CTL. Wait a second. Let me there's shrinking. No. That that it's still not the one. So, yeah, it's usually Let's try. Echo. What what do we have here? Okay. What a mean thing to do. Let's try Oh, because it no. Because it's returned it from the bash RC. Sorry. So or somewhere. Wait a second. Let's see. You could always absolute path your

47:29 kube control. Yeah. That's Another option another option is to overwrite the what they put over there nears the games and just to copy there the kube CTL. That's a good idea. Yeah. That's that's a good idea. Let's try doing that. This will probably solve it pretty quickly. L l slash user local game. Where is the real one? Try let's try where is kubectl, the command of where is. Where is? Yep. Kubectl. Okay. Great. So we have user local as bin kubectl. Let's move let's let's let's see p to user games. Come on. Let's at least look at user

48:14 games kubectl. Open it in Vendor. Okay. S bin. In curiosity. Oh, let's see what yeah. Let's see if this works. Did they What do mean? Oh, they they maybe changed the name of the binary. That also that's also possible. Yeah. But, also, it's usually fine. Not not nothing has been. So the cat from ThemeKong has asked if help is actually helpful. Or maybe maybe kubectl help will give you a hint. Oh, that's interesting. Okay. Kubectl help. Let's see what happens. Oh, maybe we have over there another joke. Or we download simply. We we should install

48:40 Hint: Kubectl Help

48:59 Deepgram from scratch. Well, let's look at the script. Come on. Let's I wanna know what I've done. Oh, okay. You're saying it back. Yeah. I mean, Kat and it seems a bit brave, but go for it. What why you say it's crazy? You said you thought it was a script. Okay. Ascii. That was good. You can catch it. Yeah. That that's a yeah. Okay. Oh, it's Yeah. Yeah. Okay. No. It's funny. Now let's see let's see what the echo from. So it's from kubectl they oh, they simply just let's see. No. Okay. Now the other kubectl

49:51 let's see what the other kubectl is. Yeah. In in user local has been. I'm assuming that's possibly the same. Right? Yeah. So I would download, David, the I would download simply the the the root of KubeCTL. Okay. Let's do that. Download KubeCTL. Yeah. Zero trust Kubernetes. That's what we need. Yeah. So then you want me to do it? Well, whatever. Just copy kubeCTL to where the in this game directory. Let's continue with the nice game directory. Yeah. Let the games begin. Are you using it? Downloading it. Let's see. Wait a second. Oh, are you are you also?

50:06 Fixing Kubectl Binary

50:54 No. Wait a second. Someone has put something here. Let me try something. Okay? Cool. Sure. Okay. Get pods. Mods. Okay. That's better. Okay. So this is great. Okay. This is this is something. Okay. Now Right. I Now let's copy this. Let now let's copy Chip CTL to the games. Right? Okay. That's okay. So it's not connecting the API server. Right. We need to find the cube the cube config file. Okay. Right? K. Have the kube config where it is. It is sequence admin config. That's what we can that's what we can figure. Let's try to

51:04 Problem: Kubectl Cannot Connect to API Server

52:11 David, let's let's try sorry? How about just let's say, is it over here? What is it possible they put it here? Cube. I'm not sure though. Why do you think that it's not right? I mean, the IP Oh. Yeah. It may just be there's no API server running. Have you checked that? Yeah. Yeah. The yeah. So let's let's first of check if with NetStats whether it is running. Right. Okay. So just NetStats are you writing? NetStats? I thought you were writing. Are you writing? Okay. Go ahead. Something is wrote. What is it? T u n

52:32 Investigating API Server Status (Netstat)

52:58 l p. Right? Rev six. Not really running something. Something Did you mean 8414? Oh. Kubelet is running. Kube VIP is running. Salesforce is running, but no API server is running. So let's just check one more thing. API. No. Definitely not running. You can definitely not running. So the API server isn't running, and, obviously, this is not so good. So we need to bring up the API server. This is the time you need to say how we are bringing up your our API servers. Let's think. So I I'm sure that it was putting up the counter plan with Kube admin.

53:32 Problem: API Server Not Running

54:03 Yep. These are Kube admin clusters. Definitely. Okay. So I don't know The kube admin uses static manifest for the control plane component, which are in let's say Kubernetes manifest. Yeah. It was in its right to put up the the control plane and manifest. Yeah. Let's look to let's look for the manifest. Let's let's let's look for the manifest maybe. Yeah. I would go to Etsy Kubernetes manifest and see what you've got. Mhmm. There is something is wrong with my browser. Slash f c d Kubernetes. Right. Okay. We have admins kind of manifest. QAPI server. API server.

55:00 Inspecting API Server Manifest

55:10 Great. You wanna let's let's take a quick look. And if it looks okay, yeah, let's let's just supply this. Yeah. Do some of you know how we are supplying to kubeadmin? Yeah. So kubeadmin monitors this directory and the kubel the kubelet monitors this directory and will attempt to deploy anything inside of it. So you don't need to run any kube admin commands. So you're saying you don't need to run. The question is, who's who should we bring this up? Because I didn't see Kubelden running on this machine. The Kubelden is responsible for bringing them up.

55:53 Oh, the Kubelden is The Kubelden. Yeah. Kubelet kubelet conf. I see here configuration of kubelet, bootstrap Mhmm. Container. I I think it there should be a static pods here somewhere. Maybe you wanna try to keep it config? Yeah. Oh, it should be Kubernetes.com. It's in Varlib. Here. We are not in oh, sorry. We're not in the first directory. Okay. So let's see the file. Control plane. Service. Check the varlib. Let's config. Let's see if we see there anything. I think that's good advice. That's Rawkode's idea. Which one? The var lib kubelet config. Yeah. The lib

57:21 kubelet config. Yeah. Let's see. K. Authorization. There is some parameter there that is Static bot path seems right. Yep. Looks okay. Volume streams also shut down then. It is time out. Notes. Alright. And the static static pod pod path also looks okay. Right? Manifests. Yep. It is Kubernetes manifests. We have a comment from Russell on the chat highlighting that there may be a small error with an s on the end. What was the directory called under Etsy Kubernetes? Manif manifest or manifest? Oh my god. I don't know what is the right one, by the way, for savings. As long as they

58:35 match, you're you're all good. Yeah. Okay. Okay. I don't know what causes the queue. Should we restart the Qubelet? Yeah. It's a system deserviced, so you both have to to restart. System STL. System CTL. Yeah. System CTL service cube kubelet. Restart. Mhmm. And that's your system five stuff coming out. System control restart Kubelet. System control. Too many edit systems. Yeah. Delete the service. Yeah. And then the Yeah. And restart Why? This is attempt where I was working with Ubuntu. Fingers crossed. That looks better. Much better. Okay. Very nice. Team team Kong, very nice. Well done.

58:55 Fixing Problem: Restarting Kubelet

59:52 Okay. Sir, reject a request for unknown reason. Let's look for okay. Let's try events. Let's try We are we are not talking. We are not talking. Oh, this is a bad request. Try get pod minus v minus v equals six. Yeah. I know, but I don't I'm not sure that this is the issue. We have an issue. Let's see. Here we go. This shows us exactly what we requested. Just go upload it from here we go. I think that dash v six was an absolutely great idea. I think you've got Here, it couldn't get let's see. Body was

59:53 Debugging Kubectl Verbosity (HTTP vs HTTPS)

1:00:39 not decodable, unable to check for status. I'll get version kind. Okay. So the object that's returned Client sent an h t h t p request to an h t p server, again with the s, I think. Same issue. Okay. So go go to the kube config, David. Go to kube config. Okay. Russell in the chat is asking what is the dash v syntax. So dash v followed by a number will give you for boss a logging from one till nine. No. No. Not not here. Not to to the /etc/etc/tc/etc Kubernetes admin I see. Admincom. Yeah. Yeah. Here we exactly. That's the problem.

1:01:31 Problem: Kubeconfig HTTP vs HTTPS

1:01:34 The HTTP needs an I love it when the cursor starts in the right place. Yeah. Yeah. Dude, HTTP s s. You're in you're in yeah. You're you're in the wrong mode. Yeah. Yeah. I'm getting out of there. Let's do it again. Hi. There we go. Nothing wrong. Yeah. That's it. Okay. So retry kubectl again. Let's try again. Get pods. Okay. Okay. Nice. Now this is sending less wires to sending. May want some less verbosity there. Yeah. So an s miss s is missing also. You think that went for a theme, Ben, just removing all the s's?

1:01:41 Fixing Kubeconfig Protocol

1:02:26 Yeah. So I think that from now on, every time I get another from the our production, I'm going to look for s's. It's actually not a bad idea. Yeah. Clustered. Alright. These are halfway through. So we got a whole twenty two minutes left. You're you're smashing it. Three out of notes available. Okay. So they did the same trick we did to them. Let's remove the limits. Right? Do you think so? I'm I'm not okay. Get note. Just do a get note. One second. Let's let's first let's so Let's do a get note just to be You want get note? Yeah.

1:02:45 Problem: Worker Node Scheduling Disabled

1:03:12 Just to see what is going on here. Let's see what's going on. It's a good idea. Okay. Scheduling disabled on work with two. Okay? So we need to enable scheduling there. And Alright. Let's first look at the describe to make sure this is the problem. Describe notes. Let's try the worker node. V two. Let's see. Labels, annotation. So how can you set a note that's not schedulable in Kubernetes? Actually, I haven't tried. I always trying to make them scheduled. Are you familiar with the carding command? No. The carding command? Do you? Carden. Sorry. Suppose to call you Shanks then coming

1:04:11 Fixing Problem: Uncordoning Node (with Host Help)

1:04:18 through. No. It's okay. No. I haven't tried that, but we can play with it. Let's see. Oh, I'm sorry. Did you I was just sure I was just typing the command so that you could get past my Oh, okay. My funny person accent. But We have two commands that to play with nodes, cordon and uncordon. Oh, okay. Sorry. I get it. So just write anchor q c t l, I think note. I haven't I know that it's a thing, but I haven't used it. Notes. One second. What happened to my kubectl? Is it only me, something like the kubectl?

1:05:11 What's wrong with your kubectl? Okay. Sorry. Okay. Kubectl. No. No. No. Kubectl on core Cordon, U N C 0 R D 0 N, uncordon. Do you want me to write it? Yeah. I don't wanna Okay. I see it. Okay. Mark knows the schedule. Nice. Okay. And it was work what was the name? Kong dash worker dash two. Kong work dash two. You don't actually need to know it. Just encarden call worker two. Yeah. Oh. Ah. At mission control. Mission control. Oh, Call him. Very mean. Right. Keep CTL. David, remove all the mutating backups. One second.

1:06:21 Pip CTL. Who's thought mutating Webhooks were a good idea? Come on. Yeah. What's it? There's a name. What's the name for it? Mutating Webhook configurations. Kubectl mutating. Oh, the I think something is not right with the Yeah. Auto complete. Yeah. Oh, it's auto complete. Okay. So one moment. Open a new batch. Yeah. Just run batch. Although, you'll lose your export to config. Oh, okay. So I'm okay with this. Here, I I would come here, I'll do it. Let's see. Kube CTL. Good. That's the one. No resources found. No resources found. Let's try the validation. You have admission controller also in the API

1:07:26 server. You can Exactly. Static admission controllers. Yeah. You can also API. Let let's look at the like, let yeah. Let's look at the API server configuration. Where was it? It was in it was I don't remember if it was Etsy Kubernetes manifests. Yeah. Yeah. Look at the Etsy data Kubernetes manifests. Okay. One second. It's always I'm sorry. Just looking at the Okay. And Just look at the Orest, yeah, open it in Vim. Okay? So they will going to edit it. I heard from David's voice. I'm sorry. Say again? So just open for editing. I'd like to see an episode of Clustard

1:07:34 Api Server Configuration

1:08:23 where people just open Vem and have enough plugins to just, like, navigate the directory system and run cube control and never even bother machine. We want the controller manager or the API server? The API server. Okay. Okay. So you have this what's the name? Enable boost enable admission control no restriction. Always Plug in. Always the nice that's bad. Right? Yeah. It doesn't move with the old node restriction. Did we remove also the the node restriction? No. Leave that one, please. No. So let me check out whether the whether the API server has restarted. I don't know if you can check. So

1:09:16 that that keeps c t l get pods. I can't believe because it's a oh, yeah. Here it is. It's off right now. No. Yeah. Give it a wee minute. It should come back. Not a lot of teams would have got a static admission controller. That was well done. Very nice. Yeah. It's probably, you know, these are things that are, like, coincidences of your last week. I needed to wrote a paper about it, and I didn't know it was beforehand also. So some things are just coincidentally. So if it's taken too long, you can restart the kubelet to speed it up a

1:09:50 wee bit. Oh, yeah. Oh, there you go. Yeah. Okay. So Get pods. Okay. Okay. So this is still pending. Let's see why. Well, you were you were trying to encode and I know it. Yeah. There you go. Yeah. Yeah. So let's do that again. Encore. Okay. That's better. Now let's do let me get node. Okay. This looks much better. It it it it it's a little less complicated. Container creating. This is better. Okay. Let's make sure this works. And Comp it's stuck in container creating. And it's Uh-oh. Okay. Let's see what Probably pulling the image.

1:10:39 Pod Stuck in ContainerCreating (Image Pull Error)

1:10:42 I think you're maybe okay. Let's let's so let's let's look at this this container. Describe. Wow. Really annoying. I'm sorry. Yeah. What was that command that was run by team Kong, the complete command? I think we're missing that. Yeah. I I took this from kubectl. They have their complete command, but I guess clustered. Clustered. Yep. That's yeah. That's the command from Colin and then the chat. Complete dash f underscore. Oh, that's the f of the alias. That's how they made it work. Okay. Let's see. So you have a ready replica? Yeah. I wanna look I just wanna make sure

1:11:31 Testing App Pod Status (Running)

1:11:34 that everything is over okay. Over here, like, match labels Look at clustered. There are no describe on this. Do describe on the pod, please. Okay. Sure. Sorry. Running. Okay. The pod is running. No. It's running. Carl, local host. 30000. Go. Go. Go. Okay. So do we wanna check it's okay? Let's see if we see the time the watch. Yeah. We can we we can we can check it properly. Yeah. Sure. Okay. Can can you check it? I would love to. So this is Kong cluster. Let me try the worker node. Sometimes it's a control plane.

1:12:25 Yeah. Funny. Refuse. Think think after the session, I'm taking the letter s out of my keyboard. Alright. So the state connection refused. So try Carl local host 30000 just to make sure on your side. Okay. And I'm not good. I'm sorry? Yeah. Is that good? You're me. Yeah. I missed it. It's the best thing that someone is working in the terminal. Right. It's it's easier. Let's see what services are we looking at. 30000 is what? 30000. Yeah. 30000? Yeah. Okay. One moment then. Alright. We have a networking issue by the looks of it. Yeah. And twelve minutes to

1:12:34 Curl from Control Plane (Connection Refused)

1:13:17 go. Oh, that's nice. So let's look into the service. Get SBC. It's there. Okay. No port at 30,000. Wait a second, David. Let's try something else. Let's try to do a port forwarding. Let's check where the issue is. Okay. Can I I'm gonna look for the I'm thinking here of something else? Let's just check this. Okay? Get all my cube Sorry. Clustered. Okay. So we have oh, there's an old service. So the service, it doesn't have right? Let's see. Let's look at take a look at the service. Describe. This is so annoying. Let's look at the service.

1:14:26 You're missing a s. Yeah. On describe. Describe. Oh, sorry. Thank you. K. So the selector is clustered. Okay. So this looks fine. Right? And it has an endpoint? No. Okay. Now let's see the port. It's eighty eighty. The target port is also eighty eighty. The north the node port is three 30,000. It looks good. Endpoints are eighty eighty. Now let's do we wanna look yeah. Sure. Now you take try the I will track a port forwarding here. Do we have the name of k. Let's try. TL port. Thank you. Colin, thank you. We tried that source, but

1:15:22 it didn't work for us for some reason. Yes, Ben. Yeah. Oh, it won't work because we downloaded kubectl and it's not the up into app version. So it probably doesn't have what you need. Right. Maybe. I have I'm sorry guys. I'm trying to copy paste here. Okay. I'm sorry for everyone. So let's try to I don't know. 10,000 from here to eighty eighty. Right? This Mhmm. Is where it should go. Okay. So that's not good because That doesn't look good. That's all good. Good. That's good. Okay. Let's try. Okay. So So you can exec into the pod and

1:16:28 Carol is available. Yeah. That's a good idea. Let's try that. Yeah. Minus minus Wait a second. You yeah. But I think that Okay. Do we let's see what's blocking us. So the there maybe a network policy? That's interesting. Members are our heads as well. Mhmm. You got eight minute. Yeah. Wait a second. More problem is that I cannot x error for okay. So it looks like problem connection with the with the cubelets. Yep. Port one zero two fifty is a cubelet port. So something is strange there. Let's try. See, like, on every node you must.

1:17:44 Okay. So let's try the other node. Let's try Bandtest on this one. Let's see now. Yeah. Okay. I know that. Perfect. I don't know who's who can David, I will join you another node. I will open on Gong worker. What what worker do you want? Where's the payload scheduled? Wait a second. It should be Oops. Sorry. What what so sorry. Yeah. Alright. Worker 2. Worker 2. Right. There is no station on Worker 2. Please feel free to go to activity, active sessions, and join. Okay. There is a Worker 2. Oh, maybe it's not gonna get a session on Worker 2.

1:18:34 No. There there is no worker. Worker 1 session is available. Worker two. No. I cannot get to Worker 2. Now both of them are running in Worker 2. You you neither. Because we can't access Worker 2, I will encourage you to record in it and move the workload to another machine. Yeah. Let may let's restrict maybe the the let's restrict it to run-in worker one. Let's try that. What is it saying? So so should should or should I cord but wait one second. We started by Okay. Because I can't get to worker one or worker two,

1:19:13 I am gonna help you out a little bit here because if we can't get to the machines, we can't fix them. So Okay. We're gonna card on marker one, card on marker two Card on marker two. And we're going to untaint control plan one, and I can never remember the command. So I'm gonna do it the hard way. Tense. Tense. Okay. Your workloads can now be scheduled on the control plan. Okay. So, David Yeah. Okay. D w try to let's go and check on where the pods are. Yep. One second. I'm just finding the right

1:19:55 the right thing in the YAML. And The workload should have been moved for you. Oh, okay. Let's see. Okay. So let's check again. And you need to concur. So if that kubelet isn't able to speak to the API server, you will have to make a change to their deployment to force a new one. Oh, or just delete the pod? Well, the then we'll have to wait for the five minute time out because the kubelet isn't responsive. Okay. So you're saying You you can try it, but I was thinking of deleting the replica sets. Okay. Go for it. Let's try it and

1:20:42 then we'll make a decision. Okay. Because that is, like, delete it. I'm deleting Yeah. Yeah. Delete them all. Go for it. That's my favorite command. Yeah. Yeah. Oh, and we have the other one. It's a it's a demon set. Right? What is the You don't have to worry about the stateful set. Yeah. Avid's CV and get the pod rescheduled. There we go. Yes. So do you need to do the same for the status table set now. Yeah. We need the same for the postgres. Yeah. So let's delete it. Delete button. No. No. Delete. There's a typo in the delete. Typo.

1:21:33 And while you're there, I I would also I would also edit the the the deployment. Already trying to we don't have much time. Yeah. Let's get that version bumped. Can yeah. Let's kubectl Okay. Kubectl edit edit deployment cluster. I'm sorry. I'm persistent. You know, old habits. Good habits also, edit deployment. Clustered. Enter and down That's fine. Under from the we want to be two. Exit. Yeah. So let's check on the pods. Let's see if they did the same trick we did. Five seconds. Okay. It's running. Alright. Let's test it. Okay. So let just Connection refused. Do you wanna try the curl?

1:22:40 Yeah. Let's try the curl. Yeah. Okay. So what do we see here? Felt the connected database error connecting to server connection time out. Well, this is better. Okay. So so concerns. Somewhat. So let's try to see, okay, what is going with the with the post Better based service. It's connecting to the service, not the same port. Do you want? David, do you want me to take over? Yeah. Yeah. Sure. You can what do you what do you suggest over here? What? No. I looked at the service of the cluster. Maybe we should look also the service

1:22:51 Problem: App Cannot Connect to Database

1:23:30 There's no logs on the cluster pods. On the postgres. No. No. There is no logs in postgres? I would check postgres has logs. Yes. Yeah. Let's try the postgres logs. Okay. Server stopped. There It looks good. It's ain't good. So this initialized some functions. Is it possible they changed the similar to what we did with the config file? The config map? No. There's a create table with an answer. It does say your database is ready to accept connection. I would maybe start looking at the service. No. But but I think okay. So this is oh, okay. I guess so. Maybe.

1:24:09 Problem: Incorrect Postgres Service Port

1:24:28 Let's see if the service yep. Okay. So Postgres is bound to app Postgres. Oh, look at the port. 5431. You see the port, Ben? Good catch. Yeah. It's the wrong port. Let yeah. Let's confirm just that the our server is connecting. Do you oh, 5432. That's it. That should be 5432. Yeah. So 2. Okay. Let's try to David, do we need to restart the the clustered or or or is it No. You should be able to try the Yeah. Yeah. We don't need to start. And done. Fixed. Oh, yeah. Let's try to check. You there? Yep. V2 and V2.

1:25:31 Team Armo Success & Wrap-up

1:25:39 Let's check for it. Hold on. Can you please oh, yes. The week I mean Yep. Yeah. Okay. You seem surprised. No. There were many surprises over here. Guys, you did a great job, Colin, and Thank you very much. Well done. That a tricky one. Lot lots of evil work there from Team Con. Yeah. But you you you just worked through it one by one. Yeah. Definitely. Yeah. Hey. Great job. Good catch for that port as well. You very much, David. Well, thank you for joining me. Great work, Diva. We're gonna cluster. Great work breaking the

1:26:23 cluster. Yeah. Okay. So the reason I couldn't get to it in the browser is because oh, well, USW was enabled on some of the notes, but it's not a big deal. We were able to curl the page and that's all about us. So thank you team Armour. I'm gonna let you just get back to your day and calm down, have a beer, have a coffee, whatever you fancy. And thanks again. Have a good one. And to our sponsors, thank you very much for Teleport and to Equinix Medal for all their support. Thank you for watching. We will

1:26:51 Conclusion & Thanks

1:26:55 see you all again next week. Have a great day.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
PostgreSQL

More about PostgreSQL

View all 22 videos