Watch / Klustered Live
Overview

About this video

What You'll Learn

  1. Diagnose broken control plane behavior by validating controller manager logs, kubeconfig API server endpoint, and static pod configuration.
  2. Debug application readiness failures by inspecting health checks, deployment updates, and rollback conditions in Kubernetes pods.
  3. Track down DNS and networking faults by testing Postgres service connectivity, in-cluster lookups, and node-level runtime health.

Teams from Polar Signals and Pulumi tackle broken Kubernetes clusters, chasing a controller manager pointed at the wrong API server address, Postgres DNS resolution failures, and a containerd freezer cgroup trap hiding on the worker nodes.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 1:23 Welcome and Show Intro
  3. 1:45 Sponsors
  4. 2:25 Meet Team Polar Signals
  5. 4:47 Episode Challenge Setup & Unexpected Twist
  6. 5:46 Gaining Cluster Access
  7. 7:18 Polar Signals Debugging Begins
  8. 8:06 Initial Status Check & CrashLoop
  9. 11:36 Debugging Application Health Checks
  10. 19:43 Fixing Health Check Configuration
  11. 20:27 Pod Still Failing: Deployment Not Updating
  12. 23:28 Investigating Control Plane Components
  13. 26:00 Controller Manager Logs: Connection Refused
  14. 26:50 Checking Firewall Rules
  15. 28:11 Examining Kubeconfig & API Server Address
  16. 31:39 Editing Controller Manager Static Pod Manifest
  17. 36:03 Controller Manager Restarts & Pod Creation
  18. 42:04 Application Pod Running: Testing Endpoint
  19. 42:52 Database Connection Error
  20. 43:43 Checking Postgres Service
  21. 44:26 Debugging DNS from Inside the Pod
  22. 47:51 Polar Signals Wrap-up and Departure
  23. 51:21 Meet Team Pulumi
  24. 55:11 Host Reveals One Break (Containerd Freezer)
  25. 57:17 Pulumi Gaining Cluster Access
  26. 59:34 Pulumi Debugging Begins
  27. 1:00:06 Missing Pod & Deployment Replicas Zero
  28. 1:01:13 Editing Replicas (They Revert!)
  29. 1:01:36 External Controller Hypothesis & Generation Count
  30. 1:04:13 Identifying and Deleting the Rogue Controller
  31. 1:06:12 Setting Replicas Again (Success!)
  32. 1:07:27 Testing Endpoint Again (Database Error)
  33. 1:08:32 Investigating Database Connection Issue
  34. 1:11:00 Debugging Networking/DNS (Teleport Issues)
  35. 1:19:01 Teleport Session Freezing Issues
  36. 1:22:48 Hypothesis: Worker Node Problem
  37. 1:25:22 Untainting Control Plane Node
  38. 1:33:55 Application Works on Control Plane!
  39. 1:36:32 Polar Signals Explains Their Breaks
  40. 1:39:57 Recap and Conclusion
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:23 Welcome and Show Intro

1:23 Hello, and welcome back to the Rawkode Academy. Today, we have a new episode of Clustered, and we have teams edition. We have two great teams, one from Polar Signals and one from Pulumi. Going to be a great episode, if not at least interesting, and I'll fill you in on why in just a moment. Now before we get into that, I wanna say thank you to Teleport. Teleport sponsor Clustered. We've been using Teleport since the very first episode. We use it weekend. Week out is such an amazing tool that I generally believe every every infrastructure, every production infrastructure at least should

1:45 Sponsors

1:59 have Teleport running. You'll see why as we debug our clusters and pay it all of them as a great piece of software. I also wanna say thank you to Equinix Medal, and they graciously provide all of the hardware that we use in custard. We we burn through a lot of bare metal machines and just for the fun, I always use ones with lots of cores and lots of RAM. But I really appreciate that. It makes it just a little bit more fun for me and for the other contestants. Thank you to Equinix Medal. Alright. Now let's introduce team

2:25 Meet Team Polar Signals

2:29 Polar signals. Hey. How's it going, everyone? Good. Good. And you? Yeah. Just you know, I really wish I would learn not to throw a very generic question at three people at once and then see who answers first. But Yeah. Let's let's start at the top right. Frederic, do you wanna introduce yourself? And then we'll just work our way around clockwise, please. Sounds good. Alright. Hey. I'm Frederick. I'm the founder of Polar Signals. I've been working on all things, like, cloud native for, gosh, six, seven years now. I'm part part of the Prometheus team. I until recently was tech

3:07 lead for special interest group for instrumentation in Kubernetes. Yeah. I started my journey in through the through CoreOS, then Chorus was was acquired by Red Hat and stuck around for a little bit. And then, eventually, I I left Red Hat and started pulling signals. Awesome. Thank you. Kemal? Oh, alright. Hey. Hey. My name is Kemal. I'm a senior software engineer in Polar Signals. I don't know what else to say. Yeah. I've been working with these guys for the past few years. So yeah. Thank you. Hi. I'm Matthias Leuve. I'm based in Berlin. And, yeah, I work with as

4:03 well, obviously, and been doing I don't know. Like, I think I started Go development in 2014, and then through that came interested in Docker and then in into Drone and then kind of, like, read about Kubernetes. So here we are. Looked for a job for Kubernetes and ended up finding one in Berlin where where, like, on the first day, I got introduced to Prometheus. So that was in 2016. And, yeah, like, May later, Prometheus maintainer, Thanos maintainer, and, yeah, been enjoying working in observability ever since. It's been a great journey. Alright. Awesome. Thank you all. Alright. I guess

4:47 Episode Challenge Setup & Unexpected Twist

4:47 we'll fill the audience in on the fun things now that's happening on this episode of Clustered. So this episode was originally scheduled to be Polar Signals and Isovalent. Unfortunately, team Isovalent had something come up, and we had to replace them very last minute. And I very luckily was able to convince a few of my colleagues at Bolivia to join us. So I have smashed through a cluster that you're gonna work through today. Hopefully, it's not too smashed. Although, I guess that would be fair considering the cluster that I prepare used prepared for us is currently

5:18 gone. It is not responding at all, which obviously cannot be fixed. However, we are spinning up a quick backup, and then hopefully we'll have something to debug in forty minutes time. Fingers crossed. So it's been quite an episode already, but that's the way things go on clustered. When you break things intentionally, things do go wrong. So I'm gonna wish you all the best of luck with the cluster that we have prepared, and I'm gonna pop open my screen share now. So first thing I need to do is give you access. So I am gonna modify

5:46 Gaining Cluster Access

5:53 your role. Label. That's so cool. I didn't even realize we hadn't hacked have access. So Yeah. I'm not gonna give you access to all the machines. But now if you refresh your page, you will see almost mine. My teleport control plane is hidden. But you will be able to see my session on Pulumi control plane one, which is here. And so if you could all please join this session, give me an echo hello, let me know that you're there and we'll take it away. In fact, while away, I can tell you that you also have

6:28 another cluster. So at some point over the next thirty minutes, if one of you could reapply the break cluster too, that would be appreciated. If I just trust it doesn't disappear. Yep. I I I take care of it. I broke it half an hour an hour ago. I I tried to not break it as much again. So, yeah, I I I take care of that. Well, maybe we'll get lucky in, like, cluster one, and we'll just magically show back up again. But let's just be prepared anyway. So I've got Frederick in this session. I need two more.

7:10 Alright. Awesome. Cool. And Matthias, I know that you're working on the other one. You're here anyway. Cool. Alright. Usually, on custard, your best bet is to check for a control plan. So I wish you the best of luck. And the RHANTs, take it away. Alright. So I think the first thing that we thought of since we know that this was intentionally broken, let's have a quick look if there's some hints that we have left in the bash history. Yeah. Looks like that was deleted. I guess it's not your first one. Too bad. Too bad. I will tell you, I

7:18 Polar Signals Debugging Begins

7:54 actually forgot to delete it until the tech check when one of my colleagues, Laura, actually said, did you delete the history? I was like, yeah. I did. And then I didn't. And then I so thanks, Laura. Alright. That was just the first thing we were gonna check. So I guess let's try getting the kube config and see if we can do anything with the cluster at all. I I it's probably completely broken as that's the point of this, but let's see. I think it's in, like what was it? I see Kubernetes or something. Yeah. That's right. The admin dot conf is

8:06 Initial Status Check & CrashLoop

8:34 the failure. Conf. Yeah. Do we are we supposed to just to just do commands from this node? Or Yeah. Just just work from this node. That way the audience can follow along. Yeah. That makes sense. So I guess let's just start with You'll need to export KubeConfig equals that file. Ah, right. Okay. Alright. Well I can see something. Doesn't the control plane isn't entirely broken. That's something. Let's see. Can you do a all all namespaces? Uh-huh. Or I can actually yeah. That's fine. Nice. Okay. So I would encourage you when you're executing commands to use a pager rather than scrolling

9:30 on your site, again, just so the audience can follow along. Yeah. Just just when you're introspecting. So you're saying this one, the audience can't see very well? Oh, they can. But if you're looking at logs or do not describe, if you type it through less, it means they can follow your cursor. Got it. Got it. Alright. So looks like looks like it's just the the app that's currently in crash loop back off. Something interesting as we were kind of breaking our cluster that we were thinking through also is that it's it could be interesting to

10:11 have a look at what when the when the pods were last restarted so we know, you know, what might have potentially been messed with. Yeah. So could be interesting. We can see that the what is it? API server has been restarted somewhat recently. Ingress agent. Was there even an Ingress on our cluster? I don't remember there being one. The cluster, shipped with the service. There's just a note part that you can access. Right? So Yeah. We just use the node part. The the ingress, even if it was broken, it's not a big deal. Got it. Anyways, there's there are a couple

10:58 of pods that were restarted in the. I don't even know this ingress. Something new. It was formerly ambassadors ingress. I can't remember the name before. I I think it was just ambassador ingress. Yeah. I think yeah. Yeah. I I remember ambassador ambassador just Sounds familiar. Yeah. Yeah. Alright. I guess then let's have a look at events and see why this thing is crash looping. Events. Incorrect. Very funny, Russell. But, no, I am not cheating and using my admin privileges. Alright. So looks like it's the container that's failed, and let's have a quick look at if we can see more in the pod

11:36 Debugging Application Health Checks

11:56 status. Autocomplete work with this thing? You can I I don't think it works with the with the names of the part? Yeah. We do, like, copy it. We can enable the autocomplete if you want. I can give you the command for that. Oh, that would be nice. Yeah. Okay. So that just went totally out of my pain. Someone else can take over because Let me load the auto complete party. Control. Is shell all complete or just all complete? We'll find out. Oh, I can't even type. That's the camera curse. I swear I've done this before.

12:53 Oh, completion. We'll get there. So so this command of kubectl will just return a batch script that you source? Yeah. Never do this. Now now you can tap complete for your partners. Right. I think it's been, like, four years since I set auto completion up, so I don't even know how to do it anymore. There go I, Bartlett. Yeah. So what were you're saying you wanna, like, look at the the part and then type that to less. And or did you describe the part, Frederick? I I just did dash o, yeah, Moe. I'm not sure exactly

13:38 why, but the moment I did that, kind of the the teleport session got kind of messed up in my browser. Okay. I think it's okay to just rejoin it, right, since it's just an since I'm not the You can. It's because it tried to resize the to everyone's screen. So, normally, you get a scroll within a scroll, but you you should still be able to move the scroll bar at least. It just sometimes feels a a bit weird. I can't scroll in having it opened unless, though. So what you were saying, like That's okay. I'll

14:14 I'll just scroll. Okay. You can scroll like that. Okay. So so, yeah, it's it's running. It just terminates with with an exit code of zero. I don't know. Something about the volume I just saw. Yeah. The volume contains injected data from multiple sources. K. Interesting. I don't think I've seen this one before in real life anywhere. So the mounts mount a service account, and there are where are the volume mounts? They just called that. Yeah. They just called that. Right? But I guess further down, there's the cube API access volume. That's maybe it's, like, hard to read the YAML in

15:26 this, like, different format, so I usually like to just return it as YAML. Maybe you can scroll scroll like this. It's a bit easier. At least it's, like, more similar to what you would usually type. Right? So why is the pod crashing? Yeah. I'm not seeing the volume. We can put we can have a look at logs. We don't yeah. Definitely. I would run it this crave again. But, Frederic, do you actually remember what we said? We wanted to just deploy Key Prometheus, and I think the control plane works so far. So we can just, like, deploy

16:30 the monitoring stick and see what's up. We can. We can. But, I mean, we can also, like, quickly look at the logs first and then take The application has no logging. Okay. Yeah. That's why I saw But I would rather describe again. I think you missed you missed something. Yeah. Nothing. Yeah. There's Okay. Then let's have a look at describe again, TS. Yeah. I think it's still there if you just scroll up. Right? In the describe, we missed something or in the. Yeah. In the describe. Right. Is that fine if I clean the screen? Yep.

17:13 Yeah. That's fine. Actually, there. Readiness probe failed. HTTP probe failed with status code 404. Are we even running the right image, like, of version one of this thing? Would be good to compare with the cluster we have. I I mean, I'm happy to answer questions if you wanna validate or verify the thing. So feel free to edit it. I already put the GitHub repository, and it's GCRGHCRAORawDashAcademySlashClusterV1. Correct. So That that looks right. That looks right. Yeah. Yep. That is the correct image. Or 8080. That looks right. Think that was correct as well. Yeah. From having read the source and work with

18:22 this the last twenty four hours. Yeah. Like, in the source code, I can see 88 is what it tries to bind to. Yeah. That I remember that port also from the service. So Mhmm. You missed one important piece of information than the header. It feels like I'm starstruck and, like, I have to concentrate on the thing. Usually, this should be easy. Right? Like, now I'm, like, looking the status code 404. What does that tell us? Well, that the request is definitely being made, but that that our application is not returning what it should There is no there is

19:15 no health z endpoint on that application. That's right. There. Yeah. Yeah. Yeah. Yeah. I don't I bet it's just called health. Yeah. So that's a good one. That's that's a really good one. So, Nice try. So I guess, should we just fix it using cube? How to edit? Or Yeah. Just just edit. Like, do the the fastest way possible is the best way. This doesn't production. Deployment. Clustered. And then that's it. Oh, I think that broke on on the stream with Yeah. The CLI can't handle them very well. I switched over to the browser for that. I just deleted

19:43 Fixing Health Check Configuration

20:09 the z at at the path of health. Nice. That's it. Then maybe we can actually get part and then watch what happens. It's still not coming up though. Yeah. You're laughing like you're giving yourself away. I mean, let's have a look at I just like it when because, like, when you make the break, sometimes you're really worried that it's just not gonna work. And I'm very happy that it worked. I'm I'm definitely worried about one of ours as well, but let's get to that later. Alright. Let's let's have a look at the, like, the status of the pod one more

20:27 Pod Still Failing: Deployment Not Updating

21:01 time then. Mhmm. Do you wanna take over again? I can. Describe pod, I guess. Ah, whoops. Yeah. That's oh, my god. Can't even type. Alright. So It's still saying no. That's old. Now it's the fails to start, I guess. Right? Yeah. That that always throws me off still. I think the probe failing was, like, last seen seven minutes ago now, was it? Yeah. It still says health zed in the description though with the readiness and liveness probe. Yeah. So Can you run we update everything. Can you run get close again? Right. So Yeah. This part hasn't even been updated. Yeah.

22:14 So what happens when you modify the deployment? There's, like, a controller that needs to kick in to take over in on the scheduler. Right? But it isn't reconciled. Like, if we look at can we describe the the it's called the replica set? Yeah. I mean, we've got the I mean, I'm I'm guessing the replica set is not updated. Right? Right. The replica set is not updated. Help zed. Well, no. They created. The new one gets created. Right. And that hasn't happened. Yep. At least the the replica set that is there does not have the the modification, which means

23:11 that probably the it hasn't been updated. Laura said in chat that there's a hint if we need one. Thank you, Laura. Maybe you should read the hints. No. Yous are yous are on the right path. Yous are talking about it. Yous are describing how this works. Oh, yeah. It's super valuable to everyone watching. So you you know that. You definitely know that. So so I can we can we check cube system? Yeah. I I I would check the cube system namespace for, like, the controllers being up and and reconciling our stuff probably are the webhooks. I

23:28 Investigating Control Plane Components

23:59 But I think Russell is suggesting that I could have used a mutation commission controller. But yeah. Actually the controller manager. Yeah. The controller manager is there. We were thinking of deploying a a faulty, like, admission webhook that would, like, deny every creator or something like that. So Seems like we aren't the only ones. Maybe we're not the only ones. User user on the right path. Don't go down the webhook path. That's that's Russell is very smart, but incorrect on this occasion. User user already on the right path for your conversation. So first thing that I can already see

24:44 actually so there's also a readiness probe that is also pointed at health health z. So that will also be a problem later. Mhmm. But maybe we wanna fix that already just so so Yep. Yeah. So I did miss one. Yes. I love how the livestream is just completely broken when using Vim. Yeah. I have to Hey. Yeah. Switch it over. I I still have Vimium on. Can you hit escape for me, Matthias? Yep. There you go. Thank you. That was also the control c bug with teleport was very fun, wasn't it? Yeah. Yep. But there's, like, Vimception for you, Frederic. Like,

25:33 you have the Vim in the in the cluster and then the Vimeo in the browser. Yes. How to exit two layered Vims. Alright. So what did we say? Admission webhooks. Right? No. We wrote that. Controllers. Ah, we It's all the controller. Yeah. Yeah. Let's look at the controller manager then, I would say. Alright. Okay. You wanna look at logs, or what do you wanna look at? That's Is it the controller manager or the this or the scheduler, though? Well, the, no, the the controller manager has the deployments controller Right. Exactly. Yeah. Should be the thing that

26:00 Controller Manager Logs: Connection Refused

26:21 creates the new replica set. So we can have a look at the logs of the controller manager. No. What I love about this is that you both know Kubernetes absolutely inside out, but it shows how easy it is or how difficult it is to debug problems, right, in production even when you know that. For sure. We can we can definitely see here leader election, error retrieving resource lock, kubectl kubesystem controller manager. It it's getting connection refused. Alright. So I think we were talking about firewall rules at some point. So maybe we've got something nasty there. So what is it?

26:50 Checking Firewall Rules

27:17 That was yeah. That's inactive. That's inactive. Alright. So that's one. Gosh. I don't remember. What is it? IP table. It's flush or something. Dash capital l for less, and you can do dash capital f to flush. You know this too quickly. I think you've done that. Like this you said? Or Oh, that you've so you've just done the flush. So all the IP tables rules will be rewritten by a zillion. And then if the failures Now we can dash dash list rules. Or just capital l. We'll list them. Yeah. Yeah. Exactly. Okay. So there is no firewall in play.

28:03 Okay. Or I just flush everything. Let let's have another look at those logs then. How familiar are you with cube admin clusters? Haven't haven't used those in a while. We're using GKE for everything now. My camera is overheating. As long as we can hear you, that's fine, I guess. I was I was pushing that. We know you're still there. My backup. Right. Yeah. No. It's it's been a while since I used q a d m as well. I think even, like, back at, not Kubamatic, they used this, I think, beneath the, like, the other schedulers

28:11 Examining Kubeconfig & API Server Address

29:06 they had. I mean, connect connection refused. Can we are are we even sure that this is the right address that it's going to connect to? Like, I I remember when I was looking at our Hoop API server, there was some other address. It was, like, 6443 or something like that. I mean, we can just look at the service Kubernetes in the default namespace. Right? That should probably tell us. Do do you wanna take over again? For some reason, some some resizing happened again for me. And Yeah. I'll just clean the Okay. Yeah. The screen, and

29:45 then it should be fine for everybody. So the service like this, there's, like, default Kubernetes service in every Kubernetes cluster that always has, like, a IP to the API server. So that is ten nine six zero one. I think that was, like, something with a five in the logs. Right? That's completely different. Yeah. So the service is the internal cluster IP for the control. Mhmm. The the this yeah. But what we just saw was that the actual like, the the the the service just forwards to something somewhere else. Right? And that uses some different port, I think.

30:39 You're using an admin.com to speak to this cluster. Right? Do you remember where where that lives? Yeah. Matthias, can you can you have a quick look at the address that's in there? The oh, the con? Is it is it fine to just show this? Yeah. Yeah. Yeah. It's I'm asking you again. No. You feel you could you could flash that. Yeah. That's that's okay. Right. Now you have to be quick typing, or I can at least let me grab something. So so it it seems like this address is just completely wrong. Like, it's it's neither the internal one, and

31:24 I don't think you even use the internal address for for the setup. So let's just try to modify the the address that it's trying to connect to to this one. So I'm guessing let's look at the deployment of that thing. It might be a I I don't think it's a it's a deployment. I no. I it might even be a static pod. It is. A static pod. Yeah. Yeah. Patrick. No. You can't you can't modify those. They're on disk somewhere. I I forget exactly where. But Right. But I should be able to get the Yes. You can get the pod,

31:39 Editing Controller Manager Static Pod Manifest

32:11 definitely. The pod. Right? So that's called controller, the cube controller manager. Yeah. Cube controller. Thank thank you a thousand times for setting up the auto completion earlier. Yeah. I don't I don't know how to to work otherwise. Okay. So lots of text. Yeah. It should be further up. Yeah. Where there the commands in the flex are, I would say. Mhmm. K. So it uses a cube config that is the controller manager account. So maybe if we look at that one and compare it with the app admin one. Yep. It is too easy for you, Frederick.

33:19 So something like that. Wanna pick that for less just for the Yeah. Let me well, just correct server. So one is different than the other. So which one is correct? I would say the admin one is correct. Well, there's there's a bunch of .coms and slash Etsy Kubernetes. Just I actually I I just wanted to point out something since people might be confused why hasn't spoken up. His Internet is entirely broken. And Who knows? Yeah. His his ISP is a % offline. So Alright. No worry. Unfortunately, he he can't be here right now. But we had three replicas.

34:09 Oh gosh. I mean, let's Let's compare with the scheduler one then. Right? Or cube Kubelet. Yeah. So that's that's this one. This can be a lot of users. I wanted to compare these two. So there is the controller manager. It's just Kubernetes. You didn't you forgot an s. Such a Sometimes I forget. I like to quit them, and I I delete words. You know? It's it's it's an accident. Yeah. That makes makes a lot of sense. That happens in production. Too bad. I I I love Unix tools like this. It makes it so much easier.

35:18 So, yeah, let's edit the controller manager at that part. 6443. That's the only instance. Hold on. Frederic, do do you know how to restart the pod? Do we need to do this? You will have to delete the pod. Yeah. You you should be able to just do delete pod. And then it will be I think if I remember correctly, the way that the static pods work is that even if you delete them, if they are present in that wherever on disk, they will be recreated all the time. That is correct. I think the kubelet will will

36:03 Controller Manager Restarts & Pod Creation

36:03 rematerialize them. They said there was, like, a checkpoint for Kubernetes at some point, but I think that's slightly different still. Right? That's just Yep. The kubelet Yeah. So what you I think what you're remembering is BootKube. And BootKube used used checkpointing used static parts to checkpoint things like etcd. If people still remember Bootkoop, that was, like, self hosted Kubernetes where we used at CoreOS, we used Kubernetes to spin up Kubernetes, and you needed, like, a bootstrapping cluster, and checkpointing was kind of a snapshot of a point in time. And then it was well, it kind of pivoted from static pods

36:53 to, quote, unquote, real pods. It was it was fun. So what I loved about that break is that at no point does the controller manager ever say that it's not ready or healthy even though it can't Yeah. Speak to the to the APIs server. There you go. Alright. The controller manager was restarted, but I don't think the replica set was updated. Right? Well As it looks. Let's let's see if we progressed in any way. Okay. So that looks like it's still only one. Let's have a look at the logs of the controller manager again. That's what we

37:40 tried to fix. Right? Yep. That'll be the old one. Oh, no. Wait. Did you do logs on the actual part, or did you do it from the Yeah. On the on the part. Right? Yeah. Let's actually tell this. I'm actually confused. That's still the wrong port. Yeah. Like, it's not the updated port. Yep. So we we look at no. But the the part not delete the part. Describe. Yes. Describe the controller manager part. And if we scroll up, we see that we were using the controller manager, though. So that's the one we updated. Is there

38:48 any other place? Like, the bind address looks somewhat suspicious to me. Like, that that's, like, an easy thing to to modify this way. But why would the the controller manager just talks to the Right. Kubernetes API even if it were to bind to the wrong address? Did you update the port correctly and the controller manager Yeah. I was I was gonna say that as well. Just double check that we actually Yeah. There's no difference between those two when it comes to server, at least. So, I mean, the admin conflict will be different, I suppose. Oh, yeah. You did put

39:41 it right. So the control element, 10721415, and we grab find access here. You wanna have a look at the Mhmm. Do you wanna have a look at the whole config just real quick? Yep. I mean, you can take over as well if you want to. Someone from the from the audience said that I think coup the Kubelet doesn't restart static pods If you delete it using kubectl, either restart couplet or restart controller manager pod using c r I c r I c t. Yeah. I just I'm gonna make sure that I don't leave an old break in that I

40:32 did remove. I'm I'm just gonna Open to here. I think my screen is doing a weird thing there. Yeah. That's what we Charlie just said, isn't it? An Etsy Kubernetes manifests. Yeah. That's what we actually just opened here. Yeah. Okay. So that's the thing we we need to to modify. That's where those that's where the static pods are. I I don't think we need to modify anything in here. Alright. Because we modified the Someone else Yeah. Someone else in the audience said that we might need to restart the in order to have it have it take effect.

41:29 I don't remember that. But Yeah. I think using Vem on that fail has restarted the controller manager now. You're using what? Sorry. It appears to have restarted the controller manager now. Ah, maybe Just by giving it time. There you go. Yeah. Go go back to your get replica set command, and then you should do something different. They sent me ten seconds ago. Alright. Alright. So it's there now. It seems like it's up and running. Yeah. Let's check the check the size Cluster. I guess it's yeah. Let me You can curl. We'll do those 30,000 if

42:04 Application Pod Running: Testing Endpoint

42:40 you want to test that. Yeah. Okay. That also works. Yeah. I can I saw that one coming? Yeah. The database connection doesn't work. You have five minutes left. For the database connect to work. So right. I I take a look at our our doc real quick to copy some commands for the database maybe. So let's say we have we have Postgres running. Actually, can we have a quick look at the error message that we saw there? Because the can you do the curl one more time? Error connecting to server failed to look up address information name or service

42:52 Database Connection Error

43:35 not known. That like, let's first make sure that the Yeah. Service for the Postgres is even there. True. Yeah. Good point. Do you wanna type, or do you want me to Yeah. Go ahead. Start with typing. So Postgres there, and then if we say Postgres and look at it's Yammer. You can see, yeah, 5432 is the port. Do you see something already? No. No. I was just saying, is that is that the right part? Is that the Postgres part? I'm not sure. Yeah. That's it's always, like, five and then counted backwards, like, increasing. That's how I remember it.

43:43 Checking Postgres Service

44:20 Would have broken penis, would he? Yeah. So, again, copying our commands. Let's see what do we have here. Let's look at. Or do you wanna do you wanna try I mean, yeah, is definitely That looks fine. Yeah. I was gonna say, try let's try to exec into the clustered pod and Mhmm. Like, try to connect to the Postgres database from there. Right. I should even have command ready. Came prepared. However, problem wise is or the problem is that there is no drill or dig in this container. You can have to update and that install DNS dash utils if you want dig. Oh,

44:26 Debugging DNS from Inside the Pod

45:17 there. See, I didn't even realize that earlier. Yeah. So it was, like, some distro less or some You have to add up space. There was, like, some super small container that didn't have a package manager. So I'm I'm surprised there's app in here. I didn't even try it earlier. That was me being very nice to people that come on the show. Yep. I think, yeah, now it broke on my side. I can't really see what I'm typing. Can I type clear? Ah, that worked. Now it's broken for the stream, though. I reset Okay. Yeah. So install

46:01 oops. They're called big or DNS. Rick just mentioned DNS, Then I need to be able to type install DNS details. Yes. I'm just a Arch Linux boy. I'm going going to Which shell did you run? This is a bash. I think that's why you're having so many Yeah. It's If you run bash It's like the adjust bash. Yeah. There you go. That'll work better. So now there is oh, there is yes. K. That doesn't work. I just read the read the livestream. So we're see Postgres because that's the host that, like, connects to. That's, like, baked into the application in the

47:04 source code I saw earlier. It is. Yep. Do you wanna, like, just Telnet the address import? Probably, there is no Telnet yet. Post. Dang. Yep. Yeah. I I mean, we don't get, like, a proper DNS response back on top. Right? Like, there's no IP address in in in this. Yep. And you're all out of time. So close. Too mad. Too bad. Like, I I bet something in in core DNS. But I I didn't touch core DNS. Oh, you're still in the container. Yeah. In the container. So I went to the. And if we Giving it a different

47:51 Polar Signals Wrap-up and Departure

48:16 sorry. My terminal is really weird. Let's just do this. Messed up the, like, cluster DNS or what? Yeah. Yeah. So that cluster's IP address where the DNS lives is a different one than it actually is? It's not the kubel. Oh, where the hell did I change that? Kubernetes manifest. Yeah. No. The kubecta service. Where did I change it? Web API. Where is that? No. It's definitely in the Kubik Con. Oh, I changed it on the worker nodes. Oh, that's fine. Oh. Even on the worker notes. If we pop over here and go to Lumi worker one

49:16 and I modify bar lib kubect config dot YAML. Yeah. The cluster domain was changed here from cluster dot local to Kubernetes.local. Oh. Yeah. Okay. But that breaks because, like, the application is just looking for Postgres as a host name. So I'm surprised it it breaks it breaks the hardware. End dots to fill in the rest of the domain as Kubernetes.local. So by the end, I guess, the core DNS. There's there's never anything. And then there was one really horrible thing at the end where you wouldn't have been able to pull the container image anyway because I stuck container

49:52 d in the freezer c group. So Particularly harsh, but you started awesome there. It's like there there there was a lot to unpack there, and I love the fact that you were sharing, you know, history and knowledge as you went along. So I wonder how long it would have taken to get to the other two now. But those I mean, the DNS one seems doable, but the that other that other one, that seems pretty pretty rough to be fair. Yeah. I I would I would love that you sit on that one for too long. So but good

50:25 job. Now did I ask? Thank you. How is how is Cluster 2? Yeah. So, I mean, I was driving this, but I can quickly, in, two minutes I'm gonna check if Cluster 1 magically came back. Yeah. I I tried in between. Alright. But it's it's gone. But let's reload. I don't know what happened on that one, really. It's just completely disappeared. Alright. Well, I well, say thank you very much and goodbye to you just now. If you could break your cluster in the background and we will do our introductions for as long as possible. And if you can give us a nod

51:09 in the comments, that would be wonderful. But thank you, Frederic. Thank you, Matthias. Great job. I'll see you, Boston. Thanks. Thanks for having us. Cheers. Thank you. Alrighty. Let team make up. Up. Put my camera over here. Do need to update everything over to my other camera? Alright. Pulumi team, please join if you're still watching. I see that we've got one so far. Hey, Steve. Hey. How's it going? How's it going? It's it's getting there. Although, now as I move this on to the next one, I'm gonna there we go. Hey, Laura. Hello. Hello? Can you hear me? Yeah. We can hear

51:21 Meet Team Pulumi

51:54 you. You're all good. All good. Yeah. We're just waiting on one more. And what the how was that? Did she just have fun watching that? It's a little intimidating. Just a little. No. That was great. They did a great job finding, you know, getting down to it. When you know when you know the when you know the answer, you could see them just scooting on the edge, you know. So it was great. Yeah. That a good a good debug there. Russell and the audience is asking, and I will show my final break. I mean, I guess we've got a minute

52:30 so we can do. I'll pop back over to the screen share. Jump on to one of these nodes. So there was a mystery service called Teleport Proxy running on this, which was running a script called Teleport Proxy, which was trained on the pet of container d and freezing it every one second. Yeah. Maybe too harsh though. I did start I did freeze the cupola as well and and prior to just going live, but there were far too many error messages. The whole system was falling over, and I saw it make it easier. Alright. When we have our whole team, we're

53:18 just gonna wait for someone from Polar Signos to give us a note in the comments. But let's start with some introductions. And, Steve, if you wanna kick that off and we'll work around clockwise, that would be great. Sure. So, everyone, I'm Steve Sloca. I'm a staff software engineer here at Pulumi. I just joined, like, a month or so ago. I came here from VMware through Heptio. So I used to work at Heptio, and I'm a maintainer on Contour. So if if you're off my with ingress controllers, Contour is an ingress controller for Kubernetes. Awesome. Thank you. That's that's my story. I

53:48 work on the this this SaaS product for Pulumi is what I do here. Nice. I didn't know you worked on Contour. That's what I use on my own production infrastructure. It's such a cool one. There you go. So let me know. Yeah. Then whenever I get a bug, I'm gonna be hitting you up on the Slack. But, hey, man. Yeah. What what's the problem here? Yeah. For sure. Alright. Vivek? Sure. My name is Vivek. I am a engineer here with Voomi. I've been here for a little over a year and a half, I think. Yeah. And I primarily work on the provider's

54:18 team, specifically spend a bunch of time sort of taking care of the Kubernetes aspects of Gloomy. Yeah. That's me. Spend a little bit time before that running some large scale production Kubernetes clusters. Yeah. I'll leave it there for now. Awesome. Thanks. And Laura? My name is Laura Santa Maria. I'm a developer advocate here at Pulumi, and I come here from all over the place. I but I am on SIG contributor experience in upstream Kubernetes. And my biggest claim to fame with all of this is that I used to run an entire cluster before Kubernetes of Docker within Docker.

55:04 So all of my horrible horrible experiences there hopefully will help here. We'll find out. Well, I have no idea what polar signals have in store for us. But the fact that I completely trashed their first cluster does leave me a little bit concerned and worried. And you left a few ideas in your in their mouth and I'm a little worried they're in here, like, breaking it again. Now I'm just wondering, like, what are they gonna add to all of this? Okay. So he's gonna leave out the bet that properly trashed it. I think that's a good

55:11 Host Reveals One Break (Containerd Freezer)

55:46 idea. I don't wanna have to be fixing this with any more of a timer than we already have. So Yeah. We just give PS another minute or two. Sounds fair. I love that they're not even sure. They think they're leaving out. I mean, these are They should be pretty beefy boxes. Like, I don't know what they've done. The fact that SSH like, you could ping the box. It was online with an IP address, but SSH was just 100% not responsive. We'll wait to wait and see. Alright. Matthias has said thirty seconds. I have to share what they did after

56:28 we're done to see, you know curious what they did to break it. Yeah. All the breakers notes are available on github.com/rocketacademy/custard, where we will put the Bologna bricks as well. Okay. Spoilers as they're not doing it. There it You used firewall to block SCD. Like, nobody gets firewall rules right the first time. So clearly, you've just you've blocked everything. That's that's been our problem. Alright. Matthias, can I connect? In fact, know what? I'm connecting. You've had plenty of time. So, I'll share my share my screen. And I am going to open a session on control plane two. Matthias has given me

57:15 a yes now as well, so perfect. That's because it's my fault. That is not them. Alright. We have an echo hi from me. If you could please go to active session activity and active sessions, go to the bottom and join and just give me an echo to let me know that you're in the session. We will get this kicked off. Well, that's fun. I'm getting an access denied. Oh, yeah. I didn't give you access. I was wondering what I was doing wrong. I will modify the Pilbara team now and let you actually take part in the show.

57:17 Pulumi Gaining Cluster Access

58:11 So Sorry. My my brain is just frazzled today. There we go. You're good. It's You're you're good, highly entertained thinking that they had already broken that so that they made it so that only you were allowed in. Oh, okay. Try again. You should be alright. Alright. Oh, Matthias was worried that it broke it. Nope. It was just me. All good. Fireball rule was only saying that bash would have been helpful. Alright. Steve, Vivek, you just end the session? I'm coming. I forgot the link. Here we go. It was like I I was on worker one and I was like, why isn't this

59:02 working? Someone's in result. Whatever. I just had to look at it at my browser too in case we need to switch to Vim, which of course we will. So Right. Anyone feeling particularly interested in driving? Sorry, Vivek. I left you off the screen there. Sorry about that. No. You're good. It's fine. Alright. Pretty good, but I will I know it's all all weird on my side. I don't know. Alright. Well, I'll handle the taping and just Sounds good. Communicate. Tell me what to do. So I have set up our cube config and go straight into the alias and

59:34 Pulumi Debugging Begins

59:39 run You can get your alias. Okay. We have a control plan. Didn't seem to be it. Let's the pod. Actually working. I'm now very nervous. Well, we don't have a clustered pod. Yep. But we do have posters at least. Alright. What do just wanna check next? So we don't have the pod. Do you wanna take a look and see if maybe there's a deployment for the thing at all? Or There is. Cluster deployment? Okay. Should we edit? Yeah. Let's just go straight in. Yes. Yeah. Why not? We're gonna have fun. Let's do it. Alright. Alright. What are we looking for in here,

1:00:06 Missing Pod & Deployment Replicas Zero

1:00:28 team? So this thing exists. Let's can you go down, like, I guess? Let's see what the template looks like. Yeah. Request for your v one. Blah blah blah. I mean, I'm gonna guess so there is no pod. I'm guessing there isn't a replica set either. I was gonna say right now, it says replicas are zero. So Oh, yeah. Oh, hey. You good good catch. Yes. I mean, maybe we should start there. Good good catch, Lauren. That's a good one. I don't know if that's gonna just get it. Oh, no. It is ready. Okay. Well, it's not done anything. So

1:01:13 Editing Replicas (They Revert!)

1:01:20 I know. But, I mean, like, at least now, it's it's it's closer. It's not, like, crashing completely. So let's go back. Yeah. Oh, replica is back to zero. Alright. Nice effect. What are we thinking? Interesting. I mean, I can set this to one again, but I'm pretty confident. So what's the revision on that view? I mean, one might also wanna take a look at what's going on on the revision on that. Yep. So So generation six sorry. One more edit. Why not? Let's see what happens. Generation eight? Eight? Oh, that's interesting. Some big changes. Sorry. I was gonna use the wrong word

1:01:36 External Controller Hypothesis & Generation Count

1:02:12 there. Some sort of a mutate a webhook or something, I think. I should really load it up quickly. I am really impressed that you can type all that. I am gonna load out a complete just now. But, yeah, there's no mutation admission controller. Okay. But that was my first part too. Yeah. Interesting. So something's going around changing in. Yeah. That's crazy. I first thought I first thought was the selectors matched the labels, y'all went down a different path, which makes sense. Well, the generation doing that double bump. Yeah. That was the thing that's, like, kind

1:03:05 of interesting. Should we see some events or something? Or It's a good comment from Russell, actually. Like, if this was a new station admission controller, it would have been a single generation bump rather than two. So something external. Got it. Right. Probably. Messing with this. I I gotta ask. There's like there's no way someone else could be connected to this modifying it live. Right? Like, just making sure because that could always happen. I mean, I'm sure Matthias is a very quick typer but I mean, those editors were fast. Really fast. Yeah. Can we run top

1:03:49 on this? Just to see what's running in case there's something else running? No. Okay. I mean Okay. They just they just said no. It's all inside the cluster. Okay. I mean, I I've got a hypothesis, but I'm I'm gonna hold off because it's on the the way of things. Are there are there any other pods? Like, if you say get get pods all across the cluster to see what else is there. Control. So the age strikes me as suspicious. Yeah. I I was looking at that VIP Damon's. Is that Damon's set? Yeah. Could be third

1:04:13 Identifying and Deleting the Rogue Controller

1:04:50 third one from the top. For BGP advertisement. So that is well, I mean, don't know if they've modified it, but it's certainly I am happy that it's there in the cluster. The thing I'm not happy about, just because I know the code, is metric server. I don't think cluster has ever had a metric server before. Well, then why don't we go see what said metrics cluster is doing? What is said metrics cluster? I mean, they wouldn't give a a a safe name and then lie to us, would they? Why not? Why not? It'd be fine.

1:05:30 Why is that not all of the TV? The first thing that caught my eye. Yeah. Nice. So can we just like wipe this thing off? Just to see what it does? Yeah? You don't you don't want to look at it anymore. You just want to disappear. Is that what you're Well, no. I like I don't know. This is my version of debugging. Like, if it's something that I don't know what it is, it goes away. So we can, like, just disable it and try to turn it off. We're gonna see what it does. See it. It's gone.

1:06:12 Setting Replicas Again (Success!)

1:06:15 Alright. So then what's We will never know what it takes. Modify things. Alright. Oh, somebody's got a good question if there's any cron jobs running, which that makes sense. That's fair. And there are no. Yay. Okay. You just wanna try to edit again? Yeah. Let's do it. Let's try it. What did it do? What will it do? Now we were at eight. Right? So that was the version we were at? Are we still at eight? Yeah. Good. That's a good sign. Hey. That sounds right. Hey. Hey. It's keeping it. It's keeping it. Nice. Good job.

1:06:58 That's a good point. We have a pod. Nice. Let's It's a start. Curl back. K. Yes. And and to answer the chat, yes, I randomly will delete things that don't look familiar. So it was my cluster. I know what's supposed to be there. So Yep. Should we try curling and see what happens? Yeah. I think I think I think be here. Uh-oh. Database is not currently accepting connections. Okay. So let's take a look at the the the spec again and see what the, I guess, the Postgres connection string is, I guess. I don't remember if that's how it works. I'm assuming there's

1:07:27 Testing Endpoint Again (Database Error)

1:07:55 a connection string somewhere in the spec. It is hard coded into the Rust application as Postgres. Gotcha. Okay. Database clustered is not accepting currently accepting connections. Did they, like it sounds like it's actually establishing connection. It's like the actual database might have basically some it's, like, disabled connections. So do we need to, like, pop into Postgres potentially? I hope not. The Steve is so so well versed now with all of this. I'm curious if what do you think? It's apparently, Matias is saying that that Lokitus. I'm not sure if that's how you pronounce it. Is that JSON or JSON or

1:08:32 Investigating Database Connection Issue

1:08:45 however you wanna pronounce that base controller that was set in a replica zero. Lokitas of Borg. Very nice. Very nice. Lokitas? Is that how you would say that? I'm assuming I'm assuming it's Lokitas of Borg. I'm assuming that's what it refers to. Alright. Okay. Because Star Trek. But that's where my brain goes. I only watched Star Gate. I've never seen Star Trek ever. Oh, David. We have to fix that. Anyway, so Is the is the Postgres service called clustered? Is that what we're supposed to connect to? Yes. Our service is there. It's called Postgres. And it's port eighty eighty?

1:09:26 Or no? That's the web application. That's the web app? That's the web app. Five four three two is the Postgres. So What if you describe that Postgres? Does it select the pod? It has an endpoint which Okay. And it's 5432 is that matches. So Fredrik, it's not that I don't like Star Trek. It's that I grew up on s g one and have never watched Star Trek. Now, I'm perfectly willing to reconcile and fix this. Well, I I will make sure we get we get a list. We'll we'll make a list. Yeah. Make me a list.

1:10:13 Don't we do so the endpoint is ten zero one ninety one. Yeah, I see an endpoint. I mean, that I'm relatively confident that Yeah. The selectors are are okay. Should we do we have PC pull or something, like, that we have that we could try connecting to the database ourselves, perhaps? Oh, we can exec into cluster. Yeah. If we exec into the worker, we might be able to, like, kinda poke at that into the pod. I I do believe that's available. Oh, that's DNS. Okay. Alright. Interesting. So So who knows how DNS works in Kubernetes?

1:11:00 Debugging Networking/DNS (Teleport Issues)

1:11:12 Yeah. Okay. I guess I'm just wondering. 1096010. So that is the hard coded IP address for Yeah. That is sorry. is fine and That sounds right. Alright. Yeah. Yeah. Yeah. I think so. Should we pop out and see if is running and stuff like that? Great question. K. I mean, it's quite fifteen minutes old. But it it is Well, so that that one but that one was the one that had the lowest time since modified, since started among the entire thing earlier. Yeah. I spun this customer up an hour ago. So Well, no. It said eight minutes where everything else

1:12:08 was eight minutes thirty seconds. Right. That one could have been modified after it came up is what I'm curious about. Should we just take a look at the accordion access spectrum? Yeah. You can override that, can't you, with a secret or something or extended, I guess. Yes. So at least it's the core DNS image. I'm I'm pretty confident this is the correct Yeah. Yeah. Almost another Star Trek container. Anything you just wanna see in here? Volume mounts, Etsy core coordinates. Okay. So so I might be kind of poking around because they rolled all the pods to hide the

1:13:11 metric server. So that may be why maybe it's not actually in here. Curious if you at the config, the Etsy, like, the actual thing. Yeah. You could take a nap. Lame duck. Exactly. That's that's that's real and legit. I I know. It's just I hope every time I see it, I'm kinda like, wait. Underneath Prometheus, the forward progress. Oh, yeah? Where where is that? I missed that on the line. It's yeah. Oh, okay. Yeah. Nice. Yeah. Good catch. Let's do that. Let's hope it. I guess we'll It should how should I match create? Will it?

1:14:05 I think coordinates will just pick it up. Will it? No. We're gonna find out. I I'm not convinced, to be honest. I mean, we can always just kill the pods and roll them, I guess, like Well, what's your preferred way to roll the pods? Because, I mean, I know what I was about to take, but I I don't wanna I I Bring my bad habits. I would I'm I'm dirty with this. I'll probably just, like, kill the parts. You're not the only one. That's all I do. So Laura's Laura's just gonna destroy everything. Right? She's gonna

1:14:41 I mean Because half the time when you do that, it works. You know? Have you reboot it? Have you turned it back on and on off and on again? It works. I mean, sensible people would be like, let's do a rollout and do this nicely and I'm sorry. No. Well, but sensible people would put it on, like, production workloads. This is not necessarily a production workload, so why not? Hey. This application and cluster is very important to me. Well, I mean, we are we already locked it over, like, three or four times in the twenty four hours. So

1:15:12 Well, I have rolled via the deletion method or the smash method, core DNS. What if we can throw it now? We should go back in and see if No. Okay. Okay. Well, that's a good point, Ash. I think we really good eye catching the post press forward, but we couldn't run an app update, which is not post press. So we fixed one thing, but not this thing we were actually trying to fix. Yeah. Definitely. Definitely. I'm curious if the thing connects now in theory, like, if maybe the the application is able to connect. Is there anything in the Postgres pod, like

1:16:06 the command getting overwritten or something? Or you can can't you swap the DNS in the pod? You can change the change the thing. We we could apply a custom DNS config to, our deployment and pass that through to the pod. Because by default, it pulls it from the host. Right? You can override that with Yeah. Per pod. Is there anything running in, is this running Docker? Container d. Container d. Oh, mhmm. I was just thinking if there's something like, somebody modified a the daemon the the docker daemon. But No. I think there must be something else in that config

1:16:53 map. Right? Yeah. Let's go look. Cluster of. Looks alright. Yeah. I'm not sure That's good. If loop, reload, and load balance are normally there. Like, I don't remember. Yeah. It's been a while. Yeah. I I'm I'm not that familiar with, like, the config. I wonder if we should just, like, look up, like, whatever the the this the the, you know, like, standard thing that Cordian has ships with and just, like, replace and go with that. Well Pretty sure the default is Yeah. Yeah. So really, they don't know happened. Here. Customization. I thought I would Can we actually the

1:18:04 the logs on the worker and, like, tail them and see what's coming up when we actually try to ping? Because it makes me wonder, like, is there, like, a a forwarding going on, or is there, like, a did they somehow sneak, like, a VPN in there or something that they've, like, really messed with everything? Like, that's the kind of thing I'm wondering. But, again, I'm I'm a very suspicious admin. Okay. So that does seem to so that's the control plane, though. Right? We should be trying on the worker that the thing is running. Yeah. Control c.

1:18:40 Oh, no. No. No. That's we're not just stuck watching the ping. Could you stop it there? No. Game over. So this is a fucking teleport. I think really teleport. Oh, I guess I didn't think really on that. Can I Oh, there's no way to like halt processes somewhere in the page? It's not responding to control z either. Alright. I just open a new tab. Create a new session? Yes. Alright. Alright. So we all get out of the session and Can anyone else control see that or are we just stuck with that? I think it's yeah.

1:19:01 Teleport Session Freezing Issues

1:19:23 It's not responding. I tried. It didn't work. Okay. I guess if I just close the tab, I should, in theory, be out. Right? Alright. You can join there's a new session on your activity page. Let's jump out to there and then regroup. Okay. But yeah. So if we go over to the The chat are all telling me to open another session and do the p kill. Yeah. I can tell you that's not gonna work, and I'll I'll show you it because I still have yeah. I still got the other session. So I've apparently just Oh, This one's frozen. Did

1:19:59 something broke? I'm sorry. I might have, like, tried to type at the same time as someone else or something. No. No. No. It's completely frozen now. I'm gonna try rejoining. Is this maybe they've broken it again? I I I can't get in. Every time I try and unjoin, it gives me like a resolve conf error. Oh, now I'm in. I've done a peek of on that. We'll go back to our first session. And now you can't type. So this is the teleport bug that I saw downgrade and fix. Okay. So if I join the join the next

1:20:32 session and I say I I I don't know. I hope so. Right. Okay. We got Laura back. There we go. But yeah. So if we hop over to the worker and try pinging and see what it does. Well, don't ping. Actually, don't ping unless you're gonna just tell it only do ten ten pings or something. Yeah. Was it ping c one? Yeah. Good. Okay. We do have Internet on the host at least. We potentially fixed core DNS with the postgres forward. But, yeah, maybe we do need to jump on to the worker. Can we can we

1:21:15 look at which worker it's gone to? Like, that's a wide whatever on the pods and, like, hop onto that. That's a wide, I think, to get the the node, right, for the pods. Yeah. No. I can't. We're in our head and I can't type. I can't I can type. So wait. We're we're no. I can't type. Can't type. Frozen again. That's okay. No. It's frozen. Oh, no. What I what what are you just doing to my server for fuller signals? Oh, no. Okay. We're gonna try that again. Is there anything weird here? No. Well, I don't think I've ever actually broken a

1:22:20 server by trying to ping things. This is a new one. Yeah. I I have I I don't know what is happening right now. So let's let's see. So get pods. Oh, you didn't do your alias? Okay. Okay. So I think you've got a good point in the back. It what what if they've messed with the workers? Right? Can we just untaint the mastery, the control the control nodes, and let them run on the control nodes? And then Mhmm. I don't know. Actually, we can That's idea. Yeah. Because the the goal is just to get it running. Right? I don't know how.

1:22:48 Hypothesis: Worker Node Problem

1:23:00 Yeah. Yep. Yep. Yeah. Good idea. Hammer. Hammer. Hammer. If we'll record the card in, let's add the Yeah. No one said we actually had to figure out exactly what was wrong. Just make it work. We can also do the other the other smash approach, which is to come down to spec templates. I'd be curious the logs too at some point. Laura, you mentioned that what the app is actually telling us. I mean, I guess, I think a lot is that me? I don't know. I don't know what that was. Okay. What what what what what. What is it?

1:23:50 Maybe it's a back. We lost a back. It's just everything that can go wrong is going wrong. Alright. We have rescheduled our workload to our control plane assuming that they have done something naughty to the worker node. So why don't we exec and We could curl it. Right? So okay. So it's database is still on the worker. Right? True. Should we check if we have at least connectivity in this part and then maybe come up with another plan? I really need to keep that. Alright. I'm gonna try unmuting and seeing if I'm about to make loud noises.

1:24:49 No. You're okay so far. Okay. Good. Okay. So that was a really cool idea to card on the notes. We've got the workload on the control plane, but we're still it's still DNS. So But but I guess the database is still running on a on a worker. Right? Yeah. But we can't get to we can't run an app update. So, like, we still need to fix networking on the container on the host either way. And then we can maybe look at other things. So Well, I guess if something on that node is stopping the connection inbound,

1:25:22 Untainting Control Plane Node

1:25:27 what's what's can you log the the app the app and see what it tells you? There's no I can know that the app. There's no logging. So all you know is that it's not currently accepting connections. Well, I mean, we ran app update inside the container though, and it says it can't resolve Debian. And when we were in here, we did cat. Why does that matter? Why does Because there's no networking on the application. Okay. There's no DNS resolution. I guess we don't care about that. All we care is that Can the app can talk to Postgres.

1:26:02 Right? I don't know what happened to my to my machine. Yay. But you're back. Yeah. Well, sort of. I'm just gonna mute myself for a bit to make sure it doesn't happen again. Alright. So So we know it's DNS. Well, we definitely have a DNS problem, but what Steve's saying is that do we really care if Postgres was resolved. And we don't know if Postgres does resolve or not. I don't know if we have a way to test. Does this Postgres log anything? Can you log that? Tell the logs in the Postgres pod. What? Not real. Oh, I mean custard. Oh.

1:27:10 Alright. Source. Source. You spelled source wrong. It's the pressure. It's getting to me. I know. Log of course. Why is it all? They're not working. Currently not accepting connections. Oh, and then I can't control c the terminal. Oh, wait. Wait. Wait. Wait. Wait. Let's see. Oh, we can't control c again? Nice. Oh, no. So okay. So it's having any go ahead. I don't know what that control kubectl get config map is doing. So Syntax error at error at or near coop control at character one. Yeah. I think the command has been overwritten or something. Yeah. So if we went into the

1:28:12 if we went into the pods spec, right, maybe we can find out if they they mucked with it. Well, hang on real quick. Real quick. If we look up a little bit, what's the docker entry point? Are there any volumes attached to this? Alright. I'll just open I'll keep opening the session. Don't worry about joining. Okay. We can just Yeah. We could just talk about it. I'm gonna read backwards in the in the logs while y'all are doing that because clearly, that was at Alright. Oh, that's there. Oh, there's a guest, which I can't. Oh, no. I've broken that again because I

1:28:48 did, like, control r. That's a horrible teleport bug. Okay. Here he is. Alright. Alright. So let's add it. SDS. Exec is ready. Is your what you're saying is that it'd be the oh, you're in the session. Cool. Okay. Sorry. That was me in the session scrolling up and down by accident. Okay. But Will will the service forward, dude, if it's 127001? Will that accept from outside itself? That's just the readiness probe. Never mind. It would, if I if I remember. Yeah. That's just the probe. Never mind. How about that Yeah. Config map. It's an init config So the contact map

1:29:55 is how I load the data into the post grid. So let's take a look at that. Was there an image here? Or it might just be really silly. Oh, yeah. It's it's there. Okay. Let's check out the I'm still kinda reading some of them. Alright. Could the assertive be messed with? No. You're good. Annotations. That looks okay. Yep. That's that's my data. Okay. But something else is getting called, like, looking back at the logs themselves from the prior the prior message. It's saying let's see. Where am I? Where you want me Sorry. I might have missed some context here, but,

1:30:57 like, it sounded like from the Postgres logs, we were seeing the clustered, like, sort of it's not accepting connections thing, and that was the message that we saw in the in the application as well. So are you think there's something I'm going gonna go back to see if maybe, like, connections are not enabled on the database or something. Like, it sounded as if, like, we perhaps are well, actually If you do the DNS. It was weird that it said, like, kubectl get config map from But I wonder if that was the teleport bug because we're gonna

1:31:36 end it with them, and it was mixing the histories. Oh. Oh. Oh, I see what you're saying. Okay. I'm gonna go to the So Matthias left us a a note in the chat. You were a little too fast deleting that line. Postgres' problem should be orthogonal, though. So that line that we deleted out of the Out of the thing? Yeah. But it should be orthogonal. Okay. So edit config map. We delete the postgres forward. But, I mean, that that doesn't feel like it should be there. Right? So unless the Right. It's completely broken. Yeah. So in in the in the

1:32:26 Postgres worker logs, it's saying the error is syntax error at or near kube control at character one. The statement is kube control get config map dash n kube system core DNS dash o YAML all to roll postgres set statement time at one. So it makes me wonder, is it sneakily living somewhere? It was somewhere up. I think we're suggesting that that may just be Yeah. This one. The output being merged. Well, that is that does actually look like a locally. Right? Yeah. Like, it it it looks like if I didn't know better, they were trying

1:33:07 to put commands into Postgres somehow, like, feeding the command into Postgres and, like I don't know. I don't know. Like, it just some something's weird about that. That's not a command I ever see. Like, not something I ever see in the logs for Postgres databases. So I'm suspicious. Yeah. It also says it was shut down. So on the right underneath the Unix socket, it says database was shut down. So it's not even running. So I don't can we just Yeah. But then it came comes back up. It was saying the database system is ready to

1:33:43 Matthias is saying just copy and pasted it wrongly. So I think that may be a a red herring. Oh, okay. Still. I'm like, something doesn't look right there. Wait a minute. There's no networking inside of our cust inside of our pod. What's in the resolve file? On the whole oh, and the container as I thought. Right? On the host. Why why can't we move Postgres to the control node? Can't you just we can. Yeah. Alright. I don't know if we want to do that. So Oh, I don't know. I was just trying to, like, get rid of

1:33:55 Application Works on Control Plane!

1:34:31 those other notes. I don't because I I guess this we can't exec into the web app or to verify another container has networking, I guess. Another pod. Okay. It's there now. So let's try. Okay. And what's the log say again? What's log? This is a time out now, actually. Oh, it's working. Hey. Nice. So wait. I I missed what you did. What did we do? Steve asked us to move We moved the program. Or was it sorry. I I can't see. Now we can go take a look at what the health program. Yeah. Alright. So now we're now we're seeing if

1:35:27 the if we updated it. We have. Great. Nice. Nice. Yeah. Yay. It's, like, three and a half. We don't know what happened. Oh, yeah. That's the idea. I mean, we can we can go chase that in that that node. So which do you know was it like, which worker was it at? Like, sorry. I'm kinda disoriented. But We cart them Clearly, there's something going on. Earlier on. Okay. Got it. But I I don't know if we did have Well, it's something that's not a good one. Right? Yeah. Yeah. Yeah. So yeah. Yeah. If Matthias or Frederic wanna

1:36:09 jump on and tell us what you did, I I I would love to know more because I have no idea how we fixed that. Well, while they're still jumping on, I almost wanna, like, just hop onto the workers and go dig around. Well, I think restarting Postgres fixed it because they modified Postgres. So when we moved it from Oh, nice. Okay. They Gotcha. But, like, the metrics server, first of all, that was like I didn't expect you to find it that quickly. That was, like, very, very impressive, I have to say. And, yeah, it just was setting replicas to

1:36:32 Polar Signals Explains Their Breaks

1:37:15 zero. And then the other thing I mean, we tried disconnecting with with the firewall from getting a connection. Like, the API server shouldn't be able to talk to anymore and thus, like, basically break everything. But, yeah, fear fearing that that would break the entire cluster once more, we didn't wanna go through with that. And then the other two things, yeah, were just like DNS. So the line you had, and I I don't think that's still fixed, but I think, like, there was, like, still state of DNS and stuff in in the cluster where I just set this to forward Postgres

1:38:00 to the local host address. There's usually a forward everything to etcresolveconf and kind of like that connection between and the Etsy resolve conf, that was completely off, and you just completely deleted that line rather than fixing it. Right? So that's why we could quit or that's why I can speak to Postgres, but I can't get the NFN external because we Yeah. Exactly. And I think the the Postgres was actually still working because the service, like, the Kubernetes service that had the correct IP address given. Like, it still had that in in the in the Kubernetes object.

1:38:41 So maybe there was a mistake or, like, we should have, like, deleted that service, and then it probably had have a a wrong IP address. But yeah. And then the last thing was Postgres where you, like, correctly found out that I was just typing copy copy pasting commands, and then, like, by mistake, I copy pasted a cube c d I command into into Postgres' interface. So, yeah, that that was a good one. I had to laugh by by seeing that when you saw that. But, yeah, in Postgres, we revoked select statement access on the quotes table.

1:39:23 We completely disabled all the database connections for the cluster d database. So that's what that that was the one you were seeing most of the time. And then we also set all the connections that were allowed to zero, and the the statement time out was one millisecond. I think that's what just one is. So, like, even undoing some of the statements that were were altering the database wouldn't have succeeded because they would have taken longer than one milliseconds. Of course. To get around that by, like, using a different user. Yeah. I went into, like, a postcard for a.

1:39:57 Recap and Conclusion

1:40:04 But, I mean, you fixed it anyways. And, yeah, congrats. Yeah. Pleasure. Good to see you. Who who was that that suggested moving the Postgres onto the control plane? That was the touch of magic. I'm I'm deep down. I want to receive it. Nice. Nice. Alright. I guess it helped that the the break wasn't persistent for that. Right? Like, that was the big one. So if Yeah. I wonder if, like, modified inside. Yes. The data coming back into the database. It's loaded as a as part of the edit scripts. When the container starts, it pulls it from the content map. See.

1:40:43 Yeah. Every time you reschedule it, you'll fix post quiz every single time. So Okay. Yeah. I mean that's for their future teams. Right? We got quite lucky there, I think. But still, we'll take it. Right? So Yeah. But, I mean, like, all the other things were, like, pretty intricate for sure. And and it was equally impressive, if not more than than than what we did. And seeing people really, like, getting into these, like, nitty gritty details is is super fun to watch. I love the format. Alright. Well, thank you very much, everybody. Yeah. That was amazing. That was really fun. I

1:41:20 think we got a a nice mix of problems there and a a lot of knowledge shared by the way that we were talking and texting. So thank you to Polar Signals. Thank you to my Pulumi colleagues. Yeah. Thank you for having us. Also, saying hi from and and Frederic once more. Thanks for having us. It's a pleasure. Thank you to Teleport and EquinixMetal for all their support. See you all next week. Have a good day. Bye. Bye. See you. Take care. Bye.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Klustered

View all 45 episodes
Kubernetes

More about Kubernetes

View all 172 videos
containerd

More about containerd

View all 23 videos
CoreDNS

More about CoreDNS

View all 21 videos
PostgreSQL

More about PostgreSQL

View all 22 videos

More about Teleport

View all 38 videos