About this video
What You'll Learn
- Remove a broken admission webhook and verify API calls return valid service-account tokens again.
- Trace mutating imagePullPolicy changes, then delete stale resource quotas and restart pods to clear blocked scheduling.
- Fix Cilium and DNS outages by validating service host and port settings and repairing CoreDNS ConfigMap typo rollbacks.
Teams from Adobe and Zapier debug each other's broken Kubernetes clusters. Issues span admission webhooks, Cilium networking, image pull policies, resource quotas, etcd manifests, kubelet permissions, and a CoreDNS ConfigMap reverted by Cloud Custodian.
Jump to a chapter
- 0:00 Holding screen
- 1:53 Introduction: Welcome to Clustered
- 2:14 Thanking Sponsors (Teleport & Equinix Metal)
- 3:27 T-shirt Giveaway Announcement
- 3:53 Introducing Team Zapier
- 4:30 Team Zapier Introductions
- 5:49 Start of Team Zapier's Session
- 6:11 Accessing the Cluster (Zapier)
- 7:15 Initial Cluster Checks (Zapier)
- 8:32 Investigating Pending Pods
- 9:33 API Server Authentication Error Discovered
- 10:40 Checking for Admission Webhooks
- 11:30 Deleting a Validating Webhook
- 12:39 Revisiting API Server Errors
- 14:49 Looking at Taints and Scheduling
- 15:09 Worker Node Kubelet Issues
- 17:21 Debugging Cilium Network Problems
- 18:33 Investigating Image Pull Issues
- 19:35 Identifying "Never" Pull Policy on Pod
- 22:00 Debugging Image Pull Policy Source
- 27:27 Mutation to "Never" Pull Policy Identified
- 28:19 Searching for the Mutating Source
- 29:38 Resource Quota Discovered
- 30:24 Deleting Resource Quota
- 31:11 Image Pulls Succeed
- 31:28 Fixing Postgres Stateful Set
- 33:33 Checking Network Policies (NetPol, CNP)
- 34:11 Identifying Problematic Cilium Network Policies
- 35:50 Application Not Responding (Network/Cilium)
- 40:21 Further Cilium Debugging
- 41:11 Discovering More CNPs
- 42:11 Deleting Problematic CNPs
- 42:57 Application Responds Locally
- 43:05 NodePort/External Access Issue
- 44:35 Team Zapier Concludes Session
- 44:45 Wrap-up with Team Zapier
- 45:38 Introducing Team Adobe
- 47:08 Team Adobe Introductions
- 48:47 Start of Team Adobe's Session
- 49:18 Accessing the Cluster (Adobe)
- 50:47 Initial Cluster Checks (Adobe)
- 51:25 Investigating Kubelet Failure
- 51:41 Kubelet Permission Denied
- 52:25 Fixing Kubelet Permissions
- 52:52 Kubectl Still Failing
- 53:15 API Server IP Mismatch Errors
- 54:38 Identifying API Server Advertise Address Issue
- 56:36 Wrong Kubectl Version Used Initially
- 57:58 Fixing Kubectl Alias
- 58:10 API Server Pod Not Running
- 59:35 Checking Static Pod Manifests
- 1:00:26 Unexpected Pod/ReplicaSet Manager Found
- 1:01:14 Removing Unexpected ReplicaSet Manager
- 1:01:57 ETCD Not Starting
- 1:02:11 Debugging ETCD Failure (Expecting IP)
- 1:03:15 Multiple Listen Peer URLs in ETCD Manifest
- 1:04:07 Editing ETCD Manifest
- 1:05:36 Restarting Kubelet (for ETCD changes)
- 1:05:55 ETCD and API Server Running
- 1:06:32 Kubectl Commands Working
- 1:07:05 Cilium Issues (CrashLoop, Restarts)
- 1:07:51 Debugging Cilium Env Vars (KUBERNETES_SERVICE_HOST)
- 1:09:16 Hardcoded Public IP in Cilium Deployment Env Var
- 1:09:54 Fixing Cilium Environment Variable
- 1:11:07 All Control Plane Pods Running, App Still Failing
- 1:11:15 Checking for Network Policies (Adobe)
- 1:11:45 Deleting LimitRange
- 1:12:25 Re-checking Cluster Objects
- 1:12:44 Checking Service Object
- 1:12:52 Service Port Mismatch (667 vs 666)
- 1:13:34 Fixing Service Port
- 1:13:55 Application Database Connection Error
- 1:14:10 Checking Deployment DNS Config
- 1:14:50 Hardcoded Postgres Hostname in App
- 1:15:00 Removing Resource Limits (Deployment)
- 1:15:40 Checking Kubelet DNS Config
- 1:16:16 Debugging DNS Resolution (Dig)
- 1:16:30 CoreDNS Config Map Typo ("health" vs "health")
- 1:16:49 Fixing CoreDNS Config Map Typo
- 1:17:34 Restarting CoreDNS
- 1:18:16 Revisiting DNS Issues (Host Resolv.conf)
- 1:19:21 Re-checking Kubelet DNS Config
- 1:20:05 Checking Search Domain Config Map
- 1:21:16 DNS Still Failing from Pod
- 1:22:22 Kubernetes Service DNS Also Failing
- 1:22:47 CoreDNS Config Map Reverted
- 1:22:59 Suspecting External Reversion (Cron Job?)
- 1:23:22 Creating New CoreDNS Config Map
- 1:24:26 Deploying App with New CoreDNS Config
- 1:25:21 Application Running Successfully (Fixed)
- 1:25:46 Team Adobe Concludes Session
- 1:26:19 Root Cause Revealed (Cloud Custodian)
- 1:26:27 Wrap-up with Team Adobe
- 1:27:29 Giveaway Winners (This Week)
- 1:27:51 Giveaway Winners (Last Week)
- 1:28:23 Outro
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:53 Introduction: Welcome to Clustered
1:53 Hello, and welcome back to the Rawkode Academy. Today is Thursday, which means it is clustered day. We have teams from Adobe and Zapier who have broken two clusters with forty eight hours notice and will attempt to fix them or at least fix each others in forty minutes or less. We'll see how they get on in just a minute. Before that, I do have some housekeeping and sponsors to say thank you to. So thank you to Teleport. I've been using Teleport Unclustered since almost the very beginning. It is a fantastic tool that allows me to share access to these clusters via GitHub single
2:14 Thanking Sponsors (Teleport & Equinix Metal)
2:29 sign on. Not only that, we use their wonderful web terminal editor so that we can all type in the same session. This makes pairing and debugging, especially broken clusters a little bit easier, although I won't be as bold enough to say easy. So thank you to Teleport. If you wanna learn more, go to Rawkode.liveteleport. That supports the show, keep the sponsor happy. It means that we can keep doing Clustered and giving away t shirts, which we'll be doing in a minute. I also wanna say thank you to Equinix Metal. They have provided the hardware for every
2:59 single episode of Clustered, and these are not small machines. I'm not using VMs with virtual CPUs. I'm using big machines with 48 plus cores and a whole ton of RAM. Why? Because it just makes it a little bit more fun. So thank you to Equinix Metal. You can also check out Equinix Medal by using the quote Rawkode. This will get you 200 US dollars in credits. Learn more at Rawkode.livemedal. So thank you, Equinix. We are also doing t shirt giveaways. Thank you to our sponsors. So if you wanna win a t shirt today, go to Rawkode.live/win.
3:27 T-shirt Giveaway Announcement
3:33 You can tweet or follow accounts. These will get you additional entries. I forgot to draw the winners last week, so we're going to do two winners from the last week's competition, and we'll do two winners from today's competition at the end of the episode. Don't let me forget audience. This is on you now. That's not a very nice thing to say actually, is it? Like just blame the audience. All right. Let's introduce our first team. Hello. How are you all doing? Hey. Hey. Doing good. How about you? Yeah. I did alright. I I I always have, like, this existential
3:53 Introducing Team Zapier
4:10 fear before Cluster, but then I settle into it and have a bit more fun. How do are you are you sharing the fear with me a little bit here? I hope so. Little bit. Alright. Well, team z alright. Team z, period, can you please introduce yourselves? Tell us a little bit about you, and then we'll we'll have some fun with our customers. Yeah. So I'm John Anderson. I'm the engineering manager for the SRE team at Zapier. So I mostly create Jira tickets, but luckily, Adam wrote a CLI and Rust for interacting with Jira. So I at least
4:30 Team Zapier Introductions
4:44 stay sharp in the terminal. Hey. I'm Joe, and, I used to perform live, in a Pink Floyd tribute band in front of hundreds of people, and somehow this is a little bit more nerve wracking. And, hey, I'm Adam. I'm an SRE on the team, and, apparently, I'm famous for writing some CLI that I guess I use a little bit, but, yeah, it's nice. Alright. Thank you all for sharing. And I feel like I've just been told that I've been pronouncing zapier man wrong my entire life. So thank you for correcting me live in front of our audience, John. Anyway
5:24 You will not be the first. That's probably the biggest probably I have. What if I don't even know if Zapier or Zapier? Like, say it again? It rhymes with happier. Happier. That's how you remember. Yeah. Zapier happier. Zapier happier. Alright. Well, thank you all for the introductions. We're all excited to see how you get on with this cluster. So I am gonna share my screen, and we have forty two minutes. Those two minutes are used as bonus minutes for me to give you access to the cluster. So I'm gonna come in and modify your role.
5:49 Start of Team Zapier's Session
6:03 You have not been able to see the other cluster, which I love about Teleport. And now you can. So I will open a session on our control plan, and this is gonna be Adobe. Please feel free, don't feel free, please join the session and give me an echo or some command to let me know that you're in it and then we'll get things kicked off for today. Alright. We got one. I think John and I are sharing the same session and not in the same one as you guys are. Yeah. Please go to active sessions.
6:11 Accessing the Cluster (Zapier)
6:52 Find this one, click the join button, and then we'll all be I mean, the audience can see everything that we type, which we found to be quite important on clustered. So If it's just me then, I only I'm the only one that found the the correct cluster. Alright. It looks like you're all in. So set up your cube config. Set up any aliases that you need. Best of luck. Oh, we got we got a special script. We've got it. We're ready. We're ready to go. Cube control, g a q, etcd, helm. Oh, yeah. Even setting up an alias. Alright.
7:15 Initial Cluster Checks (Zapier)
7:27 Oh, yeah. So for this. Although I think your tab completion may have failed. Yeah. Okay. So let's see. So, yeah, so what that did, if you look we didn't trust Adobe. So we we have our own cube control now. So let's check. Perfect. That's exactly what we want. Okay. It's fun. Yeah. Yeah. We we don't need you. You go away since you're not ready. That's interesting. I would have expected that to Cartoon Creek live. So you're going Can we check sorry. I was gonna say, can we check the status of the like, all the pods,
8:15 like, especially cube system? Yeah. That's sorry. That's okay. Yeah. That's That's fine. So That's fine. We got a lot of pending pods here. Yeah. All those terminating, we're not gonna worry about those. Here. So let's see. Where are those? So those are probably oh. All on the worker. Yeah. Makes sense. Yeah. Those are on the workers. So okay. So let's see. Git node. So we want to get rid of this, but that won't actually go away right now. Right? So so we're gonna have to figure out why that won't drain. So can we Yeah. Do something simple like That's
8:32 Investigating Pending Pods
9:06 good. No. I'm not sure. We should check check API server logs for that one. I'm gonna take a look at the logs on the worker itself, see if anything interesting is going on there. Alright. The API server is out. Unable to authenticate request. Okay. So this token is Uh-oh. The one thing we didn't want him to do. Invalid better token, service account token has been invalidated. Yep. Well, we've not seen that before on clustered. Yeah. So they so we're gonna have to reissue that. Right? Let's see. Right. I got a little jiggle problem. I can't
9:33 API Server Authentication Error Discovered
10:08 see half the screen. Try typing reset. Hopefully, that'll kick it back. Okay. Alright. So how do you guys wanna fix the fact that we don't actually oh, well, it's actually not. Yeah. So the cube API server can't talk to stuff right now. Unable to authenticate the request and valid token. So that means it's talking to and it can't talk to it. Right? Oh, and then there's also some sort of webhook. Check the admission.cluster.live. Yeah. I don't think there needs to be a a admission webhook. That could be part of it as well. We should probably if we can
10:40 Checking for Admission Webhooks
10:54 Oh, yeah. Let's see. We'll do get all. What do we got in here? Yeah. Alright. I'll put it into less so that we can scroll. Let's see. I don't see an admission webhook in there. You get all doesn't return The dev ticket. Yeah. That's Oh, yeah. Get All is not all. Just to confuse you more. Yeah. There we go. And not in it's not mutating. It was admission. Yeah. But Out whoops. Yeah. There we go. There we go. That one. That one. Yeah. Straight in with the delete. Get out of here. Get out of here.
11:30 Deleting a Validating Webhook
11:47 We don't want you. Someone spent time handcrafting that, validating admission wiper configuration. They just deleted it without even looking. Some somebody. Come back. Yeah. That's good. Yeah. Alright. So now we need, yeah. So now we need to fix that bear token issue. Right? Because we still have that. Yes. Yes. Or was that the hook that was giving us the problem? Could have been. Check the yeah. Check double check the API server logs. Right. I've there you go. If your screen has given you a headache, just resize your window by pixel. It it does help. Alright. So we still get the Yeah. Reset
12:39 Revisiting API Server Errors
12:39 issues. Yeah. And I'm assuming that's talking to ETCD. Right? So but a bearer token. It shouldn't be using Yep. Bearer oh, let's see. Yeah. Check. What is that? Endpoint. Check if was up. Is that what it is? Don't forget. Health. Health? Yeah. And then I think endpoint health. Yeah. Yep. Okay. Okay. Alright. There's also an endpoint status, which might be useful depending on what you're trying to check. K. That looks okay. Yeah. Yeah. So I think we're good there. That means it can authenticate. So what would it be using a token for? Shouldn't be using a token for anything.
13:42 Right. And we should probably check the the API server. There are other config files. Yeah? So which log was giving you the Veritone? Sorry. The API server. So the API server is making a request. Or is this is this a request coming to the API server? Oh, yeah. Probably. I mean, it could be the worker. We know the worker's not. Yeah. Check check yeah. We can check that because if, like, the rest of the control plane can access, that's fine. We also might get more information from the other control plane components. Yeah. Alright. I will open a session on Adobe
14:33 Worker one. Feel free to join that if you wanna do some debugging over there. Let's open that. Alright. So before I do that, so I've heard of this famous Rawkode scheduler. I think Infamous probably more than famous, but yeah. So that goes on the spec. Right? So Yep. Oh, check at the bottom as well. OneNote had attained unreachable, so that's probably the network. Which one did? The worker, I assume. Unless it is it could be that the controller is also unreachable, but that makes sense. Well yeah. So we probably want yeah. We need to remove the taint off the master anyways if
15:09 Worker Node Kubelet Issues
15:30 we're gonna do that. Right? Yeah. If we want anything scheduled there. If we can do that now, that would maybe get the network scheduling. Probably, there's something else wrong with the network. Oh, the worker is configured wrong. When I did kubectl get pods, at least that was pointed to local host eighty eighty. There is no authentication on the worker for you to be able to run kubectl. Yeah. Right. That makes sense. Alright. Okay. So I just unpainted the master. That would be a look now. So this is unpainted now. Oh, sorry. Oh, it's still there.
16:38 Oh, so, yeah, I think we're gonna have to fix that API server first because when I When I did that, that should've the server doesn't have a node resource. Oh, whoops. That's my bad. That's why it didn't work. Okay. So that's untainted now. Okay. So we can there's no sorry. Go ahead. Nope. Go ahead. I know that on the worker general control says there are no logs for Kubelet, so that might be part of why it's not there too. Yeah. And it doesn't seem to be running at all. Yeah. Or interested in a worker? Yeah. I'm gonna check. Yeah.
17:21 Debugging Cilium Network Problems
17:32 Where is it? Says that I'm on never pull policy as well. Connected. Okay. So there we go. Okay. KubeLit is back on the worker. Okay. So sorry, David. We're it sounds like we're splitting up. I I don't know what what you're watching. But I'm on a working note right now looking at the journal control logs for the Qbook, which doesn't look too bad. No. It doesn't, honestly. Better than I expected. And I see we don't have a launch. Back to the master. Yeah. I cordoned the worker and drained it, and so now I'm moving everything to the
18:18 control plane for now, at least. K. The worker does say it's ready, though. Oh, yeah. I think it was just card and then triplet shut down. Right? It kinda looks like maybe what happened there. Are we still seeing tons of API server errors or oh, error image never pulled. Yeah. But that's fine. So Yeah. Alright. So that should make a new one. Right? Hopefully. Delete the bug? Yep. Still. Still. Yeah. Oh, there it is. Well, there's a remove. Out of this? So it's just as a Yeah. See error. Yeah. Yep. Let's so that means my save didn't actually work.
19:35 Identifying "Never" Pull Policy on Pod
19:35 Let's see. That's annoying. Oh, I am. Yeah. Yeah. Because I have the all oh, let's see. Let's see. Pull. Oh, I think I where is it? I think I edited the wrong thing. Nope. Right there. Image. Yeah. It should be there. Alright. So let's woah. Alright. Alright. So let's see what this is actually doing in here. Yeah. It's not present with an policy. Okay. So let's see. There's a bunch of I'm back on the worker. There's a bunch of law I checked the Kubelet logs again. There's a bunch of logs that it's struggling to, I think,
20:31 destroy network. So it doesn't seem it can't create networks either. I'm not very good with crack control. We're still on a Docker control plane. So is crack control pull supposed to be what I'm doing? CTR is the container d one, or or wait. Yeah. There's You can do it with both commands, I believe. Oh, yeah. You should do with the one that has a but how do I pull? Yeah. Can get the list. You were right. Great. It's awful. Okay. So that didn't seem like it was doing anything. Let's do phcr.io. Alright. It's there. So it should be fallen.
21:20 And it appears to be GitHub, which is good. That better than we could have hoped for. Let's see. V one. Alright. We did get we we just got a lot of errors from on the worker. And, yeah, they're what Adam said. It's a bunch of Cilium failure of run network stuffs. So I'm guessing Cilium is broken. Alright. So yeah. Well, I think those pods are just not sorry. Yeah. Look at we got one pending on the worker. Yeah. That Cilium Oh, but on the worker, don't care. I cordoned the worker for now. So we only
22:00 Debugging Image Pull Policy Source
22:04 care about the control plane right now. Right. So yeah. So we have one that's there. That's fine. So what I'm thinking is there's some configuration that's preventing us from pulling that's not the deploy. Like, the deploy is saying pull. So what else would be saying don't pull images? So if you do a get deploy cluster dash o yaml and grep for pull policy, it it does say always? Yeah. Oops. Wrong cluster. Yeah. It does indeed. Cool. Yeah. So it updated, but then if you check here it's still erroring out there. And so if I delete that,
22:59 it's gonna make a new one. Oh, actually, it does. Yeah. So it does delete. Oh. Oh, it's green. Oh, that's not a problem. Yeah. Alright. And that's definitely scheduled on the control plan? Oh, it's good to go. Yep. Interesting. Oh, but Postgres never moved. Well, let's just That was bold. Yeah. That one's bold. What were they? Could've started with the pod. K. We should restore that. Yeah. I'll I'll restart it. So but, like, what I wanted to do is actually that. Okay. There we go. That sounds fine. That sounds fine. Still terminated. That part, you you probably wanna
24:02 gonna have to for because it can't clean up the it's it's Oh, yeah. Trying to clean up the network. Right? So it's not gonna do that. It's not gonna remove it until network goes there or the the network is fixed. So I think we should probably fix figure out the network problem. We should check. We've got different, probably, config maps first. Alright. Do you wanna run drive it for a little bit? Yeah. The audience is wanting to see the container d config. They think something's gonna be funny there. Container d config? Alright. That's worth checking.
24:37 Get, what am I looking for? You would just drop config dump from container d. It's like I think it's container d config dump or cry control config dump. I forget. Container d config dump problem link. Yeah. There there is no config. Container d config dump doesn't show anything on the It says that con container d slash config dot toml doesn't exist. One way default, I guess. Control plane? You're in your own session. Oh, yeah. Sorry. I was letting you guys do your thing while just poke around stuff. Yeah. Yeah. Let me join up. Alright. So if container d has no config,
25:25 it's unlikely container d is responsible for the full policy, which is the makes it more interesting. Yeah. In fact, if we sorry. I'm just taking over a little bit. But if we do that, there's there's nothing interesting there at all. At least nothing that I see. Interesting. So Not a good way. What else yeah. What else controls image pull policies outside of the image itself or outside the spec itself? That's a good question. Like, maybe that cubelet now because it'd be on Well, The oh, cubelet is coordinating. So it would at it might have logs.
26:14 If we if we terminate the pod, get it to reschedule, and then check the recent cubelet logs, we might see more information about why it's choosing not to, possibly. I find that interesting that this data and the events show another pool. The status also, like, if you did So you described the the deploy. What if you described the pod itself? Does it have a pull policy? Did we check that? That's a good question. Oops. Sorry. We can do it for now. Oh, I'm lagging. Can someone type real quick? Put pod in there for me. It.
27:01 Oh. That's there you go. I think we need to add dash or YAML to see if it's policy. Oh, no. That's true. Dash I on the oh, you did the dash I. Before the Gitpod, not describe pod. Oh, yeah. Oops. Thank you. Never. Oh. That's interesting. Getting mutated. Yep. But we got rid of all that. Did we check mutating go ahead. Yeah. Did we check all namespaces for mutating admission controllers? Or I believe you did. Yeah. I I sure I remember a dash cap at Lee on that, but it doesn't hurt to double check things.
27:27 Mutation to "Never" Pull Policy Identified
27:48 We're halfway through. Twenty minutes left. Alright. So here, let's Go ahead. Yeah. So it it went away. What are the other ones? Validating? Validating. And that was actually what we had was the validating. Yeah. We removed the validating. And we don't have, like, or anything in here? No? That would also register a mutating webhook. So it's unlikely to be kybernal. Alright. What else could be mutating our pod? Controller manager would be responsible for creating did we check the replica set? The replica set never. That's a good idea. Yeah. Yeah. That's a good idea. So I'm just gonna edit the pod as well
28:19 Searching for the Mutating Source
28:37 to see what happens. It's immutable. Rep rep yeah. Replica sets are immutable. Yeah. It's immutable, unfortunately. Yeah. Nice try, though. Well, here. I no. No. I I will not be fooled. I I like that this is the multifaceted break because you've got something modifying the spec to never pull, and you're actually unable to do a cry control pull as well. So I think you can fix either one of these and you might be alright. But Right. I'm definitely curious about both breaks. So look at this. So so ready. So let's just describe. K. So here, Adam, take over if you wanna, like,
29:28 get the rep with us at YAML and look at what it's doing. Do we have There's that. Oh, look at that. It's for bidding. Yeah. We got some resources. Exceeded quota limits. Oh, yes. Sneaky. Yeah. Into Dash dash all. On you go. It's like Get get the spirit of things here. Alright. That was Check for limit ranges as well just to be safe. The obscure one that no one ever remembers, then there you go. Yeah. Somebody did. Oh, yeah. There we go. Gone. Alright. Alright. You will have to delete that pod to kick start the replica set switchover.
30:24 Deleting Resource Quota
30:35 That's a really cool effect. Like, because the resource quota was there, we thought we were getting a new pod, but we're actually getting an old pod for an old replica set. That was cool. Yep. Yeah. That was clever. Yeah. That was yeah. That's an effect you wanna expect. Yep. Delete pod. Yeah. Adobe. Yeah. Hopefully, we are evil enough. That that I didn't expect. Oh. Oh. We still have it. Okay. There so there's three replica sets. Can we get rid of two of them so we can know which one we're, modifying? I don't know how to do it.
31:11 Image Pulls Succeed
31:15 Good work. Someone's enthusiastic with the band hammer. Oh, are we gonna get a container? We are. It's pulling an image. Oh, it's pulling an image. Nice. Oh, buddy. We still need to fix the Postgres one too. Running. We got that. Well, k delete Postgres SQL. This is risky if it doesn't The pod? Or Right. So even though you reapplied the stateful set, because it's a stateful set, that postgres zero means it'll never spin up a new one. Right? Even though it's an old stateful So we had to force it. Yeah. Like, graceful, whatever. Yeah. Because it does the Yeah. Yeah.
31:28 Fixing Postgres Stateful Set
31:58 Ordered rollout. I just wanna I wanna I wanna check for what did I type wrong? Post risk. Is it post guys go Oh, did it actually delete? Oh, it might have deleted. I didn't check afterwards. Yeah. It's gone. Okay. There you go. Describe that. Forbidden. We got more. Alright. It's the same. It just needs Yeah. Any change will do. Yeah. I'm just adding add an invitation. Wait. Now I need to add it on the pod. Right? Oh, okay. This is easy. Great pod. Yep. Boom. Alright. Nope. Still over at zero. Oh, yeah. Because I think you There you
33:24 go. There. Oh, there we go. Curl local host 30000. Let's see what it's doing now. Check for network policies. You, Adobe. As with everything else. What's the point in knowing all the things if you don't use all the things? Eventually, we're gonna have an episode where someone brings us stress that just deletes all webhooks, all network policies. K. Oh, get the service. Check what endpoints we have. Well, I should have filtered then. It's API server. We'll just Okay. I'm gonna filter that. Yeah. What do we got? We've got those. Check the endpoints too. Okay. Yep.
34:11 Identifying Problematic Cilium Network Policies
34:31 Yeah. That looks good. So I'll do k get e p. Let's see what e p's we have. Right. At the right one, it looks like that's wait. Node name control plane. Yeah. That's right. I mean, it's no. Yeah. It looks right. Yeah. I tried it from a browser, and I I don't think we can. Yeah. Here. One second. I'll get back. Yeah. Go ahead. You're not in the pod. Oh, what? Yeah. So people don't know what's in the port has changed. That is my fault From a special eBPF episode we're doing at KubeCon, I did change the port to a
35:33 privileged port for reasons. So, yeah, this looks like it's look suspicious, but it's okay. But she's running, but she's not responding. It's p d. So why would she be Are are we still on v one? We should just change it to v two in case we did anything we is v two. Is it v two? Okay. Yeah. So is oh, is v two does v two still have the old part port number, or is it the new one? They should both be compiled and pushed with 666 as a port. Okay. Okay. Dumb question. We're on v 2 now.
35:50 Application Not Responding (Network/Cilium)
36:17 Okay. If that container doesn't have the port exposed specifically, can you still hit a new service? You can. I believe they're just they're just hints. Yep. Okay. Yeah. So, like, the problem right now is this. Right? That's not actually listening to anything. Oh, that should definitely work. Is is Cilium still down? Because it could be messing with the routing. We probably wanna get it at least try restarting those pods and see if it comes back. Oh, you were in the container when you ran that curl. Yeah. That should work. Yeah. That should've worked. Yeah. That should've
36:57 worked. We have a problem in there. So we're Just in case I am the I am the the reason this is broken, can you try curl an eighty eighty? I mean, just in case. Okay. Yeah. Makes sense. May maybe I'm the bad actor in this one. Oh, good. We'll multiple people to that for us. Nope. So the 666, it's in there. It's just hanging. It could be trying to reach Postgres and failing. Oh, you're not Right. Yeah. Exactly. It still could be the network. So Yeah. Okay. What's the emissary and ambassador pods? You can disregard those. Those are
37:37 a load balancer or an ingress controller that we're not using. So Okay. So even though they don't work, it's It's alright. Yeah. You're all good. Resolve conflict's good in there. Alright. Adam, you can take over again. Yeah. So Choco in the comments says, could it be a silly of network policy? I mean, it definitely could be, but the local and local host, I'm sorry, the container can't be affected by network policies because it never touches the network. So there's definitely something else. It could be Unless it's trying to get to Postgres. Yes. Like it's saying because it
38:13 can't get to Postgres. So I do set the time out on a container of five seconds. You should see a GIF come back regardless, which I don't think we were seeing. Describe that part. Oh. Got ten minutes. K. Yeah. Let's see why Cilium's not coming up. Zero nodes are available. Insufficient CPU. Okay. More specs. So edit edit Cilium. I mean, you can run an LS CPU. There's definitely sufficient CPU in these machines. Looking at h top is always quite funny. Oh, good lord. Not not a lot. See, one one didn't match the pod's nodes affinity
39:09 slash selector. One node didn't match. Okay. So edit the oh, yeah. Edit the daemon set or deploy. Well, you deploy the stateful set cleanly from remote. Right? Oh, is this Cilium? Yeah. You're okay. This is Cilium, not Postgres. This is yeah. So they modified Cilium. A reminder that there are hints from Jake in the chat. So K. Oh, there are hints. Oh, cat. No. Don't do it. Don't do it. We got ten minutes. This is Alright. Well, let's set up the deploy first, and then we'll look at them with the deploy. Because we can see, you know, zero, two
39:57 nodes are available, one is a vision of the CPU. They have confirmed that Cilium was not modified. Fix the affinity. Cilium was not modified. Okay. Well Thank you, Jake. Then why is it failing to schedule? Right. Oh, well, then that means keep schedule or something like that. Mike forgot to see if the hint fails, so you're hintless. But they are in the chat and active if you need hints. Alright. Yeah. No worries. So let's see. Warning failed scheduling. Default schedule. Is there multiple schedulers running? Like, check just edit the daemon site and check what its scheduler it's set to work
40:21 Further Cilium Debugging
40:35 with. But you you have a running Cilium on the control plane. Right? Are you debugging the worker or not Cilium? Well, we called in the worker, so it shouldn't be trying to put one over there. Oh, it's a daemon set. So it would still be scheduled even though the worker comes out of phone. Yeah. So that's probably the worker one. So we're probably okay. So let's keep let's figure out why the pod can't talk to Postgres. Right. Right. I think network policies are the good So build up. CNPs. Let's see. So I forget what the Sicilian ones are.
41:11 Discovering More CNPs
41:16 CMP and CCMP. That's the wide. Oh, look at that. Adobe. Now be careful with this delete because there's two there. Yeah. Don't delete it at fault until we at least store it if you delete it. So I assume we You may wanna look at these. Is the one we like. Yeah. We should let's let's look at these. Yeah. That one's You guys 85. And then let's look at the other one. Yeah. Those are naughty. Seems really five four three two. Yeah. I think we get rid of both of those. Yep. Sometimes it is it's better to check. Right?
42:11 Deleting Problematic CNPs
42:15 Yep. What did I miss? You got a clustered extra. Oh, yeah. There we go. Alright. Let's check. CMPs? Pearl. Oh, yes. CMPs. But you said that CMPs and then Yeah. Large ones with their policies, man. Like, they they they're really see Alright. Five minutes. I feel you're close. Well, I have a guy and a delete in there. K. Yeah. Alright. Okay. That's text. That says me too. Do you want to put it in the browser? Well Can you open it in the browser for us so we can see the video? And try There's nothing listening there's nothing listening on
43:05 NodePort/External Access Issue
43:08 port 30,000 on the control plane. Yeah. So I think there's still The node port. Yeah. There's a node. Can we move it back to the worker? Let's just revert that. I think we're good from but your current local host, it doesn't work. Yeah. The end Yeah. It works. We're fixed, but we'd like to see it in the browser because the video's fine. Yeah. But right. That's the point is the there's no external IP for that for that clustered node port, and there's no it's not listening on port 30,000 on the machine itself, so it's not gonna be able
43:39 to get out to the outside world. I don't know why it's not. Yeah. I can't get to it through a browser. Yeah. If you So, like, it's there is a node port, but if you do there's nothing there. Well, this is because, sadly, I'm using eBPF. Okay. I don't know. Adobe, confirm whether this is you or accidental, and we'll decide if we need to spend time. Because I I mean, we've got the Carl. I can see the v two asset web m. I'm kind of happy. I would love to see the dance, of course, but
44:21 I don't know if this is intended or collateral damage. So I think this was you, David. Yeah. It it it could very much be me. But accidental. Alright. Then we'll consider this fixed. Well done. Well, well done. And you had a whole three minutes to spare. Perfect. Yeah. Oh, nice work. Yeah. Zach is a little smaller, so we don't use all those policies. Yeah. There were there was a lot to to debug there. The resource quotas and the limit ranges were were were great because those are so esoteric. Well, I mean, resource quotas people probably do use. But limit ranges, I've
44:45 Wrap-up with Team Zapier
45:05 never seen them used in the wild, not for a long time. So alright. Good job. Well, thank you for joining me. Feel free to relax now and join us on the YouTube video and hopefully join us in the chat as we invite team Adobe to come on and take a look at your cluster. Oh, we're definitely gonna be in there trolling them after that one. Alright. I'll let you get back to it. I'll see you. Thank you very much and good job. Yeah. Bye. Bye. Thank you. Alright. I'll pop over here just now. Team Adobe,
45:38 Introducing Team Adobe
45:39 come and join us. I will get the session ready. Alrighty. Let's see. We got Mike, Jacob, Carlos. I'll just give that one more minute. So how's that? Are you enjoying watching someone else work through your twister? Oh, I thought they were gonna have it in the first couple minutes for a second. That's it. It never seems to go that way. But those were some some sneaky breaks. And Yeah. The A lot of repetition network policies, cluster wide network policies. Just making sure your blast radius. We thought for the validating webhook, we had tested it so they'd have to go to
46:38 FCD for the admission controller. But I allowed them to just delete it. We must have missed something. So that that happens a lot as well. I wouldn't worry about it. And then the only three minutes to spare, so you definitely kept them busy for that. Yeah. Well, thank you for joining us. We're now gonna attempt another cluster, but if we could start with some introductions, We'll start with you, Mike. Just work our way around clockwise. Feel free to say hello, share anything you want to share, and then we'll we'll hand you over to a cluster.
47:08 Team Adobe Introductions
47:08 Yeah. I'm Mike Tujaran. I'm one of Adobe's engineers on what we call our Ethos platform. It's kind of our bespoke Kubernetes install that runs the vast majority of Adobe's containerized workflows. Adrian? Hey, guys. I'm Adrian. I'm a cloud software engineer at Adobe, and I'm also, working together with Mike and Jacob on the Adobe Kubernetes platform. And I am very happy and hope we'll get to to clean up the cluster today. Thank you. Thank you. Carlos? Hey. I'm Carlos Sanchez. I work at Adobe Experience Manager. We use the clusters, the Kubernetes clusters that these guys build for
48:04 us. We break them a lot, so maybe they got some experience from this. And now I'm scared about what's what's in here for us. Awesome. Thank you, Jacob. K. I think I'm back. You are back. You wanna say hello? Hello. Hello. Sorry. I was having some connectivity issues there. But, yes, I am Jacob from Adobe. I'm a lead engineer. I've been working with Kubernetes for, five years now. It's it's been a wild ride. Have a lot of war stories, so some fun stuff here. Awesome. Thank you all for sharing. So I am gonna pop over to the screen share,
48:47 Start of Team Adobe's Session
48:47 give you access to your other cluster, and then I'll give you a minute to get set up. You have to export your kube config, configure any aliases, and you'll be good from there. Awesome. Alright. And just a moment, this session to the control plane will be open. From there, you can use active sessions and join. Oh, we have a wee hello as well for you. Look at that. Come and join the session. Let me know that you're there and best of luck. Alright. This is where I get nervous. Alright. We got one in. Good. I'm gonna
49:18 Accessing the Cluster (Adobe)
49:43 join myself just in case we have a Vim, which of course we always have Vim. Alright. Four people. Awesome. Alright. Take it away. Good luck. Alright. Jacob, you you still typing? Yeah. So first things first. Let's go get a script that, Zapier team just used, and let's go ahead and, you know, execute that. I thought it was great that they just made that a a public gist that I could just go, you know, download and run. So, you know, thank you. I really appreciate you giving us our group CTL and our CTL and a couple of things there.
50:32 And now that that's done, now let's go add our own and add a couple of things there and source that. Okay. So now let's see what we got. There is no such thing as cheating on clustered, although teams up here are trying to claim otherwise. There goes the admin.com. Okay. Well, let's go ahead and look at the API server. Nothing. Let's go look for a crew. Your manager, had a controller manager, and a scheduler. K. Let's see. From here, let's let's take a look at the manifest and see what we have. Actually, let's first look at the Kubla to
51:25 Investigating Kubelet Failure
51:26 see if that's actually running stuff. Exited. Exec. K. Permission denied. Uh-oh. Did you swap out the Kubelet? I don't know. I guess it depends how much you trust the timestamp on a binary on a random Linux machine you've been given. Right? What is today's date? Yeah. The date is correct. It was it was correct. Yeah. It's the twenty eighth. That looks okay. So problem spawning the Kubelet. Let's see. Thoughts here, team? What are you what are you thinking? Where should we go next? What permissions on the Kubelet are? Yep. The config is not set. That's right. Yeah. Where's the wee x?
52:25 Fixing Kubelet Permissions
52:29 Yes. That one's simple. Let's see what journal is doing now. We might need to restart it. K. True version. Oh, starting stuff. That looks healthy. K. No. It's not found, so there might be a problem with joining the cluster. Let's go see now if got an API server running to scheduler controller manager. Not seeing oh, the API's are. Yeah. K. K. So let's see what happens. Nothing. Nothing. Wonderful. So let's go look at ports. Hold on. Right there. Do I go back to what you were just looking at? I can't scroll through. Oh, looks like there was a mismatch.
53:15 API Server IP Mismatch Errors
53:30 I mean, sorry, an IP mismatch between what IP it was trying to connect to and what the advertised IP was for the API server. K. Should those IP should that IP be just something different? What is the API server? KubeConfig KubeConfig settings. For five, should that be it, or should can we just change it to local host? You can. Yeah. Are you sure it's local host? That'd be fine. So I show the Not sure. It depends. The it should match. The CT scan is 145. It'd be 1454103121. Yep. That's what it's set. Right. So let's take a look at what the API
54:31 server's advertising has. I thought I saw a difference between the two. In the manufacturer's name? Yeah. Advertise address, it's the 10.network. Should that be set to the public? I'm not positive on that. Yeah. Please also check the pre AL IP of the masternode just to make sure that is not a misconfiguration here. So it probably should be advertising to the the private IP or not. So if it's advertising the private IP, the cube config should probably be pointing to the private IP as well. K. Let's see. I'm pulling up some other resources right now. Mike, do you wanna you wanna
54:38 Identifying API Server Advertise Address Issue
55:30 take over and start doing that? Sure. I'm looking at something else really quick. Scroll on me. If you have a scrolling issues, you can just drag your window one pixel or type reset, and it should help. Type reset? Okay. Alright. I can type new API server. You should be able to just copy that address, which I'm assuming you're doing, enter the admin.com. Nope. I find it interesting that you're getting the correct error message, but you're also getting something else on the output. Oh, yeah. So keep And even weirder that you used their script to switch
56:36 Wrong Kubectl Version Used Initially
56:51 out the cube control. That's not the one the script downloaded, I don't think. Where's Espin. It's our local Espin. What one twenty three dot what do we remember? Three. That's a month. It's our local S. Ben. Oops. BUSR local Ben UCTL. UCTL. There we go. Now we're on a more normal cube CTL. So Kathy, wrap API server. API server is not sorry, John. Let's look at what's doing here real quick. So that's there. Cryo CTL. Also not used to Cryo yet. I think it's container list. No. I see the list. Cryo control PS, I think.
58:10 API Server Pod Not Running
59:28 Yep. Or pods list or yeah. Yes. Yeah. So the UPS server still isn't running. So what's it like, Rawkode with the dash a? Yeah. Let's check the static. Yeah. CD isn't running. 351. Oh, I see something's getting in there. There is anything jump out at anybody real quick off of that? If not This the queue replic replicative manager, that's not supposed to be there. Yeah. That sounds fishy. And our SCD config has peer URLs when these are set up as single SCD instances. So this is one thing. I'm just gonna remove this this replica manager.
1:00:26 Unexpected Pod/ReplicaSet Manager Found
1:00:42 We need to Yeah. Oh, wow. Damn. There's a whole bunch of. Oh, man. This happened. I think someone pasted it into the terminal. I just pasted the junk into the terminal. Sorry. Did you look at the replica set manager? What was it? I did. Yes. It is the Rawkode Academy cluster d. You can scroll up and see it. Just selecting the image for Rawkode Academy clustered. So apologies. I just pasted all that. Okay. So let's see. What was this before? See if it clean up any of those files that are excluded. K. Anyway, those thing actions is the method. Folks are doing
1:01:14 Removing Unexpected ReplicaSet Manager
1:01:38 the clean stuff. So I think the admin one and the controller manager one are the ones that I just created. Now that looks that looks similar to the case just above. That's good. Okay. So let's see. So let's so that CD isn't starting. Right. I was wanting to think about the containers, so we should probably take a look at the logs. Yeah. It's a good idea. Let's see. Actually, three minutes ago. Oh my goodness. I know what it is. I use iTerm, and when I have it, something copies it. Jeez. It's not doing it. Sorry. Oh my goodness.
1:02:11 Debugging ETCD Failure (Expecting IP)
1:02:30 You're sabotaging yourselves now. This is all just for everyone's enjoyment. You're welcome. Okay. Removing that. Removing that. Oh my goodness. That's gonna trip me up so many times. I just did it again. K. Hey. Everything's going there. So your best bet would just be to go to varlog containers or pods, varlog pods, and you'll get the logs from there. You can just tell them, or you can just create control. One. Expecting IP for. K. Is there a config file for SDU somewhere that's not in the manifest folder? Nope. All inside of here? Yep. Wait. It's listing two peer URLs there.
1:03:15 Multiple Listen Peer URLs in ETCD Manifest
1:03:29 There's only one in the FCD. Hopefully. I'm not sure which one should be the right one. Where am I looking here? I'm in the wrong place. You're in the right Up five lines. Mike, do you I'm really blanking on how say this. Okay. So listen. Pure URL is here. Yeah. So it's got two IPs there where there's only one control point node. So I don't know what that 1337420 is. Good question. So let's get rid of that. Yeah. Let's remove that. And then we gotta do this client URL should only have one of them as well.
1:04:07 Editing ETCD Manifest
1:04:42 So local host in this case. Now I'm guessing it's gonna be I think that that line's okay, to be honest. To compare that against one. K. Let's see what happens here. Not showing it there. We might need to restart the Kubelet in order to you know, just find out that would be good enough. Oh, did somebody else talk to you? Oh, sorry. I tried to reset my thing, and it screwed it up. Sorry. So let's restart the Kubelets to force it to pick up its manifest. Again, I know there's I know there's a timer there, but
1:05:36 Restarting Kubelet (for ETCD changes)
1:05:42 to k. Running two seconds ago. Still running. That's good. K. Much better. Running. Serving traffic. K. Let's see. Balance. We have API server. Hasn't restarted a bunch of times. Three eight forty four seconds ago. See if not that one. Let's see what's in here. Oh, Let's see. Oh, okay. We have our control plane up. Awesome. See what's running on this. So, Liam, operator, necessary, etcd, clustered app, miscrease. Cool. All that is looking okay. That means we hit yeah. Yeah. And now and now it looks like we should also have the KubeCTL. So what's interesting there is
1:07:05 Cilium Issues (CrashLoop, Restarts)
1:07:06 Cilium operator, the one that's in crash loop back off, has a public IP, not a private IP for one trying to get on to the worker. And so does Cilium with 203 restarts. So let's take a look at those. Same operator. Nope. Not. Part of me just wants to rekick it and let it come up again. But oh, yep. Kubernetes service host is a public IP address in the in the spec. So let's look at the one that is running that doesn't have that. See what it's oh, dang it. It's highlighting here. Copy paste again?
1:07:51 Debugging Cilium Env Vars (KUBERNETES_SERVICE_HOST)
1:08:14 Yeah. Seriously. Goodness gracious. Should I even be driving? K. So let's run up here. Kubernetes service host is a public IP address for that pod too. Interesting. Let's take a look at previous, replica sets to see what an older one looks like and see if they nicely left us the older ones or not. No. That's do we have? So does run as a privileged host namespace, so at least for network on each node. So I think the public IP and I'm I could be wrong. I think that's a red heading. Okay. Then let's look at
1:09:03 this container and see what's up. Yeah. So the API server, remember, we have running on the the private IP, not the public IP. Right. Yeah. Let's check the deployment spec. I saw some interesting environment variables in the I'm for the You're thinking Yeah. Let's edit the deployment. Yeah. Or Yeah. Do you wanna look through that? No. Go ahead. I just opened it. But I I think it's very obvious. There is a a variable environment, setting the API server to a public, IP address. Right. That might be the culprit here. Yeah. Let's set that back to the the
1:09:54 Fixing Cilium Environment Variable
1:09:55 private one. Is it 127? No. It would be the the 10 dot. Do you remember that off the top of your head? Well, if it's in the public it it is in host network. That's the same operator. Right. But it won't be able to get one from the worker node. If it runs on there, it won't be able to get to local host. Oh, you're right. K. Then let's get where it says believe this is what it should be looking at, the 109601. That's where the Kubernetes API server should be. Actually, so then config wise
1:10:38 well, we can try it and see what happens. So let's let's edit this. Actually, by editing, are we going to cause any problems with the pod that is currently running? Because we have one that is actually, both are running right now. Curious. K. Maybe that's just maybe that's not something we should pursue at this point. Let's actually I think it's probably now we have the API server up, it's probably a good time to go look for errand policies. Let's see what might those be. Part of the list during the the other call. The little. K. So let's get NetPole.
1:11:15 Checking for Network Policies (Adobe)
1:11:33 You see that in there. Good. CNPs. CNP, CNPs. The range. Oh, wonderful. How about we just, you know, nuke that? Nobody needs a net limit range. Called unlimited, though. Clearly, that's a good one. Unlimited. Look at it. Oh, thank you. You guys are so wonderful. I can't believe the name lies. It was more than that. Wow. Two level ranges and two clusters. No. That's Webhooks. Webhooks. Yeah. None of those. None of those. Resource quotas net polls. K. That's all looking all the pods are running. So Yeah. Let's check the services, objects too in the default namespace.
1:12:44 Checking Service Object
1:12:48 I also saw something interesting, like, the port is six six seven instead of six six six. Better than that. Table. Must have been accidental. Yes. K. So we know that's Let's confirm that, yeah, let's first let's first confirm that our deployment is listening on 666. Yeah. It should be the same one as Normally, but who knows what Zapier team prepared us. Yeah. It it is there. Should we update it? Let's change this. Yeah. The server should be 666. But it's confirmed it is listening on 666 inside of the pod. K. So that part should be looking good. Now let's
1:13:34 Fixing Service Port
1:13:50 see. Okay. So the last thing we got, though, was failed to connect database. So there might be some interesting name or not known. Okay. Well, let's then get our deployment and look at that and see what our DNS config is. Edit this so people can scroll with me here. That's her first. Okay. Might as well change the we might as well change the image while we're at it. Here's that image. I just wanna avoid DNS entirely. Do you know remember what the the other values are for this so we don't have to use cluster first? Because, I mean, we can go
1:14:50 Hardcoded Postgres Hostname in App
1:14:50 look at the the worker node and see what its DNS policy is set to. Let me Google it real quick. K. Just for the records, Rust application has a hard coded lookup of Postgres. So a a well a well need cluster DNS. K. So I found a limit in here. We don't need that. We have plenty of resources available. It's the only app on the cluster. So let's see. You know, it's interesting. So not We we could just look at the like, what we were messing around with on the worker. Check the the cubelets config.
1:15:40 Checking Kubelet DNS Config
1:15:44 I'll put this back to your cluster first. Yeah. Cluster first with host net is the other one you were thinking of. K. But I don't think we need that. Yep. Six six six and then next. K. Build lookup still. So let's go look at on this host on these. It might have messed with config. Is that a config map here? Yes. K. Or at least usually is. K. So we're seeing health. Is that not an r? Doesn't look like an r, does it? No. It doesn't. Let's go play with that. That's fun little typo. That's breaking an unofficial custard
1:16:49 Fixing CoreDNS Config Map Typo
1:16:56 rule of no unique with characters. Naughty naughty team. Naughty team. K. Forwarding that to resolve conf as it's upstream. I don't know if this Prometheus is necessary in here. There's obviously not on this cluster, but that might just be something default. I don't think it should mess with the filter pane of. So let's let's do that. And I know that will live reload, but And we should also check the the result comp file. Yes. That is very polite using a rollout restart. That is five years of a whole lot of other tenants on our clusters. Could have at least added a wee
1:17:34 Restarting CoreDNS
1:17:44 message just to log why you were restarting it as well. Yeah. If you don't know, at Adobe, Kubernetes, we are a we're ethos team. We are a platform team for all of Adobe. So we there are certain things that are baked into our how we do things. I I think those are good good things to be baked in. So I'm gonna go actually, we can we can just do this from local hosts. Yeah. We're still getting DNS issues. Okay. So DNS. It's the wonderful thing that it's gonna be broken even when you don't think it's
1:18:16 Revisiting DNS Issues (Host Resolv.conf)
1:18:24 broken. So let's go look at let's look at ic.com on this host. K. It says pointing to resolve the DNS server, a 14775 address. David, what is DNS supposed to be set to on these hosts? That is an Equinix IP, so I think they're okay. 14700 out of this? Yes. Let's check the Let's check the bar list. 14775 is an Equinix IP, which will be the which as the region for this cluster, you'll be hitting our local DNS server. Yeah. Okay. So 14775 is At farlibkubeletconfig. Look at it. Yeah. There's some interest in DNS settings on
1:19:21 Re-checking Kubelet DNS Config
1:19:23 there. That's a good idea. That should be config.YAML, I believe. That's 126010? That is correct. Then what's the service? Is the service set to that? The you're saying the Kate service? Yeah. So, yeah, so the it is set to that. I confirmed it just above system. Okay. Can we hop over to one of the workers? Actually, I wanna look at that file. Me look at that search object really quick just to make sure nothing is messed up in there. I have also opened a session on Zapier worker one if you wanna join that session too.
1:20:05 Checking Search Domain Config Map
1:20:15 Yeah. While Jake's looking at the other thing, I'll hop over to that and Zapier worker one. K. I'm I'm just going to exec into just describe what I'm doing. I'm gonna exec into the pod and try DNS queries and see what happens. The cluster DNS in the config is fine, but there is that live reload that could happen with the config file on cluster. You can do an app install DNS utils, I think, if you need access to take I was I wasn't I was actually just gonna skip it because curl did it correctly. But what we should do is dig postgres.
1:21:16 DNS Still Failing from Pod
1:21:20 What is the It doesn't resolve. Yeah. What is it supposed to be, though? Postgres dot default DotService dot Cluster dot local. K. So Does it work for Kubernetes? If you dig against the I mean, if you do it yeah. If you jump back into that container and try to get the Kubernetes.default.service dot local, does that resolve? And use Cluster Never remember what the actual DNS name is, so I'm not positive. Actually If cluster dot local. Oh, okay. Sorry. The I mean, n dot should mean oh, service dot customer dot local. But you should be able to detect Kubernetes on this one as
1:22:18 well. Yeah. So it's it's not resolved. I'm wondering I'm wondering if they removed in the cluster the Kubernetes config if they removed the service pieces. We got eight minutes. It got changed back. What's up with that? That's interesting. That's sneaky. They're keeping you back. A a job or a cron job. So Yeah. That was a bit fast to a cron job, but maybe. K. None of those. But what we can do is just reconfigure it, create our own config map. That's all of that. Or maybe there is a a crumbly a crumbly job running at
1:23:22 Creating New CoreDNS Config Map
1:23:27 the. There definitely could be a crumb job on the host itself, but we only have seven minutes. So I'm just going to copy this really quick. Let's see. I need to be able to create Okay. So I can the profile look okay? Just It does. You fixed the r? I need to change I need to change the name to something else. It was there. It's better to deploy. Using the IP seems to be working. Maybe it's just a DNS now. Where is my volume? Oh, terminal is getting messed up. Where's the volume from config map there?
1:24:26 Deploying App with New CoreDNS Config
1:25:03 K. K. Let's see what you got. Hey. Hey, Rawkode. Yeah. It's working. Up your limits. Is that what we needed? I believe so. I'm gonna try and have it anyway. So Yeah. It looks working. You got the Thanos. Hey. Welcome to the one. That's awesome. Well Well done. Your team. I don't know what was changing that. I don't know what was changing that the DreamUp file. So was fun. Well, we never change our current jobs on the host, and I guess that could have been one way. Yeah. If it changes so fast, there's something I
1:25:46 Team Adobe Concludes Session
1:26:03 think it was something in cluster. I I'm not sure. I'm not sure what. Maybe they'll drop it into the comments. Not in Missouri. But nice work. Yeah. I'm looking forward to seeing what they post and send for the Yeah. Oh, cloud custodian. Cloud custodian. Alright. Alright. Well, this is a lot of fun. Yeah. Yeah. That was fun. Those were two pretty pretty mean clusters, but you both absolutely smashed it. So well done, both teams. That was a that was great work. And so Yeah. Thank you to Zapier team as well. Well done. Alright. Very well done.
1:26:27 Wrap-up with Team Adobe
1:26:45 Well, thank you for joining me. I'll let you skip back to your day, and we're gonna draw some banners for t shirts. You're welcome to to hang around if you want, or you can, you know, watch on the YouTube. It's up to you. But we've got let me get my oh, come on, Internet. You've been fine all day. There we go. Now my internet is going terribly. Alright. Let's share my screen. So this is today's competition. We'll draw the two winners. Come on. Let's go. Kevin and Avanash. So you're both in the Discord. So just
1:27:29 Giveaway Winners (This Week)
1:27:45 send me a DM and we'll get you sorted out with those t shirts. We've got one more from last week that I forgot because it was a weird epic two and a half hour cluster. And I'm never doing a community one again as much as you may want me to. And last week's winners are Jeff and Jason. Okay. I'm not sure if you are on the Discord, but DM me on Twitter, YouTube comments, however you want. Just get in touch with me, give me your details, and I'll get you the t shirt. And that's us. Thank you to team Zapier, Zapier,
1:28:23 Outro
1:28:23 and team Adobe. Thank you to Equinix Medal and Teleport for continuing to sponsor this. And John, you cannot get a t shirt for suffering on cost of I'll tell you what, if you join me again, I'll I'll definitely get you a t shirt. Alright. Thanks to everyone in the comments. It's been an absolute pleasure. I will see you all for the next episode, which will be just after KubeCon. So we won't have one next week because of DevOpsDave Birmingham and I'm on I'll be away at the conference pretending that it's 2019 again. Have a wonderful day and
1:28:52 I'll see you all soon. Adios.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments