About this video
What You'll Learn
- Trace service changes over time to spot broken deploys and config drift.
- Use manifest diffs and RBAC checks to explain Cilium agent failures.
- Follow DNS and NetworkPolicy evidence to restore Drupal to Postgres access.
Komodor's CTO and engineers debug a maliciously broken Kubernetes cluster live, using their timeline to track down Cilium CNI RBAC, Flux GitOps reconciliation, and a NetworkPolicy blocking Drupal from Postgres.
Jump to a chapter
- 0:00 Introduction and Challenge Setup
- 1:21 Initial Service Investigation in Komodor
- 1:36 Identifying Problem 1: Deployment/PVC Config Change
- 2:53 Identifying Problem 2: Scheduling & Node Issue
- 5:46 Investigating Problem 3: CNI Network Unavailability
- 8:57 Debugging Cilium Deployment/Helm Release
- 11:59 Uncovering Cilium RBAC Issues via Manifest Diff
- 12:56 Granting Komodor Agent Permissions (via terminal)
- 14:00 Node Fixed (Cilium RBAC & Cordon)
- 14:42 Back to Drupal Service: GitOps Deployment Issue
- 16:43 GitOps Pipeline Status Investigation
- 20:48 Investigating GitOps Reconciliation Failure
- 23:20 Manually Fixing Cilium Pod
- 24:18 Cilium & GitOps Pipeline Restored
- 25:33 Testing Drupal Website (Encountering Problem 4)
- 26:52 Identifying Problem 4: Database Connection (DNS) Failure
- 27:10 Debugging Network Connectivity / Network Policy
- 28:33 Fixing Network Policy (Deleting Policy)
- 29:15 Drupal Website Restored
- 29:29 Recap and Final Thoughts
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:00 Introduction and Challenge Setup
0:00 So starting from left to right, starting with you. Guy, could you please say hello and introduce yourself and share a little bit more if you wish? Yeah. Hey, everyone. I'm Guy. I'm a solution architect at Komodor. I'm here for the last two years. Very excited to join Cluster. Cool. Hey, everyone. My name is. I'm a software engineer. I came week after guy, and this is it. And hey, everyone. My name is. I'm the CTO of Komodor. We like watching Clustered. Happy to be here. Alright. Thank you all very much. So this is a special edition of Clustered, and I
0:38 have broken the cluster personally myself with a handful of bricks to hopefully reveal and show us the power of Komodor as a tool for understanding and debugging problems within your cluster. So with that being said, I will now let Guy share his screen, and we will begin the debugging process. Best of luck, team Komodor. Thank you. Let's let's start. Cool. There you go. So we got our cluster with Komodor installed. So I will consider this cluster fixed if we can port forward and visit the Drupal website. That is your mission. Okay. Sounds easy. Right? Sounds Easy. Right? Yeah.
1:18 Yeah. Yeah. Yeah. The reason is it's for scheduling. Yeah. So we can go into, like, the service in Komodor and see the timeline. Alright? Like, how did the service change over time? And you can see that it's currently broken, and you can try and understand why is it broken. So why is it broken? I I see there is a heavy deploy, like, in 01/1525. Oh. If you open timeline. So do you mean, like, oh, to zoom in? Yeah. Okay. There you go. I think it's, like, great story of what happened. Yeah. And you can see there is a
1:36 Identifying Problem 1: Deployment/PVC Config Change
1:52 change in GitHub. Yeah. There was a, like, an awesome update. Like, they they update. Was it an awesome update? We will Okay. No. Gotcha. And he, David, removed some environment variables. I think they are crucial and changed the the environment. Even worse. Cool. So let's take a look about the problem. Let's see if they are, like, connected to one another. Oh, so we definitely can see that the volume the PVC not found. Mhmm. So that's been a problem with the PVC. Let's try to do the. Right? Yeah. Let's take rollback. We don't have permission to do it because
2:32 it's your cluster, David. It's also a get up cluster, so the rollback wouldn't actually work. It would be overwritten a few minutes later. However, the first you've discovered essentially the second problem. There is something that's more current with this deployment that you yes. We will have to fix this too. So Let's You mentioned the rates at the start. Somebody said scheduling. Right? Mhmm. Let's look at the bugs guy. The bugs themselves. Yeah. Let's look at them. Sending. Other than the previous name? Not scheduled. Not ready. Yeah. Let's go back. We have information in here, by the way, like, on the first
2:53 Identifying Problem 2: Scheduling & Node Issue
3:17 scheduling of the event. It would be much easier to see it from here. I think that one note is not available. No. Preemption. Let's check the pods list first. Let's try to check on the resources mode to see that there's node that is not ready and scheduling. Maybe let's try to unclog on it. It's something called done. Mhmm. So you unclog on. Yeah. I don't think we will have it. Maybe let's go to the terminal and try to fix it from the terminal. And you add yourself? Can I try As a Yeah? As a user?
3:56 Try to add my my personal name. Do you want you to I I will do that. Yeah. Okay. So we're doing, like, a switch Yeah. Just because of security permissions and because David created the cluster and the account, basically. And then we need to give the the like, to add another team member to the Komodor account. So, David, if you can invite Neil, it can be great. Yes. Yeah. Let's go. Let's go to the notes. Let's take the action and phone on call Gong. And also, let's put the wall in between. Everything at the same time. That's good.
4:44 Cool. So now, nearly, like, rolling back the service. No. Maybe the first deploy was not in Komodor. That's why. I didn't say that it's probably. I think it says that I'm in five. Yeah. No more of this will be found for the comments. We can also take the changes from the GitHub and Oh, we can also take the changes from Komodor. Right? And then from GitHub. Yeah. From GitHub. I think the service was deleted and then reinitiated. Right? Like, it's generation one. It basically means that, like, David played with the service apparently, then he deleted the deployment, then he recreated
5:27 the deployment. With the failed configuration. Yeah. With the failed configuration. And that is the reason that the camera can fall back because for us, it's a new kind of services. They in that it has a different unique ID, and this is the first generation of the new Google deployment. So we need to roll out this workload. Yeah. Is the notes okay now? Let's just check it. Yeah. They are okay. Do you, like, check it on your screen? Or No. No. I'm asking. I I can also check it. No. It's still not ready. And I know it's still not ready. Why is that?
5:46 Investigating Problem 3: CNI Network Unavailability
6:05 Container WiFi network is not ready. It looks like more like a three s issue, I mean. Yeah. We we need the network plug in to be ready. Can we check maybe on the services what is what's configured as the network plug in? Maybe let's try to take a look in the YAML. The network unavailable? Okay. So what's the the reason? CNI is not initialized. It's k three s. Right? No. These are bare metal cube admin clusters. Bare metal? Oh, helpful here. Maybe maybe those are the things. Yeah. This is a 48 core 64 gig of RAM bare metal machine.
7:25 Okay. Okay. So you can have some fun with it. Right? Mhmm. Okay. So let's recap where we are right now. Using Komodor, we explored the broken service. So we identified two bugs. One is that my awesome update and get, which you were able to visualize and see right away, potentially broke the PVC claim name, which we're gonna come back to, I would assume. I also highlighted that the cluster couldn't schedule our pod, and you went through the node dashboard and identified that the node was cordoned, and you're able to uncoordinate directly from Komodor, moving us past the scheduling problem. However,
8:05 we now have the node being not ready because of a potential issue with the CNI networking plugin. Mhmm. Yeah. Yeah. Like, we can see that there are, like I know, like, four different plugins that are installed, CSI plugins that are installed. So CNI. We are looking for CNI, not CSI. Sorry. Sorry. Sorry. Maybe Let's try this quite maybe the node guy. To what? Sorry. Describe node. Oh, it looks like we have the same operator installed Yeah. In in this cluster. Yeah. It might be with the operator. Yeah. There is no Maybe the CRDs of. And there is the operator.
8:52 Maybe the helm. Yeah. The helm saying there is, like, using helm. Oh, there's a fail. Deploy in here. Fail deploy. Yeah. Yeah. So we can see it's failed. There is an agent through. Agent not ready. Thank you as well. Minimum replica unavailable. Yeah. But it's just the it's just the operator itself. On the deploy. Let's take a look. Deployment version one, Selium. There is a spec of, like, label, match label, IO Selium operator. So let's see what Maybe it's it's the port template that is on unmatched the deployment. Do think, like, the relevant part in here?
8:57 Debugging Cilium Deployment/Helm Release
9:44 Maybe. Oh, it's it's funny. It's running and ready, but it's like, the node is not ready. It's always fun watching people fix a broken cluster. Right. Maybe, like have a look at the dashboard. Oh, like, in the dashboard, we can see, like, the. Like, we can see quite a lot on this. Like, what annotation does this have? Oh, the cluster. Yeah. No. In the cluster. Yeah. Just found it. Exactly. Maybe let's check if there is there is the setting cluster wall and cluster wall binding in the cluster. Do do you mean, like, those? Like, resources
10:35 which are not exist? Mhmm. I I think we need to create something. I'm not sure. Maybe let's check the logs of the bot, which is running. No. I think it's, like, on the annotation. Right? Like, it doesn't find annotation on the node. And this one doesn't install it on the node, I think. It's running on the node. So this may be a little bit harder to debug because I think I found a bug in Komodor, but try comparing the values from the release three to release two. Okay. Obsidian. Yeah. Revolve it. Okay. So there are changes, but they don't actually
11:16 show up here. Yeah. Maybe metadata changes? We have only the three version. We don't have the second one. No. We do have it. We do? Of the upper? Ah, in the hand. That's only show changes? It doesn't show anything. No. I do two and then comparing revision two. It's two compared with revision three. No. Yeah. I don't know why it's not showing the change. For me, it showed the changes. Wait. Then manifests. Then compare with version two. And here when you do all the, it does show Here is the changes. You deleted the service account
11:59 Uncovering Cilium RBAC Issues via Manifest Diff
12:08 and all of those. The old back guy. All the callback. I will do all the. No. Yeah. Maybe to do the role if we don't have permission to. I just performed the role back, but try to quit. Alright. Maybe it's a permission thing. Yeah. I think the watcher doesn't have a permission maybe for that. Mhmm. Yeah. It's possible. Yeah. Let's see if also here we don't have any permissions. Secret faults. Let's do also we'll get to the We can't. We can't? We can't program for one agent. We need the the web access to the to the cluster. Right? Yeah. So we will
12:47 use it. So do all the to our agent, and then we'll do to to the same Okay. So I will show my screen. I will stop sharing. Yeah. So we found out that we're missing permission inside Komodor, and it was installed without the possibility of, like, doing a rollback. Okay. This is What costly do ah, okay. Okay. Just a sec. I'm all good. Money. 4. That's it? $2.02. 2 I try to do all the things. Ah, okay. Can you okay. Yeah. That's that's it. Really? Okay. Okay. Now let's go back and check. It's gonna be spreading now.
14:00 Node Fixed (Cilium RBAC & Cordon)
14:02 Yay. Is. Woo hoo. Yeah. He's right. Okay. Okay. And now let's check out our So before we continue Mhmm. The upgrades to Komodor, I did end Komodor because it turned on dashboard. But I see that it moved to secret access, which is probably why the values didn't show. Yeah. That's the reason. Okay. I just wanted to make sure I understood what happened there. Okay. Cool. So now the node is ready. Let's go back to services. And the only thing remaining is to bowl of the version that we know for the. Okay. So we have a working node and
14:39 you fixed the Celeb deploy. Nice work. What we can Yeah. Now we need to roll it back. So what we can we we can't roll it back because what we said So we need to edit, like, let's edit it. Yeah. I think that I need to show him screen because I Remember, this is a GitOps pipeline, so you may want me just to push a fix if you could tell me what you want that fix to be. So get reversed? Yeah. Let's just get up to the link if we wanna do it. I don't know how to fix it. I
14:42 Back to Drupal Service: GitOps Deployment Issue
15:05 mean, I I just put awesome updates. You don't need to tell me how to fix it. So please, I heard your bad call. Yeah. So Yeah. Let's let's just recheck out to the, like, revision. If nothing has changed in between, then that's probably the easiest solution. We check out with the rep before the change. Are you doing the prep, I have pushed an update to get I'm Yes. I'm showing. I'm showing. Do you have, like, a pipeline that know how to like, is it automatically deployed? Yes. Flux CD is running in the cluster. It will detect this change, and it will push
15:48 it out. Speed up the process, and I will just allow just so it's a bit quicker. Yeah. So what we can see is that we knew there was, the p p the PVC change, and we got some one variable, which can be missing and what they've been changed. Yeah. Once It's only the PVC or maybe Maybe we still need those. Yeah. So let's maybe start from there. Let's wait for the rollout to happen. Yeah. We should see it in Komodor once the rollout will happen. Let me update here. We can take a look on the
16:26 Rawkode spots to see if it's pending still. Yeah. Yeah. But it's the previous one. Yeah. There wasn't any good one since So, honestly. So we are looking for a new one. What? So I did push the update. However, our get out pipeline is broken due to the fourth break in the cluster. So good luck. So there's another break? Maybe I'll go. Right? Like, Yeah. Let's let's check Argo. Is there Argo? Flux. Sorry. It's Flux. Source control notifications. All of them look healthy. What are you checking on? Sorry. The Outlook. Alright. It's Slack. Yeah. But maybe it's misconfigured
16:43 GitOps Pipeline Status Investigation
17:17 or something like that. Seems like the fax is working fine. Let's check maybe the logs of one of the workload. The controller or some other service? The source controller, like, the log message looks good. The source controller. The Maybe they didn't update by source controller? Or I think there is still problem with the ceiling. Like, one of the ports are unhealthy. And the source control logs The ceiling operator is pending. And schedule because it didn't match for the. If you go to the workload pods on the white. Right? Pods. Workload pods. Click on the The operator is okay. It's
18:06 just because we needed to roll back. I set the replicas to one because we were single node cluster, so you can ignore that pending pod. No. Take the the the first pod. Oh, you sound like it's just about fine. It's not there, if anything? No. No. It's like you said, like, in the logs of the source control. Yeah. Yeah. It was there. It was. There's, like, message for Let me go back. And then garbage collected one artifact. Why did the garbage collected it? And then a lot of changes. But why did the garbage collected one artifact?
18:41 Maybe it's related to that. I don't know if that's. Store. No changes in the. Like, this is the change. This is what you mean. Right? Yeah. And then, like, one afterwards, Type in. Yeah. This is the commit, like, David pushed. Yeah. But what does it find? Let's see if we got any warnings in here. Or You can do, like, maybe, like I don't know. So what happened is with one point, it define it it find out that there there was a change. Mhmm. But for some reason, the garbage collected it. Maybe I need to change something in the
19:42 box. Yeah. Let's check the configuration. Maybe it's something about this configuration thing. Yeah. There is By the way, in the customized controller, it always failed. The plug, the name is changed from plug system to. And what is the name in the log? It's all that. Right? Yeah. Yeah. So your rollback for Cilium actually fixed this problem, but there's a ten minute sync time on the customization. So I've just encouraged it to run again. So So we don't need to do anything? As long as this customization runs? No. It's still failing. It says the network and cluster is not working.
20:30 Yeah. I don't know if your rollback for Cilium fixed the problem. I think the rollback of Cilium didn't No. Like, there if you look at the logs of the customer's controller, there are really bad logs there. And it says that it failed on a HTTP and fairly common web. Yeah. Let me just show that everyone can see. Yeah. Reconciliation failed after a second. That's the CMPG webhook service. Who is it? Name. The CMPG thing is I think the network is still there. What is this service? The CMPG? Yeah. There is, like, one thing here. I'm looking at the logs of the CMPG.
20:48 Investigating GitOps Reconciliation Failure
21:32 Is it the pod? It's a there is a pod, but, like, the latest message is, like, periodic PLS certificate maintenance, which I don't really know. And that's pretty much it around this service. What was there in Volley in It doesn't, like, succeed, like, with the relevant service, basically. Yes. So let me give you context on that silly and break. Right? Because you've done a rollback, but you didn't really identify what the problem was and what changed. And Mhmm. I don't want this to debug something that you can't have visibility into right now because of that secret value thing. So in the
22:17 cell and health chart, what I did was disable the agent, which is definitely rolled back because we can see the agent is now deployed next to the operator. However, I also disabled the eBPF Kube proxy replacement, and you may notice there's no Kube proxy in this cluster. So at the interest of not debugging something that we're not entirely sure if it's been fixed or not, I'm going to redeploy Selium right now and assume the rollback hopefully fixed it properly. And if we still have an issue, then I'm debugging with you because I'm not really sure what the problem will be after
22:47 that. So let's let's see. Know. Maybe it's worse. I see see problem with with the old. It's It's not that. I think So my my update for Cilium has triggered a redeploy of Cilium. So the contract might definitely change. So we may be moving in to better situation. Yeah. Let me delete the latest serial operator. Oh, who can delete things? Mhmm. Delete the serial operator. Oh, okay. Oh, cool. No. By the previous one? Yeah. The previous. But that one maybe. Because the operator Wait a sec. Yeah. The the one that is pending. No. Not the one that is
23:20 Manually Fixing Cilium Pod
23:45 pending. It's the other one. Okay. What will happen? Yeah. So I'll go to the same operator to the parts. Are you sure? Yeah. I'm going to delete the old serial operator port. Oh, that's a bold move. I like it. Yeah. Yeah. We're not playing around here. You know? So now the the new version is running and It should. Give one sec. Or we won't have anyone there running. Right? It's it seemed like it's failed schedule. Hey. That worked. Beautiful? Yeah. Yeah. Well, we have no doubts about that. Seems like the the the new part of
24:18 Cilium & GitOps Pipeline Restored
24:34 the senior doesn't I like I think that's okay because we have, like, two replica. But now, like, it's a new one that is running. Right? So now what we should I can scale it to one. It's nice to try and just keep on working. Let's get a new one. Issues, you know. I'm scaling the the same one for it to one. No. Think that's the Symbol. No. I think now it's okay. Now let hopefully, create the the logs of the flex thingy there. The customized one? I think. Right? Let's see. Source. I think the cus Oh, the group, I was
25:08 just thinking. Ah, it is? Yeah. You see? If it's the final wall of yeah. Let's see it. When in doubt, delete still your operators. Access everything. Oh, now it's healthy. Woo. Take look on the bots. Yeah. Yeah. Seems like And I will do, like, a s s a. Yeah. Know this one. Forward. That's the default for. Okay. Let me share my screen, and I'll test the website for you. Alright? Yeah. And we understand, like, all you get is, like, Drupal working. That's, like, the best that's, like, the best scenario. The new one, Drupal Drupal is running. We
25:33 Testing Drupal Website (Encountering Problem 4)
25:49 have a problem with our database configuration, but maybe we don't need it. So in the interest of testing, we can go port forward. Why not doing port forward for the Yeah. Why not? Yeah. Komodor also have, like, like, another feature. Okay. So it's almost working. Let's see if we can actually open up the browser. Don't be too happy. You have five to seven? No. He's going to try to use the the database. So this shouldn't actually be needed, but the edit script is unable to run for the same reason that this command will fail. Oh, no.
26:47 Our Drupal instance is unable to communicate with the Postgres database. Back over to you, and this is the last break. Because maybe the environment for that. Right? Like the It's gonna time out. It cannot post Drupal cannot speak to Postgres. There we go. Memory failure in DNS resolution. Yeah. DNS resolution. Back to you. Last break. What is the problem right now? So there we go. So it cannot resolve the database. And okay. So let's check the events. Of the pod? Of everything. Let's check. Election network policy? Maybe. We did a lot of network policy changes.
27:10 Debugging Network Connectivity / Network Policy
27:32 Indeed. Why? Why did we deploy? We're in Komodor events. We can see network policy changes. And the last network policy change was a lot of scraping. A lot of. So we saw that there are a lot of network policy changes. And it's looked like someone changed, like, the unpacking policy. Yeah. There is a policy that's prevent us from executing the request Yeah. Out in the cluster. There is a policy type of ingress. So let's try to take on action and details. I mean, what I love about Komodor here, right, is that the fed log is a
28:20 gold mine of information. And you can see this network policy was created in the last twenty four hours. It's obviously well intended, but, you know, mistakes are easy to make in Kubernetes. Very Done. Alright. If you can stop sharing your screen, I will give the application another spin. I think we should be sitting pretty now. Because I still have my portfolio running. If we remove the install script. Yeah. We're already loading the views. And if we make sure Okay. Completed sixteen seconds ago. The database is now running. Oh, it shouldn't have to do this, but
28:33 Fixing Network Policy (Deleting Policy)
29:10 we run through it anyway. That's it. Well done. You fixed all of the breaks in the cluster, and Drupal is now working as intended. Cheers. So, you know, a small recap, and then I'll let you just get back about your day. Right? But that's that's a whole lot of fun for me. Right? I actually found it really difficult to break the developer, the consumer API of Kubernetes in a way that Komodor couldn't show right up front what the problem was. With the get integration, the depths, the helm charts, the node information, even revealing all the labels and annotations,
29:29 Recap and Final Thoughts
29:54 Everything was just there in front of me. And I think that's just a superpower for people that have to operate Kubernetes. So I will thank you all for your work. A bit harder to break, but I hope you enjoyed each of the breaks that were presented to you. And then, yeah, any final remarks from anyone? No. It was super fun. I'm very fond of the movie.
Technologies featured
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments