Watch / Tutorial On demand
Overview

About this video

What You'll Learn

  1. Restrict pods to specific nodes with node selectors and matching labels.
  2. Use node and pod affinity to co-locate workloads or keep them apart.
  3. Manage placement with priority classes, topology spread constraints, scheduling gates, and dynamic resource allocation.

A deep dive into Kubernetes pod scheduling, from nodeName and node selectors through node and pod affinity, topology spread constraints, priority classes and preemption, plus alpha features like scheduling gates and dynamic resource allocation.

Chapters

Jump to a chapter

  1. 0:05 Introduction and Topic Overview
  2. 1:01 What the Kubernetes Scheduler Does (Basics)
  3. 4:11 Bypassing the Scheduler (Manual Assignment)
  4. 5:27 Node Selector
  5. 9:56 Node Affinity (Required vs Preferred, Operators)
  6. 16:48 Pod Affinity and Anti-Affinity (Co-location and Repulsion)
  7. 22:26 Three-Tier App Scenario (Affinity/Anti-Affinity in Practice)
  8. 30:28 Priority Classes (Preemption)
  9. 34:20 Topology Spread Constraints (Even Distribution)
  10. 42:50 Pod Scheduling Readiness Gates (Alpha Feature)
  11. 46:12 Dynamic Resource Allocation (Alpha Feature)
  12. 51:16 Conclusion and Summary
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:05 Introduction and Topic Overview

0:05 Hello, and welcome back to the Rawkode Academy. I'm your host, David Flanagan, although you probably know me from across the Internet and this channel as Rawkode. Today is our first video in a new series called Rawkode deep dives. I wanna thank our sponsor, Commodore, who have kindly sponsored my time to put this first deep dive together. Thank you, Commodore. This deep dive is on Kubernetes scheduling. Now we will start off with a few of the basics just to set the scene, but we're gonna move pretty quickly into parts of the scheduler that you may not be familiar with.

0:43 These are parts of the scheduler that become more important as your cluster size grows as your cluster grows by number of pods and number of nodes. We've got a lot to cover today, so let's dive right in. Okay. So before we move on to the deep dive content, let's just understand on the simplest level what the scheduler is responsible for. So here I have a Lima virtual machine running on my machine, which provisions a standard kubeadmin cluster. Now kubeadmin clusters have static manifests that run the control plane component. Let's not delete it, but let's move

1:01 What the Kubernetes Scheduler Does (Basics)

1:39 our scheduler to the temp directory. And what's gonna happen, probably already happened, is this matter matter pod for the scheduler has been removed. Now this just means that if we run get pods, we'll see the things that are running continue to run. But let's try and schedule a new pod. Here, I have the simplest pod spec that we can deploy to our cluster, and this is just applying NGINX and doesn't specify anything else. We can use Vim to drop this in and we can kubectl apply for our cluster. Now when we run get pods, we'll see that our NGINX

2:37 is in a pending status. I'm not sure why Lima is running in NGINX, but I'm gonna let it go. Now because we removed the scheduler, this will never schedule. In fact, we can do describe pod nginx, and you don't really get any error messages. But if we scroll to here or look here, we'll see that the node is none, which is why the status is pending. Now let's look at our get pods again. Let's bring back our scheduler to the static manifest directory. We'll run get pods on all namespaces. We'll see our scheduler has now been running

3:23 for six seconds, and not only that, but our NGINX pod was scheduled. We run describe on it one more time, we'll see we have a node and the status is running. So in a naive way, we can say the scheduler's one job in this entire world is to populate pod specs with a node name. That is all. However, there is a lot of complexity in how the scheduler determines what that node name should be. That is what we're gonna dive into today. So the first thing we're going to address is bypassing the scheduler. Why? Well,

4:11 Bypassing the Scheduler (Manual Assignment)

4:15 sometimes you just gotta do what you just gotta do. We're now looking at a real Kubernetes cluster. This time it's hosted on Google Cloud with JKE. I'm gonna copy one of these node names, this time being x 42 using the abbreviation at the end of this tuple, and I'm gonna go to our pod spec. We can drop the node name in on the spec and just like so, and this is 100% valid. We can either a cube control apply pod, and if we run get pods dash o wide so that we can see the node

4:57 that it's running on, We'll see that it's running on x 42. Now the scheduler doesn't do anything in this case because the node name is already provided by me, the operator or developer. When that node name is missing, the scheduler kicks in and plays a whole bunch of rules and heuristics to determine where to schedule each node. Let's delete our pod. The next way is scheduling a pod, we're going to take a look at is through the node selector. Now in order to use a node selector, you have to understand your nodes a little bit more.

5:27 Node Selector

5:37 So we're going to run kubectl describe node and paste in our node name. Now this returns a lot of information, but what we want to look at are the labels at the top. So you can see here that this node has a name, has a role of which is none in this case, but it has a whole bunch of labels. We have a label that tells us the architecture of the machine, in this case, it's AMD sixty four. We are also able to get some more information from the cloud, like the instance type, this one being an e two medium.

6:16 This also tells us the operating system is Linux. We then have a bunch of Google Cloud labels telling us about the boot disk, the container runtime, CPU scaling level, log in variant, max pods per node, node pool, OS distribution, machine family, and private node. We then have some failure domain information. This allows us to schedule things across different zones within a region and we'll take a look at that in just a little bit. Now the one we're going to use just now to play with the node selector, it's Kubernetes IO slash Arch. As we approach more and more clusters

6:59 that aren't homogenous, they're heterogeneous. We have clusters with AMD architecture and ARM architecture. You may even have machines with GPUs for machine learning and artificial intelligence. So we need the ability using the node selector to give the scheduler a clue that some workloads are not like others, and we have to be a bit more specific about where we schedule them. So let's go back to our pods pack. And instead of setting the node name, we'll say node selector. And now I'm going to use those labels that we've seen earlier. The one I'm interested in is beta.kubernetes.io/arch.

7:49 Now that all the nodes in my cluster are AMD 64 like Copilot is suggesting. However, let's break this and say that this workload can only run on ARM 64 machine. We can now pop back over to the terminal and do kubectl apply pod. Now when we run get pods, we'll see that this is pending. We can describe our pod and we get this nice message at the bottom. The scheduler is telling us that it failed to schedule. It was the three nodes that are available, none of them match the constraints that we specified. So let's change this.

8:39 So let's apply this again. We run get pods and it's running. Alright. Let's do one more thing with the node scheduler. The node selector doesn't just take one parameter. This works very much like match labels on your deployment spec. You can add as many as you want. So let's remove the beta to Kubernetes.io slash OS of Linux. So now we need a Linux machine on an AMD sixty four chip. We can now apply this from get pods, and it's running. So the node selector is a really great way to take those labels of the knowledge

9:35 and understanding the context that we have about the nodes within our cluster and move our workloads around as required. We're not enforcing too many constraints. We're not being very specific about how it works or lives with other pods within our cluster, but we're gonna get to that in just a second. The notes later is pretty easy to grok. You have labels. You have pods, match them together. Sometimes when scheduling pods, you need a little bit more logic than just basic label selectors, and that's where node affinity comes in. So before we take a look at node

9:56 Node Affinity (Required vs Preferred, Operators)

10:14 affinity, I'm going to run kubectl get nodes. Only this time, I'm going to add show labels. And these are all the labels that we can use as part of the node selector or node affinity configuration. We'll see here that I have a machine using the topology labels in Europe north one a, Europe North 1 c, and Europe North 1 b. Let's jump back to our pod spec. And this time, we're gonna specify affinity. And we're gonna use Node Affinity. And you can see Copilot is jumping straight ahead here, but I'm just gonna let it complete and we'll run through it quickly.

11:06 Now when it comes to Node Affinity, there are two keys that can be used. There is preferred during scheduling, ignored during execution, and required during scheduling, ignored during execution. Let's come back to preferred in just a moment. Now required during scheduling and ignored during execution just means that the constraints that we specify must be true at the time the pod is scheduled. Remember, Kubernetes is eventually consistent through and through. Labels can change over time. Just because a pod ends up on a node now doesn't mean the labels that were used for that scheduling decision will still be true in

11:54 five minutes, five days, or five months. Hopefully, your pods aren't running for five months. Now affinities are very much like node selectors. We're still going to be using labels, but we have the ability to apply operators in multiple values rather than just basic 100% match. So you'll see here that we have the node selector terms and the ability to match expressions. I'm going to replace this with a label that we copied earlier. However, I'll remove the value and remove the equals, and we'll say that we want any value in Europe North 1 A and Europe North 1 B.

12:42 Let's apply this to our cluster. And you'll see here that our pod was scheduled on X 42, which is Europe North 1 A. Let's delete and apply again. We got the same node. Let's try it one more time for the last. Oh, well. Let's mix it up a bit. Let's change our constraint to only accept b and c. We'll delete and apply. And now we're scheduled on F Zed F 9. This is North 1 C. So using node affinities, we can do a match expressions using the same labels from the node selector, the ability to select an operator

13:57 and work with multiple values. So let's make a change to one of these operators. Currently, are scheduled to North 1 C. So instead of doing in North 1 C, let's say not in. Now we can jump back to our terminal, delete and apply, and now we're back on North 1 A. What operators are available? Well, we have the ability to do does not exist, exist, greater than, in, less than, and not in. So enough operators to cover most of the basic scheduling constraints that you would that you would want to apply to your workload. Now in the interest of this being a

14:44 deep dive and being complete, I will also add the match expressions isn't the only option. We also have the ability to match on fields. Like so. Now this isn't a feature that you're likely to use unless you're scheduling a daemon set on a subset of nodes within your cluster. The reason is is that is that the labels we have on a node include the host name, but the host name isn't always your node name. So this match fields was added purely for that case, and really the only key you can actually match on is the metadata dot name.

15:24 So I've covered it. You don't really need it. You can just forget about it. Let's move on. So as I said, we have required during scheduling, but also preferred during scheduling. So let's set this now. This time, I'll let Copilot do its thing. We're gonna come back up here and use the Kubernetes IO arch label. Now we know this cluster has no arch 60 fours, so let's drop that in. Now we'll go back to the terminal and apply our pod. What do you think is gonna happen? We have no arch 64 nodes. However, when we run get pods,

16:15 our container was scheduled. And that's because while the scheduler couldn't find a node that was a complete match, It was just a preference, so it settled for another. Now you can chain lots of these together and set multiple preferences and increase the weight from one to 100 depending on how much you favor each preference. Okay. So we're taking a look at node affinities. Now let's take a look at pod affinities. What's the difference? Well, node affinity allows us to say that a workload should run on these nodes with these constraints. Pod affinities allow us to take that one step further.

16:48 Pod Affinity and Anti-Affinity (Co-location and Repulsion)

17:02 Now we can say that we want this pod to run as close to these pods or these pods to run as far away from those pods. This gives us the ability to repel or co locate pods within our infrastructure. Why would we want to do that? Well, if you have components within a mission critical system, it's likely you won't want to schedule them on the same node because nodes go away. Nodes fail. And if your node fails, do you really want to lose multiple multiple components of a mission critical system? We have to minimize the blast radius of

17:39 the collateral damage when we lose a node. On the other hand, if performance is something that you're more sensitive to, it could be that you want your customer facing API as close to your database as possible to minimize latency and network hops around your infrastructure. So we'd use affinity to colocate and anti affinity to repel. Let's take a look at this by example. Here, I have four pod specifications. I have a pod called NGINX East one, NGINX East two, and NGINX West one, and NGINX West two. They're very creative, I know. We're gonna set up some pod affinities to

18:21 schedule the East pods as close to each other as possible and the West pods as close to each other as possible. And we won't bother with anti affinity just yet. So we're gonna come into our pod spec and we use the affinity key. And now we can see that we actually want pod affinity. And I'll let Copilot drop this straight on, but we'll walk right through it. Now the first thing we need to do is configure our label selector. The label selector is a bit different from the node selector this time because we're using the selector

18:58 to group the pods with the shared affinity. Which pods do we want to run as close to each other as possible? Now Copilot's already guessed exactly what I was gonna do here, which is to match on the names and say that we want to group east one and east two together. The last part of this pod affinity is the topology key, and all this is doing is telling us which label on the node to use as a grouping. So when we match these pods together, we want to ensure that they all end up in the same zone,

19:38 which is why we're using the filler domain beta Kubernetes IO zone. Of course, if we describe our nodes or show labels, We could also use this topology code here if we wanted to do this regionally, or we could use the zone code here. And instead of using the filler domain, let's use topology Kubernetes zone. So let's copy this affinity, apply it here too, and we'll do the same for West, only this time we'll update the West keys, like so. Let's apply our four pods to the cluster. Now what we should see is that the two east pods are scheduled in the same

20:44 zone and the two west pods are scheduled in the same zone. But right now, that zone could be the same for all four pods because we haven't built any repel or anti affinity into the two different pod specs. So let's see where they all ended up. Alright. West is on F Zed, and East is on X 42. So updating these pod manifests is a little painful. So let's delete them for the last time. And now let's work with some deployment objects that we can scale up and down to see how these affinities and anti affinities

21:31 work in a slightly larger scale. So let's take a look at our new deployment specs. Here we have an NGINX East deployment replicas 10 with a pod affinity to group all of its pods together on the same host. Down here, we have NGINX West replicas 10 with the same affinity to group all of NGINX West pods together on the same host. We can apply this to our cluster, run get pods, and we'll see the containers are creating. Let's run that again and hope it's been enough time to get the output of white. We'll see 10 pods on f zed f

22:17 for nginx west and 10 pods on x 42 for nginx east. So we've removed NGINX East and West, and we're trying to represent something that is a more traditional three tier application setup. Here, we have some sort of critical application. This could be your front end proxy, like h a proxy, NGINX, whatever you're using to deliver cached assets to your customers. It's very likely you never want this to fall over because then your customers aren't getting aren't getting their responses. So we're gonna call it critical app a. Next, we have web app a. This is

22:26 Three-Tier App Scenario (Affinity/Anti-Affinity in Practice)

22:55 the back end for the critical app. So if we can't deliver something from a cache, we have to go speak to the back end service and deliver the result. We're gonna call this web app a, and web app a needs to speak to a database. So we have database a. My name, it sucks. So we have our three tier application setup. Now all of these containers are just NGINX under the hood. It's not important. We're just looking at the scheduling constraints. So in a production like environment, what do we need to do? Well, we're very concerned about the performance of

23:30 our web applications speaking to our database. So let's do something that we already know how to do and couple these together through affinity. Well, let's go to our affinity. We'll do pod affinity. We'll let Copilot drop in some things. And what we're going to say here is that we want to schedule web apps next to database apps, and we want them on the same host name. Doesn't matter where the databases end up. We're only gonna set the affinity on the web app. So whenever those come to be scheduled, we always find a node or hopefully always

24:15 find a node where there is a database product available to couple those together. Let's apply this to our cluster and run get pods. Now we can see we have two copies of the critical app are on f zed f and x 42. Our database ended up on x 42, which means our two web apps ended up on x 42. So let's increase the replicas for our web application. Now let's say we need 10. Where do you think they're gonna end up? Well, let's run get pods await, and we're now starting to heavily overload our x 42 server.

25:06 Why? Because it's the only place we have a scheduled database, and the pod affinity rules mandate that all of those web application pods should run on the same host name. So let's horizontally scale our database application. And I'm gonna scale back our web application at the same time just so that we can scale it back up to see even distribution across more than one node. And we'll scale up our database to be three. Now when we run get pods, drags, we have three databases running on the same node. That's not ideal. So now we need to do something else.

25:55 We have our web applications matched in Affinity with the database nodes. Our two critical applications got schedules on different nodes by chance, and our three database nodes got scheduled together by chance on the same node. And I'm not a big fan of by chance in a Kubernetes environment. So we need to apply anti affinities to our database and our critical application. So we can come back in and add the affinity key and say that we want pod anti affinity. We're now going to say, thank you, Copilot, that all of our critical applications, which is all the apps with a label

26:37 app with value critical app a, will be repelled from the host name. That means when we schedule three of these critical applications, we will see one on each node within the cluster. In fact, we'll take that a step further and we'll request four. Now because we have a required during scheduling, we'll see that one critical web app is actually gonna fail to schedule. While I'm here, I'm also gonna copy this affinity block for our database. We will update the value here, the database a, and it's the exact same rule. We never want to schedule two databases on

27:28 the same node because we need these to be resilient across node failures. And we'll say that we want three replicas, one on each node. Well, let's take a look where these all ended up. We can see now that we have three copies of our database, all on three different nodes. We can see that our critical app actually can't schedule right now, and that's because we are in a small cluster with pod anti affinities. So they're actually never gonna be able to schedule. We can help that along by deleting the old replica set like so. And now we can see our web applications

28:20 are both still scheduled on the same node. So let's kick some and see where they end up. And now they're on different nodes because they have multiple options, because we have multiple availabilities of the database pod. Perfect. And as we said, with critical apps, and at four replicas and three nodes with the anti affinity, we have one pod which we'll never schedule. So pod affinities and especially anti affinities are great resources for increasing the resiliency of your application in a Kubernetes environment. With the caveat that you have to be aware of those scheduling constraints and the size of

29:16 your cluster. And a three node setup like I'm working with right now, it's gonna be very difficult to roll out updates to your application because there may not be enough nodes to satisfy those constraints. So let's tweak this example a little bit more. Now we have just two deployments, a critical web app, which we need three of, and it has an anti affinity against itself and the database. Why the database? Well, in the case of node failure, we don't want our database and our critical web app to be unavailable. We then have our database, which has an

29:52 anti affinity on itself because we need resiliency at the database layer. When applied together in this configuration, we end up with this. We have the three database pods scheduled and running. However, we're unable to schedule a critical application. Why? Because the database anti affinity with itself spreads out across all the nodes in our cluster. The critical app has an anti affinity of the database, so now it can't schedule anywhere where the database is running, which just happens to be everywhere. And we have a tool for this too through priority classes. Let's jump back over to our YAML and

30:28 Priority Classes (Preemption)

30:35 create a new YAML document. We can copy an example from a documentation like so. Here, we're using the scheduling Kubernetes IOV one API. We're going to create a new priority class, and we'll call this critical. We set this value very, very high because it has to rank higher than the other priority classes within the system. We do not want to use this as a global default. All our applications are not critical. And then you can add a description for any operator that comes along as curious what this priority class is for. There's just one other flag on this resource

31:16 that isn't covered with this first example, and that is the preemption policy. Now I'm gonna set it to never, but there is another option. How do we know what these options are? Well, we can rely on our good friend, kubectl explain. With kubectl explain, you can put in any resource name and then any member of that spec to get detailed documentation about how it could be configured. Now we can see that the preemption policy allows us to be never or preempt lower priority. Now all preemption means is that when new pods come into the queue to be scheduled,

31:59 we can take a look at the value of the priority class to understand if they should skip the queue. If we leave this as a default when a critical pod needs to be scheduled, plus there's 12 pods in front of it, it's likely the scheduler will say, hey. I have to go schedule this one first. Kinda like being friends with a bouncer in an eight clip. And if you don't want this behavior in your cluster, set the policy to never, and pods of this class will never skip the queue. So let's add this priority class to our

32:40 deployment. We pop down to the pod spec where we can set the priority class name to critical. So let's set up a scenario where we can see that in action. We'll take off the never so we can get the default, which is to preempt the pods ahead of the queue. I'm going to comment out a critical app and scale our database to six. Now we can run get pods. We'll see that we have three database pods pending and three database pods running. Currently, we have three pods and the queue to be scheduled. Let's bring back our priority class

33:30 and our critical application. We make sure that our priority class name is set on our deployment object. Now when we apply these and run get pods, we'll see that our critical application didn't just skip the queue, It evicted the other database pods. Priority classes give you a lot of power when it comes to scheduling workloads within your cluster. Just be careful you don't blow away other important aspects at the same time. Alright. Let's kick things up a little bit and start playing with some more advanced scheduling techniques. So there's a relatively new feature in Kubernetes

34:20 Topology Spread Constraints (Even Distribution)

34:20 called topology spread constraints. Try saying that three times fast. Topology spread constraints allow us to scale up our application while maintaining even distribution across a dimension. That dimension could be region, host, architecture, whatever you want, and we have options to configure the maximum skew for that even distribution. Mean and allow meaning that we allow a certain deviation of that spread across the dimension based on what we configure. Now I know that all sounds like blah blah blah blah blah, so we're gonna walk through an example of that step by step to really help solidify what spread topology constraints are.

35:09 So I set up a situation where we have three database pods and three critical application pods. We still have the anti affinity, so the critical app cannot be scheduled to anywhere there is a database pod. If we take a look at the YAML, we've added our topology spread constraint. Now we're gonna come back to that in just a minute. The first thing I want to do is scale down our database, giving us a little bit of room to deploy a critical application pod. Now what's important to hear, there is no anti affinity on a critical application.

35:55 We only specify anti affinity against the database. So you'd maybe expect that we're able to deploy all three critical application pods to the one zone which we're using as a topology key. Now because we have the spread topology constraint, this actually cannot be scheduled, specifically because we say that if we cannot satisfy this constraint, we don't want to schedule any new pods. So let's work through the topology spread configuration. The first thing we do is we apply a label selector. So you could use this to enforce even distribution of multiple deployments, replica sets, stateful sets, etcetera,

36:40 any pod from any higher level component if you need to enforce even distribution. You can set the topology key. Here, we're using host name. So what I'm saying is that when I schedule pods with app critical app a, I want even distribution with every node across every node in my cluster. And the max skew of one means I'm only willing to accept one host, one node, and one pod above the even distribution. Does that make sense? Let's break it down a bit more. Let's set the max skew to two. When we run get pods, we'll see that

37:32 we have two critical apps now being scheduled, and they're both going to be on the same machine, f zed f, the only node that doesn't have a database pod and doesn't violate the anti affinity. Now we don't have even distribution because we have three nodes, but we have two pods on one node. Even distribution would mean one pod or one pod on each node. And with a skew of one, meaning maybe one node could push to two. I know it can be a little complicated. So let's set the skew back to one and drop our database to one.

38:20 Now when we run get pods, we have two being scheduled, and even distribution will be maintained. We're going to see one of these on each pod, each node that is available that doesn't have a database. So let's take the database out altogether now and apply. Now we have one critical application across each node. We have complete even distribution right now, meaning one on each of the dimension, the host name. With a skew of one, we can actually scale this to four, and that will be okay. And in fact, we can push this to 20 and apply.

39:17 And this will continue to work because as long as the scheduler can have even distribution, meaning it can schedule ubiquitously across the dimension, we have no problems. So let's bring back our database node with one. We're going to delete our deployment. Like so. This puts us back in a position where we have one database once it finally schedules. There we go. And now we're gonna bring back our topology spread aware deployment. So we'll set this to three and apply. You'll see that one fails to schedule. Why does it work this way? Well, if we're if we're able to schedule

40:31 evenly across the dimension, which is host name here, and we have the ability to schedule on all the host names, then the skew never goes above zero and you can scale the workload as much as you want. When you lose access to one of the nodes on the dimension or the node in this dimension, we're going to then have uneven distribution across the dimension, which means this skew becomes very important. Now, of course, I could say that I want a skew of 10 and I can scale this deployment back up as much as I want because we can

41:07 accept 10 nodes across breaking the deviation. But it's likely that you want even distribution for a reason. We'll apply it back to one. So when should you use topology spread constraint? Well, it's kind of an evolution of anti affinity. Let's assume that you want to decrease the latency or minimize egress traffic across zones and regions within your cluster or latency to your customers. You could use anti affinity to ensure that you schedule the pods in different regions and zones, But the the anti affinity means that you can't then continue to scale up those pods within that region without adding new nodes.

41:57 Spread topology constraints mean that we can continue to scale up a workload while ensuring the even distribution. We could be in a position where we want 20 or 50 replicas of a pod. This could all be scheduled on one machine. So we add the anti affinity, and now we need 50 machines. But, actually, what we really want to do is just ensure that we have enough resiliency, redundancy, and performance benefit by maintaining even distribution across the zones and the regions. So we have the ability to do anti affinity in a controlled manner without the ceiling

42:41 on the scale or without our scale being bound by the number of nodes, and that's pretty cool. Alright. Let's kick things up again. This time, we're venturing so far into new. This stuff is one twenty six alpha. However, this is some new, cool scheduling features coming to Kubernetes soon. Now I can't use my shiny managed GKE cluster for this demo because we do need to enable a feature gate on the Kube API server. If we go into manifest Kube API server, you'll see that on the feature gate line here, I have enabled a feature gate called pod

42:50 Pod Scheduling Readiness Gates (Alpha Feature)

43:28 scheduling readiness. What does this mean? Well, it means I can have a pod spec, which has scheduling gates. This is for situations where you want to create pods, but they're not ready to be scheduled yet. This could be very useful for blue green and canary style deployments, where we wanna start pushing resources around, but we need to get their scheduling. Why? Well, when you have unschedulable pods for whatever reason, that can trigger your cluster autoscaler, which can spin up a whole bunch of new nodes unnecessarily just because you created a pod that could never be scheduled,

44:12 and it can't be scheduled because it's not ready to be scheduled yet. This is where scheduling gates come in. So let's apply this to our cluster and run get pods. We'll see here that our pod is pending, and we can describe our pod and we'll see a nice little message at the bottom saying that we can't schedule this because we have a non empty scheduling gates. We can come in and remove a gate. Wow. You just see me edit a pod. Typically, pods are extremely immutable. However, with scheduling gates, we do have some mutability on the spec.

45:04 Though it's not ever really a situation where you should be manually modifying these scheduling gates, This should be a controller within your cluster. You should be pushing these pods or these deployments replica sets should be creating the pods with the scheduling gates. And your custom controller, whether that be for blue green, canary, perhaps it's a phased rollout across new zones or regions, whatever your reasoning is, your controller can monitor those gates and remove them as required. So then that we've removed the scheduling gates, we can reapply the pod and it will be scheduled as normal.

45:48 So this isn't a world changing feature, but it does open up new levels of automation in cluster web controllers, and I'm very excited to see what tooling does with this new feature. Alright. We're gonna kick it up one more time, and we're gonna straight back in to alpha features of 01/26. This is dynamic resource allocation. What is dynamic resource allocation? Well, let's understand everything we've covered so far. Scheduler, at least in the earliest days, only allows you only allowed you to schedule based on CPU and memory. Over time, that wasn't good enough for people deploying to Kubernetes,

46:12 Dynamic Resource Allocation (Alpha Feature)

46:31 and new options were added. The ability to schedule based on the hardware available, network topologies, availability of volumes, and so forth. Well, these are all baked pretty much straight into Kubernetes itself. Dynamic resource allocation allows us to dynamically define new things that can be claimed. Now the examples focus on hardware as part of the node, like a GPU. I actually think the use case for this will go much further, potentially out to claiming cloud resources or maybe locks on topics on Red Panda. Now it's very early, but I'm gonna show you what I've been able to put together

47:22 for dynamic resource allocation. Now, of course, when you're venturing in to the wilds of alpha features, we have a feature gate called dynamic resource allocation as well as a runtime config to enable a new API group of resource.kates.io. For this example, I did have to provide my own CRD, And I'm kind of leaning on the example from the KEP here, which talks about claiming cats, only there wasn't enough code in the KEP for it to actually deploy, so I had to fix that. So here we have a custom resource definition that describes a cat, and we have a

48:00 couple of properties, one properties. One being the size of the cat and the color of the cat. And this is now available within our cluster of a grep for cat. Ta da. Next up, we have resource.YAML. Now the first thing you need to do is provide a driver. This is something that is responsible for providing or allocating the claim and then removing it when a claim is no longer needed. So here, we're just making up a driver name of resourcedriver.example.com, and we have a resource class to back them. Next, I'm creating a cat. So I'm using the custom resource definition I

48:44 put together to describe a cat, which is large and black. Then we have a resource claim template. Much like a PVC, here I'm saying that I want to be able to claim something of this resource class name with these parameters. So I'm providing a template for someone who wants to claim my cat. So we can now apply this to my cluster like so. Now I can run kubectl get cats, and I'll see my large black cat. I can also run resource claim template. My cat can be claimed, and we're just waiting for the first consumer

49:30 who wants my cat. So let's go back to our pod spec. Here, we have an NGINX pod, and that's it. So let's actually consume our new resource claim. Much like a PVC, we have to start by adding our resource claims to the top of our pod spec. We can give it a name and say that that's a Salem, the name of the cat from my favorite TV show as a kid, Sabrina. We can say the source of this claim is our resource claim template name, which is large black cat claim template. Now that we have this available,

50:14 we can go to a container. And you see here where we have resources and limits and CPU already for memory. We're going to add that we want our claims, and we want our name Salem. Now we can come to pod.yaml, drop it in, and apply. Now there's no real driver piloting this claim, so we could run resource claim template, and we'd see that it's still waiting for the first consumer. But I hope that shows you how we can start to bring in new hardware like the KEP proposes, but also claims on cloud resources. And I can't wait to see where this

51:07 functionality goes over the next year. So that's dynamic resource allocation. So that's it for our deep dive into Kubernetes scheduling. We took a look at the basics, scheduling the node by setting the node name yourself. We then progressed into node selectors, affinities, anti affinities, topology spread constraints, and some very fresh out of the oven features with with dynamic resource allocation and scheduling games. So I hope this video was useful to you. Remember to subscribe and click the thumb button to thumb up, please. And I will see you next time. Thank you again to Commodore for sponsoring my time

51:16 Conclusion and Summary

51:51 to make this video. I'll see you all next time. Have a great day.

Technologies featured

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

Kubernetes

More about Kubernetes

View all 172 videos