Overview

About this video

What You'll Learn

  1. Define Prometheus-backed SLIs in YAML and commit them to your Git repository.
  2. Use SLO quality gates to block promotion when response time breaches targets.
  3. Trigger automatic remediation that scales Helm-managed pods when performance degrades.

Juergen Etzlstorfer returns to walk through Keptn quality gates backed by Prometheus SLIs, deploy to Kubernetes via Helm, then trigger automatic remediation that scales pods on a response-time breach. Includes Litmus and k3s notes.

Chapters

Jump to a chapter

  1. 0:00 Holding Screen
  2. 0:35 Introduction and Welcome
  3. 0:37 Introductions
  4. 1:17 Juergen's Introduction & Keptn Recap
  5. 3:15 Overview of Part I (Slides)
  6. 3:20 Keptn Overview & Core Concepts (Orchestration, Use Cases, Declarative Config)
  7. 8:40 Keptn Quality Gates Explained (SLIs, SLOs, Scoring)
  8. 12:00 Auto Remediation Concept
  9. 14:10 Initial Demo Setup & Triggering Deployment (Attempt 1)
  10. 16:30 Quality Gates for Continuous Delivery
  11. 17:05 Monitoring Deployment in Keptn Bridge (First Attempt)
  12. 21:00 Observing Quality Gate Failure (Prometheus Connection Issue)
  13. 31:50 Troubleshooting and Configuring Prometheus
  14. 35:30 Triggering Deployment (Second Attempt)
  15. 36:15 Monitoring Deployment and Tests
  16. 43:20 Codifying SRE Practices with Keptn
  17. 44:50 Observing Staging Deployment and Quality Gate Result
  18. 50:40 Keptn Deployment Architectures (Multi-cluster, K3S)
  19. 53:00 Automatic Remediation
  20. 53:10 Quality Gate Fails as Expected (Response Time Issue)
  21. 55:50 Setting up Auto Remediation for Production
  22. 58:00 Adding Remediation Instructions File
  23. 1:00:10 Generating Load to Trigger Remediation
  24. 1:00:50 Observing Response Time Increase in Prometheus
  25. 1:02:10 Keptn Remediation Process Explained
  26. 1:03:45 Observing Remediation Action: Scaling Up Pods
  27. 1:05:35 Keptn Integrations Overview
  28. 1:06:00 Litmus Chaos Engineering Integration Example
  29. 1:10:00 Building Keptn Integrations (Developer Template)
  30. 1:19:00 Future Keptn Developments (0.8, Shipyard, Multi-cluster)
  31. 1:19:50 Engaging with the Keptn Community (Slack, Calls)
  32. 1:21:15 Conclusion and Farewell
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:35 Introduction and Welcome

0:35 Hello, and welcome to today's episode of Rawkode live. I am your host, Rawkode. Today, we're gonna be taking a look at the Keptn project. We have done an episode in this previously where we looked at using Keptn for a continuous delivery with some really cool cracks in the site that help gate and protect your deployments from errors and bugs. Today, I'm gonna be joined by Juergen once more. Juergen is an engineer at Dynatrace and a maintainer of the Keptn project. Hello, Juergen. Welcome back. Hi, everyone. Thanks. Thanks, David. I just gotta say, I love your Keptn t shirt. It's

0:37 Introductions

1:08 really cool. Yeah. Thanks. It's our very famous logo. We're we're happy. Nice. Do you wanna kick us off with just a short introduction about yourself and then we can start to talk about the Captain projects again and do a little summary on where we left off last time. Sure, thank you. So my name is Jurgen. I'm one of the Keptn maintainers. I think we've started this project around two years ago. It was originating from more an idea of how we can make continuous delivery and operating software better. It was that the whole idea was about automating

1:17 Juergen's Introduction & Keptn Recap

1:51 a lot of steps and then we first found, we built kind of a methodology and then we from this we built a whole Keptn project. So it's been around for some time. It was initially started by Dynatrace. It's now a open source CNCF project. We are in the sandbox phase of the CNCF. And yeah, in the beginning I was also one of the main developers. Right now, I'm more taking a look at the Keptn ecosystem, how we can integrate with other tools. We just built the Litmus integration recently. We're building a Locust integration. We have other

2:28 integrations for different testing tools, automation tools, deployment tools. So this is what what what I care about these days. Pretty excited. Okay, awesome. Yeah, I think what I realized last time when we were looking at this project is that it really is like the one complete tool for implementing continuous delivery. There was a lot of different components, a lot of different features that we can you know, adopt and leverage inside of our Kubernetes environment. And we only got halfway through the tutorial. So today we're gonna try and take a look at the rest of that tutorial and the other

3:07 features that Keptn brings to the table. So I believe you've got a few slides that you wanna run through to introduce us to Keptn again and give us a quick summary of where we left off. Cool. Yes. So I just want to share a little bit of let me just, am I already in presenting mode? I'm not sure. I think now I am. Yep, there you go. Okay, perfect. So I just want to give you again a brief overview of Keptn and what we did already last time and what we are going to see today.

3:20 Keptn Overview & Core Concepts (Orchestration, Use Cases, Declarative Config)

3:41 So it's Keptn itself is the cloud native application lifecycle orchestration. So you can, that basically means Keptn will orchestrate your existing toolset for deployment, for testing, for evaluating or you can do this just with Keptn. There are different use cases that users, when users start with Keptn they go either full Keptn with having progressive delivery with Keptn, having quality gates with Keptn, doing auto remediation with Keptn or they go only for one part. For example, Keptn quality gates. That's one of the aspects of Keptn that we see that's frequently used and heavily used because it's a very

4:25 mature concept of quality description of a microservice and or can even be a monolithic application and then Keptn will evaluate based on different data sources like Prometheus evaluate the quality of the software and can make a decision if it should pass the stage, if it should be promoted to production or not, for example. It's all built on declarative configurations. So it's very much in all the Kubernetes space. It consists of basically YAML definitions and you can connect different tools to it. I already said what you can do with it. You can automate a lot of parts like

5:06 monitoring, how do you want to do your delivery, how you want to have your remediation executed. And this is done event driven or data driven. So we are using cloud events for this and cloud events are sent to Keptn and from Keptn and then you can interact with these different tools or go different tools can interact with Keptn based on the cloud event standard, which is also a CCF project. Last time, I think we already covered the part why we did it. It's a lot of, the main part is really automation and making your continuous delivery more robust and also giving you

5:45 guidance and framework for automated remediations. And last time we already did a couple of the first two use cases. So we took a look at how to automatically configure observability and dashboarding tools. We were using last time already Prometheus. We will do this also today. We've been using two major concepts and two major files of Keptn. The first one is the environment definition. We call it the shipyard file. And I am from Austria. We speak German here. Captain for us sounds like the captain of a ship. So it's really we have a shipyard file. We have a captain's uniform

6:29 which defines the tooling that you need or that you want to use. We have a lot of things in captain. We have the captain's bridge, for example, which is our UI. A lot of things they have like this nautical terms and then this idea behind it. Our tagline still is nobody ships apps like the captain. So it's yeah. We we we like this theme. So the shipyard file is basically describing your environment. And just with a couple of lines, can describe a multistage environment and this multistage environment basically also reflects a multistage delivery pipeline. Between the stages, Keptn automatically executes a quality

7:10 gate. And this quality gate can be based on SLOs as we can also see here. But these SLOs, service level objectives, can also be used for configuring and setting up your alerting and your dashboarding and your observability tools. So for example, we can use these SLOs configure the Prometheus alert manager, to set up Grafana, and also have alerts in Grafana and just general setup the scrape jobs rules. Also, what we can do with this SLO file, as I already said, it's the definition for our multistage delivery. So with this SLO file and with this shipyard file,

7:53 we have a multistage delivery definition already with a quality gate and Keptn will execute this quality gate and will trigger the quality gate whenever we want to move from one stage to the other stage. And if you want to do a direct deployment or a blue green deployment, it's just a matter of the configurations in the Shipyard file. We will see a little bit of this also in today's presentation. We saw this also last time. We went in more detail last time. I know that David has already prepared a couple of things for today so we

8:27 can kind of have everything that we did last time which is continuous. So we will see a little bit of this also and I will just explaining as we go through it. The new part that we will see is actually more in the remediation part. But just as a reminder how Keptn quality gates is is working, We are based on two concepts coming more or less from the SRE community. So if you're familiar with the SRE book from Google or if you're interested in this book, it's a great book. And they are promoting two concepts

8:40 Keptn Quality Gates Explained (SLIs, SLOs, Scoring)

9:04 there. The one is the SLIs and which is for service level indicator and service level objective short SLO. These are the one of the major concepts there. So the first one is basically an indicator is something like a metric. You can think of a metric. So in Keptn you can define, I would call it a library or a list of indicators and you map the name of the indicator to a PromQL. So then you can reuse the name of the indicator in your SLO file and the SLO file defines an objective for an indicator. So for example, the error

9:42 rate has to stay lower or equal to 1%. So in this case we can easily define the error rate and in the SLO file we do not define where the data is coming from. So whenever we want to change how we retrieve the data, maybe we want to change from one tool to the other, maybe we just have to clarify the granularity of the data. We can do this in the indicator file. We don't have to change our quality gate definition. And it's not that we only care about one objective and one SLI but we can

10:14 mix and match different SLIs. We can have absolute thresholds, we can have relative thresholds that will be then compared to previous runs. For example, the number of database calls is allowed to increase by 2% to the previous runs but it also should stay within 10 database calls per transaction. So there are different ways to define this quality gate. And whenever it's triggered by either by external tool or within the Keptn continuous delivery workflow, here it is just an example to to trigger it with the Keptn CLI, then Keptn will reach out to the different observability tools and data providers,

10:58 will query all this data based on the service level indicators and will then score the service level indicators. Each time it's a full pass of the criteria, you will get a full point. If it's not a full pass but it's still inside a warning criteria, you will get half the points and then in total, Keptn will come up with a score. And this score you can then use to decide if it should be promoted to the next stage, if you want to keep it in this stage, if you want to roll it back. This is then if you're just using Keptn

11:29 quality gates, it's totally up to you. If you're using Keptn, I would say like the full installation of Keptn, Keptn can then automatically initiate a rollback, for example, of this deployment. And we can also use the same file to configure our alerting and then react on our problems or alerts that are coming in. So an alert, for example, sent from the Prometheus Alert Manager can be then consumed by Keptn. And if there's some remediation actions defined and added to Keptn, Keptn can trigger this evaluation these remediation actions can again evaluate if the action was actually successful

12:00 Auto Remediation Concept

12:12 based again on the SLO file and then either execute the next action or close the issue. So we'll we'll see this also in today's demo. For example, we have an alert coming in from Prometheus. Captain will first take a look if there is a remediation action defined for this service that is affected in a particular stage or environment, like preproduction, production, whatever. Keptn l and, again, if you might remember from last time, Keptn is based on a on on on a git approach, so all the configuration files are stored in its internal git repository or on GitHub or GitLab or

12:54 Bitbucket wherever you want to use it. And it will store its configuration files in this git repository. So it will also take a look in the git repository if there is some kind of remediation configuration. If it can find this file, it it will go ahead and trigger the remediation action. So for in this case, it will scale up by an increment of plus one which could also be like plus 10% or whatever you want to call it. It will reevaluate the quality gate based on the SLO definition. It will then, if it cannot meet the required

13:31 quality, the criteria, then it will go ahead and execute the second remediation action. In this case, it's a feature toggle. It will again reevaluate the SLO and then it can escalate or close the issue just based on the outcome. So this is actually what we're going to look at today. I think we will start maybe with the Quality Gates since we already have an up and running working environment. Am not going into the architecture right now. Maybe if we have time in between, I can always come back to this and show a little bit of the architecture.

14:08 But if it is fine, we just deep dive already into into the the hands on part. Nice. Excellent. Cool. So now we let's get my screen up then. So what I wanna do is just go over what I prepared in advance to make sure that we're on the same page there. I really hope I've not messed up which is you know, only happened nine out of 10 times on this show. So screen. And we also have a hello comment. Hey, Philip from Berlin. Thanks for joining us. So Hi, Philip. I just my calendar just chilling there. Don't know why.

14:10 Initial Demo Setup & Triggering Deployment (Attempt 1)

14:53 So I've worked through the tutorial which is available from the Keptn website. I've completed up to step 14, which is roughly where we got to last time. Think what we agreed was on the previous episode, we did do the quality gates, but there's some value and got over that again quickly. What we have is a Kubernetes just must remember to clear my screen before I do this stuff. What we have is a Kubernetes cluster with the captain workloads. There we go. And we have our application workloads and their own namespace. Is that stock shop? Right?

15:34 Yes. I believe we have two different or even three namespaces, stock shop dev production staging. So that's also one thing. So Keptn is in the current version. We are kind of making the distinction between environments by namespace. We are already working on this, making this distinction between environments also per cluster. So in the future with the next release of Keptn, it will be possible to have a multi cluster setup and move have a dedicated, let's say, staging cluster and a dedicated production cluster. Right now, it's separated by namespaces. Nice. I think that would be a very

16:11 welcomed addition. I think from the the people that I speak to there there's a lot of growing support for the multi cluster setup definitely. So that's nice to hear. Okay. So how do we Oh yeah. And this repository, the captain repository is available online. This is what we're continuously writing to when we make changes and we have the Captain's Bridge available too. So what we want to do now is to just correct me if I'm wrong here, Jochen. But we're gonna trigger a deployment with the quality gates to show that stopping a build being promoted from one environment to the other.

16:30 Quality Gates for Continuous Delivery

16:49 Is that correct? Yes. So in your preparation also in the last episode what we already did was deploying in multi stage environment deploying the shopping cart of our Sockshop application. So we were deploying this. It went through dev staging all the way through production. And now we are adding SLO files, quality gate definition file and we trigger another build or actually and we we trigger another deployment with already pre built image and we will see how Keptn will prevent it moving to production. Okay. That sounds good. So let me see if I can remember. I

17:05 Monitoring Deployment in Keptn Bridge (First Attempt)

17:35 think I did this Prometheus step. So this is does a captain ad resource where it adds the SLIs that we have to find in this PremiFuse file. Should we pop that open or we move on to the next step? What do you think is best? We can execute it again just to make sure that we have added this file. I think it's already uploaded. We can actually take a look into the git repository. The screen is a little bit blurry on my end so I'm not sure if I can, if I'm a bit behind but

18:17 in the git repository we should see in the staging branch we should see in there will be a Prometheus folder and all the configuration files that are responsible for Prometheus should be there and we should be able to see the SLI definition. If it's not there, we just execute this command and Keptn will upload Prometheus SLI configuration into the Prometheus folder. Okay. And we have the Prometheus SLI dot YAML, so I think we are okay. Yeah. All good here. So that's basically the definition how to query different metrics from Prometheus. In this case, it's all

19:03 about the response time but you could add error rate, throughput, memory consumption, CPU saturation, whatever you want to add here with your PromptQL or Keptn comes with a couple of predefined metrics that you could also reuse. In this case we're just making it clear how the concept of SLI is actually working. So we just define our own SLI putting it into our git repository. Okay. Alright. Let's move on to the adding our first quality gate then. So what we're gonna do here is add another resource and this time it's the SLO quality gates. We can actually see the contents of this

19:48 file here. Let me zoom in on that. I'm still seeing the GitHub screen. So maybe it's a little bit lagging on my side here. Oh, no. But if you're already on the SLO part, that's perfectly fine. I I what what I will do is I will just have my tutorial also next to me so I can just follow along and just Okay. Well, I was just there for now. I'm gonna add the resource. So not found that's because, am I in the correct directory? There we go. So we're now added to resource that has this SLO file.

20:39 Let's see if I can work out what this is doing. So it adds a it's got a comparison key, which has an aggregate function. I compare with an include the result and a number of comparison results. Do you wanna maybe just tell us roughly what this SLO is doing? Yeah, sure. So it's basically the definition of the quality gate defining how we want to evaluate it and how we want to compare it. So we are comparing to the previous result, just a single result and we're only comparing it to previous results that passed. So if we have a quality gate that

21:00 Observing Quality Gate Failure (Prometheus Connection Issue)

21:15 failed and another one is also failing and we're just comparing it to the previous failing one and we have, for example, relative thresholds, we would not get really good numbers because we would compare a failing error rate to another failing error rate and we would maybe allow it to pass. So we in this error, in this quality gate, we only want to compare it to previously passed results. And for if we are using more than one result we would do the aggregate function. We would just do an average. So we are not taking the minimum or maximum of the error

21:53 rate of previous runs but we would take the average. So if it's only one actually we don't actually need the aggregate function but in this case we are doing the aggregate average. And then we have a couple of objectives and these objectives are defined on SLIs. And actually in this case we only have one SLI which is the response time in the ninety fifth percentile. It's defined in the SLI file that we just, that David just showed you. And here it says key SLI is false. That means if a key SLI is true then the

22:30 Keptn Quality Gate fails if only this one single evaluation fails. So if you have 50 SLIs, but one is very crucial, that would be a key SLI or five are very crucial. Otherwise, they all weighted with the same weight. You can also change this with the weight key here. And then we have the pass warning criteria with the combination of relative and absolute thresholds. So it's allowed to pass if it's faster than six hundred milliseconds and if the increase to the previous run that also passed is below than 10%. So this is how we can build quality

23:15 gates and we can do this with, yeah, combining absolute values with relative values and thresholds. So that that's pretty cool. Perfect. So we've already added that to our cluster. Now we can verify the current versions of dev station on production and see what we're actually working with here. So let me just grab this command. And this should give us a URL which we could copy. So we can see our dev version here running 011.1 and it's green. We can change this to staging and production. What was it? Prod? Probably it's prod. Yeah. Yeah. Oh, no. I'm

24:16 just failing to type. So that's a pretty common occurrence. Alright. Where did I get wrong? Production? We'll Oh, it is production. It's production. Oh, sorry. Let me just run this command just in case then. Oh, it's a different IP address. Okay. Nothing coming up coming up here? No. I've broken production. Okay. So it's actually it's just not it's not deployed. Maybe we can take a look if we can find the pods. Yeah. Let's do that. So let's get namespaces. Let's set our namespace to stock shop production and make sure we've got our pods running. We've only got the DB. We don't have

25:05 the cars. Yeah. So what we can do is we just start another deployment. We already have the quality gates up and running which is totally fine and once we do the another deployment actually it says here that we got a failing quality gate. Can we take a look at the staging here? So it was not moved to production because the quality gate was failing so actually we have to prove that the quality gate is already working. So it's not the Quality Gate that we set up right now but it's actually I assume the Quality Gate that you set up

25:40 yesterday for the preparation for today. So we can take a look at the carts. Probably clicking on the quality gate here will give us a little bit more detail. So so just to clarify there, the problem is is that Keptn is too good, and it detected a problem and stopped our promotion to production. Production. So I like that. Kind of. Kind of. Yeah. So we did not get any let me just get the the light here. So it says the response time was actually failing here. So it was probably too slow to or the response time was

26:22 not satisfying. So let us just do another deployment probably with the same version. Maybe there was some maybe there was too much traffic on the cluster or it was not coming up. Yeah. So there was a command for that. Right. This one here. Captain send event new artifact, pass through the tag. So I could just run this to deploy the Yes. Same version. Alright. Yeah. Cool. And what it does is basically it's sending a cloud event to captain. So building a cloud event and sending this cloud event to the Captain control plane and it tells Captain there is a new

27:02 image of our service carts of our project Sock Shop. So that's the project Sockshop and the service carts was previously onboarded to Keptn. David has done this in the preparation and also in the last episode you can see this and we just informed Keptn now that there is a new container image which has the version 0.11.1 it's actually the same we want to deploy this one. Based on the shipyard file with the multi stage delivery defined, Keptn will start to deploy it first in in the first stage of this delivery file. And in our case it's called dev. So

27:41 it starts to the Helm charts, release version in our dev environment. We'll then apply this Helm chart. It will also apply this, the changes to the Git repository. In this case it might take a couple of seconds since it also waits if the deployment is actually finished, the basically a Helm upgrade if it's finished And our I think it's the the readiness probe takes about thirty seconds, fifty seconds, something like this to for for the pod to be ready. So it just takes a couple of seconds and then Keptn will tell you it's finished upgrading the chart and it's already deployed in

28:25 the first version or in the in the first environment. Yeah. So I think it's deployed to dev now. So is is there a way for us to follow this from the from the bridge? Do we see that rollout happening? Yes. We can take a look here in services. So we have the environment we can see all the different stages but in services we can see all the different services that we have already onboarded and we could see for example yesterday there was Quality Gate was failing and we can see the current run, the last configuration changed.

29:00 So configuration changed here means we changed either the image or we changed load distribution. We see the configuration was changed. We see it was deployed. We can see a quality gate screen in-depth but this one was actually green because we have not added any quality gate in-depth. So it was allowed to pass And we can already see here in staging there is an approval. There are a couple of approval events but we set the approval to automatically. So whenever there is a quality gate which is if there is no quality gate or the quality gate is passing, there will be an

29:40 automatic approval. So for between dev and staging, we just allow it to pass basically and now it says there it changed the configuration so it's basically triggering the deployment. The next event will be once the deployment is finished And once the deployment is finished, Keptn will trigger the tests. That's right now I have to admit the small hiccup in the Keptn's bridge because it's not indicating that right now tests are executed. So also in the last episode what we've done is we have added some JMeter tests and also here JMeter tests are added and they

30:21 are automatically triggered by Keptn. Again with the same approach Keptn will take a look if there is a in this case a JMeta folder in our staging branch of our carts microservice then it will take a look on the files in this folder and based on a mapping definition it will then execute these files. So right now tests are executed. They might take about two minutes to finish. They are I think 5,000 requests are sent to some endpoint of our shopping cart and once the tests are finished Jmeter will send back CloudEvent to the captain control plane

31:02 informing captain tests are now finished please go ahead with the next phase whatever the next phase is and the next phase is our evaluation. So we can see tests are finished for the evaluation. We we are doing we we're retrieving data from Prometheus so we can see data retrieval, and we could see it's not. It seems to it's not working. Yeah. It's it's failing. So let us take a look at the response time p 95 at at at the icon. There was a a red icon that should give us a little bit of indication why

31:34 it's failing. Yes. Here. So Needs Prometheus. Could not connect to Prometheus service. Okay. Let's take a look if we have Prometheus deployed. So there should be Uh-oh. Okay. There is no Prometheus here. So no no worry. What we can do is we can configure Prometheus with Keptn and actually I think in step number 13 there is some Prometheus information. So we are here in, again, in this version of Keptn, we are Keptn is managing the Prometheus installation. So with the first with this command, we are adding the Prometheus service to Keptn, which is also responsible for configuring the Prometheus.

31:50 Troubleshooting and Configuring Prometheus

32:39 And with It does say it's all unchanged. Exactly. So that seems interesting. But Prometheus itself is might not be running. So there is another command for Keptn configure Prometheus. And this one we need that's also setting up setting up Prometheus, setting up the alert manager, setting up the SLO, setting up the scrape jobs all based on the SLI file, the SLO file, and the shipyard file. So now we should be able to see those pods up and running. Here we go. I'm I'm still confused by that because when I applied this, it says unchanged which means this

33:30 deployment already existed but we didn't have any pods. So does that deployment scale it with a replica of zero and then as this kept in configure command that scales up? Is that right? Not not exactly. Can we take a look in Keptn namespace on all the pods? Because the we can see here is also Prometheus installed. Mhmm. And the way it works is if we I will explain it on the Prometheus SLI service. So the Prometheus SLI service you can install it into the Keptn namespace. It's the one component that is responsible reaching out to a Prometheus,

34:16 fetching the metrics and providing it in a way that Captain Quality Gates can work with these metrics. The Prometheus SLI service is living in the Captain namespace. Same is true for the responsible to reach out to Prometheus and configure this Prometheus instance. So those have not been changed and those have been running in the Captain namespace since eighteen hours. But Prometheus itself was not yet started. It's only triggered when you do a Keptn Configure Monitoring and then you can provide the name of a different monitoring solutions. Keptn Configure Monitoring Prometheus will then kick off send a cloud event

34:59 to the Prometheus service which will forward it to Prometheus. So only with this you will start actually the Prometheus. Okay. So let me let me summarize what you just said there because I I feel you're being too polite and what you're actually saying is David you had one job and you fucked it up like that's that's essentially what you're saying. So we don't have Prometheus. We would just you just missed one command. But it's it's it's not it's not a problem. So what what we want to do is we just send exact we just trigger

35:30 Triggering Deployment (Second Attempt)

35:31 the pipeline once again. Awesome. And hopefully, this time, we will fetch the correct metrics. Yes. So now that we have But it's great to see how it failed. Yeah yeah, you know, it's nice seeing the the bridge and showing you how this failed and then even getting that error to say, oh we had no previous like we didn't actually really have to debug anything there, we just had to go all this, we don't have what we expect. I missed the command, we run the command and now we should see this work through properly Hopefully this time.

36:03 Nice. Yeah. We give it a try. Yeah. It'll be fine. I've got confidence. We have another hello from Mark. Hey Mark. Hope you're enjoying your week off. Hi Mark. So I guess now this is gonna take another two two minutes roughly to run through the entire pipeline. Yes. Okay. Most of the time is is spent on the actual deployment and then the the execution of the tests. In dev, it's not a lot of tests. We just ping if there is something if the endpoint is available in staging. We do a little bit we we execute

36:15 Monitoring Deployment and Tests

36:48 a little bit more but it should come up just in a second. Alright. Thanks. And then what we're doing next is to deploy a rather broken version. So zero eleven two which should fail our quality checks one more time. Is that right? Yes. Exactly. What we can do there is this version should not make it to production. So this version we will see it in-depth. We should also see it for a couple of minutes in staging. In staging, we are using a blue green deployment. So we're first moving it to staging. Maybe just wait for this, we will wait

37:28 for this to finish. Also the same is true here. We are moving the one version. Right now it's the same version but we are moving one version as a blue or green version into staging. We are executing the tests against this version and then we are executing and triggering the Kept quality based on the outcome. We keep quality gate is fine. We keep this version. Otherwise, we roll it back to the previous version. If we keep this version, Keptn will also tell the next stage to start deploying or actually trigger the approval process in the

38:04 next stage. Yeah. So production, it's we should not it's not yet there because Yeah. We have to wait for the tests. Yeah. And then we should see that rule in our entire production namespace. So this is now running. Okay. So this is done the auto approval to staging. We have to wait for staging to finish and then the Gmeter test will be kicked off. Exactly. And for this automated approval, yeah, deployment finished so now the tests will be executed. For this approval that's also quite nice. Here we have it automated just for the sake of the demo that we don't have to

38:43 click here and wait. But our friends from Citrix, they have built an integration to Keptn with a Slackbot. So whenever there's an open approval, they get this message directly into Slack and can click either accept or reject and they will also get all the data from the quality evaluation from the previous stage. So they will see what is the quality of the service and then based on this, they can make the decision and they are they don't have to leave Slack. They just work in Slack. You could have the same in Microsoft Teams, I guess, but

39:18 Whatever you're using. All of the above to be honest. So yeah, that I can see there's a lot of value in that. I know, you know, getting those kind of information and no, we're all using Slack these days, right? It's we spend a lot of our time there communicating with our colleagues have not under them. So it's it's nice that it has that integration and we can see the performers results and improve it directly from Slack. I can see a lot of value there. It passed. Way. Oh yeah. And now I have to Oh no, the

39:50 approval is finished. So we should There we go. We now have production pods. Finally. Yep. And then if I come over here, not quite yet. I guess there'll be probes on that. So the service would be passing in the end points yet. So we just have to wait a little bit longer. Describe service cars. Yeah. No end points yet. So we'll just wait. Patience is not my virtue half the time. I will be honest. Like I don't want quality gates just deploy everything to prod as soon as it's ready. My customers are quality gates. No. That's a terrible attitude. I shouldn't promote

40:45 that advice. That's terrible advice. Yeah. Sometimes for sometimes you you just want to deploy something. We actually what what we see with with feature flags it's the same that you have the feature flags. You cannot test all the combination of your feature flags in the pre prod or hardening, staging, whatever you call it. So it's basically your customers are your testers and this is one of the ideas that we also built into Keptn with combining feature. We can we now have integration for example to the Unleash feature flagging framework that where we can just reach out to it and toggle a

41:27 feature flag if there's something not not working. In our case, everything's working in production. That's cool. So then we have to break it. Right? Is that that's what we're doing? Just Yep. We can just exactly. So now we are deploying a rather slow version of our shopping cart. Again it's so once you're using Keptn it's, there is some initial setup with creating the project, with onboarding the services, with adding your test information, with adding your quality gates and all the SLI information, how to retrieve the data from Prometheus or Dynatrace or Datadog, New Relic, whatever you want

42:17 to use. And then most of the time you're using a Keptn add resource to rewrite your quality gates or to add more test instructions or you just use Keptn send event new artifact which basically tells Keptn to kick off a new delivery workflow. We also have integrations in I think it's Azure DevOps where we just one one guy from the Keptn community with the name of Bert Van Dyke. He's from The Netherlands and or from Belgium. I'm not yet sure. But what he what he did is he was building an integration into Azure DevOps where the whole Quality Gate part is Keptn

43:03 and the whole other part with the delivery is Azure DevOps. So you can think of this that you can just use parts of Keptn in your existing environment if you already have have it. Yeah. I think what's really cool about this is that, you know, more and more organizations and teams are trying to codify their SRE initiatives, you know, codify their SLIs and their SLOs and that's what what Captain is doing here like, we have this YAML syntax that's hopefully intuitive and easy for people to pick up and then commit that to get and then have

43:20 Codifying SRE Practices with Keptn

43:37 all of these checks built around that automated for them. Such a powerful set of tools. Let's see what's happened here. We've got our evaluation is now gone to staging. So we're just maybe a minute out then. And what we should see is we've already seen it twice because of my little mistake, but a promotion to production being blocked by the quality gate. So I guess we'll have this time what we'll see instead of an error message that just says, you know, I can't speak to Prometheus is that we'll actually see the response time fail of the

44:21 required values. What we should already see is how it's deployed into dev, you know, into our dev environment. So if you can open up the dev environment again, we should already see the yeah. That's kind of the broken version indicated by the red background, kind of, yeah, obvious. In staging, it should also be deployed already. I think it says in the in the bridge that it was deployed, but not sure. Not yet. Okay. No. So Yeah. So it's it's approved at end staging, but it's not actually deployed just yet. It's about to be deployed, but traffic has not been shifted yet to

44:50 Observing Staging Deployment and Quality Gate Result

45:06 the to the new version. It We will get the information here in the bridge once traffic is shifted and deployment has finished. Then we will be able to see this version also in the browser and it will be there. I can already spoil a little bit here. It will be there for a couple of minutes while the tests are running. And after the tests, we will have another evaluation of the Keptn Quality Gate. And so here we are. And the captain quality gate, if it's failing, it will roll it back to version number one, to the green version.

45:39 If it's not failing, the captain quality gate, then we will keep this version here and we will move it also to production. So hopefully, we don't move this version to production. It's kind of the slow version. Unfortunately, this will take a couple of minutes because it's the slow version, the tests we send about 5,000 requests against the service. So the requests will take more time and it will take about, in my experience, about like five to ten minutes. I guess we can just move on. Yeah, we can talk about our next steps. We don't have to execute it already

46:15 but this is what we will see. It will not be promoted to the next stage because of the blocking quality gate. And again in the demo, the quality gate is very, very simple. We just care about the response time in this demo. But again you can build your quality gates having all different kinds of metrics and data in it. For example, most often you want to care about the error rate. So you want to make sure that errors are below, I don't know, 2% or not increasing by more than 5% to the previous runs. You want to make sure that throughput is

46:56 high. You want to make sure that response time is low. So all of these things you can build in your Keptn quality gates and if you don't know the exact values yet then we see a lot of Keptn customers or Keptn users actually. They are using relative values. So they are using just, let's say, response time is not allowed to increase by 10%. Throughput is not allowed to decrease by 10% to the previous runs or 5%. So you can build your relative thresholds and once you have you know what your absolute threshold should be, then you can fill it in or improve

47:38 your quality gates or however you want to do this. Okay. Can I ask a question? I'm over waiting. So assume we know that that's broken and you know, maybe we're just adopting Keptn and we are you know, the the automation is really cool, but at the same time, we're still checking the metrics ourselves manually during the place. Like what happens if I trigger this new artifact now while we're still in the middle of another currently processing deployment? Does it get queued up afterwards? Does it start the roll out to dev just now? What would go on there?

48:09 So in the current version of Keptn, this is actually a problem because it, you would just kick off another workflow. It will move it to dev and it will also move it to staging. So right now there is no blocking mechanism. This is also what we're going to fix in the next version. The reason why it was built like this is that actually most of the time the artifacts are coming out from some kind of CI system. And after the CI system you decide, is it a version that I really want to deploy into dev or staging or whatever? Is Is

48:49 it the version that I already want to deploy or was it just like a test build? And once you decide you want to deploy this version, there should be kind of a blocking mechanism that you cannot deploy anything into staging whereas currently deployments going on or tests going on but in this version you would, it would not be blocked by Keptn. So in the current version. Okay. So do you see, like going back to the Currently we are using Keptn in a a multi namespace approach. So we have a namespace for dev, namespace for staging and a

49:24 namespace for production. But you also said that, you know, in the next version we can expect to be able to have a dev cluster, staging cluster, a production cluster. Do you see Captain as something that runs on like a management control plane cluster outside of those other clusters or with Captain running my dev cluster and then promote to other clusters? Like, what do you think the ideal setup would be there? That's a good question. I think it really depends on the use case where we also see that Keptn is just running on the Keys, K3S

49:58 on the Keys installation where you just run the Keptn control plane on a small Kubernetes distribution. You don't want to run a full Kubernetes cluster because you're only using parts of Keptn. So you run it on a small Kubernetes distribution. If you're going full Keptn, then you might want to run it on a stable environment, maybe not in the dev environment because if it's one that you tear down and recreate a couple of times, you don't want to have to install Keptn all the time again. So that might be living production related cluster, the Keptn control plane. And then you still

50:32 have Keptn services running on different clusters because they have to be able to execute some actions on these clusters. Can Keptn run as a control plane run outside of Kubernetes? Could I just run it as a daemon process on Linux or does it does it require those Kubernetes APIs to be available? It's not so much about the yeah, it's a little bit about the Kubernetes APIs. It's just built in a way that we need some kind of service discovery, we need some kind of this orchestration. So you can run it on a smaller Kubernetes distribution

50:40 Keptn Deployment Architectures (Multi-cluster, K3S)

51:07 but it's built in a way that there has to be some Kubernetes runtime below Keptn. Yeah. I think that's a really good idea actually, like you mentioned, maybe using like a single node KCS cluster running a captain there as a control plane and speaking to other you know federated clusters in some fashion. That would probably be a really good pattern I think for this kind of thing. This is what we've seen with captain users also having this the quarter gates with more monolithic applications. So they are, they just want to do quality gate check every now and then

51:46 or regularly. And they are not running yet on Kubernetes. So they just want to do this with in a install Keptn on a VM with k3s and then trigger a Keptn via the API or via the CLI. No need to go full Kubernetes if you are not yet ready and you don't have to re architecture your application if you just want to use parts of Keptn like the Quality Gates. You have to have some kind of monitoring because you need some data for the Quality Gates to evaluate but we don't actually there are no restrictions where the data can

52:27 come from. Right now we have integrations with a couple of tools. It's open source so everyone is is more than welcome to provide more integrations with more tools. Excellent. So we're sitting around eight minutes. I'm expecting us to fail any second now. Our staging is red. And this is this just takes longer because of that arbitrary injected latency which affects the 5,000 test runs. Exactly. So we're gonna be moving So the next part, yeah, we can actually deploy the regular cards version. We can actually skip this part. Why we can skip it is if the Keptn quality gate is working correctly

53:10 Quality Gate Fails as Expected (Response Time Issue)

53:21 and I'm pretty positive that it will, then we will not move the broken version into production. And for our final part of the stream, we will do the self healing part and we will take a look at how, actually I want to call it auto remediation. It's not self healing in a way that will fix your code but it will remediate actual issue. So it's, in the next part, we can use the version that's already running in production, in our production namespace, because version number three and version number one, it's actually the same version just with a different

53:58 background. There is no latency introduced. There is no nothing else introduced. The only thing what it what both versions can do is first they have a feature flag implemented with Unleash so we can change from outside. We can change the configuration and we can react on this that Captain can reach out again to unleash to the feature toggle framework and turn off a feature flag that we turn on. In this tutorial, we do it a little bit differently. We will be just adding a couple of other items to our shopping cart that will incur the

54:38 decrease of no, an increase of the response time and we'll just initiate a scale up. So you can do this with also with auto scaling of of but we are doing this with with with Keptn so that we also have all the configuration of how many replicas are actually running. We also have this versioned and stored in our git repository. Repository. This is one of the reasons. Hopefully, by now, the tests have already been finished. I'm actually not sure. Yep. Looks good. So let's take a look at the quality gate evaluation. Yep. Okay. It says the

55:19 response time is way too high. It's more than one second. And so tests were running for eight minutes and thirty eight seconds. Response time is failing. We we don't have the error rate here but we I I did this couple of test runs I did it before and the error rate is always stable close to zero. So it's not about errors. It's really about the response time. We can also see it's already rolled back to the previous version. Yep. And if we go to production, we can take a look also in production. It should still be

55:50 Setting up Auto Remediation for Production

55:52 also version number one running, but it's the version number one from previously, from from a couple of minutes ago. Yep. Cool. It's healthy. So the next step then is we're gonna add our new resource. And this time, we have a self healing SLI resource. Yes. So this is just as as I said, we can have different ways how to retrieve our service level indicators and this one is prepared for to retrieve the the right response times for the for the self healing use case. So this just is you can have it we're just overriding this actually or we are

56:38 adding this to our production environment Because in our previous runs in production we never did the quality check. We only did the quality check-in staging and then if the quality was good in staging we moved it to production. In production we are not executing any tests. As David said we have our end users doing the tests in production so we are not executing more load or JMEDA tests in production. What we also want to do is this time we are adding an SLO file, so a service level objective in production that says if we execute a

57:13 quality gate in production it has to satisfy this criteria and now that we want to also, yeah we should see it here, It's actually very similar. We also see the SLO file here just added sixteen seconds ago. Very similar to the previous one. And what we also want to do is we want to do another Keptn Configure Monitoring because now that we have SLOs and SLIs in production, we can also configure Prometheus to the Prometheus alert manager to use production. So we actually create alerts for our production environment. Here it says Prometheus successfully configured. Rule created.

58:00 Adding Remediation Instructions File

58:06 And what is the next step? Now it's getting blurry again. Oh, the remediation instructions. Yes. So we're adding let's should we take a look at that file, I guess? Yeah. We can take a look. Remediation. It's basically a very simple instruction here. It's the remediation file looks very much like a custom resource definition. It's not really a custom resource definition. It's just a YAML. It's a clear description for Keptn how to do remediations. And it has a list of problem types and whenever this problem type is coming in via CloudEvent, Keptn will execute all these actions that are

58:55 counter actions for this problem type and execute one, evaluate the Quality Gate again, execute another. In this case we have the same actions for different types of problems. The one is just called Response time degradation. That would be actually the problem type that's coming in from a commercial monitoring solution called Dynatrace. For this tutorial we could use the same file. The problem type would be called response time degradation. In Prometheus, the problem type is called response time p90. That's just how we set it up because the Keptn integration between Prometheus and and and Keptn, it's using the SLO

59:43 or yeah. It it it's using the the the name of the SLOs for the problem type and it's kind of inverting the SLO and whenever the SLO is then breached, it will kick off the alerting in the alert manager. And the action here is to scale, scale up by a value of one. So that's what we should see. That means we should see a replica set of two when everything's fine. Now we want to generate some load to cause that remediation step to be executed. Exactly. So this version, all three versions actually, they have some faulty

1:00:10 Generating Load to Trigger Remediation

1:00:28 item in it and we can just add, we will apply a load generator that will add all those faulty items into a shopping cart that will slow it down quite drastically And then we should be able to see that in that the response time is degrading and Keptn will kick off the the response time of the Okay. So right now we have one one pod running for our cats d b, our cars d b. So if we apply this load generator, we should see that increase. Yes. So this will take a while. We can take a look in Prometheus how

1:00:50 Observing Response Time Increase in Prometheus

1:01:20 this affects our response time. Does that have a special name, or is it just There there is one query that we provide that you can just take, and it it gives you the the the exact graph. Yeah. This one. Oh, good. I thought I was gonna have to remember PromQL there. Okay. Yeah. It's always giving me a hard time. Alright. So hopefully, we something. Yeah. Oh, look at that. There we go. We have a space. Okay. Yeah. It's already increasing. Let's just jump down to five minutes. Yep. One minute. Yep. There it goes. So yeah. Or is this gonna look similar to

1:02:04 that? Yeah. We can watch the self healing in action. And we can take a look. We should see already that there is an alerting rule set up in the alert manager. So this is what we can take a look at also in Prometheus. Yes. And what does it say here? Yeah, unfortunately, in our demo it says it has to break this rule for ten minutes. So it's only after ten minutes. It's not great for these kind of demos. So we actually have to decrease this time. Otherwise, it's not really useful if we have to wait ten minutes all the time for

1:02:10 Keptn Remediation Process Explained

1:02:55 to kick in. But what will happen after ten minutes is that there will be an alert sent from Prometheus to the Prometheus integration in Keptn, which will translate this alert into a cloud event. And this cloud event is then forwarded to Keptn. Keptn will forward it to the remediation service. This is also a little bit the part where we need Kubernetes because Keptn itself consists of a couple of different services. So it will forward it to the remediation service. And we have different as you see in the remediation YAML, there are different actions that you can define.

1:03:32 And one action is scaling. So there we have a specific component that can remediate scaling issues or that can execute the scaling action. Yeah, exactly here. So the action is scaling. And as we're also doing deployment with Helm, we will also do the scaling with Helm. We will rewrite the value of the replica set in the Helm chart and then use Helm again to deploy this. And we will also see it in the git repository of Keptn that this value will be changed. So we will scale this version or scale the replica set. We also have other integrations, for example, for

1:03:45 Observing Remediation Action: Scaling Up Pods

1:04:14 the feature flags that I mentioned earlier. So for the feature flags, we have the integration of Unleash which will then use or which will retrieve the information from Keptn, will translate this into an API call for Unleash and would turn off or turn on a feature flag. Yes. So we have a couple of integrations here. One that we have the does it say here the unleash? I I don't see unleash. No. I don't think it's listed on this page. Oh, we missed it. We have to add it here. But we have, for example, the I just

1:04:53 saw it. Yeah. Keptn unleash. Self healing. This is for the previous release, but we we can see that it's here working. We can also see how how it works here. There's actually also we did a webinar kind of with the with our friends from Unleash even. So glad that they they have not bugged us yet that we forgot to to to list their integration. We we can see the remediation objects here. So, you know, action is feature toggle, and then the the value of the toggles are turned on or off. So nice and self explanatory, you know, pretty intuitive.

1:05:34 Yeah. Yeah. So there there are different tools that you can integrate. Unleash is the one that's, yeah, pretty pretty good and it's working great for the feature flags. It's really a great framework. It's also open source by the way. And the because you you just had the integrations page up here, one of the last integrations that we did was together with Litmus, which is also one part that I really like. Litmus Chaos is a chaos engineering engine built on Kubernetes and the cloud native technologies. And the really cool thing here is that usually when we talk about executing tests we talk

1:06:00 Litmus Chaos Engineering Integration Example

1:06:25 about functional tests or performance tests, load tests, but what we built with Litmus or how we combined Litmus is then whenever we enter a testing stage in Keptn we don't want to execute only performance and we don't want to just trigger some and hit our services with some load. But we also want to introduce chaos to our services because you might know that in production it's not always everything is not always sunny weather in production, let's say like this. So there might be some problems in production and we can already simulate these problems in shifting left and doing this in previous phases.

1:07:04 So by introducing either a chaos stage in Keptn or just doing it in a preproduction stage, adding the Litmo service, you can execute chaos for example removing a pod or introducing network latency while you also have some performance tests running. And once both are finished you can use Keptn quality gates again or Keptn will trigger automatically the quality gates And then you evaluate the quality of a service while it was under heavy heavy load and while it was under chaos testing. And with this, you can you can evaluate what is actually the resilience of my of

1:07:46 my application. We we just had in the in the Keptn user group, we had Adrian who is an SRE at the in building exactly this use case and evaluating the resiliency by either having no chaos, light chaos, or heavy chaos. And they requirement to the application is that no chaos and light chaos have to basically behave the same which I think it's pretty interesting that they say our application it has to be resilient in a way that introducing light chaos to application should not change anything to our quality gates. We have to be able to survive some

1:08:30 light chaos by just removing one part of a replica set or by introducing network latency of a couple of microseconds that has to survive, which is pretty interesting. Yeah. I think litmus chaos is is one of those things that I keep me into play with and I haven't got around to yet. But, you know, now that you've described the concept of heavy chaos, I feel like I'm gonna have to deploy that to all of my environments now to see how things to see how things weather that storm and I love that you can integrate that

1:09:00 with Keptn as well to kind of build in those remediations and stuff like that. So it's cool. So there's quite a few integrations here then so I mean that's cool. The Slack one, the ServiceNow, we got notifications of course is Dynatrace as a whole tower. Lots of stuff for people to experiment and what's involved in writing an integration. See, I think, oh, I actually I wanna build like maybe Discord notifications or Twitter or whatever. Yeah. Some of that. What's involved in writing one of those integrations for Keptn? We usually recommend to get started with the

1:09:35 the the Keptn service template. We call our integrations a service also. It's a service to it's a Keptn service. And we provide a service template in Go. We might also provide it with other programming language. But since Keptn Core is written in Go, we also provided the first template in Go. You can use this template and basically what you have to do is there are a couple of event handlers. And you define in the deployment file of your service, you define which events you're interested in. I explained in the beginning, Keptn is event based and data

1:10:00 Building Keptn Integrations (Developer Template)

1:10:14 driven. So there are a couple of different events like deployment finished event, test start event, test finished event, configuration changed event, remediation start event. So there are a couple of events and whatever events you're interested in because on the example of the Litmus service, we are interested in test start events because then we want to start tests and we will send a test finished event so the captain knows chaos tests have been finished. If you are in a deployment tool that you would react on different events. If you're like a Discord service might want to listen to all the events

1:10:52 because it might just want to notify notifying in on Discord about everything that's going on similar to the captain's bridge, for example. So you basically decide where you want to to which events you want to subscribe. And then you will go ahead and implement in the different event handlers. You just implement your logic. And depending on the integration, you might want to send an event back to Keptn or you you don't want to. It just depends. For a notification service, you usually don't send an event back to Keptn that you sent the notification. It's not like, okay,

1:11:28 you did it, great. But for a test integration, it's important that you inform Keptn that tests have been triggered and tests have been finished. So, and we provide this, the whole template here. It already has the all the event handlers, which are just, it's basically just a stop where you just write your your business logic. And most or the the first part where where we ask the Keptn community to to provide their integrations is the Keptn sandbox. This is where we start developing new integrations. We can already see we can see the Monaco service. We can

1:12:08 see the local service. I just initiated some discussions in our Keptn Slack. It doesn't have any resources right now here, but we found a couple of folks that are interested in working together on the locus service and we will define what it has to do, how we want it to behave. I heard there are different ways to execute locus tests. One is via the CLI, the one is via the API. So we would just kind of discuss how we want to build it and this is the first place to put it. And once they are

1:12:43 used and tested, then we can move it to Captain Contrip, which is the place where also the Captain Core community is more also more involved in maintaining the integrations. Yeah. I think Alexa notification service. I've just had an idea for like my next Friday hackathon project, which is to have my feeling my feeling quality gates turn the room color red, I think, or maybe green on a pass. So I know where my deployments have broken. So like some sort of Philips Hue integration to change my legs. I think that would be good fun. That would be cool. That would be so

1:13:17 cool. Yeah. I would love to see them. Yeah. I I would've worked on that for sure definitely. Alright. Let's see where we are with our query here. I still have that in my buffer. I do. Let's graph that over ten minutes. Okay. I do it my way. Alright. We're almost there. I think we're just we're approaching the ten minute So what we should see is if we run I'm in the wrong direction. If I run get pods, this should start to scale up. Oh, it has. Oh, yeah. We've got two of them now a minute ago. So

1:13:59 Oh, already did. Cool. Yeah. So Nine nine seconds ago. Yeah. So is there a visual way of that in the bridge? Will I see this anywhere? Okay. So it's also here in the sequence screen. So we can see that there is a problem opened. We can see everything that was going on. So it's actually a couple of events that are sent through Keptn. It's first a problem open event coming in, a remediation triggered, the remediation status changed, an action has been triggered, action has been started. So it's a lot of things again where subscribe to it. You can add your services

1:14:37 to where we can see which services are responsible. For the actual action, I explained earlier that the action itself, the scaling action is done by the Helm service. If you have your own scaling or if you want to, I don't know, increase the brightness of your hue lights in your room, whenever time there is a scaling you could also listen to this scaling events in your service. And it says action started, action finished. So already we know that it already did the scaling and in Kubernetes we already saw that there are now two pods. Excellent. If it's not yet

1:15:16 remediating the response time that it will evaluate again the Keptn quality gate and if it needs another scale up to meet again the quality gate of Keptn, it will do another scale up. So it's just you can start small like let's say you don't have to scale up by 100% but maybe just by 5% and then if it's already meeting the SLOs again, it's fine. If not, then it will execute it another time or you can execute another action. So you can as I've shown in the in the slides, you can mix and match the the

1:15:55 remediation actions. Cool. Awesome. So we've already seen this work. We've seen it in the bridge and I guess if we leave that wrong long enough, we'll see the response time drop. Yeah. And we should also see that the the alert is will be be closed again. So first the alert is pending. They will be firing and then it will be will be going away and go back to green. But it takes a while for us to see everything. So I think the the most important part here was that we saw it how the remediation is actually kicked off

1:16:38 and which files are included. Everything is again based on the definition files, the declarative definition files of Keptn based on the shipyard file. SLO file is a very strong concept and one of the most important files in Keptn as it defines the quality gates. SLI file defines how to query all the data that's needed for the quality gates. Remunation, the file defines which actions are kicked off whenever there's some alerts coming in or a problem coming into Keptn. Excellent. I think we've covered an awful lot and I think we've seen a lot of really valuable

1:17:13 use cases for what Keptn can offer for, you know, Kubernetes this deployments. I think you know, the the quality gates are just so valuable for anyone who wants to ship anything through multiple stages and then being able to build your own event driven remediation loops. Think it's just wildly cool as well. I'm assuming any integration that I want to build or access to all of those events embedded by any other plug in within the captain system. Is that right? So I just say, hey, I want to listen for this and I want to start all I

1:17:43 mean it was other stuff on top of it. I think just the power and that event driven kind of loop opens up so many different possibilities for the way that we deploy our applications to Kubernetes. So very very cool. I'm very excited about that. Is there anything else? Yeah. It's it's really exciting. Sorry. On your go. We are yep. I I can just oh, sorry. Okay. So we are really excited also to share soon our next iteration of the shipyard files. The file that we saw or maybe if you could scroll up here on this screen

1:18:18 a little bit on the first part I think we saw the shipyard file. And yes, so this was the file that we are also using in the demo or in the tutorial. If you want to walk through the tutorial, it's tutorials. Captain. Sh. But we got some requests from the Keptn community and we are in the next iteration are we're doing an yeah. A new iteration and an improved version of the shipyard file. And we already have the Keptn zero dot eight alpha out there and I'm pretty sure that we we can deliver the stable and final release of 0.8

1:18:58 pretty soon and it will be even more mature and there will be more use cases possible. Also they have parallel stages to have multi cluster support. So there's a lot of things to come. Also by the end of this week hopefully we will be releasing a new version of the Keptn website. So it's a we are in exciting times with this project. And if you can think of any integrations that you want to see or you want to start, please also reach out to us on slack.captain.sh because we are happy to provide you the git repository

1:19:00 Future Keptn Developments (0.8, Shipyard, Multi-cluster)

1:19:34 in the Keptn sandbox so it's kind of the place where your integration is also visible and you can provide it to the Keptn community. If you want to start on your own repo, we can always merge it later. But we're always happy to see new integrations also in this organization so that it can be shared by everyone. Awesome. So yeah, everyone join the the Keptn Slack. Start writing your own integrations. I'm gonna start my Q1 very, very soon. Is there a weekly Is the Slack the best way to get engaged with the community? Do you

1:19:50 Engaging with the Keptn Community (Slack, Calls)

1:20:07 have weekly calls, office hours, anything like that that people can also get help with if they do want to start making their own integrations? Yes. So we have the Slack where we are, where the Keptn Core team and the Keptn community is online every day. Then we have the Keptn developer calls every Thursday on 5PM Central European Time. So the next one is tomorrow. Then we have also the Captain User Groups. We run this monthly and these are more sharing use cases as I briefly described. Had last time we had Adrian from Keptn sharing his story around Litmus testing

1:20:51 and the chaos testing. We also had Sumit from Intuit sharing already his story on integrating different performance tests and then Keptn evaluations. So we have different formats but for like fast response times around development, I think best is to join the Keptn Slack and also joining us on the Keptn developer calls every Thursday. Awesome. Alright. Well, everyone should install Keptn, improve their continuous delivery, take advantage of quality gates and remediation and build your own integrations. Join the Slack and join the developer calls. Jorgen, thank you. That was really great to see that from start to finish. I'm so

1:21:15 Conclusion and Farewell

1:21:30 glad we came back for part two I Thank you. Even though I broke it I'm glad we got through there we've everything working I'm really happy with that so have a great day thank you again for joining me and I hopefully speak to you again soon. Thank you David, have fun. Thanks. Bye.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Rawkode Live

View all 173 episodes
Keptn

More about Keptn

View technology
Prometheus

More about Prometheus

View all 26 videos
Kubernetes

More about Kubernetes

View all 172 videos
k3s

More about k3s

View all 5 videos
Litmus

More about Litmus

View technology
Helm

More about Helm

View all 49 videos

More about Grafana

View all 20 videos