About this video
What You'll Learn
- Install Keptn and Istio on Kubernetes, configure the CLI, and bootstrap a GitOps-controlled Keptn project.
- Define Keptn's shipyard and onboard services into the platform using service packaging and JMeter tests.
- Set up Prometheus quality gates, trigger a deployment, and observe automated staging rollback on SLO failure.
Jurgen Etzlstorfer introduces Keptn, the CNCF sandbox control plane for delivery and operations. We install Istio and Keptn on Kubernetes, write a shipyard, onboard a Helm-packaged service, and let Prometheus-backed SLO quality gates gate the staging rollout.
Jump to a chapter
- 0:00 Holding screen
- 0:15 Introductions
- 0:55 Introduction
- 1:21 Introducing Keptn & Guest
- 1:58 Guest & Project Introduction
- 2:22 What is Keptn?
- 3:36 Keptn Project Overview (Slides)
- 4:17 What Keptn Does: Control Plane, Orchestration & Integrations
- 6:15 Why Keptn? Solving Manual & Pipeline Issues
- 7:57 Use Case: Application Onboarding & Monitoring Config
- 9:00 Use Case: Progressive Delivery & Quality Gates (SLIs/SLOs)
- 12:02 Use Case: Automated Operations & Remediation
- 15:48 Event-Based Architecture & Tool Integrations
- 20:11 Starting the Live Demo
- 21:03 Demo Step: Install Istio (Dependency for CD)
- 21:50 Installing Istio
- 25:08 Demo Step: Get Keptn CLI & Install Keptn Control Plane
- 25:45 Installing Keptn
- 30:10 Demo Step: Configure Keptn CLI & Networking
- 31:31 Demo Step: Setting up the Keptn Project (GitOps)
- 32:00 Setting up our Keptn repository
- 35:05 Examining the Shipyard File (Pipeline Definition)
- 39:51 Q&A: Keptn Metrics
- 41:00 Project Definition Complete, No Deployment Yet
- 41:40 Demo Step: Onboarding Services (Cart & Cache DB)
- 43:20 Onboarding our first service
- 46:36 Demo Step: Adding JMeter Test Files
- 48:18 Q&A: Edge Deployments
- 50:28 Q&A: Ingress & Load Balancer Setup
- 51:00 Deploying our first service
- 55:05 Demo Step: Triggering Initial Deployment (Send New Artifact Event)
- 58:08 Debugging: Deployment Failure (Insufficient CPU)
- 59:19 Modifying Configuration & Retrying Deployment (GitOps Workflow)
- 1:02:03 Q&A: CLI Authentication
- 1:02:52 Observing Progressive Rollout (Dev -> Staging)
- 1:04:38 Resizing Cluster
- 1:10:39 Observing Dev Deployment & Promotion to Staging
- 1:10:53 Viewing Deployed App UI
- 1:12:00 Adding Prometheus
- 1:12:10 Demo Step: Configure Monitoring (Prometheus)
- 1:14:44 Demo Step: Adding Quality Gate (SLO File)
- 1:15:24 Checking Deployment Status in Bridge
- 1:16:06 Observing Staging Rollout & Test Execution (Slow Version)
- 1:16:40 Progressive delivery
- 1:17:12 Quality Gate Evaluation Process Explained
- 1:18:38 Re-examining Shipyard File Details
- 1:20:25 Waiting for Staging Test Results
- 1:22:25 Observing Staging Quality Gate Failure (FAIL)
- 1:22:41 Confirming Rollback to Previous Version
- 1:22:48 Demo Recap: Progressive Delivery & Rollback Success
- 1:33:10 Tutorial Structure & Future Plans
- 1:48:01 Guest's Final Thoughts
- 1:49:24 Closing Remarks & Thanks
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:55 Introduction
0:55 Hello. And welcome to today's episode of Rawkode live. I'm your host, Rawkode. Before we get started, I just wanna say thank you to my employer, Equinix Medal. They provide the time and the resources for me to invest and purchase in the show and making sure we all have great cloud native content to learn together. If you wanna check out Equinix metal, there is a code, Rawkode dash live, you can use at metal.equinix.com. Check it out. Have some fun. Today, we're gonna take a look at the captain project. I have a great guest, Jurgen Esslsdorfer, who is an engineer at Dynatrace and a
1:21 Introducing Keptn & Guest
1:31 maintainer of the kptn project. Hello, Jurgen. How are you? Hi, David. Hi, everyone. Thanks. I'm good. How are you? Yeah. Yeah. I'm very well. Thank you. I'm looking forward to today's session and and learning how I can use this really cool project in my Kubernetes delivery pipeline. I think the best way to get started was why don't you take just thirty seconds or a minute or, like, you know, take as long as you want. But tell us a little bit about you, about the project, and then we'll we'll move on from there. Okay. Cool. Yeah. So my name is. I work for
1:58 Guest & Project Introduction
2:02 a company called Dynatrace, and we started or my role at Dynatrace, maybe I start from there, is, officially a technology strategist. I joined the company about three years ago, and I have been, member of the Dynatrace Innovation Lab, since then. And, yeah, a couple of years, it maybe it was around two years ago, we started this project. It was not called kptn by then, but we found that we have a lot of best practices that we are using when it comes to cloud native technologies, when it comes to delivery, when it comes to quality of software
2:22 What is Keptn?
2:38 that we are already using, and we try to basically provide everything also to to other organizations and also to our customers. And this was the basic idea. Let's start an open source project, and let's make everything that we already know. Let's make it available for others, and let's make this yeah. Let let's make this a platform that everyone can use. And with this project or with this idea, the kptn project was born. And it's now a CNCF sandbox project, and we are pretty pretty excited to have it. And I'm also pretty excited to to show it or to take a
3:21 look at it today with you. Awesome. Nice. So I believe you've got a a few slides you wanna run us through first to give us a little bit of a flavor, and then we're gonna get hands on it and kick the tires on this. So let's pop that up. There we go. Cool. Okay. So let me just introduce briefly the kptn project, which use cases it solves or, like, targets and which problem it solves. If you want to take a look, there it's captain. S h. We in Austria or in German speaking countries, we say captain, like
3:36 Keptn Project Overview (Slides)
3:56 the captain of a ship that was the main idea behind it, the the captain ship's applications. So we will hear a couple of these nautical terms like ship and the shipyard and yeah. We will we will hear it anyway in the rest of the show. And, yeah, we also have Twitter. Yeah. So what what is it actually, and what what what does it do? So we came up with this figure here. We we actually target a couple of different use cases from progressive delivery to quality gates as one part of progressive delivery that you can use
4:17 What Keptn Does: Control Plane, Orchestration & Integrations
4:37 maybe in other delivery frameworks that you might already have in your organizations and you where you don't want to use maybe kpten to auto remediation when it comes to operating your microservice applications. These are these fans are use cases. And everything is built on a GitOps approach where you just bring your configuration files, and kpten will manage these configuration files and will also connect to your tools. So David already mentioned it's a control plane, a cloud native control plane for continuous delivery and automated operations, and it's basically solving the problem of application life cycle orchestration management.
5:18 So if you have health charts and if you use helm for deployment, you can do this with kpten as well. So kpten can use your health charts and deploy your microservices for you. You can also connect it to Argo. You can connect it to Jenkins. We have other integrations for monitoring tools like Prometheus where you can use this data for your quality checks. You can connect it to JMeter for load generation and performance testing. You can connect it to Litmus KAOS for chaos engineering. So you can connect it to a lot of different tools. Everything is event based.
5:53 And with this, we are very open to to to broaden this ecosystem and to, open up for tool integrations. And kpten really brings together the different tools that you already have in your tool belt, let's say, and manages continuous or progressive delivery and automated operations for you. So how is why did we actually start? I said this in the beginning briefly because we saw that a lot of different organizations that we are working with, that we have heard from, they are doing a lot of manual manual tasks when it comes to delivery of applications and new releases.
6:15 Why Keptn? Solving Manual & Pipeline Issues
6:38 And these manual tasks are either in evaluating test performances, evaluating if the new build is actually better or if the new release is actually better than the last one, and also writing a lot of those pipelines. And these pipelines, they have grown pretty large. And, this is one one, issue that we are also solving with kpten. There is no pipeline code anymore. Everything is event based, and you just connect your different tools, and you have very, very small configuration files that are for a very specific purpose. And with this, it's very easy to change them, update them, maintain them.
7:15 And one part is also a lot of organizations, they spend a lot of time in remediating the issues. They might already have some playbooks in in place, like, say, Ansible playbooks or some it can be shell scripts that just remediate an issue, but, anyway, they have to be executed. And here, we also bridge this gap by from, let's say, Prometheus alert, we can trigger Ansible playbook or or execute shell scripts or Python scripts that already remediate or try to remediate issues. So we are with this event based methodology, let's say, we we can we we can bridge these gaps.
7:57 Use Case: Application Onboarding & Monitoring Config
7:57 So the first use case yeah. Thanks. The first use case where I think kpten already brings some value is when you start to manage your applications with kptn. Because kptn can already configure your monitoring, let's say your Prometheus create the scrape jobs, create alerting rules in the alert manager, create dashboards in Grafana. We have not even yet deployed the application, but we when we onboard the application, that's also how we call it. It's a kpten onboarding. It's we have a lot of these these terms. But when you onboard an application and you were using Captain CLI,
8:40 that you can already use it to configure your monitoring solutions, set up alerts, create dashboards, set up scrape jobs, these kind of things. And as said, we have not even yet started to to deploy the applications. Yeah. I I think we'll see this later on, so I'll just skip this one. And the second part where it's really we are bringing we are bringing really a big benefit is we have based our delivery pipelines on SRE concepts, SLIs and SLOs. SLOs are short for a service level objective, and you or the the user of kpt and just
9:00 Use Case: Progressive Delivery & Quality Gates (SLIs/SLOs)
9:21 defines a service level objective that can be based on response time, can be based on the error rate, CPU consumption, whatever. Everything that's queryable or that's a metric, let's say, and you can query this from, for example, from Prometheus, you can define some service level objectives, and kpten will will set up a multistage delivery pipeline for you. And between each stage, kptn makes sure that it evaluates the quality of your microservice based on the service level objective that you have defined. And you just define it in a YAML file and the deployment and the the triggering of the tests and
10:00 then the evaluation, this is taken care of by kptn. So for example, here, we have the shipyard file, the SLO file, the Helm chart for deployment, and we can see that the shipyard file actually, that's the it's the complete shipyard file for a multistage pipeline. We have dev staging production with the quality gate. That's basically the whole quality gates. It's the SLO file, and you will get this automated quality gate evaluations for each deployment that you're triggering with kptn. And the data is then gathered in this example for from Prometheus, and the tests have been executed by JMeter,
10:39 for example. And in the yeah. This is how it works in a little bit more detail. We have these SLO files and SLI files. These are you can think of this as a library of indicators and how you can fetch this data, with a PromQL query, where you can, you you define your PromQL, and, this is then executed when the captain evaluates the quality gate. And you can just reuse, let's say, the error rate here. We just reuse the error rate in the objective file, and we specify the error rate has to stay lower or equal than 1%.
11:15 And the time frame for the evaluation, that's actually calculated by kptn since kptn can also trigger the tests. And kptn will trigger the tests, let's say, g meter tests or neo load tests or place meter, whatever you're using. And, once the tests are finished, kptn will be informed the tests are finished, and, it can evaluate for this exact time frame of the tests. We can evaluate the error rate, response time, number of database calls. And from this file, it will generate the total score. Start the evaluation. It will generate the total score by by evaluating
11:51 this. And based on the total score, it can then decide if it should be promoted to the next stage or if it should be rolled back, for example, to the previous version if you have a blue green deployment. Okay. And for the third major use case, of kptan, this is where we, automate part of your operations. We call it by closed loop remediation. And with closed loop remediation, what we really mean is that we are constantly evaluating if the remediation is actually if it was successful. So we've seen that, organizations, and also this is where where we started
12:02 Use Case: Automated Operations & Remediation
12:29 also a couple of years. Once there is problem or an alert coming in, this can also be sent to kptn as a cloud event. Kptn can execute remediation actions, trigger other tools, let's say, toggling a feature flag or triggering the rollback of a deployment or executing some Ansible Tower playbooks. So kpt kptr will trigger, let's say, an Ansible Tower playbook. Once it's finished, kpt will evaluate if this action of Ansible Tower did actually resolve the issue by reeval evaluating the quality gates, reusing the SLO file that you already have defined. And if it's remediated, job is basically done and alert,
13:14 can be, resolved or the problem can be resolved. And if it's not, remediated, captain can take the next action, and it can be a sequence of actions that can be taken. And all these actions, they are actually evaluated if they if they remediate the issue or not. So for example, we have we have one more demo application where we send an alert via the Prometheus alert manager. And in the remediation file, we define two different remediation remediations for one specific problem, and the problem in this case is a response time degradation. That's the name of the problem. This will map
13:55 to the alert of Prometheus. So these these two, they basically have to be they they will match. And we'll then have a a number of of actions. So in this case, two actions. The first one will be executed. It's a scaling. Scaling is, is a built in functionality of kptn, so we can, basically change the values of the helm chart. The the replica set values of the helm chart will increase it, in this case, by plus one or you can say by plus 10% or whatever is might be the right action. We'll then reevaluate the SLO file.
14:33 We'll do this evaluation of the quality gate. If it's remediated, everything's fine. If not, it will go to the next action. This is in this case, it's the toggling of a feature flag. So we in this case, we actually want to disable the a campaign of a specific campaign ID. We can think of this like a promotional campaign that has been initiated. Maybe that's generating a lot of problems for our response time. Maybe it's it has to has to do too much, I don't know, database calls, for example. So this can be also triggered. We don't define the tooling in these files.
15:10 The tooling will be just listening for these events. So if you want to move from one feature flagging framework to another, you can still use the same remediation files because what you will do is you will just change the the part which listens for for events from captain. And in this case, might be resolved or not. We can then escalate it. We can send all these everything that's going on, all these cloud events also to to Slack, for example, or just sending it to Slack when it escalates and we could not resolve it. Yeah. That's the that's the main or the
15:46 major three use cases. And as said in the beginning, kpten is built as an event based control plane that lives in your Kubernetes cluster, and you can send cloud events to this control plane, and the control plane can also send cloud events. So it re receives and sends cloud events, and this is how the the tool integrations are actually done. We already have a couple of tool integrations, and we see more and more or we see this growing as more and more organizations are contributing. For example, we received a contribution for a Slack bot where you
15:48 Event-Based Architecture & Tool Integrations
16:22 can control continuous delivery workflows with a Slack bot. So you get the whole evaluation of the quality gate directly into Slack, and you can just accept or recheck if if this build should move on to, let's say, from your dev environment to preprode or even from preprode to prod. So that's pretty cool. We have a couple of integrations here in our yeah. Contributed to to the kpten ecosystem. Yeah. That's our our latest integrations, I would say. And, yeah, maybe from this, we already take it from there and take a look at kptn. If you want to also, afterwards, maybe take
17:06 a look at kptn's kptn dot s h. That's our main website. And, yeah, I think we're good to Okay. So it does a lot. There's there's a lot of things here. Right? So, I mean, I'm very excited now. I have a couple of questions. Before we get to them, we got one from the chat. So Adam has asked, is it possible to use captain outside of Kubernetes? Yes. So the the when I read this question, I can definitely answer with yes. Kptn the applications that are managed by kpten, they don't have to run-in Kubernetes. But kpten itself is built to run-in Kubernetes.
17:51 But we've also built a kpten on k three s, a very, very small kubernetes single binary distribution that you can just run on your small e c two instance, and you just install kpten on k three s there. And then it acts the control plane acts on this e c two instance, and you can send the cloud events to kpten. It's just built as a couple of microservices, and they need some kind of orchestration, and we we we decided to go for Kubernetes. But, yes, you can also use kpten for traditional applications. It just it, I would assume
18:26 or I would just argue it works best best with microservice architectures. But we also see, monolithic applications being, quality gated by kpten. Alright. Nice. So on your slide in fact, let me just pull that back up. Let me can you go back? Was it one slide, two slides maybe? There. This one. Okay. So if I'm trying to think and, you know, we'll get hands on it in just a moment. And I'm sure all my questions will be answered there, but I'll throw them out anyway. It's like, kptn runs as a control plane to monitor pretty much everything after I get merged into
19:03 my main branch to application running in production. Now you talk about environments and how it can handle progressive delivery across those environments. Now the example here is namespaces. Could those be separate Kubernetes clusters? One for dev staging and production? Yes. But only in the next version of kptn. So we are working on this. So today, we will we we have to use kptn zero dot seven dot three. That's the latest version. And we are working on zero dot eight. And with dot eight, it will be possible to have to manage different clusters in diff even in
19:42 different clouds. So then it doesn't really matter where the cluster lives. But right now, the control plane can manage one cluster, and we are doing the separation between the the the control plane and the execution plane, and then it will be easier. Alright. Sweet. Okay. Let's get started then. So if I pull up my screen, there we go. We have the kptn website. People can check that out at kptn.sh and there's documentation tutorials, etcetera. The tutorials, I believe, is what we're gonna get started with today. So these are these just guides that anyone can follow along
20:11 Starting the Live Demo
20:24 at their own time, click star, and then you just guide it through step by step. Right? So are we doing the captain filter? It sounds exciting on Prometheus one. Right? That's the one we're we're gonna check out. Yep. That sounds good. And the other ones, they are basically for different use cases. They show you for example captain in the box will use micro kubernetes. We also have the guide for the captain installation on k3s if you want to do that to connect it to Argo CD. So we decided to go for tutorials. We also have a documentation
20:57 but tutorials are more guiding you how to to follow the the the concepts of kpt and how to use them to to the best. So here's the branch I want then for my multi cluster approach. Okay. Exactly. We won't do that today. I won't I won't put pressure on you. So we get this kinda alright. Here we go. We got step by step guidance through the filter. So when it says the filter of kpt, that's just going through all of the components that you kinda listed on those slides. Right? Exactly. That will will explain how to install it.
21:03 Demo Step: Install Istio (Dependency for CD)
21:35 It will explain how to do delivery with kptn, quality gates with kptn, and automated operations or let's call it self healing with kptn. And we'll bring also all the the tools that you will need. So in this case, yeah, we'll ask you to install Istio because that's what we are using for the route generation and the traffic shifting between the blue green versions. So this is the this is why we need this deal in this case. Alright. So as this is recommending that I run s two one five. I'll just check what I actually have.
21:50 Installing Istio
22:17 Okay. So should I install this version? Does the version matter? We I have not tried with the with the the latest versions. We just had that the previous versions did not work pretty well, so we just decided to go for this version. This is the most stable version in for our demos here. Alright. That's fine. Why did that not work? Oh, because I put the path after. Let's just do this. Much better. Okay. So we have a steel control one six five. We're just gonna run and install against our cluster. Default profile, the correct one. Yep.
23:13 The default. Yes. That's cool. It's just one of those things that I've been putting off for so long to play with that I really am gonna have to start experimenting with it eventually. Actually, when when you're using kptn, you won't see a lot of Istio because we will be the kptn is generating the the virtual services and all the traffic routes that you will need for for Istio, the ingress, these kind of things. So we won't see a lot of Istio. It's just used behind the scenes. But it's definitely a very, very interesting project. Is that something that kpting could, you know,
23:53 more have a dependency on the service mesh interface rather than STO, or is there stuff on STO specifically that's needed for a captain to work beyond to the service mesh interface? We only so if we are not use if we're not doing continuous delivery with kptan, if we are using, let's say, Argo CD for continuous delivery and we are just using kptan for the quality gate part and the automated remediation part, then we won't need Istio. It's when we are going for the full installation of and we want to do continuous delivery, we need Istio.
24:26 We were thinking about also moving to the service mesh interface, then also supporting Linkerd and other service meshes. But it's just it was never the right time to to broaden up for this because what what we see is that more reusing already existing CD tools like Argo CD, Argo Rollouts, Spinnaker, and trying to integrate with them and actually providing integration with them seems like the better approach right now. Alright. Makes sense. Alright. So, yeah, we've got installation complete. We hit next, And I'm curling for a captain c l I. Here we go. Okay. That was fast.
25:08 Demo Step: Get Keptn CLI & Install Keptn Control Plane
25:28 My connection's pretty good. Well, we know when it wants to be, of course. There are definitely days where it's not good, but today it's been pretty good. So we have the kpten CLI. Nice. Okay. That that's pretty good. So yeah. And now with the kpten CLI, we can install kpten in our Kubernetes cluster. And as said, for our installation type, we are going for this use case, continuous delivery. If we don't decide for this use case, then it will be a smaller installation of kpten without the services to to do the full continuous delivery, but
25:45 Installing Keptn
26:03 then we have to take care of doing continuous delivery by ourselves. So we just go for full installation here. Okay. So you want me to change continuous delivery to fill or as continuous delivery that fill installation? That's so we just leave it like it is. Yeah. Alright. Okay. Yeah. Yeah. That's the, like, let's say, the the full installation with the continuous delivery use case included, which just ask us if everything correct. Yeah. I'm just gonna keep clicking buttons when things are presented. If I do it if I'm wrong, just feel free to to yell at
26:39 me. But so this is gonna oh, so it's installing captive. So I could have done this for the the helm chart as well. I'm assuming the CLI is just a wrapper around that, provides some values for the use cases that are are selected. Is that roughly right? Yes. Yes. Exactly. We had it differently that we had a really it like, a a a job that was really executing step by step the capital installation, but we we thought it's best to move to Helm chart since it also provides possibility to install it via a Helm chart without
27:11 the installer. That's what we also see. Sometimes it's not allowed to install any software with an installer script, but just with helm charts so that organizations have full control and can take a look what is actually installed. In this case, it's basically applying the helm chart, and we will see a couple of pods spinning up in the captain namespace. So we install everything in one namespace, and that's the captain namespace. If we if if you want, we can we can try kubectl get pods minus end captain. We will see the what it's actually creating here. Yeah. We'll
27:50 see. That's basically the the core of captain. Couple of different services that talk to each other. Okay. So we've got engine x for the gateway. We've got bridge. I don't know what that is yet. We've got config, and we've got an event broker. Is is that NAT, sir? Are you using something else there for the event broker? It's actually yeah. It's NATs. So you can also see the captain nets cluster that's deployed. Oh, yeah. Yeah. Then we the the the bridge is actually the captain's bridge is the, let's say, the control center of captain. That's it, our UI.
28:24 We will see the UI later. The lighthouse is the one component that is responsible for reaching out to all these SLI providers and doing the evaluation. Yeah. The shipyard is responsible for creating the environments. Is gatekeeper open policy agent or something else? It's it's our own internal gatekeeper. It's the the one component that decides if one version of a microservice is allowed to move on to the next stage or not. It's it's doing this decision making. Right. I'm not seeing any puns here for walking the plank. So I think we need a a new service for that, for maybe infecting
29:04 other apps. Alright. I like the nautical same. It's always amusing. Okay. So let's let's see what's next. So we have to configure STO and kpten. I guess I just trust this and run it. You can yeah. It's actually this one is just downloading, and the next one is then we will just run-in the next command. So we would can take we can also take a look here. It's what what it's doing is it will create the ingress, and it will create also a a gateway, and it will restart our re restart one service of of kpten that
29:50 it can fetch this configuration. Alright. K. So we're just gonna run it. Parameters. Nope. Okay. Good. Alright. Let's give that a second. Alright. And then it tells you what it's doing anyway. Right. Cool. Got it. And now we need to configure our CLI to speak to the control plane. Exactly. Verify this. Okay. So we already see where your Kubernetes cluster lives. So we can take a look at the API endpoint. And there is also the the bridge that will be exposed here. I will just warn you. So whenever we are doing the we yeah. That's the the API of kptn. It
30:10 Demo Step: Configure Keptn CLI & Networking
30:50 can be controlled via the the CLI or via the API. And, yeah, you can imagine that when Citrix built integration with the Slack bot, they were more using the API. Today, we will be more using the CLI since it's easier instead of writing all these cloud events and then sending them to the API. Yeah. And yeah. So with this, we already installed kptn. We have access to the kptn API, so we know it's up and running. And we could also already take a look at the kptn's bridge, but it will be empty anyway. So we can go ahead and create
31:25 our first project, onboard offered services, and then take a look at the bridge. Okay. So we're gonna clone. Okay. So it's just an example repository. And then we're gonna take a look at onboarding karts. I should just pop open. Okay. Nice. So I oh, wait. So I still have to create the project. Okay. I've just cloned it and opened it, but it's Yes. And here we can go if you want, we can link it to a public Git repository in GitHub or GitLab or wherever you might run you might wanna run it. So we can also take a look at all the
32:00 Setting up our Keptn repository
32:12 different projects and all the on everything that has been created by kpten, or we can just assume that everything's fine and we just trust that the kptn repository that is managed by kptn in its in its in the cluster is will be fine. So we can go with the git upstream or we can also link it afterwards. So right now, we can also go with the installation without the Git upstream, and we can do the link ups the the Git upstream afterwards if we want. So that's just that's open to you. Alright. So let's see.
32:51 So I I could just create a an empty repository and then fill in those variables. Is that Yeah. So we we could you can even create an empty organization, and you can have it there. So you afterwards, you can delete the whole organization if you want. It they're also for free organizations. Okay. Let's call Rawkode captain. I own email address. Oh oh, alright. Skip. Okay. So I have an organization. There we go. Yes. And we just need one project in this organization. Yep. Yep. And that. Since we are going to onboard let's call it sock shop since we are going to
33:41 onboard a shopping cart. I think for captain, please make it sock shop without the dash because in previous versions, we had issues with the dash and just make sure everything's fine here. Yeah. There we go. Okay. So then here, what we need is just to expose these. K. So let me just modify them here. So is that my organization or my actual username? That's your actual username. Then k. Yeah, we would need the Rawkode. Or you can just copy the can you I'm not sure if it would would work in this way. Maybe we just need the,
34:34 the URL that gives that is given by GitHub. Oh, it's actually it's exactly this. Okay. Sorry. Then it's is it? Okay. I I always go for the HTTPS version, so this is why I'm confused. Alright. Right. Yeah. Maybe we can go for the HTTPS version. I think that is actually the one that we that we want. I actually need a get token? Yeah. And we might need this token, but I'm not sure if we can share it here on the screen or you would go for a different display or monitor where where you're going to set this. Yeah. I'm just
35:05 Examining the Shipyard File (Pipeline Definition)
35:23 getting one over here. So let me get token, personal access. Any specific permissions it means? Read write project will be just fine. So I think that's the first two permissions. K. Let's see what happens then once that done. So what I'm gonna do is just move this over here. And that and all the secrets on the stream is always the fun bit. Okay. So I can pull this back and pull my terminal back. Yeah. I I should have warned you warned you earlier, but I totally forgot. Oh, yes. I need to export that variable. So
36:30 get token. See what happens. And the documentation just wants me to run this command here. Just didn't copy. Alright. So we get. So what's going on here? Do you wanna fill this in a little bit? So what kpt will do is here, it will start to create the project and it will create different stages. And these stages will be managed by branch. So it is going to create the project sock shop. That also means the folder in the git repository creating the branches in the git repository, and it should already be synced to your git upstream. So
37:40 it's always doing this that it will sync it to a Git upstream. So if you reload it here, we can already see the shipyard file, and you will also see three different branches for dev staging and production. So with this, we have already created a, let's say, the environment of a multistage pipeline, dev staging production. And, if we can take a look in the shipyard file, we will see how this is configured. So in our dev stage, we do have a deployment strategy which is direct, so no blue green deployments for for our dev stage.
38:20 We just deploy directly into the dev stage, and we execute some functional tests. That is the instruction how this stage should be, should be set up and is set up by kptn. For staging, we have a deployment strategy, which is a blue green on the service level and the test strategies performance tests. So the test integrations, they get the, let's say, instructions to execute performance tests, and an approval strategy is set to automatic for for everything. That means either or it doesn't really matter if the quality gate gives us a warning or a full pass,
38:57 we promote it automatically into this stage. For production, the approval strategy is a little bit different. If it's a full pass, we we deploy it automatically into production. If we get a warning, we would have to click on or approve it manually either by clicking in the captain's bridge or doing this with the Slack bot or with an API call. Just approving this, you can have an integration with other tools as well. And, we also have a remediation strategy for production. So that means if there's any issue, any alert sent from, let's say, the Prometheus alert manager in production,
39:34 kpt will trigger a remediation. For other stages, kpt won't trigger remediation. This is this is basically everything that we have defined in this, and this is how kpt will behave in these different stages. Alright. That makes sense. Okay. So we have a a a question as well, which we can tackle just now before we move on. Is asking, does captain expose any metrics that can be consumed? The kpten directly does not, right now expose its own metrics, how the of the performance of kpten, if if this is the questions. Kpten, can collect different metrics from SLI providers.
39:51 Q&A: Keptn Metrics
40:21 And which metrics like a response time or total amount of calls or what whatever you you can think of, these are just defined in the SLI file. So you would just map it between the name of the SLI and the prompt query, for example, or if you're using other monitoring tools, the name of the SLI and, let's say, an API call how to to fetch this query. Okay. Thank you. So let me just make sure I understand what's happened right now. So we have this repository created. I mean, if I click on dev here, we just got some metadata. So nothing's really
40:58 being deployed to our cluster yet. Right? Exactly. We just defined that there will there is a project SOC shop, but we have not onboarded the microservices yet. We can think of, we have defined the the whole application that we have a a container for this application, and a logical container, let's say. And now we will onboard services. So we could onboard a front end, a database, a back end. And what we will do is the demo application comes with two microservices. One is the database. It's not really a microservice. It's a database, and the other one
41:00 Project Definition Complete, No Deployment Yet
41:34 is the microservice. And we will then do different deployments with with this demo application. Okay. Let's jump back over to our tutorials. So we have oh, that's alternatively. Okay. Yes. Okay. So if you don't want to have this Git upstream, if you just want to do it locally, then it's totally fine. It's also working. Okay. So now we can take a look at the bridge. This one this one will print an a token for the bridge since it's just a demo environment. If you're fine with the token to be exposed on the stream, if you delete the installation afterwards, it's
41:40 Demo Step: Onboarding Services (Cart & Cache DB)
42:15 totally fine. But just Okay. Yeah. That's fine. And then we're going to render this. What's that doing? Let's see. Echo, get ingress. Alright. Okay. So this is gonna print out a URL, which we can click on this. Of course, it went to the wrong tab. Yeah. Just copy that. There we go. And now I need our password here. So let's see. And and go. Yeah. Alright. So this is the user interface. I can select application. I can see my three environments. Not much going on just yet. Services and okay. Some integrations. Cool. Let's see what's next.
43:17 Now we have to onboard our first service. Exactly. And in the tutorial, we have two services that we want to onboard. The first one is the cart. It's actually the application of the shopping cart. And in the in step two, what we also do directly here is that we add test instructions for, our different stages. So for dev, we will test, we will add test instructions for Chimida, which is just a functional test, basically checking if the API is available, of the cars microservice. And for the performance test, it runs a couple of thousand requests against the service. So
43:20 Onboarding our first service
43:56 this is everything that we can see here, is creating the application, and onboarding actually the Helm chart. So this command will also upload the Helm chart that lives in the carts folder. We'll upload this to the git repository. Is that a table, carts? No. That's actually correct. It's a it's a little misleading because it's the carts. It's a cart chart. And it's a chart of the cart. Alright. So let's onboard that then. Let's pop over here after onboarding. Okay. Yeah. Let's run that. Give that a second. And the cool thing here that you just have to define your helm chart once, and
44:51 then you basically pass it on to captain. And from this point, you don't have to touch the helm chart. You can do most of the things with, cloud events or with the captain CLI, and, you don't have to, to maintain it they they will be managed in the captain git repository, and depending on the stage. So what we've seen is for the staging and the production environment, we have defined a blue green deployment. So there will be different manifests Kubernetes manifests created for these different stages because for blue green, we need virtual services for Istio.
45:32 For direct deployments, we don't need this. We can go with the standard Kubernetes services. So this is just how it's how it's done. Actually, it should return a little bit faster. Maybe that's the the git sync that it might take a while, but, yeah, we'll see. So should we see anything here? Oh, yeah. It's had a push list a minute ago. Okay. So we can see that it's, yeah, it's added all the chart stuff. Okay. And It's getting there. Inject an STL into the thing. I copy paste. It's a bit temperamental today. Let's just try again.
46:36 Demo Step: Adding JMeter Test Files
46:41 What does it mean by resource here? It means it will upload the basic check dot JMX. It's it's an instruction to for a JMeter execution, and it will just add this resource to the JMeter folder for the service carts, so for the shopping cart in our dev stage. And for staging, you've also just uploaded a file. So if you go, for example, yeah, in dev, we will see in carts two folders, one for helm and one for already. And here, we just have the instructions which tests to execute. Is one service that is when you install
47:26 kpten with the continuous delivery use case, g meter is installed with kpten by default. You can, of course, remove it afterwards, but it's just one of those what we call battery included services since we see it's it's widely adopted that organizations have g meter files and use JMeter for their performance and load tests. And so with this, you can just add the JMeter file, and the JMeter service will then pick up those files and will execute them. We'll see this just in in in a second. So, you know, if I wanted to, could I omit the captain at resource command and just
48:06 add the get add and push myself? Would that still work, or is there gonna be something missing from that? You could actually yeah. You can also do it with the git add and git push. That works since kpten always first takes a look at the remote branch, syncs it to its internal git repository, and then executes. So whenever there's new changes on the remote branch, it's kptn will pick them up. K. So now we're gonna onboard our database. Right? Power safety. For this, we can also see it here as the last parameter in the in in this line
48:18 Q&A: Edge Deployments
48:50 that we are overriding here the deployment strategy. So we have defined the deployment strategy on the level of the project. We defined for it's in-depth, it's direct. In staging, it's blue green, and in production, it's also blue green. For the database, we don't do blue green deployments in this case, so we just go for a direct deployment for the database. We won't do a lot of deployments with the database anyway, so it doesn't really really matter. Alright. Well, that finishes. Let's just tackle a couple of questions we've received. So Bella has asked, would kptim be suitable for deploying on
49:28 edge devices? Yes. So it's if you have your as I said, right now, kpten is doing deployments on the same cluster where the kpten control plane is also installed. So I would not go right now for edge devices here. But, with, the next version of kpten, it's totally possible to do deployments also to other clusters or to, to to edge devices. That's that's totally possible. Yep. Okay. I mean, I guess you could have, like, depending on the edge setup. But if those edge edge devices are all part of Kubernetes cluster at a control plane centrally managed,
50:13 I mean, captain could also just run there and still do deployment to those nodes even though they're distributed, I would imagine. And you also mentioned the captain runs on k c s as a start of the stream, which I guess may be a good fit for those edge scenarios as well. Okay. One more question. Some pass is asking, is the ingress running on the local host for the captain dashboard? Ingress is running on our cluster Kubernetes cluster is the same one as captain. I don't know. Have you provided a few I don't know how to answer that just
50:28 Q&A: Ingress & Load Balancer Setup
50:46 now. I'm not sure if you do, Jorgen, but maybe some path you could provide a few extra details there. Okay. We've onboarded our cache DB. So now it's telling me, hey, go look at the UI. Let's do that. Alright. We can see we have two services. Not deployed yet, though. Exactly. So it's everything that we have done until this point is basically a one time setup. So we have created the project. We've onboarded our services. We've not yet deployed them, but they already know which tests they have to execute once we deploy them. They know if they
51:00 Deploying our first service
51:27 should be deployed in a blue green fashion or if remediation should be executed for them. So we have already provided all these instructions. And in the next part, we now deploy the carts database and the carts microservice. So in the in the first part, we deploy the the database. And here, we can also see a little bit of the kptn insights and and details because we call it captain sent event new artifact. That's a very strong hint that kptn is event based. It's not an operator living in a Kubernetes cluster and and and doing things and looking for
52:08 changes on the Git repository, but it's really an it's really event based. So every time there is a new event, captain will act upon this event. And you send these events as a as a as a cloud event to the captain control plane. So is this something I would use with Jenkins or GitHub actions? Every time I have a successful build, I would maybe send a captain event there. Exactly. That's exactly what, people that are using captain, are doing. Once you have a a container image, then you trigger captain. K. Let's just get both of these going
52:45 then. Well, we've told that that the CartsDB service has an image. So it's just gonna pull it from the Docker Hub. So that's just very synchronous right now with the kpt and send event. Is there a way I can just say send the event and don't get me any feedback? Like, just, you know, go and do it? That's basically what what you will be doing with the API. The API will just respond 200 if the the cloud event is okay. This one is always opening up the the WebSocket. Can also omit the WebSocket. It's I
53:31 think it's dash dash no WebSocket or suppress WebSocket. If you go to the kptn help, then you you you will find it. Press WebSocket. Okay. Suppress WebSocket. Yeah. So then it won't do the the WebSocket communication. And oh, it says unknown flag project. No. This is a global flag. Let me move that. Oh, yeah. I copy it wrong? No. I think it looks good. Maybe we can just put the suppressed WebSocket at the very end. Maybe Is this a COBRA application? Yeah. It's always weird. Oh, no. Let's just remove it. Oh, artifact is is is wrong. Our artifact, it says
54:34 here. Always my fault. Always always is my fault. There we go. Artifact fact. User error. Okay. And that was very quick. So I should set the event by suppressing the WebSocket. I'm saying just go to wherever you have to do in the background. So Exactly. And but what you always get is the captain context ID. And with this captain context ID, you can always query the events that belong to this captain context. So in this case, kpt would start the deployment. It will now do the deployment in the Kubernetes cluster. It will trigger the tests.
55:05 Demo Step: Triggering Initial Deployment (Send New Artifact Event)
55:11 It will do the evaluation after the tests have been finished, and it will decide if should if it should be promoted to the next stage or if we or if it should stop. So everything is done now starting from here. And in the Captain's Bridge, you can follow the the sequence of events. And it will just this demo application is I think it's a spring boot application. It's not tweaked, let's say, in any way, so it will just take a while to to be deployed. Here, we should already see actually the cards to be. Yes. So cards to be already says it's
55:51 Mongo four two two, and we also have the configuration for the first service. Yeah. It's not yet finished with deployment, but that's not the problem of the of of kpten that it's slow with deployment. It's just that the demo application or this carts application just takes a while for to come up and the for the first readiness probe to be finished. Awesome. Okay. So has given us a little bit of extra detail on their question. So what they're saying is when we ran captain configure bridge output as does that command create a public load balancer?
56:34 I think I know the answer to this, so you can feel free to correct me if I'm wrong. But I I don't think it does. I think it's just querying the IP addresses from the services inside the clustered. When we installed STL, it deployed the STL gateway, which then requested a load balancer from the provider. So that that was all set up with STO. There is no ingress set up as far as I'm I know with that command. Is that right? Yes. So the ingress was really part of the of of of the STO part.
57:06 Yeah. K. It was. Cool. And the configure bridge, think, was just getting us a a URL that we can use. So Exactly. And if you're not familiar with the NetIO, it's just a free service where you can put any IP address in their format. Like, you can see here, if I let me submit on that a little bit. Really cool when you don't wanna have to deal with DNS records at all, and and it just maps back to the IP address, which is part of the actual domain name. A really cool service. Okay. So let's see what's coming up next. This is
57:39 fifty eight minutes remaining. There must be quite a lot in this tutorial. Yeah. It is. It is. But actually, it's just a a wild guess of this fifty eight minutes, so we we can always be a little bit faster. But let's see if the it's already in stock shop dev cards is still We have something pending. So let's just take a look at that and see if we've had any sort of challenge. Insufficient CPU. Okay. Okay. We got six notes. I mean, that it should be alright. I think that's just a problem that the demo
58:08 Debugging: Deployment Failure (Insufficient CPU)
58:26 application, we have not tweaked the the resource request. So maybe it's just looking for a little bit too much. Alright. Let's take a look at our resources. Okay. Let's just that's the limit, and then this is the request. If I modify that resource. Yeah. Actually, I think easiest is that we modify this in the public repository that we that we did the that we set up in the beginning. Ah, right. Right. Right. Right. Right. Okay. So back and back Because that will be synced anyway, so we should be able to see this. But it will give us a hard time
59:16 because the Helm chart okay. He here here comes the the thing. The Helm chart is now you have it in-depth in staging and in production. So we wouldn't change it three times because kpten is taking this helm chart and putting it for each stage, replicating this helm chart so you can have it basically managed by each stage. So we can try to tweak it here a bit and go down with the Alright. I mean, we can just remove it. Right? We can just say we don't really care. Yep. Where is the save button? I know I should enter a better commit
59:19 Modifying Configuration & Retrying Deployment (GitOps Workflow)
59:58 message, oh, well. So what what happens when I do that? Does anything happen? Right now, nothing happens because we have not yet sent an event to captain. Yeah. So we've just changed the deployment file, and what we should do is to send the new event to captain to to do the deployment. So, basically, the same command that we had earlier. And we trigger a new deployment, and captain will fetch this will fetch this again. K. Let's give that a minute, I guess. And you're saying I should modify this in staging and production as well? Yes. But can you take a look in
1:00:47 the helm there is also a there's a helm and a helm generate or a cards and a cards generated folder. And here, we actually have to take a look in the carts generated since these the carts is the original one that we onboarded, and the carts generated is the one where we also have the Istio resources that have been created by kpten. I I don't see anything. And in the dev stage or in the dev branch the staging, sorry, in the staging branch Yep. We should see them. No. Nothing. Okay. Then then we have to take a look in
1:01:36 the carts. Yeah. Then maybe it's not in a generated. Then I'm totally wrong. So for dev, we already fixed it. That's fine. We have to take a look in the Doesn't seem to be doing anything. Okay. So let me try and understand what's going on here. Is there a way for me can I subscribe to the captain event? Can I see what events have been happening? Can I how would I debug this situation? We can take a look in the captain's bridge what is going on. So we would see that there is a new service sent or a new deployment
1:02:03 Q&A: CLI Authentication
1:02:12 sent in-depth. Oh, did did did I send it to the right yeah. Yeah. Yeah. There's no environment there. So that that command is okay, isn't it? Yeah. That's that's fine. We can just remove the the suppressed WebSocket so we can also see the logging output because that's actually what we are we don't see right now. But we'll see starting updating the chart cards of the stage dev. There we go. And here we go. Has no deployed releases. Okay. It failed in the first time, so the Helm release is actually not working. So now it's basically in the undefined state with
1:02:52 Observing Progressive Rollout (Dev -> Staging)
1:02:55 Helm. That's the I've had this so many times with Helm where the first release fails and then you get stuck in this limbo land where you need to actually remove the the old release. So if I I mean, will my standard Helm tooling work here? Oh. Yeah. It it should work actually because helm should also point to the same cluster. And There we go. Okay. So Yeah. We can do a shop. It's SOC shop minus dev. This is how the Delete and the release name. That's no. That's d p. The I don't see it. The event probably had it. Didn't say yeah.
1:03:46 Okay. This one. There we go. Okay. So we could we should be able to to trigger that new artifact again, and they should start from a clean slate, maybe? Yeah. But is there any chance that we can, like, bump the size of the cluster a little bit? Because it will do the same in staging and in production. But in staging and production, we will do a blue green deployment. So we have two versions of cards running. So we might run out of space also in staging and production because, like, we are adding five more containers
1:04:28 to the show. Yep. No problem. Okay. Go. Alright. I've increased that size. There's a new node pill being added now, which should give us five more nodes. So let's see where status of that is. Did I do that right? And I think that gives me the the chance to to rethink about our demo application since it's a little bit too big for very, very small shopping cart, and it takes, just too much resources. It's, yeah, it's just that we we're going for a database plus a small a small UI and a little bit of yeah. It's just a small microservice.
1:04:38 Resizing Cluster
1:05:25 Maybe we should go with two small microservices that talk to each other and are very, very small, maybe just pinging each other, and then we can also showcase the capabilities. Alright. Let's just see how far along that is. It shouldn't take too long. Let's see. One, two, three, four, five, six. I can see if you're adding my new note pillow. Okay. I'm just gonna increase the size of that. So I may lose the API server for a few minutes, but we can deal with that. The joys of a livestream. Right? K. Let's see. Exactly. Yeah. Maybe we have more questions coming
1:06:14 in that we can also do a couple of questions. Let's see. Happy now. Yeah. Okay. And that should be spinning up now. So let's see what happens. We'll give that a second. Okay. We do have one question just now we can tackle, and then we'll cover a few other things. So Bella asked, does the user have to manually update STO to keep up with their releases? Yes. So Istio is not managed by kpten. It's just for blue green deployments right now. It's it's needed by kpten, but it's not managed. So you would just install it by yourself and
1:07:02 take care of the latest versions of Istio. Maybe in this context, I can add that upgrading kpten is usually done with the kpten CLI. So we release a new kpten version every four to six weeks approximately. And there you can usually, it's just a kptan upgrade, and it will just upgrade your kptan installation and move all the the services to its latest version. If there are new versions of other depending services released, then you would just have to take care of of this yourself. Cool. Thank you. Alright. Can I trigger caption send event again, and we should see
1:07:54 that work this time? I won't suppress the WebSocket. Let's follow along and see what happens. Let's try. So I guess that also explains the weird situation then because you were expecting this here to see the generated YAML, which I'm assuming isn't there maybe just because that release wasn't successful the first time. Is that right? Yes. Yep. There are a couple of things going on. The Helm chart will be generated, and the Helm chart will be applied. And the CLI will wait for the Helm chart applied to for the Helm upgrade. Actually, I think it's a command to be finished.
1:08:38 So in this case, it takes a while since just the the start of the container takes a couple of seconds. So, normally, it's this command if you have I have a very small Go application that I'm using sometimes. It's called the potato head. It's a CNCF demo project. And with the potato head, this command finishes within, let's say, five seconds because it's very small. It's start up it like, the start up is really fast. With the shopping cart, the start up is a little bit longer and and slower. So, also, this command just waits for the first successful deployment in the
1:09:12 first stage. So it won't wait until it's deployed into production, but it will just give you the feedback if it could be deployed in the first stage. So it finished upgrading the chart. That should be fine. And, yeah, now it's writing the generated ones, and now it's all good. There we go. There we go. Finally. Alright. So that what what it turns out then is my wild idea of a suppressed the WebSocket. We actually managed to tighten that rate. I did it on the request that failed. So go me. Thanks. Alright. Let's go back to our tutorial.
1:09:50 So we have done both of those, and we should be able to verify all this now. So let's try that one more time. Here we go. That looks much better. So we got the staging. And now we can also see it's, it's starting up already in staging. There is also difference, when we take a look at the the pods and the containers that are running in the pods. For, dev, there is only one container running. For staging and production, there will be two containers because we will also have the Istio sidecar. But for dev, we decided not to go
1:10:25 for blue green deployment. So there is no need to inject the Istio sidecar because we we don't need this traffic shifting between blue and green versions, and we're only going for this in staging and in production. We if we take a look in the Captain's Bridge, there should already be the link to the service available so we can also take a look at the UI of this very beautiful shopping cart. Actually yeah. It's deaf. Yeah. Here we go. And that's our shopping cart. Yeah. So it took a it took a while to see this very nice
1:10:53 Viewing Deployed App UI
1:11:06 UI. Cool. So I'm just curious then. If I come in here, take a look at this. The this one get the change that I made in dev. Right? This is something I would have to manually update here. Let's even get away with with those extra notes. Okay. So that's cool. Nice. So it's we've already looked at that. I'm happy with that. We go to next. Now it wants to generate some traffic. Yep. Do we want to do that or will we We we can also skip this part as we I think what's really important to
1:11:53 where we sorry. I think did I just lose you for a second? Oh, no. Okay. So I think what what what will be interesting to take a look at is how we can connect kpten to Prometheus and also run a quality gate evaluation with data from Prometheus maybe. I think we still have a couple of more minutes left in the stream. Or, David, how long is the stream scheduled for another couple of minutes? I mean, it's scheduled for another fifteen minutes. If you wanna go over that, I'm fine. As long as you've got the time, we can
1:12:10 Demo Step: Configure Monitoring (Prometheus)
1:12:28 we can go until we get the Prometheus stuff done. Yeah. I I think if we just execute the installation of Prometheus and do the configuration, should be quite fast, and we will get the data. Yeah. But now that you've said that, something random is gonna happen. Right? So we're deploying Prometheus. We are gonna configure captain. Okay. We're just telling that that Prometheus yeah. Captain configure a module from Prometheus. So this is a command that you wanna tell me what that does? Yeah. It it it also says here. So what it does is it will set up
1:13:05 Prometheus, and it will also, configure Prometheus in a way that it will, generate the scrape jobs for cards in dev staging production, that we can use the data afterwards. And if we would have set up already the SLO files, the Qualitygate files, that it would already set up the alert manager rules. So right now, we have not added the SLO files, so there is no alerting yet since we have not defined the QualityGate yet, but this will be the next part. Okay. Got it. Yeah. So this is using my local shipyard, right, to kind of pieces together. When I run
1:13:43 this captain configure monitor, it's looking in this the shipyard fail here? Or no. It's sending the event to the control plane. It's sending the event to the control plane. It's taking a shipyard file from the git repository. Got it. Okay. That makes sense. We don't need to browse to Prometheus. So Yeah. We can go for the next. And here, sometimes I need a little bit of explanation because it's the Prometheus service. That's the one responsible for configuring Prometheus. If you don't want to configure Prometheus with kpten, then you just need the Prometheus SLI service, service level indicators.
1:14:17 In our case, we want to do both, so we have the Prometheus service and the Prometheus SLI service. So this service we also need, it's a service that we also deploy into our kpten namespace, and it it's just responsible for querying Prometheus and delivering the data from Prometheus to kptn. And with that, we already set up Prometheus and the integration. And now we need the quality gate, and the quality gate is our SLO file. It's written here, and we can just use it as it is. And it will check basically for the let's say I think it's for the response time. We
1:14:44 Demo Step: Adding Quality Gate (SLO File)
1:14:59 were just checking for the response time of phone service. Okay. Yeah. So it's an an average across response time, p 95. Okay. And it's got some criteria. Excellent. Exactly. Okay. So I can do this just from the bridge. Right? Yes. So we can take a look if it's already deployed in staging or in production. Both service. Oh, environment staging. Cart. It's here. That looks good. I think in production, I have not seen it yet. Its cards is not yet deployed in production. B b is there. Not yet there. Maybe we can take a look in the bridge.
1:15:24 Checking Deployment Status in Bridge
1:16:02 Normally, the bridge gives us if we go to on the left side, if we go to services, then we can take a look directly on the service on the carts. And if we click on the last configuration changed, we can see we the configuration changes is usually the part Oh, it's trying to already fetch the SLIs. Actually, it's trying to fetch the SLIs. And at the same time, we just deployed Prometheus. So I think there is a kind of a race condition here because this part would be skipped if Prometheus is not installed. So normally, you install you can
1:16:40 Progressive delivery
1:16:45 do it without monitoring, and then it will just just be skipped. This time, we just we deployed Prometheus, but we have not added already the the other parts. So what we can do is we can just do another deployment or trigger a new deployment of carts to just rerun the the whole thing with the same version, and it will it will then fetch the data. Okay. So I just wanna make sure I understood that correctly. So our production approval strategy has some sort of performance monitoring before it does an automatic rule out, and it just
1:17:12 Quality Gate Evaluation Process Explained
1:17:20 managed to capture Prometheus existing and time for it to try and check that the performance was adequate, roughly? Yes. So it's working in this way that with this command, we are deploying cards in the dev namespace. Then, kpten will trigger the tests. We have added a JMeta file. This will be triggered. Afterwards, there will be an evaluation based on the SLO file. We have not added the SLO file for our dev environment. We only added it for staging. So in dev, it will just skip the quality evaluation. For staging, it will deploy. It will trigger
1:18:00 the tests. We have added the load GMX file, so we will add some performance tests or, like, execute a couple of thousand requests against this shopping cart, and then it will do the evaluation. We have defined the SLO file, and we have defined the Prometheus SLI provider. So this will actually query Prometheus for the data of the testing time and then come back with the data, provide this to kpten. Kpten will do the evaluation based on the SLO file and then come up with a total score. That is actually everything that's going on in the background when
1:18:35 you just send this one command, kptn. And I think at the beginning, mentioned that, kptn access this control plane and the orchestrator of these tools. So it connects a chain meter, Prometheus. Later on, it can connect Unleash or Ansem Tower to trigger remediation actions. So there's a lot of things that are sent that will receive events or send events. So this this is what we this is what what what's going on here in the background. Yeah. We can see in this case, there is not a lot of evaluation because we don't have the SLO file. But if we add an SLO file
1:18:38 Re-examining Shipyard File Details
1:19:17 to to our dev state, dev environment, they would would see it. Actually, right now, it's quite fast. The gatekeeper already decided that since we got the let's say, if there is no SLO file, kpt will treat it as as a, let's say, a a past evaluation. So it moved on, decided to promote it to the next stage. For staging, we have automated deployment activated, so it already started to deploy it in staging. And maybe we already see that tests are executed in staging. But, yeah, we just need to to wait a couple of seconds or minutes for the
1:20:02 the test to be finished. Right. Okay. Okay. Gotcha. Yeah. So so so we can see it here. Deployment finished, then kpt was triggering the tests. Kpt was also doing the was fetching then the SLIs, evaluating the SLOs, and then deciding if to promote it or not. So this is this is how how it's going. Alright. Nice. I like it. So that's just gonna take another minute. Why don't we tackle another quick question? So the question is, is the captain CLI using its own service account and tokens when talking to the control plane? I'm assuming it may be just just using my cube
1:20:25 Waiting for Staging Test Results
1:20:45 context. I'm not sure. What Yeah. We are using our own service account. In the beginning, we were just using the default service account, but that's this was, I think, changed with two versions ago, three versions ago. We changed, and now we're using the captain service account. So you have a little bit more flexibility and control what captain is actually allowed to do. Alright. Nice. Okay. So should I refresh this? It looks like it's rolled out to dev and staging, but I don't think we've had a production deployed. It's still maybe we can take a look in staging
1:21:26 if it's if the deployment has already finished. Then if the deployment is already finished, then we know that the tests are also executed in the background. Oh, we've had a CPU thing again. Okay. I'm assuming Okay. So we are we are hitting another CPU limit. Yeah. It's just the I think the problem that we Sorry. Let's just remove it, and then I'll check it at another event, and we'll see if we can get that to run through a time. So we can just say in the CPU. I'll just commit that to prod before doing the new artifact as well.
1:22:09 Edit. Yeah. This cluster has a lot of nodes, but it's relatively small hardware. I should have and normally over provision out testing as well. Like a billion course, but today I was a bit more sensible. So I'm gonna trigger this new event based on those changes inside of the repository. So yeah. Let's see what happens. I'm curious. So would you say the bridge is the best way to follow along a progressive rule out of this application? Would you use the CLI? I mean, what's your what's your preference as someone that uses SDN, DOT? I think the the bridge gives most details
1:22:48 Demo Recap: Progressive Delivery & Rollback Success
1:22:51 because if for example, if we can take a look in, let's say, in the bridge on let's go for a previous run maybe, that way we can see a little bit more, and let's probably go for some okay. Maybe we don't see a lot of data. Maybe go for an event of yeah. Anyway, go for go for any event. On the very right hand side, there is this computer icon or laptop icon. And here, it gives you the full cloud event that was sent to kptn. And if you have a quality evaluation or a SLI retrieval,
1:23:35 then these cloud events, they get a lot of information. So usually, you have all this information. You have also, if there is an error message, you should be able to see it there if the SLI provider sends it. So since the the reason that we are running out of CPU is happening after the Helm upgrade, we we don't see this exactly because that would mean that we send another event after we do the Helm upgrade. But if there are, for example, if there are any tests that are failing or if there are some service level indicators fetching, then we should be
1:24:13 able to see this we we should see all this data here. Probably, if we close this window and go to the test finished in our dev, maybe in the previous run at 03:20, there is a test finished event on the in the dev stage. So it's a little bit on the top. And maybe we can see here yeah. It just says that it's coming from the chain media service. And, yeah, it doesn't bring a lot of a lot of information. We just know that the chain meter service sends a pass off the tests. It just executed one test.
1:24:58 What we will see is more in the in in staging. We have added more tests, we should be able to see this in staging. Okay. So staging rollout is just about completed. Looks like it's just waiting for the STL sidecar maybe. Well, you know, one container pending here. Yeah. Well, it should pop up. Second. I'm so impatient. I should just try to watch. That's good. So one is coming up, and the other one is, being terminated since we we don't need it anymore. That also means that, now the tests we have not made it yet to production
1:25:40 because the quality gate in staging never was not successful yet. So in in this case, we could not we we cannot ship anything to production as long as it's not validated in staging. Okay. So what we'll see is the deployment finished. And after the deployment is finished, right now, the Chimida tests are being executed. There is one improvement coming in the bridge that we will indicate also that now tests are being executed. It's a little bit missing here, this information, because we just see the deployment is finished, and it needs some background, in in kpten that we know, okay.
1:26:20 Every time the deployment is finished, captain will trigger the tests. This information, unfortunately, is missing in the bridge. Yeah. We somehow missed to to also put this information in the bridge, but we are working on this to improve it. Okay. So we just need to wait for those tests. So does that take long? Usually, about two minutes. About two minutes. Couple of yeah. And then I will just reach out to to move a couple of things. Oh, have you got another call scheduled there? Yeah. I think I can move it. So I'm I'm I'm really excited here to to
1:27:08 get this into production. You know, that that that's all we want with kpt in the shipping applications to production without breaking anything, and we are we are good in breaking here. So No. This this is this is look at this. It's it's starting to click with me now. I can see and I'm starting to understand this UI kinda threw me at first. But what, you know, what we can see is the this config changed event here, and then we're actually it just I don't know how I missed it, but we have the first event here
1:27:36 and then this is okay for development environment. We're gonna do deploy. We're gonna run our tests. You know, we get their SLI stuff going on here. It does some evaluation and you can actually see it move on to staging. Like, I can actually start to see how all this flows together now. It's starting to make much more sense to me. In fact, there we go. There's our production now. Events kicking off. So and I've had the light bulb moment. I now understand what's going on. This means if I pop over here okay. So we could see yeah. And the evaluation,
1:28:11 what what did it say? Could we did we get a good score in the evaluation? That's pretty, pretty good. So we got, basically pass with 90%, which is no. Actually, with a %. So what what we've done here is or what kpten has done is it was triggering the tests. The tests have been finished. It was reaching out to to to query the data for the exact testing time frame, and now we can see that the response time is is meeting our requirements. So that's that's pretty cool. And now we're moving on to production. And in production, it's doing the same the
1:28:56 same game with deploying, trying to execute tests. Actually, in production, we have not defined any tests. Usually, it's our end users doing testing, let's call it, for us in production. But, yeah, this is what we'll see. And with a blue green deployment, we have prepared a second version of this microservice, which is a very slow version that will break the quality gate and that will not be allowed to move on from staging to production. I can already spoil this a little bit because the tests are yeah. I'm spoiling because the tests will be running for a couple of more minutes since
1:29:37 we are sending the same amount of requests. I think it's 5,000 or 10,000 requests against the service. And instead of five instead of two minutes, they will take about eight minutes. And I'm not sure if we want to wait these eight minutes to see this evaluation. It will be red, and we'll be indicating that the response time for the ninety fifth percentile is not within the boundaries that we defined. So this is this is what we will see. Okay. So does that does that change, and I mean, that we have this deployed everywhere? I mean, if
1:30:10 I run get pods in production, that's it. Oh, no. This is so what's this primary pod that's running? Is this more tests? No. That's actually since we're doing this blue green deployment, we are spinning up the first version, and we are spinning up also another version. So we prepare the blue green, the traffic shift already. This time, it's the first deployment, So we are bringing up two times the same version, so we cannot really roll back because there is no previous version. But if we do more of those blue green deployments, if we send the captain if we send
1:30:51 the shopping cart in version 11 dot three, eleven dot or dot two, dot three that are already prepared, then we can see the primary is the one that's receiving the traffic, and the other one is the one that will go away. So in staging, there is there are only there's only one left, and in production, there are right now two. But if after some evaluation p period, if there is no rollback indicated, then the other one will just be removed from kpten. Okay. I understand. Nice. So is that my full rollout of karts finished? In this case, yep. We are finished.
1:31:34 We it would have been possible to just bring cards on the on the on the browser and and take a look when there is the traffic switch actually from one version to the other. We are doing the blue green deployments with with Istio. So that means we are deploying the new version. Once it's ready, we are moving the traffic to this new version. And we are if we're going back, then we are first moving the traffic to the previous version and removing this. So there's no downtime or anything. That's actually what we already know from Kubernetes,
1:32:10 but it's also in this this way we're using we are basically implementing a little bit by ourselves, but we made sure that there is no downtime from switching from one version to the other. Alright. Cool. Is there anything else you'd like to show just now? Pardon? Is there anything else you want to go over? So what what we've seen right now is blue green deployment, the multistage deployment with kpten, with helm, with using CloudEvents, with having Istio, with having Prometheus. So I think we have installed a lot and had a lot of technology. Although the parts
1:32:54 that once set up, it's basically just the kptn CLI that you need with the kptn send event, new artifact to trigger a new deployment, or configure your monitoring to to set this all up. As said, it just takes a while for the the second version to run all its tests, but we have also added the Chimida tests. I'm I'm happy to to stay and then move through all the other parts of the tutorial. But, yeah, if you want to wrap up here, that's also totally fine. And maybe we can do a follow-up and taking a look maybe
1:33:10 Tutorial Structure & Future Plans
1:33:29 on the auto remediation parts of kpten later on because this is basically, once we have things up and running in production, it's another use case of kptn to trigger remediation actions and to evaluate these remediation actions. Okay. Okay. So let's let's I have no idea. So let's just say what we wanna do. I what I wanna what I wanna be careful with is I I I don't want to rush anything that we show so that people don't get a good understanding of it. But at the same time, I'm I'm happy to make sure that we, you know, we finish
1:34:08 off what we started now. So we can definitely schedule something else and we can cover more of the captain features. You're saying if we deploy the slow one, what that's gonna do is show us that the tests fail on the staging phase, which will block the deployment to production. So why don't we trigger that now? We'll see if anyone has any questions, and we can do a little bit of a summary while we wait for that to happen. And then we can schedule a follow-up, and we can dive even deeper into anything that we haven't covered in
1:34:33 particular field. Sounds good. Yeah. Let's let's deploy the slow version because it it will be quite nice to see what what is going to happen. That kpt will deploy it first, will trigger the tests, then come back with, like, the rollback Yeah. Of this version. So that's what typically happen in my CI system. I said, hey. We have a new artifact. We just built this new release. We tagged it with 0112. We send the event to captain. What's happening here is that because captain has an event driven system, it detects this change. So let's yep. There we go. We can
1:35:08 we can see here. And that's now gonna start to prepare the deployment to our development environment. And I just wanna make sure we tile this together. If anyone has questions, please drop them in the chat and we'll we'll tackle that. But if we come over to here oops. Let's go back to the directories. That's right. There we go. This this is like the entry point. This is this describes all of our environments, the ship shipyard.yaml. And I just wanna make sure we we cover this again. So we have a development environment which has a deployment strategy of direct
1:35:47 and a test strategy of functional. So what that means is the deployment strategy will always deploy whenever there's a new artifact. And the testing strategy means it's gonna run the functional tests. And if they pass, it won't roll back the deployment. Is my understanding of that correct? Yes. Okay. So if my functional tests failed when I built that new artifact, it would revert the deployment back to the last working version. And I get it wrong. Just interject. In in this case, no. So if you do a direct deployment, it won't roll back anything. What it will do is it will block
1:36:26 it from from move on to the next stages, obviously. So if, the functional tests fail or if some quality gate may be, responding on the or checking the response time or the throughput or the number of outgoing service calls or whatever. If there is a quality gate that fails, it will block it to move to the next stage. With a direct deployment, it will run rollback is only supported in the in blue green right now. Alright. Okay. Thank you. That makes much more sense. Then we have our staging environment. And this time, we have an approval strategy of both
1:37:01 automatic on a pass and automatic on a warning. So regardless, it's gonna do its best. As long as the test pass is gonna promote that to the next stage. And our testing strategy here isn't functional, but performance. Now are these test strategies is this related to the GMX configurations that we applied earlier? Yes. It's it's related also to the Chimera service. So it's basically an indicator which kind of test to execute. The Chimera service, they also have a configuration file, but it's not needed. There is no additional configuration file needed. So we only added the g m x file,
1:37:41 which was the basically already the test instructions. The you can always add a little bit more configuration to configure the number of users and and and concurrent users and number of requests to be sent. But the test strategy already indicates which test to execute. If it's just a functional test that are usually a little bit faster or performance test that hit more load. Yeah. Okay. So these deployment strategy values, this this seem kinda special. So we got direct. We got blue green service. Is there like a Canadian? Are there any other deployment strategies that we
1:38:17 haven't really looked at? Canary is not yet supported. Everything is prepared in the background. So with Istio, it will be very easy to shift the traffic not only from 100% to 0%, but doing, 100%, then eighty twenty, 70 30, 50 50, whatever you want to do. It's not yet implemented, but this is something that's also coming in the next stages. It's interesting also that we have not received or it's not the most common thing that that or not the most common request that we get that we have to implement Canary. It's more the multi cluster support, actually, that
1:39:00 we already had in the stream that people were asking. Yeah. Yeah. I can imagine why. Okay. And then the last part of our shipyard demo, we define our production environment, has a slightly different approval strategy of automatic slash manual and then the remediation. So one of the things we've had at the start, which was really subtle, but I think it's also important, is that I'm not gonna I better cut find the commander. But when we onboard it, what did we all did we override the deployment strategy on one of these? Yeah. We did here. Right? Yeah. Yeah. Yep.
1:39:38 So for individual services, I can actually say, you know, I I don't want a blue green deployment of this. Like, I guess the database is a really good example and probably why it's included in this tutorial. You you don't want to run two different versions of them at the same time. You may just want to go ahead and pay the bill and get that running as fast as possible. So is that reflected in the get repository somewhere? If I take a look something? Mhmm. So if I go to oh, that there won't exist. And put it dev.
1:40:19 There we go. And take a look at CarsDB Is it metadata? No. It's actually the way we will that kptn stores in the Helm chart. So kptn won't create this virtual services for Istio if we don't need a blue green deployment. This is how we can see if it's a blue green deployment or not. There is no in the metadata file, you won't really see it. We are planning to also include this when you're doing a kpt and get services or kpt and get projects in the CLI that we're also including everything that's that's really
1:40:59 running in the cluster that we provide all this information directly with the CLI or with the API. With by the API, you can already query all the events that are stored and running through kptn and also stored in kptn. With the CLI, it's we we have a couple of convenience functions that you can, like, kptn get projects, and it already lists the projects in a human readable way instead of giving you the whole cloud event with all the projects. So yeah. Cool. Alright. So it's getting there. It's all coming together now. Let's see what's going on over here then.
1:41:36 So our dev environment, the test finish, we only run functional. We don't run performance, which is why we're getting a green evaluation. This is now being promoted to staging, and it's not run test yet. So that's running our performance tests now against this red version here. Exactly. So we know it's already deployed, the red version. Yep. And we know that there are the tests running. As said, I would assume they still take a couple of minutes. So normally, it's around five to ten minutes. Normally, it's, eight minutes. So in a couple of minutes, we should
1:42:13 see if the tests have been finished successfully. So that's also one of the of of the the nice parts of this integration that the chain meter service is also indicating if the tests have been finished successfully. So it might happen that the tests are breaking, and we don't even get, let's say, a successful test result from Chainmeter because maybe the service is not available, maybe the service was crashing. And this is also taken into consideration of kpten, not only the SLO file. So if the Chimera file or NeoLoad sorry. If the Chimera tests or the NeoLoad
1:42:52 tests or whatever test integration you're using. If this is failing, then kpten will get this information and will also mark the quality gate evaluation as failed since it could not even execute the tests. If the tests have been executed successfully, then kpt will go ahead and evaluate the quality gate. And our quality gate is very, very small. We only have the response time in the ninety fifth percentile. You can add more and more s l SLIs. So we have seen quality gates with 50 SLIs that are combined together, the total score is generated. And they range from number of outgoing service
1:43:32 calls to memory consumption, CPU saturation, response time throughput, all these different metrics, they are used. And you just define them once in the SLO file. And, in the SLO file, we've that we've seen also earlier, it doesn't say which service it belongs to. So if you have an SLO file that is like your standard configuration of your quality objectives, then you can reuse it for all of your microservices. And reusing is is easy in this way since you you can define absolute thresholds, but you can also define relative ones. And I always like to have a combination
1:44:14 of both that I say, okay. It's the error rate has to stay lower than 5%, but it's also not allowed to increase more than 25% to the previous bills. So an increase of the error rate by 25% is, like, rather frustrating, let's say. But so I can combine the relative and the absolute value same for for for all the other metrics. And in this way, I think it's it's it's very easy to reuse it also for other services since they they don't have the name of the microservice or the the the stage in the SLO file itself. It's just
1:44:57 how you edit them to to kpten via the CLI or via the API. Alright. Nice. So this test has been running for seven minutes. So we're expecting this to come back and crash and burn very, very, very soon. Right? Yeah. Exactly. The next event should be this SLI retrieval. So and the SLI retrieval, as we know from Prometheus, it's it's very fast. Usually, you get this data close to real time from Prometheus. So other solutions, they take a little bit more time to digest and to to operate on this data. But if we ask Prometheus for for
1:45:40 data, we get it we we retrieve it pretty fast, and then we should see the slide retrieval, SLI retrieval done, and the evaluation of the quality gate pretty soon. And we should also see then the action that is is taken by kptn upon this quality kit evaluation. So if kptn allows you to promote promote it to the next stage or if it will be rolled back to the previous version, and we remember that the previous version was the green version. So we will see what happens. And can I just I I guess if I pull up the pods in the staging
1:46:16 namespace, will that show me the tests? You won't see the tests here because they are executed by the g meter service that is running anyway. Right. Right. Right. Okay. In in previous versions of Captain, we built upon k native to to allow scaling down to zero for a couple of different services of kpten and only react when there is a new events coming in. But, actually, Knative was the the memory footprint of Knative was already quite quite demanding, so we decided to remove Knative and just go for Kubernetes and and allow all the services to just keep running.
1:46:58 Right. We have our failure. That's actually what we wanted. Yeah. That's good. Yes. So it says here that our SLA retrieval started. It's done. We do our evaluation. We can see that we have quite a large amount of red here, and we can see the result was a fail, which means that this change will not be propagated to production, which is what we want. Which is pretty cool. Maybe we can take a look if it actually did the rollback. There is the service in the deployment finished event. We should see the the blue arrow that we can open up the microservice, and
1:47:36 we can take a look, and it's rolled back to green. Nice. There we go. We made it. Awesome. Very, very cool. I like that. We got there. Alright. Nice. Yeah. We we we did it. We did it. Awesome. Yeah. There's there's just so much going on there, but I think the flexibility and the way that we can encode essentially, which is, you know, actually, play books to a certain degree of how we want our applications to handle failures. How we wanna build test them across environments and and have captain orchestrate all of that for us is a
1:48:01 Guest's Final Thoughts
1:48:12 really, really cool thing. Call me pretty much impressed with what we've seen here today. And based on what we have in this tutorial, you know let me pull that back up again. It's my stupid testing. We started forty minutes worth of other things we could be looking at as part of the filter. So maybe if you could just take, you know, a minute or, in fact, maybe we should just reschedule something and not do that. Yeah. Let's do that. I think we've covered so much today. Is there any final thoughts you wanna leave us with? And we'll come back to this another
1:48:42 day. I think it was a very amazing journey that we actually we did a little bit of debugging already, and we took a look at the the DevOps sorry, the the the GitOps the GitOps approach of kpten. I think that was very fascinating to also also for me to see this how how fast we we've done these changes of the small hiccups in the helm charts. And, yeah, it was really a pleasure to to walk through kpten here with you and also seeing this from the from the other perspective. Not normally, I I'm sitting on the keyboard
1:49:18 doing all this. Awesome. Well, thank you for Sorry, Nico. So so I I I would be happy to to join you again and to take a look at the auto remediation part of kpt. And once we have everything up and running, we could just do one deployment, send it all the way through production, and then play a little bit with the auto remediation parts where kpten is triggering Ansible playbooks or feature flagging frameworks and these kind of things. So to also see once it's in production how how kpten can help. Yeah. Definitely. We'll we'll start after after this.
1:49:24 Closing Remarks & Thanks
1:49:56 But I think what we've shown today is just the progressive delivery stuff is really good. I hope people get an opportunity to play with us, and we'll come back with more episodes then. And and we'll take a look at all those other features as well. So thank you, Juergen, for joining me today. It was really fun to kinda walk through this and see how it all clicks together and how it works. It's really cool software and thanks for your efforts. Keep it up. I will speak to soon. Thanks, David. Have a nice day. Thanks, everyone.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments