Overview

About this video

What You'll Learn

  1. Show how probes validate steady-state hypotheses before faults are injected.
  2. Build chaos workflows from the public or private Chaos Hub.
  3. Track resilience scores from GitOps synced runs across repeated experiments.

Litmus maintainer Karthik Satchitanand previews Litmus 2.0: the new ChaosCenter portal, public and private ChaosHubs, resilience scoring, GitOps sync, and probes. Demos chain chaos workflows against Bank of Anthos on Kubernetes and EC2 failure against Sock Shop.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 1:00 Introductions
  3. 1:01 Introduction & Welcome
  4. 2:23 Introducing Kartik & Litmus Overview
  5. 3:11 What Litmus is & Chaos Engineering Principles
  6. 4:26 Preview of Litmus 2.0
  7. 5:25 What's new in Litmus 2.0?
  8. 6:40 Evolution of Litmus: Kubernetes Native & Early Features
  9. 7:46 Community Feedback & New Requirements
  10. 10:08 The Need for Probes and Steady State Validation
  11. 14:04 Why Litmus 2.0? (Major Version Upgrade)
  12. 15:00 Demo
  13. 15:15 Hands-on Demo: Litmus Portal Overview
  14. 16:08 Litmus 2.0 Architecture (Portal & Agent Components)
  15. 21:20 Litmus 2.0 Feature: Chaos Hub (Public & Private)
  16. 24:12 Litmus 2.0 Feature: Analytics Dashboard (Resilience Score, Application Monitoring)
  17. 27:35 Litmus 2.0 Feature: Teams & Collaboration
  18. 28:28 Litmus 2.0 Feature: GitOps Integration
  19. 31:47 Litmus 2.0 Feature: Docker Registry Customization
  20. 32:25 Litmus 2.0 Feature: Usage Statistics
  21. 33:01 Litmus 2.0 Feature: API Documentation
  22. 33:25 Q&A: Centralized vs Standalone & Getting Started
  23. 39:14 Q&A: Supported Data Sources (Prometheus)
  24. 42:29 How to Contribute to Litmus
  25. 43:07 Workflow Creation Demo: Bank of Anthos Network Chaos Setup
  26. 45:25 Creating Workflow from Chaos Hub (Selecting Experiments & Tuning)
  27. 49:40 Adding Probes for Validation (Initially Skipping)
  28. 50:57 Explaining Resilience Score Calculation
  29. 53:53 Scheduling Workflows
  30. 54:07 Viewing the Workflow YAML
  31. 55:17 Executing Bank of Anthos Workflow & Observing Impact
  32. 58:59 Viewing Workflow Results & Logs
  33. 1:00:55 Need for Workflows (Chained Failures)
  34. 1:01:22 Other Workflow Creation Methods (Cloning, Importing YAML)
  35. 1:02:08 Git Syncing Workflows
  36. 1:02:58 The Litmus Chaos Center (Centralized Management)
  37. 1:03:49 Chaos Against Non-Kubernetes Entities (AWS, GCP, Azure, VMware)
  38. 1:07:08 EC2 Instance Failure Demo Setup (Weaveworks Sock Shop)
  39. 1:07:56 Steady State Hypothesis with Probes (HTTP, Performance Checks)
  40. 1:08:55 Creating EC2 Failure Workflow (Importing YAML, Tuning)
  41. 1:11:59 Executing EC2 Instance Failure Workflow
  42. 1:12:33 Observing EC2 Impact & Application Metrics
  43. 1:16:44 Recap of Chaos Principles & Litmus 2.0 Capabilities
  44. 1:24:26 Future Directions: Security Chaos
  45. 1:24:50 Community & Contributions in CNCF
  46. 1:25:39 Conclusion & Thank You
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:01 Introduction & Welcome

1:01 Hello, and welcome to today's episode of Rawkode Live on the Rawkode Academy. I'm your host, Rawkode. Today, we're gonna be taking a look at Litmus, a chaos engineering project for Kubernetes. Litmus has a two point x release coming up, and we're gonna see some of the new features available from Kartik, the maintainer of the project. Before we dive into that, there's just a little bit of housekeeping. First, you've not subscribed to the YouTube channel, now would be the best time to do it. Click that subscribe button, and feel free to tick that bell. That just means that you will get notifications for

1:36 all new episodes of Rawkode Live, and I will do my best to make sure we cover as many cloud native technologies as possible to guide you and help you on your journey into this vast, vast ecosystem. If you wanna come and chat with other cloud native enthusiasts about Kubernetes, cloud native, and anything in between, we also have a Discord server available at Rawkode.chat. Come in there, say hello, and I look forward to meeting you. I also wanna thank all of the people that have now become members of the Rawkode Academy on YouTube. You can do so for

2:06 as little as $1. This just supports the channel. There's also other tiers, which will get you interactive courses, which launched this week, and a very special tier just for the corporations out there that also wanna say thank you. So check that out. I very much appreciate it. Now we're gonna move on to introducing Kartik, a maintainer of the Litmus project. Hey there, Kartik. Welcome back. Hi, everyone. It's great to be back on Rawkode. I think was an awesome experience the first time that we talked about SKAOS engineering and the evolution of cloud native SKAOS engineering,

2:23 Introducing Kartik & Litmus Overview

2:44 and I'm really happy to be back here. Thank you. Thank you for having me. Giving me this opportunity to talk to the upcoming 2.co. Yeah. It's it's my pleasure. I had a an absolute blast learning about Litmus the first time, and I I'm really excited to see what new things you have coming from the project. But just before we dive into for anyone who didn't meet you the first time, Ryan, can you just give us a quick introduction to yourself, who you are, and what you do? Sure. Thank and this is wonderful again to be back here. And, Kartik,

3:11 What Litmus is & Chaos Engineering Principles

3:20 one of the maintainers of the Litmus Chaos project. It is a CNCF sandbox project. And Litmus is cloud native chaos engineering platform, I would say. And it helps you to practice chaos engineering on Kubernetes as well as on pre Kubernetes infrastructure that you would have. That's how how we're calling it today. These two instances, basically, Azure VMs or your VMware VMs in your data center. You can go ahead and practice clear engineering. Clear engineering is not just about injecting faults. It's a it's a lot about hypothesis, what you expect in the system when you're engineering a fault,

4:05 and being able to derive meaning out of an experiment, learn new behavior about your systems, and not recklessly taking it. So we talked in the last episode on Rawkode about what is KS is where it is and what is not. We will develop that a little bit today, and I'm very happy to introduce the newer version of the Litmus platform. We are trying for the launch of the next release 2.0 latest in August, and this would be some kind of technical preview of some of the features there. Thank you to a lot of the members

4:26 Preview of Litmus 2.0

4:43 in the community who have been actively testing it and providing your feedback on the beta that we have out there. So, yeah, looking forward to interacting with you all. Awesome. What's that? Just to just to clarify there, this is a preview of Litmus to it's not out quite yet, but it's gonna get people a good taster of all the awesome new things that are coming down the pipe. Alright. Awesome. Okay. So do you wanna talk about some of these new features first or would you rather just show us an action? What what would you prefer, Classic?

5:18 Yeah. I'll probably give you a couple of pointers. Think as we go along doing the demo and talk about the history of how the feature came about, the bot requests in the community actually give rise to that feature and Everything is a resource in Kubernetes, and you have controllers to reconcile those resources. And people are used to that model or that particular paradigm of running their operations on Kubernetes. And all your policies, all your resource details, the applications and and its life cycle management, infrastructure, everything is decorative, managed, and get, etcetera. So when we wanted to do chaos

6:40 Evolution of Litmus: Kubernetes Native & Early Features

7:10 in a similar way, we didn't have the solution right at that point of time, and we're talking about sometime late two thousand seventeen to mid twenty nineteen period. So we came up with Litmus and the intent of Litmus Cortex was to provide chaos experiments or simple fault injections, which you could define in a YARL and which could be carried out by Kubernetes or or a controller in other words. And after we achieved that, we got to know that there's much more to chaos engineering than just being able to define a fault. You must be able to define the

7:46 Community Feedback & New Requirements

7:54 blast radius, for your experiment. What is it that you want to impact? What is it that you do not want to touch? How restricted should your experiment be? And then we came about the necessity for automating these experiments in a good way or schedule them. Sometimes you want chaos run-in random fashion. Sometimes you want it to run-in a strictly scheduled or sequential way. So it sort of and one of the things we learned as we went along is why chaos engineering as a discipline has its very solid principles or or as you can say, test principles

8:42 tested over time by all the sides that you need to be able to run-in production. You need to be able to, you know, run various events, etcetera, etcetera. By now, that is true. We found that people practicing chaos generally often tend to use the tool in mixed ways. They use it as a testing tool for carrying out simple failure testing with very, very, very defined expectations on what happens when you run a product. But they also can use it to run things in a freestyle exploratory way where you're learning about the system. You don't have you you basically don't have a set

9:20 expectation. You're just trying to see what is happening. Of course, you have some hypothesis, but then you're you're okay to learn about it, and you're doing it in an environment that is very close to production but not production itself. Yeah. All this evolution around shifting right and testing in production is great, but after people started adopting the cloud native paradigm, started rearchitecting their applications, testing our Kubernetes as the deployment environment of choice. There are too many things happening at once, and the people aware of our organizations, we are not confident to, you know, carry out chaos only in

9:58 production. But they still need chaos to test out the efficacy of what they're trying to build. So they want to do it in pre production in the roundhouse. And and when they are doing that, they wanted some aids, things like probes, for example. You're carrying out an experiment. You're injecting a fault. You also want to see what is happening as the fault is getting injected and what is happening part might not be manual all the time. We have talked about observability here. People might be looking at their peering into their observability dashboards getting information. But sometimes, they're just running it

10:08 The Need for Probes and Steady State Validation

10:38 as a background service, injecting faults, random times, scheduling tools on their clusters. They also want to factor in what's happening in their environment even as they do that fault. So that needed to be built in into the experiment itself, the steady state hypothesis validation. So that was another interesting thing. So we've talked about last three years. We've talked about steady state hypothesis coming into the experiments. Then the need for they came up to schedule it and run a background service of sorts. So these are new things we learned on new requirements we got as Litmus was being adopted

11:15 in the community. Then we had a very interesting requirement that came off to that too. Basically say, okay. I have a font, but the font is very complex or other I wouldn't call it font. Let's say, it is a situation that I've encountered in my environment, and that situation came about as a result of some failures happening over a period of time. It's, like, built up over a period of time. You can call it chain failures. Maybe you you had a node that sort of went into a degraded state because of some maintenance activity. Then you have the other node,

11:54 which is probably getting more load or your parts got rescheduled there. And then you're hitting a resource insertion problem on that node. And you have fewer nodes to work with. You already have one in degraded state. So your applications will take time to get scheduled. There will be addictions. All these kind of problems sort of get built up over time. So let's say we're assimilating those kind of scenarios, how can I do it? With Litmus, when we started out, we were able to inject faults, standalone faults at a given point of time. But now we needed the ability to string together

12:29 multiple issues or faults to give a larger scenario that you would experience and then find out how your data has behaved. That gave rise to what we call as the workflows. And in the next case, we did get a glimpse of it in the last proper discussion as well, but, clearly, we will still elaborate a little bit more, actually, the different ways of creating new workflows. Then all this was great. Now I have decided to practice chaos. I understood what it is and all that, but there are a lot of clusters that they had, lot of test or staging in grounds,

13:09 as you can call it, dev clusters. And I find it tedious to go and install a set of microservices in each of those clusters and then carry out chaos individually, monitor them. I would need something more simpler to do it. That's when I brought in the the chaos center of the request portal as we call it, where you can attach other clusters or register them into the portal, and you have some agents resetting those clusters to carry out the chaos process and give you some visibility on a centralized location. So these are requirements that came in organically,

13:47 sort of not exactly the sequence that I'm talking about right now, but these were were things that we gathered from the community and we built in. And since there were a lot of changes that went in, we thought it's worth doing a major version upgrade. Litmus was was one dot two, one dot three, and so on, and we hit one dot 13. And then we thought, okay. There are lot of changes coming in, so we should move it to the next version and do a major version upgrade. So the version upgrade is sort of reflects

14:04 Why Litmus 2.0? (Major Version Upgrade)

14:23 or is a notion of features rather than being completely all new version of it, which doesn't work with what was on there. It's completely backward compatible. People can continue to use Litmus with purely the operator and the case resources as they were. But the portal is probably something that we will invest time on to make it easier for folks to carry a chaos engineering in a simpler way and a scalable way, etcetera. So and all the new features that I talked about, some of them are directly implemented within the portal, and some of them are not really

15:00 Demo

15:01 portal related. They just went in into the infrastructure itself, the bank and orchestration infrastructure of Chaos itself. So these are the things that we will try to go through today in in some detail. And I've just pulled up my demo environment, which I created some time back. So there might be some real chaos that we might see naturally happening. So I hope the demo gods are with me today. So with that bit of introduction, let me share my screen. Yeah. Awesome. Alright. So this is the Litmus portal, and I will just log out to start

15:15 Hands-on Demo: Litmus Portal Overview

15:50 afresh. And if I pull my lens IDE, which I'm using. And I have sort of added these clusters into the lens just to show you what the parts that are running, what are the microservices that we are running as part of you can create different users now. Litmus, what do we call it? The the chaos center. And so this piece this dashboard that you see is the first page that you're going to come with the home page, which tells you some past history, some past workflow runs, what are the agents that have connected here, etcetera. It also gives you an option to

16:08 Litmus 2.0 Architecture (Portal & Agent Components)

16:52 schedule new workflows. And I think my lens is up right now. So let me go ahead and show you what are the parts that are running. I have the Litmus namespace here. You can see these parts. Out of these, the Litmus portals are both front end and MongoDB from the control plane. So this is what is powering this portal. This is this one. The components that the UI just call it as portal. The server is a GraphQL server, it also has an all server embedded within it. And there is more to store all your peers'

17:35 operation details. And what we have here, the other ones, the exporter, the operator, the event tracker, subscriber, or control. These are part of your agent infrastructure or the deployment that will run-in the environment where you want chaos performed. So you could have just these three once these three components with the checkboxes to run the chaos center, and you can go ahead and add agents, which will run the other unchecked the the components with unchecked boxes. So the operator is at the core and reconciles the chaos resources, the chaos engine, the chaos experiment, etcetera. So chaos experiment is a template that describes

18:22 the fault. The chaos engine is the one that maps an application instance or an intra component with a particular fault and also provides scope for defining the steady state validation and other properties. The subscriber is the one that speaks to the portal server and helps track progress of workflows in the portal. And the workflow controller here is an Rawkode workflow controller, which is the one that is helping us execute the workflows or the steps within a workflow, which is RBO because of its immense capabilities in sequencing or ordering actions within a workflow. So you could do

19:07 things in parallel, you can do in sequence, and you can define a lot of other characteristics. You can do cleanups, conditional retention of parts, a lot of flexibility. So we did not reinvent the wheel, and we've brought on Rawkode as a workflow engine. Of course, instrumented with some Litmus images, so we understand Litmus API. So this is an add on. There will be other ways of running workflows, which is going to be a native way of learning of doing it as workflows. And there'll be other tools that one might want to integrate with it for period of

19:39 time that we take on. You know, it also the architecture is being modularized to add workflow controllers of choice. So right now, by default, we have Rawkode that's coming up into Rawkode. And this folder is a Prometheus export for chaos to show chaos metrics. Event tracker helps in event triggered chaos. You could set up some policies on when you want to run chaos experiments, maybe based on some actions on your cluster. Let's say when you upgrade an app, you want some sanity runs, and we would set up an event tracking policy to identify that there has been an app

20:16 upgrade, so we basically got a large kiosk. What kiosk to do there is something that you can describe via annotations. So that's something that we can take a look later. So the Litmus agents is the concept we were talking about. There's something called self agent here. So the self agent is the represents the agent on the local cluster. That is the cluster where the chaos center has been deployed. So it automatically registers itself as a chaos environment. So you can start doing chaos against microservices living inside of this particular cluster. Yep. And I have another agent that's added

20:58 here. I'll show you the process of adding an agent Mhmm. Which is another cluster which has only agent components, and there's a Kafka state. So it's running there. That's the demo that I think we covered last Rawkode episode, so we'll probably not do the same demo. But I'll just show you how you can connect an agent. Now let's go ahead and take a look at some of the other sidebar items and what the improvements have been. So there is a chaos hub. So this is the integrated or embedded chaos hub within the portal. You know about the chaos hub. It is

21:20 Litmus 2.0 Feature: Chaos Hub (Public & Private)

21:34 some kind of a one marketplace that has the experiments of products listed, and they are categorized by different categories, different kinds of classification. The generic experiments are the standard Kubernetes parts, source network, or other defaults. Mhmm. Then there are AWS categories. There's application specific categories, and we also have VMware experiments that we serve. And I will talk about chaos against non Kubernetes entities from the portal, which is something that is new, something that we improved upon since last time. So that's one feature. Going back to talking about the hub, you can connect either a public kiosk hub or you can

22:30 connect a private one as well. Let's say you you have some kind of a aircraft and a company you're operating out of, you can connect to local get repositories. It just needs to have experiment entries within the deposit in a particular directory structure as you can see in this repository called as the chaos charts. So the chaos charts here is the canonical back end for the hub, and it has all these experiments listed inside the charts folder. These are not the health charts. We just call them the litmus charts, and each of this is a category.

23:10 And you could coincide that by individual experiment. Each experiment in each category is defined by some information placed inside of the chart service portion. The other way, which provides the source for all the information printed on the on the hub. Very similar to the way an operator top is built. We took inspiration from that and created something that's very similar with Chaos Charts. You could form this and place your own experiments and connect it as a private kiosk as well. You would use experiments from here to pick and create new workforce. So that is in your development that we

23:51 did here, And this came about as we were working with enterprises and organizations which prefer their own sources. Of course, they would commit stuff into upstream, give us some and we're comfortable with it, but you would have local sources to experiment stuff. The other capability came in the form of the analytics dashboard. So one of the requirements that we had is to be able to ascertain the progress we are making in our kiosk practice. Let's say you had some experiments against some microservices. You got a result, and we would need to be and we would want to compare our results

24:12 Litmus 2.0 Feature: Analytics Dashboard (Resilience Score, Application Monitoring)

24:39 over a period of time. How would you do that? How would you put a metric or a number against the experimentation process that you did? So we have something called resilience score. I think we've stressed about it in the previous episode. There have been some improvements there, so we'll talk about it when we get to running the workflow. So you can compare workflows here in the workflow dashboard tab of analytics where you're essentially comparing different workflows that you ran over a period of time, right, and what scores they achieved, what resident scores they achieved. And there's also capability

25:19 to do application monitoring, not so much analyzing results, but looking at how application changes behavior changes as we go along. And this is something very similar to what you would do with, let's say, a Grafana CloudWatch or something like that. So you have a data source. We support the latest phase where you can connect that, and you can create a very simple application dashboard. It uses Blockly, and the JSON structure for the dashboards are very similar to how we would construct Grafana dashboard. We have some features coming in close to our quote that would help with

25:59 creating new panels or creating new dashboards from the portal, but we need to sort of construct and then upload them. So that is about that's the improvements that we've made in the analytics section. So that's something. And we we got that requirement because there are folks who wanted to sort of view what happens to their application within the portal without having to move to, let's say, a different application or different browser to open up some other app. How much you use this is left to the user because a lot of people already have observability infrastructure in place.

26:43 They already have application dashboards set up, and they would only need metrics from the kiosk framework to be able to correlate what's happening to application when you start up. Consume metrics from the Litmus side. But for folks who've not really set up something already, maybe they can use of the make make use of the application dashboards. It also has more intuitive data corresponding to the chaos. It tells you when a chaos experiment started, ended, what is the score, etcetera, when you hover on the graphs of the applications. So that's something that we added in recent

27:30 times and something that's that I just need to report. So as far as the settings goes, this is a standard accounts page where you set your preferences. Next one is for teams. So one of the useful features of the chaos center is about how you can invite other folks in your organization or in your team to collaborate with you on chaos. So you might have people looking at results. You might have people constructing new workflows and running them. You have people just gathering some stats, etcetera? So you can invite you can create users and invite them into the team in different roles

27:35 Litmus 2.0 Feature: Teams & Collaboration

28:15 as editors, as as users of all all types, and you can collaborate with them Yep. For GA. The the other feature here is for GitOps. This is something that we created in Pixabnance. So, Rawkode, this is very interesting and we sort of wanted to play the best practices of GitOps that's being used for app delivery into the chaos engineering practice. That is one way of looking at it, One requirement that we sort of encountered. The other one was how do we integrate with the standard app delivery GitOps flow. So you have are we saving your plugs or tree in

28:28 Litmus 2.0 Feature: GitOps Integration

29:04 the in configuration management spinnaker that you set up. And there's a change in the source of your application that or maybe there's an image that has been pushed. You've set up some automated monitoring and upgrades. So your tool is basically going to go ahead and sync your source with the cluster. It's going to go and update the cluster with the latest portion of your app. And when you do that, you might want especially when you do this in staging environments, which is probably where you might first apply the the GitOps upgrade, the flow, before you promote it to production.

29:47 So when you do that, you might wanna run some sanity checks on a very nearly present staging environment, which mimics production. You might want to see how this new change in your application works, how does it hold up against different kinds of various scenarios. So you could run something like a sanity test, and you could create event tracking policies. I think we mentioned it when looking at these deployments during its back. You can set up some policies to ensure that your application undergoes some chaos, predefined chaos workflows that it can subscribe to via annotations, and you will get a result of that

30:31 and help you take a better call on what to do next, whether you promote this to production or you go fix something, etcetera. So the first part of the GitOps story that we mentioned is about workflows that you create, which gets synced into a Git repository. So if I go ahead and enable GitOps, and I basically say, this is the Git repository where I want my Kiosk artifacts to be sourced I mean, to be placed as a golden copy. So whatever workflows that I create here gets synced to there. And these workflows, which have been synced to

31:08 get and maintained, can then be subscribed to by some application, resulting in a cluster by means of even tracker policy. Though it's not a new mandate that you need this always stored in Git, the artifacts stored in Git to be able to pull them and run for as part of the event triggered in chaos, it is still good to have it so that, let's say, event changes and get the same gets reflected on the portal. Next time the chaos runs in an automated fashion because of your event track policies that you set up here, like, the latest version

31:44 of it. So these are the various settings, and there's there's another tab here. It's mostly for a day to purpose. People who are running in their own run their own registries, which is most likely the case today, you might want to replace the images within a workflow with images coming from your registry so you can do changes. You can make some custom changes to the workflow just before you run it. So that that's to ease up, you know, maintaining your workflows. So these are the various features. There's something called usage statistics here. It's more for

32:25 Litmus 2.0 Feature: Usage Statistics

32:27 a high level overview of how your case management practice is going, how many users you have on the platform that are participating in chaos, how many projects are there, how to what requirements or what teams or what services, And then how many agents do you have? How many workflows you've done? How many have succeeded or how many have failed? A lot of that statistics can be viewed and downloaded as reports for you to give you some kind of information on how things are going. And yeah. So we have API documentation, which is basically going to help you to utilize this platform

33:01 Litmus 2.0 Feature: API Documentation

33:09 without necessarily using the dashboard. So that's the refurpished portal dashboard for you that's coming up. At this point, I'd like to see if there are any questions before I talk about the latest phase of running workflows. And we we will have some sample applications, some sample types of these apps that we'll use to run these workflows. So are there any questions at this point? Yeah. I've got a a couple of questions, and we've one in the chat that's about getting started with chaos engineering, so we can tackle that as well. But we'll start with the the changes to Litmus

33:25 Q&A: Centralized vs Standalone & Getting Started

33:54 first. So the this new ability to control agents and remote clusters, do you see this being the standard way now for people to really adopt chaos engineering by having, like, a centralized chaos center and reaching out to those clusters? Or do you see a world where it's really just personal choice and people, you know, will still prefer to go down and deploy each chaos center into their own clusters? I mean, what's your opinions on that? Is is there a better way there? It's a great question. I think these are for the way for our community interaction,

34:32 we see that there are different personas using the chaos. Engineering has become sort of popular really popular in the last, let's say, year or so. And not to say that it's not being practiced early in the yesterday's, but somehow there was this perception that this, you know, dedicated group of experts sitting somewhere carrying out a game day. The SREs who really are the ones with engineering that from that perspective, it is sort of changing. There are more people getting involved at various levels of the organization or various levels of app delivery, I would say. There are these SREs

35:13 who are doing it in a very controlled way. The game tends to be very popular way of playing it. Then there are other folks, let's say, engineers, while looking at it, we're looking at chaos and sort of adopting it as the de facto way to do exploratory failure testing, if you will. So the definition of chaos engineering is becoming a little more broadened, I would say. It's more like the principles of chaos are being applied at different stages. And, therefore, the consumers of the tool that provides you the chaos engineering capabilities are multifold. And some of them like

35:54 centralized approach to managing chaos. So they have this mandate and they have this unknown, and they say you have these clusters. These are all staging environments. You are a group of folks who ensure that all services getting on to staging need to be good, need to be validated, and we need to be able to derive some meaning out of that validation. And there's a single place where we need to collect all that info. I think for those kind of folks, the secure center is useful. But like you said, there might be practitioners of chaos, more advanced practitioners of chaos

36:31 who might think, okay. I'm comfortable doing things as hand charts and templates. I manage everything by myself. I choose to do it on specific clusters as a standalone deployment or installation, and I'm able to make observations myself and probably have my own tooling to glean results from whatever key or service or other glean data that you are putting out and need not be packaged a certain way. I don't need any help in inferencing data that also that person also exists. And the way we see the portal of center is, of course, it has a UI component dashboard,

37:12 but there are APIs. There's an API server which is pretty useful and becoming more powerful in time. So you can do a lot of things using CLI or correcting book. API, can have a GraphQL client in Python or Golang that you are making use of to construct useful faults, run things, maybe consume analytics data, etcetera. So to answer that question, to summarize, I think both will coexist from some time. The the consumers of the care center of one persona and the others who are directly go with them on a per cluster installation and manage things their own way. Things both will

37:58 continue to exist for some time. We had a lot of people asking for this capability, is by the sort of thought this this could get to a get to a specific audience and help them get started with chaos. Maybe this is an easy way to sort of break that barrier and start doing experiments because it's feels pretty simple to construct. It's about first around them, view what's happening, and things like that. Then once you have that familiarity, you'll probably no longer need this or you'll probably just use the API service. You guys have so directly without having to use the

38:34 dashboard. So, yeah, I think both exist. Alright. Thank you for that. I I think we're seeing this pattern across the cloud native landscape. I know a lot of get ups tooling now is is starting to do that idea of remote management and centralizing your get operations as well. So I think this ties in really nicely with that approach, as, you know, we're starting to turn the corner on organizations having lots of small clusters instead of massive large clusters. And the the remote management aspect there is is appealing and just really getting a lot of steam.

39:09 Great answer. Thank you. Okay. I have And on the other question Yeah. Go ahead. No. It's okay. Go for it. The other question on getting started. So the product docs of Litmus or as we call it as we call it, it's there here. This helps you to get started with Litmus one topics purely based on the kiosk operator, with the kiosk, CRs, and executing it in individual clusters, etcetera. If you are happy beta testers for Litmus2.co, you you can take a look at the Litmus docs that we have on Netify. That is the Litmus data Netify app. Then

39:14 Q&A: Supported Data Sources (Prometheus)

39:58 go to the master version. You can see a lot of information that's being put here. And this is one of the things that we are we are at right now, trying to improve the documentation and put more details of how we can use the itdoes2.o. Sorry about that. So this is a good place to start, and I will recommend you to take a look at it if want to get started with. The studio in case you just want to run the bare minimum experimentation with the case operator, the dogs, or the mosquitoes, whatever is the one

40:40 Awesome. Thank you for that as well. I have one more question, and then I'll let you jump on to the the workflow thing that you said you were going to cover. With regards to the data sources, what kind of remote data sources are supported there? As of today, it is Prometheus Okay. That that we support. There is Prometheus exported from Litmus that is being used well sometime. So you could add Prometheus data source, start managing your start creating panels to view your application behavior, then you could sort of get your best metrics to the same

41:19 data source and superimpose them against your app metrics to see what's happening to your app during chaos. Yeah. I think it would be really interesting if we here's my future request for you now, Kartik. But, you know, when we do chaos I speak from my own experience. I won't say we. When I do chaos engineering, you know, I'm generally trying to see how my app is for customers during terms of turbulence and chaos. And I wonder like if we could have Google Analytics as a data source and be able to actually see, you know, JavaScript rendering time and, you know,

41:52 time to the page being available, the data come down and all that. I think that would be a really interesting insight to how the chaos affects my customer's point of view. That's a great feature request. I think it makes a lot of sense. Yes. I think that's something that we would be very happy to take control. Thanks for that feedback. Yes. I think Yeah. I'll make sure I open an issue on the the GitHub for that, so don't worry. Okay. Do you wanna take us away with the workflow change? Oh. Oh, sorry. On you go.

42:28 On that note, this is where you can create issues on Litmus. So it's like Litmus care slash Litmus is the repository. Please feel free to create issues here, and we will definitely get back on them. Be happy to consider and feel free to put your thoughts there on those issues. Thank you. And, of course, pull request welcome. Right? There. So you could start off a new I mean, you could participate in a discussion, start up a new discussion thread. That that works as well. Nice. Alright. So let's go to the workflow creation part. And before we do that, let me show you

43:07 Workflow Creation Demo: Bank of Anthos Network Chaos Setup

43:17 what I have. There is an application called as Bank of Anthos, which you might be familiar with, provided by Google Cloud Platform. Very interesting microservices app has different kinds of services, different platforms being used to create these microservices, Python, Paint Java, etcetera. And they have all the services to give you the experience of a banking app. So we have Bank of Anthos with the balance and your net shares and has option for you to deposit funds or make payments, etcetera. So it's it's a cool example to get started with for chaos to see what happens.

44:07 So what we're going to do is we're going to inhibit network traffic on on the balance reader app. So that's one of the application. One of the sub pieces, following this detail, I'm able to read this balance because of the balance reader. And also, when we make payments, it needs to see how much balance we have in our account before it makes the payment. So it tries to speak to this microservice, the payment service, whatever, tries to speak to the balance service. So what we're do is create a % packet drop. You could call this, like, a

44:47 black hole. Basically, nothing goes through to that service and see how that impacts our experience, how it causes a degradation in the user experience. Let's say, I'm a customer of Panthers, how it impacts them. But with this, we will also see how you can the intent of this part of the demonstration is to see how you can construct workflow by picking the parts of experiments from the integrated chaos hub and how you can set some how you can tune it for how long you want to run it, etcetera. So let me go ahead and click on

45:25 Creating Workflow from Chaos Hub (Selecting Experiments & Tuning)

45:26 schedule workflow. I'm going to select self agent because the back panels resides in the same cluster where my portal is installed on. So happens to be self agent. So let me go ahead and click next. So there are different options here. Some of these might have been you might have taken a look at them before, but they've been improved and made, especially the section here on creating your work program hub. So in this drop down, I have the chaos hub, the single public chaos hublets embedded here. So I basically go ahead and select this, and I give it a name,

46:07 Rawkode or Anthos black. And I have the option of going and selecting a new experiment. Let me add a new experiment. I'm interested in network loss, so let me select that from this category. I have a printer that you can use. So once I select this, I can go and tune this experiment. So, basically, click here and I can see details that basically show what have I taken the experiment from and what is the name of the experiment and etcetera. When I say next, I have the option of indicating the application against which we will do

46:56 this part. Now this is something like asset discovery, we can call it. In other in other words, this discovering the microservices on your Kubernetes cluster. The agent performs this task, the subscriber, to be more precise. So I'm looking at application in the volume space, which is where my background resides. There are also the namespaces. This is too much you can see. And the balance rate is of kind deployment, and it has an application label called app equals to balance figure. That's the one that I'm going to target. Annotation check here is a way to further

47:37 do filtering of applications or to increase blast radius control. If you have multiple applications that share a common label because that's how you deploy it, they will all have a common label. You are going to select just one of them for some reason. You can annotate it with a specific annotation, and then you can force witness to check for that deployment which carries the annotation before it does the but it's not mandatory. Mind you, I just have one instance of balance readers, right, the annotation check to false. There's a cleanup policy here, which tells me that I want to keep

48:17 my bots or I want to clean them. I want to keep them because that's how I can see logs on the portal. The the workflow visualization graph in the portal gives you the opportunity to pull the logs for the experiment that is just executed, and it will be able to pick logs from a live or an existing part of the cluster. We do not store the logs in the chaos centers. We just retrieve it on demand. So I would like to keep those parts in order to show you those logs. There are also options to set node selectors. For

48:52 example, like you may all know, when Litmus executes it for, runs it as a Kubernetes job. And you might have references on how you want to run this this job, where you want to run it, etcetera. If you have a specific note, there is this practice that's there. They put on business applications on certain nodes. They have dedicated some nodes to run their main application services either by making use of affinity policies or tolerations and things like that. And they have another node dedicated to running some third party services or tooling, so to say. In case you're just

49:31 doing like that, you can provide the notes selected for those details. So let me click next. So at this point, I have the option of adding a probe. Probe is a way to validate steady state hypothesis, and there are different kinds of probes. We will come to that in the next workflow that I run, which is against another service. For now, let me not do any validation. Just more interested in just doing the workflow and seeing the fault. So let me click next. And this is before you have to tune the duration of your your chaos

49:40 Adding Probes for Validation (Initially Skipping)

50:10 and adding other variables. There are different tunables that particular experiment supports that you can see in the experiment documentation. So let's say, for network loss, you can see that there are different tables here that you can provide in case you're interested in doing that. Many of them are optional. So you can provide that by adding a new case. And I'm interested in providing the just the defaults and running with them. So click finish. Revert schedule is to keep the chaos resource and not clean it up. This is what the chaos engine resource, which is set to false.

50:53 And when I go ahead and say next, so this is the step which you might already be aware of to provide the criticality or weight of a particular experiment within a workflow, and this is going to influence the final resiliency score. The points that we provide here multiplied by the success factor of the experiment. Success factor is retrieved as the execution completes. So you have a percentage success for that part depending on how many probes were successful or how many negative checks within the experiment were successful. When say negative check, it's about let's say you have a pre chaos check.

50:57 Explaining Resilience Score Calculation

51:32 I want my services to be in so and so state before carry out the chaos because I don't want to do chaos against an already degraded system. And then after you run far, maybe there is some particular state you're looking for, and let's check that you can do. These are pre and post class checks. It's also part of the steady state hypothesis validation. So depending upon how these checks went and how the probes went even as your fault happens, something there's something called as a continuous mode in the probe, which happens in parallel as you do the

52:09 fault. So it checks real time how your app behaves, not just the recovery as in the native post year sticks. So all those checks contribute to you getting a success factor. We call it as a post success percentage. It's anywhere between zero to 100%. And that multiplied by the points that you gave for a given experiment. And the summation of this for all the experiments you have within a workflow divided by the total points gives you the resilience score. It's a metric that helps you to understand where you stand with respect to resilience for a particular scenario or workflow

52:51 for a given app. So this is the one that connects an application or an infrastructure component with a scenario and then tries to try to figure out the scope that you have for it. And then that is something you can use because it is quantified. You will be able to compare it over time to see how it has improved or maybe debated or even what you want to see, how it changes across infrastructure. So when you run the same form against an app deployed, let's say, it's JFK application deployed with a particular storage class on that particular c nine

53:26 or a particular type of node versus when you repeat the same thing with some other environment completely different, you will be able to see some differences sometimes, which are very instructive. So you're learning more about the system that way. So this is going to help in that process. So I'm going to give all the points because there's no real other points in this workflow, this one. So I just give it all points. I click next. At this point, I can schedule it once or I can schedule it repeatedly. This is standard crop where you go ahead and keep running those

53:53 Scheduling Workflows

54:03 workflows anyway. I'm going to just try it once. So at this point, I can view the YAML to see some summary of what is there. The YAML is maybe of it may be familiar to you. It might look very much like the Rawkode is, just the kind of workflow. And we have two steps. The first step to install the template for a font. This describes the font itself, and it is pulled from the hub that is integrated into the portal. And the next step is to launch the engine, which connects your font with your application.

54:07 Viewing the Workflow YAML

54:41 So you have something here of interest. We can see there is balance reader in default namespace, and we have the duration here. So we have the little drop packet loss percentage. By default, we use or we assume that the cluster is docker and this is the socket part, but that will be different. You can change it to container d and give a different socket part, etcetera. It makes use of the runtime APIs to inject the network for this case. Let me finish the flow of the result and go to a workflow. So this is going

55:17 Executing Bank of Anthos Workflow & Observing Impact

55:19 to run. And there will be some parts, transient parts created in your witness namespace, which are now doing these tasks of putting the experiment template, the hop and installing it, and then launching the kiosk inject to trigger the actual fault injection. So that can be visualized in this particular visualization. There's also a table view that you can use. And as the experiment gets underway, you will be able to see the impact caused on Bank of Anthos. So right now, we're just getting initialized with the actual font. So add back to that. This looks good. You can see balances.

56:07 Alright. We're able to create it. We're able to deposit funds and do all these activities. Very soon, I will not be able to read this balance as the yeah. You can see I made a deposit right now. They cannot see how much it increased. So that's it's not good. You would not like to see these kind of things happen in real time, especially with banks. So the next step, let's say, you make payment. You won't be able to make it because you're not able to pay the balance. So this has been deliberately set up, and

56:49 then we do not have any yeah. You can see the payment failed. It was not able to proceed. In fact, there's a a bigger exception that you can see in the logs, and this is something that you would we really want to avoid for your applications. So we've deliberately set it up without the mitigation. So you will see the error. It just illustrate the font in progress. So we are going to do this. We are going to keep at it for some time. Sixty seconds as you saw in the engine manifest. And once that is over,

57:30 the balance will be available again. We will stop the key hazard, revert it, And you will be able to see that things are restored. And the experiment makes a check to see if it is restored, if it is healthy before it actually concludes. If it finds out that post the injection was not able to recover or let's say the self healing in your infrastructure is not working, then that's a alarm bell, and you will have the notifications to go ahead and check. You can send notifications based on from these metrics, but you can have other means of verifying

58:08 this, but have your own server team. We have some alerts coming in into the Kubernetes portal, but that's going to be post to Rawkode. So this is about how your application came up. So let me go ahead and reload the app once again, and I think I should be able to read my balance because the duration has elapsed. Yeah. This looks great. And this is just going to finish its post chaos checks and complete the experiment. You will see there you will see a pause here in this experiment because it actually ran successfully. The for ran successfully in the

58:56 pre and post year six were good, but we did not really validate the behavior of microfantas as the part was happening, which I think is pretty important. So in the next workflow that we run using a different technique, we will upload the AR file instead of constructing the workflow through the Maya, the port of the integrated tier store. Integration intent also added. You can see the logs here. We retain those parts to be able to see the logs. So you will find those details here. The kiosk result is the resource that carries details of the kiosk. It says process percentage

58:59 Viewing Workflow Results & Logs

59:35 is hundred. Therefore and since we have given 10 points, this was the only quality that has score of as well. But this doesn't have a validation in bonding, But you can see some details. There's a history of cost of pay terms. We can repeat this. It will be some historical information. And we will also have details of let me just show you that on my console of the successful injection report status of the of of your chaos. Let's see. This is Chaos result. Looking at the Litmus namespace. Let's see there's some information here. This was the application against which we did

1:00:35 the chaos, and the status of chaos says reverted and the kind of application that we actually get the 42 is is a part happens to be a part, which most likely it is. There will be a it could be a disk or an instance, user instance, whatever. You'll be able to see the current status, which is, like, generally useful. And there are some events, as we can see here, and also events on the the kiosk engine resource too. So we have some kind of observability around Kubernetes events to be able to track the progress of the experiments that way as

1:00:55 Need for Workflows (Chained Failures)

1:01:14 well. So that is something you could do, one way of creating workflows. The other way is to create a template. So I I I could go ahead and schedule workflow, select the agent to reach which you wanted to chaos, then you could basically select create a new workflow by cloning an existing template. You could go to the individual workflows, the schedules tab, and take a look at all that you ran. And then you go ahead and create this template out of it and provide some name and description. Next time you want to run this full

1:01:22 Other Workflow Creation Methods (Cloning, Importing YAML)

1:02:00 previous template, edit it for your builds and proceed with chaos. That is one simple way of running it. But let's say you want to store it in your git repository, you could do that. So let me say I set up a Git repository to Chrome, and here, I'm just gonna provide a simple repository on my side. And now I have this connect that I will do to this repository. I can provide access token or I can set up SSH key. Again, I can provide this to my repository. You can just go ahead and provide my SSH keys,

1:02:58 The Litmus Chaos Center (Centralized Management)

1:03:13 and you will be able to go ahead and create a sync workflows here whenever you construct it from your request port. So that is one quick way of ensuring that there's a golden copy of your workflows in your repository. So you can go ahead and maintain it there. You can change it and get synced in the portal. So in the interest of time, let me move up go ahead to the next workflow creation type. So here, I'm just going to select the agent, and you can import the workflow from the YAML that you might have in your workspace

1:03:49 Chaos Against Non-Kubernetes Entities (AWS, GCP, Azure, VMware)

1:03:56 on your laptop or at some such location. You can just or you have these machines. You can just pull that and import your YAML. In this case, I'm going to select a kiosk workflow that I'm going to use to create an engine for an easy to instance. So this is one of the things that I wanted to sort of highlight in our chaos journey. So a lot of organizations have hybrid economics. Either they are in the process of adopting Kubernetes, migrating to Kubernetes, or they have already adopted it for some services while they continue to operate some other

1:04:42 services And they would have they'll try to do chaos against those components as well, not just targeting pods, doing pod level and node level figures for Kubernetes nodes and pods, but also act against what you could many lines of instances, if you may. So there are different parts you can do on that. You can do things like taking down the instance or disconnecting disks attached to it. Or you can cause some CPU burn or memory burn inside the VM. Can kill processes. You can inject latencies against data, networks there. All sorts of you can do service for

1:05:39 the kiosk for services running inside. We use a lot of things that you would want to do there as well, and they would want to have the same tooling location from where you run all that, have the same experience in running chaos against non Kubernetes entities while you still run the application business logic in Kubernetes. So it was as big as a Kubernetes microservices app. So if you take a look at if I have so here it is. Let me go ahead and try to show you some slides. Yes. Here it is. You can take a look

1:06:24 at this. It means it's going to run on Kubernetes, but it is going to make use of the provider API to do fault against some cloud infrastructure. As long as it is accessible, you have set up the access for it using secrets. You will be able to leverage the API provided by these cloud platforms. Many of them have very defined SPK, which you can make use of to launch the kiosk. You can either construct a new experiment that way or you can make use of existing experiments to do that. This is one such. So we're going to do chaos against one

1:07:06 of these instances. So I have some easy two instances running some application. A beams a beams scopes. I'm sorry. Beave Beaveworks SOCKSHOP microservices application. And that is basically hosted on one of these worker nodes. And we're going to bring that down and see what happens to the performance of my workshop. So this is some chaos happening on my dashboard. Yeah. It's a funny Grafana theme. So let me go ahead and should last five minutes. So and you mentioned about statistic hypothesis validation that you'd like to do. So I have the front end service, which is showing me some

1:07:56 Steady State Hypothesis with Probes (HTTP, Performance Checks)

1:08:07 queries per second, some transactions, two hundred hours per second. Maybe this is going to go down when I do the chaos, but I expect you to come back up within specific period of time. Sometimes you sort of consign yourself to the fact that, okay, there is going to be a different performance when we do this, but I'm going to test the main plan for recovery. I'm going to test how quickly my system recovers, how quickly it comes back to good operational state and the metrics become optimal once again. So we have that kind of a

1:08:43 validation report to do along with an availability check to see if the service under question is always available. So those things we want to be in into the experiment. And for that, let me just take you to this flow rate workflow. So this was another means of running the workflow or rather constructing it. I have it already, so I've just imported it from my workspace. And then you can go ahead and tune it same way that you tuned. And if at all you did tune something here, either you you can do it here. You can do it by editing the YAML.

1:08:55 Creating EC2 Failure Workflow (Importing YAML, Tuning)

1:09:24 And in case you change something, let's say, you want to keep the duration of this not just for thirty seconds but slightly longer, I I can choose to do it as I go to thirty five. And I want to see what is the latest URL of the Wave source app. And for that, let me try and see what my IP is from the EC two instances. And so so if I look at the detail here, this is my public address. So I'm just going to copy it. I'm going to put it in my probes that we have. There's an HTTP probe,

1:10:40 which is going to continuously verify every two seconds to see if I can see the 200 okay on this endpoint. And I'm also doing a performance recovery check within a period of let's say, there are two retries with one second tool. So let's say, within a two second period, I'm interested on this also time out here. So each try is going to wait for some time if the response is not found. So within the recovery period, you want to specify recovery period. You want the performance to come back to some number. Right? It it may may not be right.

1:11:25 You can see there's a probably this end point. There's a query that we gave, and then there is a comparative that we've defined. So this, you can think of as some kind of service that are indicated as a line, and the cutoff against that can be your service level objective. Sometimes service level objectives might not be as simple as a one microservice metric. It might be a larger data point that you're looking for from your platform. But you would be able to sort of represent that intention, that check around as well as also using the box.

1:11:59 Executing EC2 Instance Failure Workflow

1:12:04 So now that now now that we have all this information, let me go ahead and save whatever changes I made here. And then we provide the criticality and just go ahead and execute it. So when we do this, you will see the same couple of steps being performed. You will see the installation of the experiment, and then the case engine will be launched to trigger the instance stopping. So here, you will be able to see the worker one stopping and eventually going to stop state and stay there for thirty five seconds or whatever. He did repeat it.

1:12:33 Observing EC2 Impact & Application Metrics

1:12:47 And you'll also see in the process, the application is going to go down, and the metrics are going to drop. And we will be able to make these checks from within the experiment from and give you a word of it and give you a post success percentage Let me go ahead and yeah. You can see that this is stopping, and it's eventually going to go to stop state. So this is one way of acting upon you can call an out of band way of acting upon cloud resources. But you might also want to do experiments

1:13:35 to eat up resources across network forwards, services running within VMs. And that's something that will be added on. That's also an improvement from last time. We have new experiment categories which make use of agents running within this VM to carry your forms there and give you that, in fact, yes, capabilities in addition to what we already had with, let's say, instance failures or disk failures and things like that. So, yeah, you can see that it has stopped. I will not be able to load WaveSox anymore. And you can do this not just against is it for instance, is contributing to Kubernetes

1:14:19 cluster? You can do this on any not least for instance. You can do things by tag. And if you're doing it by tag and you have multiple instances that are identified by the tag, you can provide a percentage of instances that you want to bring down by instance, affected percentage. And you can also, I basically, provide a flag for whether the instances that you are killing are part of managed node group or more. If it is, then the recovery or the health post case health checks that we perform will be slightly different. You can see that this is done,

1:15:00 and the application is gone down. So this red area that you're seeing here is coming because of the Litmus annotation that we have, the Litmus metric and it has a Griffin annotation to tell you this is when it gets actually happening. So this is the front end ops going down and coming up, etcetera. And you will end up seeing the results of this validation. Sometimes, your application recovery might be fast. Sometimes, it might be slow. It really depends on that. And finally, you will have the details presented inside the kiosk reset, what happened to the individual

1:15:41 probes, whether they succeeded or failed, and you will be able to get a metric. So there are different kinds of Litmus metrics. The weighted experiment is a useful one to show you the period when the chaos ran, and it has injection time and for the details. Lot of labels, you can see there's something called context. This is basically details or some metadata of why you're running this particular experiment. You can basically add new labels just to give you an overall idea. So this is how you go ahead and, yeah, you can see that this experiment failed.

1:16:44 Recap of Chaos Principles & Litmus 2.0 Capabilities

1:16:51 As anticipated, the availability check failed, so did the performance checks. So that's something that was expected. So this is a new capability doing chaos against non Kubernetes entities and also being able to do that for both kinds. There are for example, there is a workflow that you could use to do CPU or memory hop from your be on your VMs, not so much on pods or Kubernetes modes. So this is something that will that that can be done as well. You can see there's a reward kiosk step in addition to what we had in other experiments

1:17:36 to clean up. Then exchanges and then go ahead and run this. So all that I'm showing you through the dashboard, all these experiments can be done by directly invoking the kiosk engine the one dot x model of running it with your kiosk operator. Or if you choose to run with the kiosk center, which is what is recommended, you can make use of the APIs to perform these operations. There'll be some there we are working on called worker three. They have a node exporter setup, and you will see some utilization spike on on the workers here

1:18:45 for one of these nodes. That's a pretty cool new feature. I really like that. So it's nice that you can target those AWS different, you know, kind of vectors rather than sticking purely within the Kubernetes thing. I assume this is something that will be expanded over time to more providers, more clouds, more external resources. Like, really, there's there's no limit to to what you can add here. Right? That's true. You you can actually take a look at some of the experiments which are tech review mode. So you'll find them inside your slip. And you can see there are in the

1:19:47 similar experiments of GCP and Azure as well and for VMware. And like I said, it's not just the instance of the other services that you can target. So, yes, like I said, there's a lot of things that we could add here to make it easy. One of the reasons we or one of the advantages of it being purely community collaborated and open sources, chaos means different things to different people. So they come with their viewpoints, contribute, and sort of really enrich the library of in terms of what all materials scenarios can be done and what are

1:20:34 the strategic validations can be done. The the hypothesis is something that, again, is very diverse. It is yep. You can see some spike happening here. The the value in a particular field inside of a customer source, maybe you're in a posters operator, and you're want monitoring the health of that deployment, and that is reflected in some status field on your CRM. And you're interested to get that before you do some other things as part of the experiment. That's that's part of the hypothesis. So there are different ways of doing checks, and we're also trying to work

1:21:36 with different on adding these probes, different kinds of probes that you might want to do. Some of them will be prompt over purely about consuming metrics from Providius. There will be other other providers that we would like to integrate with to show the interleaving as well as to do validation. But there are other probes that we can build. So all these different modes of doing them, when you do them, etcetera. So you can actually create if you look at the principles of chaos here, there are lot of things which are true irrespective of the the model or the usage

1:22:21 general, mean, it's cloud native, not cloud native. It's great to be able to do validations as we stay. Either you do it manually or you build it into the automation, into the experiment. Very real world events to different kind of forms. There's so many kinds of forms possible today to try and bring them in. Running production when you're confident, and you know the incorrect procedures well. You have a sort of sign off from all the stakeholders. And running continuously because especially this force so much through the cloud native one because there's a constant churn to

1:22:59 provide that end user experience. You have your services that we that you roll up as a as a as an application service provider. And then there are so many things you're borrowing from the CNC landscape, from observability, from storage, service measures, all sorts of things which are giving you that end experience to the user or it helps your SLA team to maintain that end user experience in maintaining the the health of your deployments. All those things keep changing continuously, so you keep upgrading them. Kubernetes itself, you keep upgrading. So you would like to know how things

1:23:36 change. So it's important to run these experiments continuously and have a framework to automate them, set up scheduled chaos maybe. And two, minimization of last year is select the right summaries, like, component for the right duration, in the right namespace, etcetera. I have the ability to specify all that. So we're looking at taking these principles and bringing them on to Rikmus along with a few other games, like GitOps and workflows to stitch together scenarios and few other settings to give you an end to end platform for case engineering. But there's a lot of improvement that can be made yet

1:24:21 in terms of features and in terms of capabilities because chaos is getting, as we speak, is evolving. For example, people are eventually into security chaos today to to find out vulnerabilities. So in fact, being able to run some services successfully without the right restrictions being placed on running them is also a chaos test. So there are different ways of looking at it. So the the the scope really, like you mentioned, is is endless. The the lot of capabilities that we brought in, we're we're looking forward to the feedback and contributions pouring in from the community

1:24:50 Community & Contributions in CNCF

1:25:03 and being the scenes in project helps. There's a lot of good feedback that we received over time, which has helped us to make these improvements. And a lot of good contributions coming in from community. A lot of the probes work was contributed before from Red Hat, so which is awesome. And really looking forward to continuing this journey on improving the experiments and the capabilities and hope to be on another discussion on on Rawkode with more improvements as we go. Awesome. I think that that's something that takes it to close on what they had in mind to discuss today.

1:25:39 Conclusion & Thank You

1:25:57 That was perfect. There there are a lot of really cool updates to love there with Litmus. I'm really looking forward to the the new version and playing with some of those features. And I love that you can have almost tied it all together at the end there with the mention of the CNCF and contributions, but we will be doing a contributing to Litmus on Cloud Native TV a week on Friday with one of your colleagues, I believe. So people should check that out if they want want to know how to contribute to the Litmus project as well.

1:26:27 Alright, Kartik. We we are at time. I just wanna say thank you. That was a really nice look into some of those new features. I think the remote agents are really cool. The workflow improvements, all of that. It's just super cool. And adding on new chaos experiments that go beyond where we already were just opens up so many different potential avenues for breaking or breaking Kubernetes, which to me is very exciting. I'm I'm now trying to work out if I can use Litmus for some custard fun. So lots lots for stuff for me to experiment with

1:26:57 there. Any last words before I let you go for today? Thank you for this opportunity once again. Really enjoy being on this show and talking about the. Looking forward to breaking stuff on Kubernetes. Awesome. Alright. Well, check out principles of chaos. Check out Litmus. Look forward to two point o. Kartik, thank you again. Have a wonderful day, and I'll see you all soon. Thanks.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Rawkode Live

View all 173 episodes
Litmus

More about Litmus

View technology
Kubernetes

More about Kubernetes

View all 172 videos
Argo

More about Argo

View all 7 videos
Prometheus

More about Prometheus

View all 26 videos

More about Grafana

View all 20 videos