Skip CNCF Incubating Observability and Analysis / Chaos Engineering

Litmus

Official Website Documentation

License: Apache-2.0

CNCF Project

Cloud Native Computing Foundation

Accepted: 2020-06-25

Incubating: 2022-01-11

Dev Stats

Community

Join the conversation

Twitter/X YouTube Slack

Videos about Litmus

Hands-on with Litmus 2.0

Introduction to Litmus Chaos

Complete Guide

Comprehensive documentation, best practices, and getting started tutorials

Litmus is a Cloud-Native chaos engineering platform designed to identify weaknesses in Kubernetes deployments by injecting controlled failures. It allows teams to proactively discover potential issues and improve the resilience of their applications. By simulating real-world failure scenarios, Litmus helps developers and operators build confidence in their system’s ability to withstand unexpected disruptions, ultimately leading to improved uptime and reduced risk.

LitmusChaos is an open-source, Cloud-Native Chaos Engineering platform that helps Site Reliability Engineers (SREs) and developers improve the resilience of their Kubernetes applications and infrastructure. It achieves this by systematically injecting controlled faults and observing how the system behaves under stress, thereby uncovering weaknesses before they impact users.

Key Features

Cloud-Native Chaos Experiments: Provides a wide range of pre-defined chaos experiments (e.g., pod delete, network latency, CPU hog) that can be run on Kubernetes.
Kubernetes-Native: Chaos experiments are defined as Custom Resources (CRs) in Kubernetes, making them easy to manage and integrate with existing Kubernetes tools.
Chaos Workflows: Orchestrate complex chaos experiments across multiple applications and infrastructure components using chaos workflows.
Chaos Control Plane: A central component for managing chaos experiments, visualizing results, and defining chaos schedules.
Observability Integration: Integrates with popular monitoring and observability tools to help analyze the impact of chaos experiments and validate resilience.
Experiment Customization: Allows users to create custom chaos experiments tailored to their specific applications and infrastructure.
Automated Verification: Define probes to automatically verify the system’s health and application performance during and after chaos injection.

How it Works

LitmusChaos operates in two main planes:

Chaos Control Plane: Resides in the Kubernetes cluster (or a separate management cluster) and handles the scheduling, management, and monitoring of chaos experiments.
Chaos Execution Plane: Deploys temporary “chaos agents” or “chaos operators” within the target Kubernetes cluster(s) to inject faults and observe their effects.

Users define ChaosExperiments (e.g., Pod-Delete, Network-Delay) as Kubernetes custom resources, which are then picked up by Litmus to execute the desired fault injection.

Benefits

Proactive Weakness Identification: Discover system weaknesses and potential points of failure before they cause outages.
Improved System Resilience: Build more robust and fault-tolerant applications by systematically testing their behavior under adverse conditions.
Reduced Downtime: Minimize the impact of real-world failures by understanding and mitigating their effects in advance.
Increased Confidence: Gain confidence in your system’s ability to withstand unexpected disruptions.
Enhanced Observability: Improve understanding of system behavior and dependencies under stress.
Shift Left Chaos Engineering: Integrate chaos testing into development and CI/CD pipelines, making resilience a continuous practice.