Skip to content

Chaos Engineering

Nubenetes V2 Elite Portal

You are browsing the AI-Curated V2 Elite Edition. Looking for the exhaustive list of references? Check out the V1 Historical Archive.

Architectural Context

Detailed reference for Chaos Engineering in the context of Platform & Site Reliability.

Resilience

Chaos Engineering (1)

Cloud Architecture

  • (2021) aws.amazon.com: Chaos Engineering with LitmusChaos on Amazon EKS [ADVANCED LEVEL] [COMMUNITY-TOOL] โ€” Provides an architectural blueprint for integrating LitmusChaos with Amazon EKS. Walks through installing custom resources, setting up experiment workflows for container and node disruptions, and verifying application resilience with AWS native CloudWatch metrics.
  • (2021) Azure Chaos Studio [COMMUNITY-TOOL] โ€” Provides an overview of Azure Chaos Studio, Microsoft's managed chaos orchestration platform. Explains how to configure fault injection pipelines against virtual machines, AKS clusters, and key-value stores directly inside the Azure portal.

Continuous Integration

  • (2022) thenewstack.io: Operationalizing Chaos Engineering with GitOps [ADVANCED LEVEL] [COMMUNITY-TOOL] โ€” Proposes the paradigm of GitOps-driven chaos engineering. By declaring chaos configurations alongside standard application manifests in Git, engineering teams achieve strict auditability, versioning, automated cleanups, and predictable pipeline integration.
  • (2021) pingcap.com: chaos-mesh-action: Integrate Chaos Engineering into Your CI [COMMUNITY-TOOL] โ€” Demonstrates the implementation of Chaos Mesh within GitHub Actions CI/CD workflows using chaos-mesh-action. Allows developers to continuously assert the resilience of code changes by spinning up test clusters, injecting faults, and validating outputs on pull requests.

Curated Resources

Enterprise Platforms

Kubernetes Tools

  • (2025) chaosblade โญ 6352 [GO CONTENT] [ADVANCED LEVEL] ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ [DE FACTO STANDARD] โ€” Alibaba's multi-platform chaos engineering tool designed to inject faults across various levels of a system. Specifically targets OS resource exhaustion, network degradation, disk I/O bottlenecks, and deep application-layer faults for languages like Java, Go, and C++.
  • (2024) GitHub: kube-monkey โญ 3064 [GO CONTENT] ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ [DE FACTO STANDARD] โ€” A Go-based Kubernetes implementation of Netflix's Chaos Monkey. kube-monkey runs inside k8s clusters to systematically schedule and delete random Pod instances within designated namespaces, forcing development teams to architect highly redundant and self-healing services.
  • (2024) PowerfulSeal โญ 1975 [PYTHON CONTENT] ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ [DE FACTO STANDARD] โ€” A Python-based chaos engine designed for Kubernetes. PowerfulSeal operates interactively or via declarative policy configurations to systematically delete pods, shut down infrastructure nodes, and disrupt networking to reveal platform design flaws.
  • (2020) openshift.com: Introduction to Kraken, a Chaos Tool for OpenShift/Kubernetes [COMMUNITY-TOOL] โ€” An introduction to Kraken, Red Hat's open-source chaos engineering engine tailored for OpenShift and Kubernetes. Kraken enables automated node disruptions, namespace-level resource starvation, and API-level faults to locate architecture bottlenecks.

Operations Strategy

  • (2021) thenewstack.io: Chaos Engineering Progressively Moves to Production [ADVANCED LEVEL] [COMMUNITY-TOOL] โ€” Discusses the progression of chaos testing directly into production clusters. Focuses on minimizing the blast radius using progressive deployment gates, automated circuit breakers, canary releases, and deep observability to safely capture real-world dependency issues.
  • (2021) opensource.com: 5 lessons I learned about chaos engineering for Kubernetes [COMMUNITY-TOOL] โ€” Distills key lessons from executing chaos experiments on live Kubernetes clusters. Discusses critical parameters such as understanding container restart policies, the impact of DNS connection caching, resource limit thresholds, and balancing false alert metrics.

Serverless Systems

  • (2021) thenewstack.io: Breaking Serverless on Purpose with Chaos Engineering [ADVANCED LEVEL] [COMMUNITY-TOOL] โ€” Explores the challenges and methods of injecting faults into ephemeral, serverless environments (e.g., AWS Lambda). Discusses techniques like wrapper-based latency injection, API response mocking, and runtime variable tampering to validate failure paths.

Stateful Systems

Telemetry Systems


๐Ÿ’ก Explore Related: DevOps | Test Automation Frameworks | SRE

๐Ÿ”— See Also: About | Postman