Site Reliability Engineering (SRE)¶
Nubenetes V2 Elite Portal
You are browsing the AI-Curated V2 Elite Edition. Looking for the exhaustive list of references? Check out the V1 Historical Archive.
Architectural Context
Detailed reference for Site Reliability Engineering (SRE) in the context of Platform & Site Reliability.
Continuous Delivery¶
Feature Management¶
Reliability Engineering¶
- (2021) devops.com: How SREs Benefit From Feature Flags [COMMUNITY-TOOL] โ Investigates the role of feature flags in risk mitigation. Demonstrates how isolating application deployment from runtime feature activation enables SREs to instantly disable buggy paths without executing a full application rollback.
SLO Validation¶
REST APIs¶
- (2022) thenewstack.io: Validate Service-Level Objectives of REST APIs Using Iter8 [ADVANCED LEVEL] [COMMUNITY-TOOL] โ Demonstrates utilizing Iter8 to execute programmatic SLO assertions on REST endpoints. Ideal for modern CI/CD patterns validating performance bounds before traffic transitions to production.
Observability¶
Monitoring¶
SRE Fundamentals¶
- (2021) circonus.com: Monitoring for Success: What All SREs Need to Know [COMMUNITY-TOOL] โ Examines monitoring frameworks focused on business outcomes rather than raw resource metrics. Guides SREs on setting up context-rich telemetry pipelines that prioritize application-level user experience over physical host utilization.
Service Level Objectives¶
Community Events¶
- (2021) SLOconf [COMMUNITY-TOOL] โ Official site of SLOconf, the premier developer conference and community space focused entirely on Service Level Objectives, error budgets, and reliability-driven service platforms.
Declarative Standards¶
- (2021) OpenSLO specification ๐ โญ 1496 [YAML CONTENT] [ADVANCED LEVEL] ๐๐๐๐๐ [DE FACTO STANDARD] โ The vendor-agnostic OpenSLO specification defines standard YAML schemas for declaring SLOs, SLIs, and error budgets. In 2026, it remains the standard for orchestrating declarative system health models inside GitOps automation.
GitOps¶
- (2022) thenewstack.io: Automate User Satisfaction with This GitOps-Friendly Spec for Service Level Objectives [ADVANCED LEVEL] [COMMUNITY-TOOL] โ Explores using GitOps mechanics to track reliability definitions. Outlines managing SLO declarations as declarative code in Git repository workflows to programmatically configure observability tools.
Google Best Practices¶
- (2020) sre.google: The Art of SLOs [ADVANCED LEVEL] [DOCUMENTATION] [COMMUNITY-TOOL] โ Google's structural guide to defining user-centric Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Features processes to prevent metric fatigue and map indicators directly to consumer utility.
Site Reliability Engineering¶
Root Cause Analysis¶
- (2021) youtube: Platform9โs Madhura Maskasky says observability is also essential for diagnosing and debugging in order for SREs to "get to the root cause quickly enough so that you can feed that back to the development teams." ๐ [COMMUNITY-TOOL] โ Video analysis featuring Platform9 discussing the strategic role of observability in debugging distributed cloud-native infrastructures. Emphasizes establishing tight engineering feedback loops between SREs and application developer teams to dramatically reduce Mean Time to Resolution (MTTR).
Operations¶
Platform Engineering¶
Organizational Design¶
- (2022) devops.com: SRE Vs. Platform Engineering: Whatโs the Difference? [COMMUNITY-TOOL] โ Differentiates the core focuses of Platform Engineering and SRE. Analyzes how SRE teams protect application reliability, while Platform Engineering optimizes developer velocity by packaging tools into Internal Developer Platforms (IDPs).
Strategic Alignment¶
- (2022) thenewstack.io: SRE vs. DevOps? Successful Platform Engineering Needs Both [COMMUNITY-TOOL] โ Argues that successful modern Platform Engineering requires marrying DevOps velocity patterns with SRE reliability practices inside the developer platform tooling layer.
SRE vs DevOps¶
Tooling¶
- (2021) youtube: Viktor Farcic - What is the difference between SRE and DevOps? [COMMUNITY-TOOL] โ Technical video analysis contrasting the conceptual responsibilities of DevOps and SRE. Explores modern architectural pillars including service meshes, declarative GitOps workflows, DevSecOps integrations, and platform-driven self-healing mechanisms.
Site Reliability Engineering (1)¶
Best Practices¶
- (2022) infracloud.io: Site Reliability Engineering (SRE) Best Practices [COMMUNITY-TOOL] โ Compiles architectural and cultural patterns for establishing cloud-native SRE. Key focuses include automated toil elimination, disaster recovery automation, chaos engineering practices, and objective-based release gates.
Cloud Native Ecosystem¶
- (2022) thenewstack.io: How the SRE Experience Is Changing with Cloud Native ๐ [ADVANCED LEVEL] [COMMUNITY-TOOL] โ Discusses how the shift to highly dynamic microservices and declarative Kubernetes operators has changed SRE workloads. SREs now prioritize building internal control planes and standardizing platform telemetry over manual server patching.
Enterprise Architecture¶
- (2021) thenewstack.io: Google SRE: Site Reliability Engineering at a Global Scale [ADVANCED LEVEL] [COMMUNITY-TOOL] โ Examines Google's massive-scale SRE methodologies. Detail-oriented profile covering how Google applies engineering to operations, manages error budgets across global infrastructures, and structures on-call shifts to protect engineering sanity.
Google Best Practices (1)¶
- (2020) cloud.google.com: SRE at Google: Our complete list of CRE life lessons ๐ [ADVANCED LEVEL] [DOCUMENTATION] [COMMUNITY-TOOL] โ A comprehensive aggregation of operational lessons compiled by Google's Customer Reliability Engineering (CRE) team. This resource serves as a core blueprint for implementing user-focused SLOs, identifying operational anti-patterns, and structuring incident communication paths.
Google SRE Book¶
- (2016) sre.google: sre-book - The Evolving SRE Engagement Model [ADVANCED LEVEL] [DOCUMENTATION] [COMMUNITY-TOOL] โ Core chapter from Google's SRE Book defining the Service Engagement Model. Outlines operational boundaries, launch-readiness reviews, service onboarding criteria, and the programmatic handback of unstable systems to development teams.
Incident Management¶
- (2021) infoq.com: Observing and Understanding Failures: SRE Apprentices [COMMUNITY-TOOL] โ InfoQ presentation highlighting pedagogical methods for onboarding junior SREs. Focuses on developing systemic mental models for debugging multi-tier failures and safely operating complex, distributed live environments.
Podcasts¶
- (2021) sre.google/prodcast [COMMUNITY-TOOL] โ Google's 'Prodcast' SRE series. Features in-depth architectural and organizational conversations with Google operations engineers regarding multi-region failovers, disaster planning, and fleet-wide maintenance.
Tooling (1)¶
- (2022) getcortexapp.com: A guide to the best SRE tools [COMMUNITY-TOOL] โ Cortex's architectural guide evaluating modern SRE tooling stacks. Highlights service catalogs, central tracking, alert deduplication, automated runbooks, and performance benchmarking.
- (2021) thenewstack.io: The Site Reliability Engineering Tool Stack [COMMUNITY-TOOL] โ Explores the fundamental tooling categories necessary for robust SRE practices, spanning declarative Infrastructure as Code (IaC), performance profiling, on-call management, and automated monitoring integrations.
- (2021) thenewstack.io: The Best Site Reliability Engineering Tools in 2021 [COMMUNITY-TOOL] โ An evaluative overview of top-tier reliability platforms. Focuses on tools addressing continuous anomaly detection, SLO dashboards, chaos engineering, and incident management pipelines.
Software Engineering¶
Professional Development¶
Core Architectures¶
- (2025) Skills for Real Engineers โญ 128202 [MARKDOWN CONTENT] [ADVANCED LEVEL] ๐๐๐๐๐ [DE FACTO STANDARD] โ An exceptionally popular repository detailing the foundational principles, design philosophies, and architectural protocols required for master-level software delivery. While the curator focuses on career advancement, live engineering practice indicates that mastering these fundamentals is vital to surviving rapid AI development shifts. It represents an elite reference for engineering standardizations.
๐ก Explore Related: DevOps | Test Automation Frameworks | Performance Testing With Jenkins And Jmeter