Kubernetes Troubleshooting
- Introduction
- Kubernetes Events
- Kubernetes Network Troubleshooting
- Exit Codes in Containers and Kubernetes
- ImagePullBackOff
- CrashLoopBackOff
- Failed to Create Pod Sandbox
- Terminated with exit code 1 error
- Pod in Terminating or Unknown Status
- OOM Kills
- Pause Container
- Preempted Pod
- Evited Pods
- Stuck Namespace
- Access PVC Data without the POD
- CoreDNS issues
- Debugging Techniques and Strategies. Debugging with ephemeral containers
- Troubleshooting Tools
- Slides
- Images
- Tweets
Introduction
- learnk8s.io: A visual guide on troubleshooting Kubernetes deployments ๐
- medium: 5 tips for troubleshooting apps on Kubernetes
- managedkube.com: Troubleshooting a Kubernetes ingress
- veducate.co.uk: How to fix in Kubernetes โ Deleting a PVC stuck in status โTerminatingโ
- thenewstack.io: 5 Best Practices to Back up Kubernetes
- tennexas.com: Kubernetes Troubleshooting Examples
- levelup.gitconnected.com: 5 tips for troubleshooting apps on Kubernetes
- medium: Common Kubernetes Errors Made by Beginners [2021] ๐
- cloud.redhat.com: Troubleshooting Sandboxed Containers Operator
- andydote.co.uk: The Problem with CPUs and Kubernetes
- medium: Better Debugging Environment for your Micro-Services
- thenewstack.io: 6 Kubernetes Best Practices to Empower Devs to Troubleshoot
- youtube: 3 Ways to Detect Evil “Latest” Image Tags in Kubernetes - Kubevious The “latest” image tag is a disaster waiting to happen. In this video, you will learn how to detect usage of the latest images using 3 different methods.
- thenewstack.io: Living with Kubernetes: Debug Clusters in 8 Commands ๐
- dzone.com: The Three Pillars of Kubernetes Troubleshooting ๐ [ARCHIVED] Diving into how the three pillars of understanding, managing and preventing for Kubernetes troubleshooting, and how it helps to conceive of whatโs needed to be able to properly troubleshoot real-world Kubernetes stacks that are the hallmark of complex, distributed systems.
- freecodecamp.org: How to Simplify Kubernetes Troubleshooting
- itnext.io: Distroless Container Debugging on K8s/OpenShift
- When people focusing more on the security of containers, distroless based images are frequently used to reduce the attack surface. In these images, the package manager, the non-dependent modules or libraries, even the shells are stripped off, only the app and its required dependencies are kept. For the statically linked executable, produced by golang for example, we can even use โscratchโ as the base.
- The potential exploit of vulnerability is therefore greatly reduced. But, on the other hand, it is difficult to troubleshoot the application if even the shell is not available, leaving only the logs from the app.
- In this paper, we will explore different options to facilitate debugging by bringing back the shell.
- speakerdeck.com/mhausenblas (redhat): Troubleshooting Kubernetes apps
- medium.com/@andrewachraf: Detect crashes in your Kubernetes cluster using kwatch and Slack ๐ Monitor all changes in your Kubernetes(K8s) cluster & detects crashes in your running apps in real time
- research.nccgroup.com: Detection Engineering for Kubernetes clusters In this article you will learn how to detect anomalies in your cluster using Kubernetes Audit logs and Anomalies Detection Engineering.
- pauldally.medium.com: Kubernetes โ Debugging NetworkPolicy (Part 1)
- medium.com/geekculture: Common Pod Errors in Kubernetes to Watch Out For
- faun.pub: Kubernetes โ Debugging NetworkPolicy (Part 1) For something as important as NetworkPolicy, debugging is surprisingly painful. In this article you will learn a few practical tips on how to debug your network policies
- tratnayake.dev: Oncall Adventures - When your Prometheus-Server mounted to GCE Persistent Disk on K8s is Full In this article, you will follow Thilina’s journey on debugging a failing Prometheus server on Kubernetes. The story starts with a wake-up call at 3.30 am ๐
- sysdig.com: Understanding Kubernetes pod pending problems
- blog.alexellis.io: How to Troubleshoot Applications on Kubernetes ๐ In this article, you will learn a practical framework to troubleshoot applications deployed on Kubernetes:
- Is it there?
- Why isn’t it working?
- It starts, but doesn’t work
- There are too many pods!
- But can you
curlit?
- blog.devgenius.io: All You Need to Know about Debugging Kubernetes Cronjob Walkthrough tools & configs & knowledge used in Kubernetes cronjob/deployment debug. In this article, you will create and deploy a (broken) CronJob. Then you will debug it and in the process learn about environment variables, RBAC, pod resource configuration, logging, and more.
- saiteja313.medium.com: Tracing DNS issues in Kubernetes
- medium.com/@jasonmfehr: Kubernetes Informers: Opening the Mystery Box In this article, you will learn how the team at Cloudera found a performance issue with Kubernetes informers and how they managed to rectify the issue
- maxilect-company.medium.com: Graceful shutdown in a cloud environment (the example of Kubernetes + Spring Boot) ๐ In this article, you’ll learn why it is crucial to think about graceful shutdown in Kubernetes and how you can approach this task. Many people think about starting an application in the cloud but rarely pay attention to how it ends. Once, we caught quite a few errors explicitly related to pods stopping. For example, we saw that Kubernetes occasionally kills our application before it releases resources, although it seems that this should not happen. It was impossible to reproduce the problem immediately, and we wondered what was happening under the hood?
- martinheinz.dev: Backup-and-Restore of Containers with Kubernetes Checkpointing API Kubernetes v1.25 introduced Container Checkpointing API as an alpha feature. This provides a way to backup-and-restore containers running in Pods, without ever stopping them. This feature is primarily aimed at forensic analysis, but general backup-and-restore is something any Kubernetes user can take advantage of. So, let’s take a look at this brand-new feature and see how we can enable it in our clusters and leverage it for backup-and-restore or forensic analysis.
- madeeshafernando.medium.com: Capturing Heap Dumps of stateless Kubernetes pods before container termination and export to AWS S3
- faun.pub: Troubleshooting Kubernetes nodes storage space shortage on Aliyun (Alibaba Cloud) In this article, you will follow Stephen’s journey to identifying the root cause for cluster nodes running out of space on the Aliyun cloud
- thenewstack.io: What David Flanagan Learned Fixing Kubernetes Clusters David Flanagan has fixed 50+ Kubernetes clusters as part of his YouTube series, ‘Klustered.’ He shared what he learned at Civo Navigate.
- github.com/metaleapca: metaleap-k8s-troubleshooting.pdf ๐๐๐
- nicolasbarlatier.hashnode.dev: .NET Core Tip 2: How to troubleshoot Memory Leaks within a .NET Console application running in a Linux Docker Container in Kubernetes In this step-by-step guide, you will learn how to troubleshoot a memory leak in a .Net Core application running within a Kubernetes cluster.
- blog.devgenius.io: All You Need to Know about Debugging Kubernetes Cronjob Walkthrough tools & configs & knowledge used in Kubernetes cronjob/deployment debug. In this article, you will create and deploy a (broken) CronJob. Then you will debug it and in the process learn about environment variables, RBAC, pod resource configuration, logging, and more
- dzone.com: Tackling the Top 5 Kubernetes Debugging Challenges Bugs are inevitable and typically occur as a result of an error or oversight. Learn five Kubernetes debugging challenges and how to tackle them.
- levelup.gitconnected.com: Access Kubernetes Objects Data From /Proc Directory ๐ The
/procdirectory is a special directory that holds all the details about our Linux system, such as โ kernel, processes, and configuration parameters. In this article, you will learn how to explore the directory in a Kubernetes cluster - learnitguide.net: How To Troubleshoot Kubernetes Pods
- learnitguide.net: How to Check Memory Usage of a Pod in Kubernetes?
- alexsniffin.medium.com: Debugging Remotely with Go in Kubernetes In this tutorial, you will learn how to debug an application deployed in Kubernetes remotely using VS Code and Delve
- thenewstack.io: Kubernetes Troubleshooting Primer A quick methodology for overcoming common error messages with examples of commands to help โ useful for both the administrator and developer alike.
- devzero.io: Kubernetes Debugging Tips
- vik-y.medium.com: An easier way to auto-remediate memory leaks on Kubernetes!
- medium.com/@yusufkaratoprak: Advanced Troubleshooting Techniques in Kubernetes Pods
Kubernetes Events
-
CPU Limits in Kubernetes: Deep Dive into Pod Throttling and Kernel Interactions ๐ - This article provides an in-depth explanation of how CPU limits in Kubernetes function, detailing the underlying mechanisms involving the Linux Kernel and cgroups v2. It addresses the common issue of pods being throttled even when idle, exploring the complex interactions between Kubernetes, container runtimes, and the host operating system to shed light on performance impacts.
- groundcover.com: Failure Is an Option: How to Stay on Top of K8s Container Events Gain a deep understanding of how Kubernetes tracks container and Pod status, how it reports error information and how you can collect all of the above in an efficient way
- decisivedevops.com: Kubernetes Events โ News feed of your cluster Understand Kubernetes Events and learn to use kubectl events to monitor and troubleshoot your clusterโs issues effectively.
Kubernetes Network Troubleshooting
- hwchiu.medium.com: Kubernetes Network Troubleshooting Approach ๐
- itnext.io: Tracing Pod2Pod Network Traffic in Kubernetes | Daniele Polencic
Exit Codes in Containers and Kubernetes
- komodor.com: Exit Codes In Containers & Kubernetes โ The Complete Guide ๐ In this article, you will learn everything there is to know about exit codes used by container engines to indicate reasons for container termination.
ImagePullBackOff
-
10 Real-World Kubernetes Troubleshooting Scenarios and Solutions ๐ - This article provides practical, hands-on solutions for common Kubernetes production issues. It details 10 real-world scenarios, including ImagePullBackOff due to private registry authentication failure, and offers exact kubectl commands and steps for diagnosis and resolution. It also touches upon cloud-managed Kubernetes solutions and IAM roles for registry authentication.
-
blog.ediri.io: Kubernetes: ImagePullBackOff! How to keep your calm and fix this like a pro!
CrashLoopBackOff
- medium.com: Kubernetes Tip: How To Disambiguate A Pod Crash To Application Or To Kubernetes Platform? (CrashLoopBackOff)
- devtron.ai: Troubleshoot: Pod Crashloopbackoff
- erkanerol.github.io: I wish pods were fully restartable Why are Pod not fully restartable in Kubernetes? Why is Kubernetes not restarting the Pod in CrashLoopBackOff?
- pauldally.medium.com: Why Leaving Pods in CrashLoopBackOff Can Have a Bigger Impact Than You Might Think
- sysdig.com: What is Kubernetes CrashLoopBackOff? And how to fix it ๐ CrashLoopBackOff is a Kubernetes state representing a restart loop that is happening in a Pod: a container in the Pod is started but crashes and is then restarted over and over again. Learn what it is and how to fix it in this article
- komodor.com: Kubernetes CrashLoopBackOff Error: What It Is and How to Fix It
Failed to Create Pod Sandbox
Terminated with exit code 1 error
Pod in Terminating or Unknown Status
- tonylixu.medium.com: K8s Troubleshooting โ Pod in Terminating or Unknown Status K8s Troubleshooting handbook
- blog.devgenius.io: K8s Troubleshooting โ Pod in Terminating or Unknown Status
OOM Kills
-
OOMKilled in Kubernetes: Understanding and Preventing Hidden Memory Leaks ๐ - This article explains the ‘OOMKilled’ status in Kubernetes, detailing how the Linux kernel’s Out-Of-Memory (OOM) Killer terminates pods when memory limits are exceeded. It covers common triggers such as incorrect resource limits, application memory leaks, traffic spikes, and resource competition among containers. The content also delves into the OOM Killer’s scoring mechanism and provides insights into identifying and resolving these issues to prevent production environment disruptions.
-
medium.com/@reefland: Tracking Down โInvisibleโ OOM Kills in Kubernetes An โInvisibleโ OOM Kill happens when a child process in a container is killed, not the init process. It is โinvisibleโ to Kubernetes and not detected. What is OOM? well.. not a good thing.
- baykara.medium.com: A Gentle Inspection of OOMKilled in Kubernetes Quality of Service in Kubernetes
- cloudyuga.guru: How does Kubernetes assign QoS class to pods through OOM score? This article discusses how to handle OOMKilled errors and how to configure Pod QoS to avoid them
- sysdig.com: Kubernetes OOM and CPU Throttling Troubleshooting Memory and CPU problems. Do you know how memory and CPU usage can affect your cloud applications? In this article, you will discuss Out of Memory (OOM) and Throttling in Kubernetes.
- medium.com/@bm54cloud: Stressing a Kubernetes Pod to Induce an OOMKilled Error Learn about memory requests and limits, and what happens when those limits are exceeded
- itnext.io: Kubernetes Silent Pod Killer Tracking down invisible OOM Kills in Kubernetes
- This article delves into the issue of “Invisible OOM Kills” in Kubernetes, where child processes getting OOM Killed go unnoticed.
- An โInvisibleโ OOM Kill occurs when a child process in a container ( any process which is not the main process, PID 1 ) gets OOM Killed. In that scenario, the OOM Kill that occurred is โinvisibleโ to Kubernetes, and as users we wouldnโt be aware of it.
- The Solution: The entire scenario changes with Kubernetes version 1.28. Starting from that version, Kubernetes enables, by default, a cgroup v2 feature known as โcgroup grouping.โ
Pause Container
- blog.devgenius.io: K8s โ pause container Why we have pause container in K8s pod?
Preempted Pod
- blog.kumomind.com: What You Need To Know To Debug A Preempted Pod On Kubernetes The purpose of this post is to share some thoughts on the management of a Kubernetes platform in production. The idea is to focus on a major problem that many beginners encounter: the management of preempted pods.
Evited Pods
- sysdig.com: Understanding Kubernetes Evicted Pods What does it mean that Kubernetes Pods are evicted? They are terminated, usually due to a lack of resources. But why does this happen?
Stuck Namespace
- blog.ediri.io: How to remove a stuck namespace With the help of the Kubernetes API
- medium.com/@it-craftsman: How to fix Kubernetes namespaces stuck in terminating state
Access PVC Data without the POD
- medium.com/@reefland: Access PVC Data without the POD; troubleshooting Kubernetes. I recently had a situation where Prometheus was stuck in a crash loop and unable to start. The solution is to delete a file within the Persistent Volume Claim (PVC). Seemed simple enough, however with the pod in a crash loop the PVC was not mounted within the Prometheus container. How can I deleted the file?
CoreDNS issues
Debugging Techniques and Strategies. Debugging with ephemeral containers
- The Hidden CPU Throttling Crisis in Kubernetes Clusters ๐ - This article explains how Kubernetes CPU throttling, governed by the Linux kernel’s CFS scheduler with a 100ms time slice, can silently degrade application performance even when resource usage appears low. It highlights the disconnect between Kubernetes limits and typical monitoring timescales, leading to unexpected slowdowns and impacting user experience.
- Kubernetes Troubleshooting: A Step-by-Step Guide ๐ - A comprehensive, step-by-step guide to effectively troubleshoot issues within a Kubernetes environment. This resource likely covers common problems, diagnostic tools, and methodologies for resolving them.
- Awesome Chaos Engineering - (Related to chaos-engineering topic)
-
Kubernetes Troubleshooting Guide: Common Pitfalls and Solutions ๐ - A comprehensive guide to common Kubernetes troubleshooting scenarios, offering practical advice and solutions for developers and operators facing issues with pods, deployments, services, and networking.
- loft.sh: Using Kubernetes Ephemeral Containers for Troubleshooting
- How to quarantine pods
- KDBG: Small Kubernetes debugging container KDBG (Kubernetes Debuger) is a small docker container based on lastest Alpine Linux image, used for debugging Kubernetes clusters from inside a pod.
- inspektor-gadget Collection of gadgets for debugging and introspecting Kubernetes applications using BPF
- learnk8s.io: A visual guide on troubleshooting Kubernetes deployments
- StatusBay is a tool that provides the missing visibility into the K8S deployment process. The main goal is to ease the experience of troubleshooting and debugging services in K8S and provide confidence while making changes.
- medium: Better Debugging Environment for your Micro-Services
- codefresh.io: Using Telepresence 2 for Kubernetes debugging and local development
- towardsdatascience.com: The Easiest Way to Debug Kubernetes Workloads The fastest and easiest way to debug and troubleshoot any application running on Kubernetes
- tetrate.io: How to debug microservices in Kubernetes with proxy, sidecar or service mesh?
- rookout.com: The Definitive Guide To Kubernetes Application Debugging
- thorsten-hans.com: Debugging apps in Kubernetes with Bridge Bridge to Kubernetes simplifies and streamlines the process of debugging applications running in Kubernetes. Debug any language using the tools you prefer and love.
- marketplace.visualstudio.com: Bridge to Kubernetes (VSCode)
- marketplace.visualstudio.com: Bridge to Kubernetes (Visual Studio) Bridge to Kubernetes for Visual Studio 2019
- thenewstack.io: Living with Kubernetes: 12 Commands to Debug Your Workloads ๐ Kubernetes can’t fix broken code. But if your container won’t start or the app gets intermittent errors, here’s how you can start debugging it. Most of the commands presented in the article will use kubectl or plugins which you can install via krew.
- opensource.googleblog.com: Introducing Ephemeral Containers Ephemeral containers are a new type of container that are part of the Kubernetes core API. An Ephemeral Container may be added to an existing Pod for administrative actions like debugging, it runs until it exits, and it won’t be restarted. An ephemeral container runs within the Pod’s existing resource allocation and shares common container namespaces.
- linkedin.com: Kubernetes Ephemeral Containers | Bibin Wilson Ephemeral Containers is one of the k8s beta features. The following command will add the debug-image container to the running frontend pod and take an exec session for debugging:
kubectl debug -it pods/frontend --image=debug-image - sumanthkumarc.medium.com: Debugging namespace deletion issue in Kubernetes
- medium.com/linux-shots: Debug Kubernetes Pods Using Ephemeral Container
- medium.com/@blgreco72: Debugging Kubernetes Services Locally ๐ There are various approaches for debugging Microservices hosted within Kubernetes. The approach used here does not alter the Kubernetes cluster in anyway to support developing Microservices, running external to the cluster, within the developerโs IDE. This is accomplished by mapping ports on the developerโs workstation to services that are normally only accessible from containers running within the cluster.
- zendesk.engineering: Debugging containerd
- heka-ai.medium.com: Introduction to Debugging: locally and live on Kubernetes with VSCode ๐ In this article, you’ll learn how to debug your code in real-time on a Pod running on Kubernetes using VS Code
- iximiuz.com: Kubernetes Ephemeral Containers and kubectl debug Command ๐ Learn how to use Ephemeral Containers to debug Kubernetes workloads with and without the kubectl debug command
- eminaktas.medium.com: Debug Containerd in Production In this article, you will learn how you can debug containerd with VSCode in a remote production environment.
- medium.com/@alex.ivenin: Exploring ephemeral containers in kubernetes ๐ Ephemeral containers, a feature that was introduced in Kubernetes 1.16 as an alpha release, advanced to beta status in version 1.23, and has finally graduated to stable status in Kubernetes 1.25. This capability provides an easy and safe way to debug running containers in a pod, without requiring full access to the underlying node.
- labs.iximiuz.com: How to work with container images using ctr ctr is a command-line client shipped as part of the containerd project. If you have containerd running on a machine, chances are the ctr binary is also present there.
- medium.com/@danielepolencic: Isolating kubernetes pods for debugging This article introduces a technique that helps you with debugging running Pods in production by changing labels, you can detach Pods from the Service (no traffic), and you troubleshoot them live
- medium.com/adaltas: Kubernetes: debugging with ephemeral containers In this article, you will learn how to debug pods using kubectl debug and ephemeral containers
Troubleshooting Tools
- The Definitive Guide to Importing Your Cloud Resources into IaC - (Related to iac topic)
- RKE2 Standalone Disaster Recovery Guide - (Related to kubernetes-backup-migrations topic)
- KubeUI: A Desktop Kubernetes Client - (Related to kubernetes-tools topic)
-
A Complete Guide to Kubectl exec - (Related to kubernetes-tools topic)
-
github.com/replicatedhq/troubleshoot Troubleshoot is a framework for collecting and analyzing diagnostic information about a Kubernetes cluster. The framework is customizable and allows third-party application developers to create troubleshoot specs that can be run by cluster operators.
- github.com/airwallex: k8s-pod-restart-info-collector k8s-pod-restart-info-collector is a simple Kubernetes customer controller that watches for Pods changes and collects K8s Pod restart reasons, logs, and events to Slack channels when a Pod restarts
Komodor
- komodor.com Turn troubleshooting chaos into clarity. Komodor is an observability tool that gives you insight into whatโs happening with your clusters and workloads. It integrates tools that we all use, like Datadog, Okta, LaunchDarkly, and PagerDuty.
- komodor.com: Kubernetes Troubleshooting: The Complete Guide ๐
Palaemon
- palaemon.io Open-source developer tool for monitoring Kubernetes clusters and error analysis
- medium.com/@ospalaemon: Introducing Palaemon, the Savior of Kubernetes Pods!
cdebug and debug-ctr
- iximiuz/cdebug a swiss army knife of container debugging. It’s like “docker exec”, but it works even for containers without a shell (scratch, distroless, slim, etc). The “cdebug exec” command allows you to bring your own toolkit and start a shell inside of a running container.
- felipecruz91/debug-ctr A commandline tool for interactive troubleshooting when a container has crashed or a container image doesn’t include debugging utilities, such as distroless images. Heavily inspired by kubectl debug, but for containers instead of Pods.
kubectl-debug
- github.com/JamesTGrant/kubectl-debug kubectl-debug is a tool that lets you debug a target container in a Kubernetes cluster by automatically creating a new, non-invasive, ‘debug’ container in the same PID, network, user, and IPC namespace as the target container without any disruption
Kubeshark
Slides
Click to expand!
Images
Tweets
Click to expand!
My top 8 commands and tools for debugging applications running on @kubernetesio ๐งต๐
— Daniel Bryant (@danielbryantuk) February 13, 2022
What is your favourite Kubernetes troubleshooting command? Looking for some new ones ๐
— Saiyam Pathak (@SaiyamPathak) April 11, 2022
I made a tool… to debug containers ๐งโโ๏ธ
— Ivan Velichko (@iximiuz) October 23, 2022
It's like "docker exec", but it works even for containers without a shell (scratch, distroless, slim, etc).
The "cdebug exec" command allows you to bring your own toolkit and start a shell inside of a running container.
A short demo ๐ pic.twitter.com/82m4vzPYJr
There is a Kubernetes deployment which processes items from a queue. Most items are very small and completed immediately. Occasionally a whopping big item comes along and causes an OOMKill. Retries don't help for obvious reasons.
— Natan Yellin (@aantn) November 29, 2022
How would you solve it?
How does Pod to Pod communication work in Kubernetes?
— Daniele Polencic โ @danielepolencic@hachyderm.io (@danielepolencic) May 8, 2023
How does the traffic reach the pod?
Let's dive into how low-level networking works in Kubernetes. pic.twitter.com/K8bBT8YiOf
- Debugging Kubernetes Systems: Practical Advice with Quality Telemetry ๐ - Adnan Rahic shares practical advice for debugging Kubernetes systems, highlighting the importance of quality telemetry.
