Kubernetes Troubleshooting
- Introduction
- Kubernetes Events
- Kubernetes Network Troubleshooting
- Exit Codes in Containers and Kubernetes
- ImagePullBackOff
- CrashLoopBackOff
- Failed to Create Pod Sandbox
- Terminated with exit code 1 error
- Pod in Terminating or Unknown Status
- OOM Kills
- Pause Container
- Preempted Pod
- Evited Pods
- Stuck Namespace
- Access PVC Data without the POD
- CoreDNS issues
- Debugging Techniques and Strategies. Debugging with ephemeral containers
- Troubleshooting Tools
- Slides
- Images
- Tweets
Introduction
- learnk8s.io: A visual guide on troubleshooting Kubernetes deployments 🌟
- nigelpoulton.com: Troubleshooting kubernetes service discovery - Part 1
- medium: 5 tips for troubleshooting apps on Kubernetes
- managedkube.com: Troubleshooting a Kubernetes ingress
- veducate.co.uk: How to fix in Kubernetes – Deleting a PVC stuck in status “Terminating”
- thenewstack.io: 5 Best Practices to Back up Kubernetes
- tennexas.com: Kubernetes Troubleshooting Examples
- levelup.gitconnected.com: 5 tips for troubleshooting apps on Kubernetes
- medium: Common Kubernetes Errors Made by Beginners [2021] 🌟
- cloud.redhat.com: Troubleshooting Sandboxed Containers Operator
- andydote.co.uk: The Problem with CPUs and Kubernetes
- kinvolk.io: Investigating Kubernetes performance issues with BPF
- medium: Better Debugging Environment for your Micro-Services
- thenewstack.io: 6 Kubernetes Best Practices to Empower Devs to Troubleshoot
- youtube: 3 Ways to Detect Evil “Latest” Image Tags in Kubernetes - Kubevious The “latest” image tag is a disaster waiting to happen. In this video, you will learn how to detect usage of the latest images using 3 different methods.
- thenewstack.io: Living with Kubernetes: Debug Clusters in 8 Commands 🌟
- dzone.com: The Three Pillars of Kubernetes Troubleshooting 🌟 Diving into how the three pillars of understanding, managing and preventing for Kubernetes troubleshooting, and how it helps to conceive of what’s needed to be able to properly troubleshoot real-world Kubernetes stacks that are the hallmark of complex, distributed systems.
- freecodecamp.org: How to Simplify Kubernetes Troubleshooting
- itnext.io: Distroless Container Debugging on K8s/OpenShift
- When people focusing more on the security of containers, distroless based images are frequently used to reduce the attack surface. In these images, the package manager, the non-dependent modules or libraries, even the shells are stripped off, only the app and its required dependencies are kept. For the statically linked executable, produced by golang for example, we can even use “scratch” as the base.
- The potential exploit of vulnerability is therefore greatly reduced. But, on the other hand, it is difficult to troubleshoot the application if even the shell is not available, leaving only the logs from the app.
- In this paper, we will explore different options to facilitate debugging by bringing back the shell.
- speakerdeck.com/mhausenblas (redhat): Troubleshooting Kubernetes apps
- containiq.com: Debugging Your Kubernetes Nodes in the ‘Not Ready’ State | nodenotready Kubernetes clusters typically run on multiple “nodes” each having its own state. In this article, you’ll learn a few possible reasons a node might enter the NotReady state and how you can debug it.
- containiq.com: Troubleshooting Kubernetes FailedAttachVolume and FailedMount When working with Persistent Volumes in Kubernetes, you might run into the FailedAttachVolume or FailedMount error. In this tutorial, we’ll show you how to troubleshoot these errors and find the root cause and fix them.
- medium.com/@andrewachraf: Detect crashes in your Kubernetes cluster using kwatch and Slack 🌟 Monitor all changes in your Kubernetes(K8s) cluster & detects crashes in your running apps in real time
- research.nccgroup.com: Detection Engineering for Kubernetes clusters In this article you will learn how to detect anomalies in your cluster using Kubernetes Audit logs and Anomalies Detection Engineering.
- pauldally.medium.com: Kubernetes — Debugging NetworkPolicy (Part 1)
- medium.com/@tina168wong: Kubernetes Ingress and Services troubleshooting In this article, you will find some useful tips for troubleshooting the traffic flow in your cluster: from the Ingress to your Pods.
- medium.com/geekculture: Common Pod Errors in Kubernetes to Watch Out For
- faun.pub: Kubernetes — Debugging NetworkPolicy (Part 1) For something as important as NetworkPolicy, debugging is surprisingly painful. In this article you will learn a few practical tips on how to debug your network policies
- tratnayake.dev: Oncall Adventures - When your Prometheus-Server mounted to GCE Persistent Disk on K8s is Full In this article, you will follow Thilina’s journey on debugging a failing Prometheus server on Kubernetes. The story starts with a wake-up call at 3.30 am 😅
- sysdig.com: Understanding Kubernetes pod pending problems
- containiq.com: Kubernetes Node Disk Pressure | Troubleshooting w/ Example In this article, you’ll learn more about Kubernetes nodes experiencing disk pressure, including causes of disk pressure and a step-by-step guide to troubleshooting the error.
- blog.alexellis.io: How to Troubleshoot Applications on Kubernetes 🌟 In this article, you will learn a practical framework to troubleshoot applications deployed on Kubernetes:
- Is it there?
- Why isn’t it working?
- It starts, but doesn’t work
- There are too many pods!
- But can you
curl
it?
- blog.devgenius.io: All You Need to Know about Debugging Kubernetes Cronjob Walkthrough tools & configs & knowledge used in Kubernetes cronjob/deployment debug. In this article, you will create and deploy a (broken) CronJob. Then you will debug it and in the process learn about environment variables, RBAC, pod resource configuration, logging, and more.
- saiteja313.medium.com: Tracing DNS issues in Kubernetes
- medium.com/@jasonmfehr: Kubernetes Informers: Opening the Mystery Box In this article, you will learn how the team at Cloudera found a performance issue with Kubernetes informers and how they managed to rectify the issue
- maxilect-company.medium.com: Graceful shutdown in a cloud environment (the example of Kubernetes + Spring Boot) 🌟 In this article, you’ll learn why it is crucial to think about graceful shutdown in Kubernetes and how you can approach this task. Many people think about starting an application in the cloud but rarely pay attention to how it ends. Once, we caught quite a few errors explicitly related to pods stopping. For example, we saw that Kubernetes occasionally kills our application before it releases resources, although it seems that this should not happen. It was impossible to reproduce the problem immediately, and we wondered what was happening under the hood?
- martinheinz.dev: Backup-and-Restore of Containers with Kubernetes Checkpointing API Kubernetes v1.25 introduced Container Checkpointing API as an alpha feature. This provides a way to backup-and-restore containers running in Pods, without ever stopping them. This feature is primarily aimed at forensic analysis, but general backup-and-restore is something any Kubernetes user can take advantage of. So, let’s take a look at this brand-new feature and see how we can enable it in our clusters and leverage it for backup-and-restore or forensic analysis.
- madeeshafernando.medium.com: Capturing Heap Dumps of stateless Kubernetes pods before container termination and export to AWS S3
- faun.pub: Troubleshooting Kubernetes nodes storage space shortage on Aliyun (Alibaba Cloud) In this article, you will follow Stephen’s journey to identifying the root cause for cluster nodes running out of space on the Aliyun cloud
- thenewstack.io: What David Flanagan Learned Fixing Kubernetes Clusters David Flanagan has fixed 50+ Kubernetes clusters as part of his YouTube series, ‘Klustered.’ He shared what he learned at Civo Navigate.
- github.com/metaleapca: metaleap-k8s-troubleshooting.pdf 🌟🌟🌟
- nicolasbarlatier.hashnode.dev: .NET Core Tip 2: How to troubleshoot Memory Leaks within a .NET Console application running in a Linux Docker Container in Kubernetes In this step-by-step guide, you will learn how to troubleshoot a memory leak in a .Net Core application running within a Kubernetes cluster.
- blog.devgenius.io: All You Need to Know about Debugging Kubernetes Cronjob Walkthrough tools & configs & knowledge used in Kubernetes cronjob/deployment debug. In this article, you will create and deploy a (broken) CronJob. Then you will debug it and in the process learn about environment variables, RBAC, pod resource configuration, logging, and more
- dzone.com: Tackling the Top 5 Kubernetes Debugging Challenges Bugs are inevitable and typically occur as a result of an error or oversight. Learn five Kubernetes debugging challenges and how to tackle them.
- levelup.gitconnected.com: Access Kubernetes Objects Data From /Proc Directory 🌟 The
/proc
directory is a special directory that holds all the details about our Linux system, such as — kernel, processes, and configuration parameters. In this article, you will learn how to explore the directory in a Kubernetes cluster - learnitguide.net: How To Troubleshoot Kubernetes Pods
- learnitguide.net: How to Check Memory Usage of a Pod in Kubernetes?
- alexsniffin.medium.com: Debugging Remotely with Go in Kubernetes In this tutorial, you will learn how to debug an application deployed in Kubernetes remotely using VS Code and Delve
- thenewstack.io: Kubernetes Troubleshooting Primer A quick methodology for overcoming common error messages with examples of commands to help — useful for both the administrator and developer alike.
- devzero.io: Kubernetes Debugging Tips
- vik-y.medium.com: An easier way to auto-remediate memory leaks on Kubernetes!
- medium.com/@yusufkaratoprak: Advanced Troubleshooting Techniques in Kubernetes Pods
Kubernetes Events
- Understanding Kubernetes cluster events
- containiq.com: Kubernetes Events: In-Depth Guide & Examples 🌟 Kubernetes events help you understand how Kubernetes resource decisions are made and they can be helpful for debugging. Learn more about k8s events in this in-depth guide.
- groundcover.com: Failure Is an Option: How to Stay on Top of K8s Container Events Gain a deep understanding of how Kubernetes tracks container and Pod status, how it reports error information and how you can collect all of the above in an efficient way
- decisivedevops.com: Kubernetes Events — News feed of your cluster Understand Kubernetes Events and learn to use kubectl events to monitor and troubleshoot your cluster’s issues effectively.
Kubernetes Network Troubleshooting
- hwchiu.medium.com: Kubernetes Network Troubleshooting Approach 🌟
- itnext.io: Tracing Pod2Pod Network Traffic in Kubernetes | Daniele Polencic
Exit Codes in Containers and Kubernetes
- komodor.com: Exit Codes In Containers & Kubernetes – The Complete Guide 🌟 In this article, you will learn everything there is to know about exit codes used by container engines to indicate reasons for container termination.
ImagePullBackOff
- containiq.com: Kubernetes ImagePullBackOff: Troubleshooting With Examples If you’ve worked with Kubernetes for a while, chances are good that you have experienced the ImagePullBackOff status. This issue can be frustrating if you are unfamiliar with it, so in this guide, you will walk the reader through how to troubleshoot this issue, what some common causes are, and where to start if they encounter this problem.
- blog.ediri.io: Kubernetes: ImagePullBackOff! How to keep your calm and fix this like a pro!
CrashLoopBackOff
- medium.com: Kubernetes Tip: How To Disambiguate A Pod Crash To Application Or To Kubernetes Platform? (CrashLoopBackOff)
- devtron.ai: Troubleshoot: Pod Crashloopbackoff
- erkanerol.github.io: I wish pods were fully restartable Why are Pod not fully restartable in Kubernetes? Why is Kubernetes not restarting the Pod in CrashLoopBackOff?
- pauldally.medium.com: Why Leaving Pods in CrashLoopBackOff Can Have a Bigger Impact Than You Might Think
- sysdig.com: What is Kubernetes CrashLoopBackOff? And how to fix it 🌟 CrashLoopBackOff is a Kubernetes state representing a restart loop that is happening in a Pod: a container in the Pod is started but crashes and is then restarted over and over again. Learn what it is and how to fix it in this article
- komodor.com: Kubernetes CrashLoopBackOff Error: What It Is and How to Fix It
Failed to Create Pod Sandbox
- containiq.com: Troubleshooting the “Failed to Create Pod Sandbox” Error The “failed to create pod sandbox” error is a common problem when you’re trying to create a pod in Kubernetes. This article will explain the possible causes of the problem as well as how to fix it.
Terminated with exit code 1 error
- containiq.com: Troubleshooting ‘terminated with exit code 1’ error Sometimes Kubernetes pods die, leaving behind only cryptic messages such as “terminated with exit code 1”. In this guide, you’ll learn what this error indicates and how to troubleshoot it.
Pod in Terminating or Unknown Status
- tonylixu.medium.com: K8s Troubleshooting — Pod in Terminating or Unknown Status K8s Troubleshooting handbook
- blog.devgenius.io: K8s Troubleshooting — Pod in Terminating or Unknown Status
OOM Kills
- medium.com/@reefland: Tracking Down “Invisible” OOM Kills in Kubernetes An “Invisible” OOM Kill happens when a child process in a container is killed, not the init process. It is “invisible” to Kubernetes and not detected. What is OOM? well.. not a good thing.
- baykara.medium.com: A Gentle Inspection of OOMKilled in Kubernetes Quality of Service in Kubernetes
- cloudyuga.guru: How does Kubernetes assign QoS class to pods through OOM score? This article discusses how to handle OOMKilled errors and how to configure Pod QoS to avoid them
- sysdig.com: Kubernetes OOM and CPU Throttling Troubleshooting Memory and CPU problems. Do you know how memory and CPU usage can affect your cloud applications? In this article, you will discuss Out of Memory (OOM) and Throttling in Kubernetes.
- medium.com/@bm54cloud: Stressing a Kubernetes Pod to Induce an OOMKilled Error Learn about memory requests and limits, and what happens when those limits are exceeded
- itnext.io: Kubernetes Silent Pod Killer Tracking down invisible OOM Kills in Kubernetes
- This article delves into the issue of “Invisible OOM Kills” in Kubernetes, where child processes getting OOM Killed go unnoticed.
- An “Invisible” OOM Kill occurs when a child process in a container ( any process which is not the main process, PID 1 ) gets OOM Killed. In that scenario, the OOM Kill that occurred is “invisible” to Kubernetes, and as users we wouldn’t be aware of it.
- The Solution: The entire scenario changes with Kubernetes version 1.28. Starting from that version, Kubernetes enables, by default, a cgroup v2 feature known as “cgroup grouping.”
Pause Container
- blog.devgenius.io: K8s — pause container Why we have pause container in K8s pod?
Preempted Pod
- blog.kumomind.com: What You Need To Know To Debug A Preempted Pod On Kubernetes The purpose of this post is to share some thoughts on the management of a Kubernetes platform in production. The idea is to focus on a major problem that many beginners encounter: the management of preempted pods.
Evited Pods
- sysdig.com: Understanding Kubernetes Evicted Pods What does it mean that Kubernetes Pods are evicted? They are terminated, usually due to a lack of resources. But why does this happen?
Stuck Namespace
- blog.ediri.io: How to remove a stuck namespace With the help of the Kubernetes API
- medium.com/@it-craftsman: How to fix Kubernetes namespaces stuck in terminating state
Access PVC Data without the POD
- medium.com/@reefland: Access PVC Data without the POD; troubleshooting Kubernetes. I recently had a situation where Prometheus was stuck in a crash loop and unable to start. The solution is to delete a file within the Persistent Volume Claim (PVC). Seemed simple enough, however with the pod in a crash loop the PVC was not mounted within the Prometheus container. How can I deleted the file?
CoreDNS issues
Debugging Techniques and Strategies. Debugging with ephemeral containers
- kubectl-debug
- loft.sh: Using Kubernetes Ephemeral Containers for Troubleshooting
- kubesandclouds.com: Debugging with ephemeral containers in K8s (v1.18+)
- How to quarantine pods
- KDBG: Small Kubernetes debugging container KDBG (Kubernetes Debuger) is a small docker container based on lastest Alpine Linux image, used for debugging Kubernetes clusters from inside a pod.
- inspektor-gadget Collection of gadgets for debugging and introspecting Kubernetes applications using BPF
- learnk8s.io: A visual guide on troubleshooting Kubernetes deployments
- StatusBay is a tool that provides the missing visibility into the K8S deployment process. The main goal is to ease the experience of troubleshooting and debugging services in K8S and provide confidence while making changes.
- medium: Better Debugging Environment for your Micro-Services
- codefresh.io: Using Telepresence 2 for Kubernetes debugging and local development
- towardsdatascience.com: The Easiest Way to Debug Kubernetes Workloads The fastest and easiest way to debug and troubleshoot any application running on Kubernetes
- tetrate.io: How to debug microservices in Kubernetes with proxy, sidecar or service mesh?
- rookout.com: The Definitive Guide To Kubernetes Application Debugging
- thorsten-hans.com: Debugging apps in Kubernetes with Bridge Bridge to Kubernetes simplifies and streamlines the process of debugging applications running in Kubernetes. Debug any language using the tools you prefer and love.
- marketplace.visualstudio.com: Bridge to Kubernetes (VSCode)
- marketplace.visualstudio.com: Bridge to Kubernetes (Visual Studio) Bridge to Kubernetes for Visual Studio 2019
- thenewstack.io: Living with Kubernetes: 12 Commands to Debug Your Workloads 🌟 Kubernetes can’t fix broken code. But if your container won’t start or the app gets intermittent errors, here’s how you can start debugging it. Most of the commands presented in the article will use kubectl or plugins which you can install via krew.
- levelup.gitconnected.com: De-Mystifying Kubernetes Debugging How to debug your microservice in VS Code with Bridge to Kubernetes
- opensource.googleblog.com: Introducing Ephemeral Containers Ephemeral containers are a new type of container that are part of the Kubernetes core API. An Ephemeral Container may be added to an existing Pod for administrative actions like debugging, it runs until it exits, and it won’t be restarted. An ephemeral container runs within the Pod’s existing resource allocation and shares common container namespaces.
- linkedin.com: Kubernetes Ephemeral Containers | Bibin Wilson Ephemeral Containers is one of the k8s beta features. The following command will add the debug-image container to the running frontend pod and take an exec session for debugging:
kubectl debug -it pods/frontend --image=debug-image
- sumanthkumarc.medium.com: Debugging namespace deletion issue in Kubernetes
- medium.com/linux-shots: Debug Kubernetes Pods Using Ephemeral Container
- medium.com/@blgreco72: Debugging Kubernetes Services Locally 🌟 There are various approaches for debugging Microservices hosted within Kubernetes. The approach used here does not alter the Kubernetes cluster in anyway to support developing Microservices, running external to the cluster, within the developer’s IDE. This is accomplished by mapping ports on the developer’s workstation to services that are normally only accessible from containers running within the cluster.
- zendesk.engineering: Debugging containerd
- heka-ai.medium.com: Introduction to Debugging: locally and live on Kubernetes with VSCode 🌟 In this article, you’ll learn how to debug your code in real-time on a Pod running on Kubernetes using VS Code
- iximiuz.com: Kubernetes Ephemeral Containers and kubectl debug Command 🌟 Learn how to use Ephemeral Containers to debug Kubernetes workloads with and without the kubectl debug command
- eminaktas.medium.com: Debug Containerd in Production In this article, you will learn how you can debug containerd with VSCode in a remote production environment.
- medium.com/@alex.ivenin: Exploring ephemeral containers in kubernetes 🌟 Ephemeral containers, a feature that was introduced in Kubernetes 1.16 as an alpha release, advanced to beta status in version 1.23, and has finally graduated to stable status in Kubernetes 1.25. This capability provides an easy and safe way to debug running containers in a pod, without requiring full access to the underlying node.
- labs.iximiuz.com: How to work with container images using ctr ctr is a command-line client shipped as part of the containerd project. If you have containerd running on a machine, chances are the ctr binary is also present there.
- medium.com/@danielepolencic: Isolating kubernetes pods for debugging This article introduces a technique that helps you with debugging running Pods in production by changing labels, you can detach Pods from the Service (no traffic), and you troubleshoot them live
- medium.com/adaltas: Kubernetes: debugging with ephemeral containers In this article, you will learn how to debug pods using kubectl debug and ephemeral containers
Troubleshooting Tools
- github.com/replicatedhq/troubleshoot Troubleshoot is a framework for collecting and analyzing diagnostic information about a Kubernetes cluster. The framework is customizable and allows third-party application developers to create troubleshoot specs that can be run by cluster operators.
- github.com/airwallex: k8s-pod-restart-info-collector k8s-pod-restart-info-collector is a simple Kubernetes customer controller that watches for Pods changes and collects K8s Pod restart reasons, logs, and events to Slack channels when a Pod restarts
Komodor
- komodor.com Turn troubleshooting chaos into clarity. Komodor is an observability tool that gives you insight into what’s happening with your clusters and workloads. It integrates tools that we all use, like Datadog, Okta, LaunchDarkly, and PagerDuty.
- komodor.com: Kubernetes Troubleshooting: The Complete Guide 🌟
Palaemon
- palaemon.io Open-source developer tool for monitoring Kubernetes clusters and error analysis
- medium.com/@ospalaemon: Introducing Palaemon, the Savior of Kubernetes Pods!
cdebug and debug-ctr
- iximiuz/cdebug a swiss army knife of container debugging. It’s like “docker exec”, but it works even for containers without a shell (scratch, distroless, slim, etc). The “cdebug exec” command allows you to bring your own toolkit and start a shell inside of a running container.
- felipecruz91/debug-ctr A commandline tool for interactive troubleshooting when a container has crashed or a container image doesn’t include debugging utilities, such as distroless images. Heavily inspired by kubectl debug, but for containers instead of Pods.
kubectl-debug
- github.com/JamesTGrant/kubectl-debug kubectl-debug is a tool that lets you debug a target container in a Kubernetes cluster by automatically creating a new, non-invasive, ‘debug’ container in the same PID, network, user, and IPC namespace as the target container without any disruption
Kubeshark
Slides
Click to expand!
Images
Tweets
Click to expand!
My top 8 commands and tools for debugging applications running on @kubernetesio 🧵👇
— Daniel Bryant (@danielbryantuk) February 13, 2022
What is your favourite Kubernetes troubleshooting command? Looking for some new ones 😉
— Saiyam Pathak (@SaiyamPathak) April 11, 2022
I made a tool… to debug containers 🧙♂️
— Ivan Velichko (@iximiuz) October 23, 2022
It's like "docker exec", but it works even for containers without a shell (scratch, distroless, slim, etc).
The "cdebug exec" command allows you to bring your own toolkit and start a shell inside of a running container.
A short demo 👇 pic.twitter.com/82m4vzPYJr
There is a Kubernetes deployment which processes items from a queue. Most items are very small and completed immediately. Occasionally a whopping big item comes along and causes an OOMKill. Retries don't help for obvious reasons.
— Natan Yellin (@aantn) November 29, 2022
How would you solve it?
How does Pod to Pod communication work in Kubernetes?
— Daniele Polencic — @danielepolencic@hachyderm.io (@danielepolencic) May 8, 2023
How does the traffic reach the pod?
Let's dive into how low-level networking works in Kubernetes. pic.twitter.com/K8bBT8YiOf