The commands, thought process, and order of operations I use when something breaks in a production cluster.
The first few minutes of a Kubernetes incident are the ones that matter most. If you spend them trying to remember what commands to run, or second-guessing which namespace to look in, you lose time that has a direct cost — in downtime, in user impact, in compounding failures. Having a practiced mental runbook doesn't mean you follow it rigidly in every incident. It means you know the starting points well enough that you can adapt without losing your footing.
This is the order of operations I follow when something breaks in a production cluster.
Start with scope. Before you look at any logs, run kubectl get pods -n <namespace> and kubectl get events -n <namespace> --sort-by='.lastTimestamp'. The pod list tells you what's running and what's not. The events list tells you what Kubernetes has been doing recently — scheduling failures, image pull errors, OOM kills, probe failures. Events are time-ordered and surprisingly informative, especially for issues that started minutes or hours ago rather than just now.
Next, narrow to the failing component. If a specific pod is in CrashLoopBackOff, kubectl describe pod <pod-name> is your next move. The describe output shows the container's last exit code, the reason for the most recent restart, and the full event timeline for that pod. Exit code 137 means the container was killed by SIGKILL — usually an OOM kill, check your memory requests and limits. Exit code 1 is a generic application crash, you need the logs. Exit code 0 means the container exited cleanly but unexpectedly, which usually means a bug in the startup logic.
For logs, kubectl logs <pod-name> --previous gets you the logs from the last failed container, not the currently running one. This is the flag most people forget. If you're looking at a pod in a restart loop and you run kubectl logs without --previous, you'll see the logs from the current (brief) startup, not the crash that triggered the restart.
If the issue isn't obvious from the pod and its logs, expand scope again. Check if it's infrastructure — kubectl top nodes to see whether any nodes are under memory or CPU pressure. Check if it's a networking issue by using kubectl exec to run a curl or nc from inside a running pod to the failing service. Check whether the service and endpoints are correctly configured with kubectl get endpoints <service-name>.
For incidents involving a rollout, kubectl rollout status deployment/<name> and kubectl rollout history deployment/<name> are your first stops. If the new version is bad, kubectl rollout undo deployment/<name> will revert to the previous ReplicaSet. The rollout undo is one of the most useful commands in a real incident — it's fast, it's reliable, and it works even when the deployment is stuck.
The last thing, and the one most people skip during an incident: once it's resolved, write down what you did and in what order. Not a formal post-mortem necessarily, but enough that you and your team can reconstruct the timeline. The pattern of incidents in a system tells you more about what needs to be fixed structurally than any single event does. That information only accumulates if you capture it.
Related posts
Stay in the loop
Get new posts delivered to your inbox
Platform engineering, frontend craft, and systems thinking. No spam.