What can the Last of Us teach us about software development?
4 critical System Design lessons about zombie processes.
Some say the dead can’t hurt you.
But try telling that to the engineer who couldn’t fork() a new process in production — all because no one cleaned up a zombie process from 3 hours ago.
“Zombie process” isn’t just Halloween filler. It’s a real term and a very real problem. In today’s container-heavy, shell-happy environments, zombie processes are incredibly common.
...and they can break your systems silently.
Today we’ll talk about what zombie processes are, why they happen, how to prevent them, and what they reveal about resilient System Design.
Let’s dig in. 🧟
What is a zombie process?
A zombie process is a process that has finished executing but still lingers in the OS process table because its parent never reaped it. Until the parent reaps the child, the child process sticks around in a defunct state (a.k.a. a zombie).
Not familiar with reaping? Here’s some context:
When a child process exits, the OS keeps a small record (including its exit code).
The parent process is responsible for collecting that record using wait(), waitpid(), or by handling the SIGCHLD signal.
That collection step is called reaping. It tells the OS, “Okay, this one’s done, you can clean it up now.”
Note: Zombie processes are specific to Unix-like systems (Linux, macOS, BSD). If you’re on Windows, this is less likely to affect you (unless you’re running containers, WSL, or cross-platform runtimes).
Why you should care
Zombie processes don’t use CPU or memory, but they do take up entries in the process table.
A few are fine. But if you’ve got a horde, you’ve got a problem.
Left unchecked, zombie processes can trigger process table exhaustion.
That means:
New subprocesses can’t start
Cron jobs and workers silently fail
Containers freeze
SSH sessions hang
Health checks time out
Autoscaling mysteriously breaks
And much like their horror-movie namesakes, zombie processes often strike quietly, and their impact spreads fast.
Who’s at risk?
Most of you are.
If your code spawns other processes, you’re in the danger zone. Especially if you:
Use subprocess, exec.Command, Popen(), or fork() and don’t explicitly wait or reap
Run services in containers (where your app runs as PID 1)
Build job runners, daemons, schedulers, or orchestration layers
Handle infrastructure, CI/CD, or container platforms
Work on long-running processes (think: workers, servers, pipelines)
If your code forks, it must reap. No exceptions.
And if you’re on-call? These zombies are your problem too.
How to prevent a zombie apocalypse in your infra
Step 1: Reap what you spawn
Zombie prevention starts with proper process lifecycle management.
Step 2: If you’re in a container, use an init
Containers make things trickier because:
Your app often runs as PID 1
PID 1 doesn’t handle signals or child process cleanup by default
To fix this:
Use a minimal init system like tini or dumb-init
Or implement signal handling and reaping logic in your app
Kubernetes won’t save you here; it assumes your container knows how to clean up after itself.
Step 3: Monitor it
You can check for the undead in any Linux/Unix system with ps aux | grep ‘Z’
The STAT column will show Z for zombie.
If you see more than a few? Something’s not reaping.
You can also:
Monitor total process count
Alert on fork failures (EAGAIN)
Export metrics like node_procs_blocked and node_procs_running with Prometheus
4 System Design lessons zombies teach us
Zombie processes may seem like an OS-level edge case, but they point to bigger truths about building reliable systems:
You own what you spawn: Whether it’s a process, thread, goroutine, or cloud resource — if you create it, you’re responsible for cleaning it up.
Cleanup is part of the contract: Designing a system that starts things is only half the job. It also needs to know how to end them predictably, safely, and fully.
Invisible failures are still real failures: Zombies don’t spike CPU or throw exceptions. They rot your infra quietly. Great systems have observability for the quiet stuff too.
Abstractions leak: Neither serverless, Kubernetes, nor containers can save you from the kernel. Your platform can’t reap what your app refuses to.
System Design isn’t just about architecture diagrams and scaling. It’s also about knowing what happens to your subprocesses when they die.
The dead don’t clean up after themselves
Zombie processes are a perfect reminder that neglected cleanup becomes production debt — the kind that doesn’t alert, but still breaks things.
Even if you forget everything else, remember this:
If you spawn it, reap it.
If you’re on-call, monitor it.
If you want to keep the learning going, check out our System Design hub for all our hands-on System Design courses, for all experience levels.
Interested in other courses, projects, or AI Mock Interviews? For a limited time, you can access them at 50% off with an Educative subscription.
🙈 Got an interview horror story? Hit reply and share it. I may anonymously feature it in next week’s email.
Until then, don’t let the dead processes bite.
Happy learning!
- Fahim




