Stateful distributed systems reward careful separation. A cluster can look unhealthy because scheduling is stuck, because storage is hot, because regions are imbalanced, because compaction is hurting latency, or because a dependency is slow.

The investigation improves when each layer gets its own hypothesis. What is the resource manager doing? What is storage doing? Are clients retrying? Are a few nodes carrying too much load? Did a maintenance event change the shape of the system?

Debugging is less about knowing every command and more about preserving the map in your head while the incident tries to erase it.