Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
---
name: dynamo-troubleshoot
description: Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
license: Apache-2.0
metadata:
author: Dan Gil <dagil@nvidia.com>
tags:
- dynamo
- kubernetes
- troubleshooting
- day-2
---
# Dynamo Troubleshoot
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: CC-BY-4.0
-->
## Purpose
Turn a Dynamo failure into a clear problem class, strongest signal, and next
action. Start with read-only evidence, avoid secrets, and fix one layer at a
time.
## Prerequisites
- Python 3.10+ on the operator machine.
- `kubectl` configured with read access to the target namespace.
- Permission to read pods, events, jobs, PVCs, and `DynamoGraphDeployment` resources (NOT secrets).
- Network reachability to the cluster API server.
## Instructions
### 1. Collect A Read-Only Bundle
Run:
```bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}"
```
If the user names a deployment, include it:
```bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}" \
--deployment-name <deployment-name>
```
Do not collect Kubernetes secrets. Do not print Hugging Face tokens.
### 2. Classify The Failure
Use `references/failure-decision-tree.md` and classify into one primary bucket:
- cluster/platform
- namespace/secret
- model cache/PVC/download
- image pull/runtime image
- GPU scheduling/resources
- operator/DynamoGraphDeployment reconciliation
- frontend/router
- worker/backend
- endpoint/API
- benchmark/perf job
### 3. Debug Top Down
Check in this order:
1. namespace, storage class, GPU nodes, and HF secret existence
2. PVC and model-download job
3. `DynamoGraphDeployment` status and events
4. pod status, `describe pod`, and container logs
5. frontend service and port-forward
6. `/v1/models`
7. `/v1/chat/completions`
8. benchmark job only after endpoint smoke test passes
### 4. Fix One Layer At A Time
Prefer the smallest reversible change:
- create missing namespace or HF secret
- patch `storageClassName`
- patch image tag or image pull secret
- reduce GPU request only if the recipe can still be valid
- switch KV router to approximate mode only if workers do not publish events
- restart failed jobs after fixing the underlying config
After each fix, rerun the relevant readiness check before moving deeper.
## Available Scripts
| Script | Purpose | Arguments |
|---|---|---|
| `scripts/collect_dynamo_debug_bundle.py` | Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) | `--namespace`, `--deployment-name`, `--output-dir` |
Invoke via the agentskills.io `run_script()` protocol:
```python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])
```
## Examples
Collect everything in a namespace for triage:
```bash
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo
```
Scope to a single failing deployment:
```bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace dynamo-demo \
--deployment-name qwen-vllm-disagg
```
Equivalent through the agent protocol:
```python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])
```
## Output Contract
Return:
- problem class
- evidence checked
- strongest signal
- likely cause
- exact next command or patch
- what was ruled out
- whether it is safe to continue deployment or benchmarking
## Limitations
- Read-only. Never mutates the cluster; remediation commands are returned, not executed.
- Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
- Bundle size grows with deployment size; on very large namespaces, scope with `--deployment-name`.
- Does not validate disagg transport — use `dynamo-interconnect-check` for that.
## Troubleshooting
| Symptom | Likely cause | Next step |
|---|---|---|
| `kubectl` returns Forbidden on events/pods | Service account lacks read RBAC | Ask operator for read-only role binding on the namespace |
| Bundle missing `DynamoGraphDeployment` status | Operator not installed or different namespace | Verify `dynamo-platform` operator is installed and watching the namespace |
| Model-download job in `Pending` | PVC unbound or HF secret missing | Fix PVC binding or create the named HF secret, then rerun the job |
| Worker pods `CrashLoopBackOff` | Image/runtime mismatch or GPU not available | Inspect container logs; check `nvidia.com/gpu` allocatable on nodes |
## Benchmark
See `BENCHMARK.md` for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run `/nvskills-ci` on an upstream PR touching this skill.
## References
- Read `references/failure-decision-tree.md` for bucket-specific checks.
- Use `scripts/collect_dynamo_debug_bundle.py` for read-only bundle collection.
Creator's repository · nvidia/skills
License: Apache-2.0