dynamo-troubleshoot

Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.

Skill file

Preview skill file↓↑

---
name: dynamo-troubleshoot
description: Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
license: Apache-2.0
metadata:
  author: Dan Gil <dagil@nvidia.com>
  tags:
    - dynamo
    - kubernetes
    - troubleshooting
    - day-2
---

# Dynamo Troubleshoot

<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: CC-BY-4.0
-->

## Purpose

Turn a Dynamo failure into a clear problem class, strongest signal, and next
action. Start with read-only evidence, avoid secrets, and fix one layer at a
time.

## Prerequisites

- Python 3.10+ on the operator machine.
- `kubectl` configured with read access to the target namespace.
- Permission to read pods, events, jobs, PVCs, and `DynamoGraphDeployment` resources (NOT secrets).
- Network reachability to the cluster API server.

## Instructions

### 1. Collect A Read-Only Bundle

Run:

```bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"
```

If the user names a deployment, include it:

```bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>
```

Do not collect Kubernetes secrets. Do not print Hugging Face tokens.

### 2. Classify The Failure

Use `references/failure-decision-tree.md` and classify into one primary bucket:

- cluster/platform
- namespace/secret
- model cache/PVC/download
- image pull/runtime image
- GPU scheduling/resources
- operator/DynamoGraphDeployment reconciliation
- frontend/router
- worker/backend
- endpoint/API
- benchmark/perf job

### 3. Debug Top Down

Check in this order:

1. namespace, storage class, GPU nodes, and HF secret existence
2. PVC and model-download job
3. `DynamoGraphDeployment` status and events
4. pod status, `describe pod`, and container logs
5. frontend service and port-forward
6. `/v1/models`
7. `/v1/chat/completions`
8. benchmark job only after endpoint smoke test passes

### 4. Fix One Layer At A Time

Prefer the smallest reversible change:

- create missing namespace or HF secret
- patch `storageClassName`
- patch image tag or image pull secret
- reduce GPU request only if the recipe can still be valid
- switch KV router to approximate mode only if workers do not publish events
- restart failed jobs after fixing the underlying config

After each fix, rerun the relevant readiness check before moving deeper.

## Available Scripts

| Script | Purpose | Arguments |
|---|---|---|
| `scripts/collect_dynamo_debug_bundle.py` | Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) | `--namespace`, `--deployment-name`, `--output-dir` |

Invoke via the agentskills.io `run_script()` protocol:

```python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])
```

## Examples

Collect everything in a namespace for triage:

```bash
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo
```

Scope to a single failing deployment:

```bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg
```

Equivalent through the agent protocol:

```python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])
```

## Output Contract

Return:

- problem class
- evidence checked
- strongest signal
- likely cause
- exact next command or patch
- what was ruled out
- whether it is safe to continue deployment or benchmarking

## Limitations

- Read-only. Never mutates the cluster; remediation commands are returned, not executed.
- Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
- Bundle size grows with deployment size; on very large namespaces, scope with `--deployment-name`.
- Does not validate disagg transport — use `dynamo-interconnect-check` for that.

## Troubleshooting

| Symptom | Likely cause | Next step |
|---|---|---|
| `kubectl` returns Forbidden on events/pods | Service account lacks read RBAC | Ask operator for read-only role binding on the namespace |
| Bundle missing `DynamoGraphDeployment` status | Operator not installed or different namespace | Verify `dynamo-platform` operator is installed and watching the namespace |
| Model-download job in `Pending` | PVC unbound or HF secret missing | Fix PVC binding or create the named HF secret, then rerun the job |
| Worker pods `CrashLoopBackOff` | Image/runtime mismatch or GPU not available | Inspect container logs; check `nvidia.com/gpu` allocatable on nodes |

## Benchmark

See `BENCHMARK.md` for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run `/nvskills-ci` on an upstream PR touching this skill.

## References

- Read `references/failure-decision-tree.md` for bucket-specific checks.
- Use `scripts/collect_dynamo_debug_bundle.py` for read-only bundle collection.

Source

Creator's repository · nvidia/skills

View on GitHub ↗

License: Apache-2.0

Security

Security checks in progress

Results will appear here once audits complete

What this skill can do

Reads your filesConnects to the internetRuns code on your machine

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk