Dynamo Troubleshoot

Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Dynamo Troubleshoot

Purpose

Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.

Prerequisites

  • Python 3.10+ on the operator machine.
  • kubectl configured with read access to the target namespace.
  • Permission to read pods, events, jobs, PVCs, and DynamoGraphDeployment resources (NOT secrets).
  • Network reachability to the cluster API server.

Instructions

1. Collect A Read-Only Bundle

Run:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"

If the user names a deployment, include it:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>

Do not collect Kubernetes secrets. Do not print Hugging Face tokens.

2. Classify The Failure

Use references/failure-decision-tree.md and classify into one primary bucket:

  • cluster/platform
  • namespace/secret
  • model cache/PVC/download
  • image pull/runtime image
  • GPU scheduling/resources
  • operator/DynamoGraphDeployment reconciliation
  • frontend/router
  • worker/backend
  • endpoint/API
  • benchmark/perf job

3. Debug Top Down

Check in this order:

  1. namespace, storage class, GPU nodes, and HF secret existence
  2. PVC and model-download job
  3. DynamoGraphDeployment status and events
  4. pod status, describe pod, and container logs
  5. frontend service and port-forward
  6. /v1/models
  7. /v1/chat/completions
  8. benchmark job only after endpoint smoke test passes

4. Fix One Layer At A Time

Prefer the smallest reversible change:

  • create missing namespace or HF secret
  • patch storageClassName
  • patch image tag or image pull secret
  • reduce GPU request only if the recipe can still be valid
  • switch KV router to approximate mode only if workers do not publish events
  • restart failed jobs after fixing the underlying config

After each fix, rerun the relevant readiness check before moving deeper.

Available Scripts

ScriptPurposeArguments
scripts/collect_dynamo_debug_bundle.pyCollect a read-only debug bundle (pods, events, jobs, PVCs, CR status)--namespace, --deployment-name, --output-dir

Invoke via the agentskills.io run_script() protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

Examples

Collect everything in a namespace for triage:

python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo

Scope to a single failing deployment:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg

Equivalent through the agent protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

Output Contract

Return:

  • problem class
  • evidence checked
  • strongest signal
  • likely cause
  • exact next command or patch
  • what was ruled out
  • whether it is safe to continue deployment or benchmarking

Limitations

  • Read-only. Never mutates the cluster; remediation commands are returned, not executed.
  • Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
  • Bundle size grows with deployment size; on very large namespaces, scope with --deployment-name.
  • Does not validate disagg transport — use dynamo-interconnect-check for that.

Troubleshooting

SymptomLikely causeNext step
kubectl returns Forbidden on events/podsService account lacks read RBACAsk operator for read-only role binding on the namespace
Bundle missing DynamoGraphDeployment statusOperator not installed or different namespaceVerify dynamo-platform operator is installed and watching the namespace
Model-download job in PendingPVC unbound or HF secret missingFix PVC binding or create the named HF secret, then rerun the job
Worker pods CrashLoopBackOffImage/runtime mismatch or GPU not availableInspect container logs; check nvidia.com/gpu allocatable on nodes

Benchmark

See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.

References

  • Read references/failure-decision-tree.md for bucket-specific checks.
  • Use scripts/collect_dynamo_debug_bundle.py for read-only bundle collection.

Bundled with this artifact

6 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0