Dynamo Troubleshoot

Purpose

Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.

Prerequisites

Python 3.10+ on the operator machine.
kubectl configured with read access to the target namespace.
Permission to read pods, events, jobs, PVCs, and DynamoGraphDeployment resources (NOT secrets).
Network reachability to the cluster API server.

Instructions

1. Collect A Read-Only Bundle

Run:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"

If the user names a deployment, include it:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>

Do not collect Kubernetes secrets. Do not print Hugging Face tokens.

2. Classify The Failure

Use references/failure-decision-tree.md and classify into one primary bucket:

cluster/platform
namespace/secret
model cache/PVC/download
image pull/runtime image
GPU scheduling/resources
operator/DynamoGraphDeployment reconciliation
frontend/router
worker/backend
endpoint/API
benchmark/perf job

3. Debug Top Down

Check in this order:

namespace, storage class, GPU nodes, and HF secret existence
PVC and model-download job
DynamoGraphDeployment status and events
pod status, describe pod, and container logs
frontend service and port-forward
/v1/models
/v1/chat/completions
benchmark job only after endpoint smoke test passes

4. Fix One Layer At A Time

Prefer the smallest reversible change:

create missing namespace or HF secret
patch storageClassName
patch image tag or image pull secret
reduce GPU request only if the recipe can still be valid
switch KV router to approximate mode only if workers do not publish events
restart failed jobs after fixing the underlying config

After each fix, rerun the relevant readiness check before moving deeper.

Available Scripts

Script	Purpose	Arguments
`scripts/collect_dynamo_debug_bundle.py`	Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status)	`--namespace`, `--deployment-name`, `--output-dir`

Invoke via the agentskills.io run_script() protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

Examples

Collect everything in a namespace for triage:

python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo

Scope to a single failing deployment:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg

Equivalent through the agent protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

Output Contract

Return:

problem class
evidence checked
strongest signal
likely cause
exact next command or patch
what was ruled out
whether it is safe to continue deployment or benchmarking

Limitations

Read-only. Never mutates the cluster; remediation commands are returned, not executed.
Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
Bundle size grows with deployment size; on very large namespaces, scope with --deployment-name.
Does not validate disagg transport — use dynamo-interconnect-check for that.

Troubleshooting

Symptom	Likely cause	Next step
`kubectl` returns Forbidden on events/pods	Service account lacks read RBAC	Ask operator for read-only role binding on the namespace
Bundle missing `DynamoGraphDeployment` status	Operator not installed or different namespace	Verify `dynamo-platform` operator is installed and watching the namespace
Model-download job in `Pending`	PVC unbound or HF secret missing	Fix PVC binding or create the named HF secret, then rerun the job
Worker pods `CrashLoopBackOff`	Image/runtime mismatch or GPU not available	Inspect container logs; check `nvidia.com/gpu` allocatable on nodes

Benchmark

See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.

References

Read references/failure-decision-tree.md for bucket-specific checks.
Use scripts/collect_dynamo_debug_bundle.py for read-only bundle collection.

Dynamo Troubleshoot