cuPyNumeric Migration Readiness
Purpose
Use this skill BEFORE the migration, not during. Answer one question: which of the user's existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.
This is a static, read-only assessment. Inspect the user's source with Read, Grep, and Glob. Do not execute the user's code, modify or write files, or print environment variables or secrets. The legate, and cuPyNumeric Doctor commands shown below are suggestions for the user to run — not actions this skill performs.
If this skill has never been seen before, head to references/getting-started.md first.
When to use this skill
Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.
Decline and redirect when the request is not a pre-migration assessment:
- Post-migration performance / profiling ("already ported, why is it slow?") → point to
legate --profileand the upstream profiling and debugging walkthrough. - Custom CUDA / kernel authoring ("write/optimize a CUDA kernel")
A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.
Instructions
Run all five steps below, in order. Read the user's code and reason about it semantically; do not emit a one-shot prose verdict.
Step 1 — Gather context
Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.
- Source location. Default to the current working directory when no path is given.
- Approximate hot-path array sizes at runtime. Default to 30–50 million elements. Map the user's numbers (or this default) to the Gate 2 tiers (65K per-GPU floor; 10M+ for real single-GPU speedup; 100M+ for multi-GPU).
- Target hardware. Default to 1–4 GPUs, single-node. Confirm before assuming multi-node. For CPU-only runs, ask about RAM per node instead of FBMEM.
- Dominant compute pattern. Stencil / GEMM / Monte Carlo / reductions / mixed-with-SciPy. Ask the user to name it; otherwise infer it from the code in Step 3.
State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.
Step 2 — Load the API support manifest
Read assets/api-support.md, the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:
✓✓ numpy.X— implemented and works on multi-GPU (the best path).✓ numpy.X— implemented but single-GPU/CPU only (caveats multi-node).🟡 numpy.X — <note>— partial support; read the note.✗ numpy.X— not implemented on the cuPyNumeric distributed path. Behavior on call is version-specific (some unsupported APIs route through host NumPy, others raise an exception) — either way, hot-path use is a migration blocker. Do not promise users a silent fallback to host-NumPy.
If the Fetched: line is more than ~90 days old, refresh the snapshot — see the Available Scripts section.
Step 3 — Read the code semantically
Walk the user's files with Read and Grep and classify each region of array math against references/idioms-that-scale.md and references/idioms-that-block.md (full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm arr traces back to a cupynumeric array (or np.* aliased to it) and check whether the access sits inside a hot loop. Apply these rules:
- Flag element loops (
for i in range(n): arr[i] = ...) as blockers; treat an epoch/step/file loop with a vectorized body as fine — distinguish the two. - Flag scalar sync —
.item()/float()/int()/bool()/complex()on a cuPyNumeric array inside a hot loop (per-iteration host sync); allow it at the boundary. - Flag reducing conditions —
if/whileover an array reduction (while np.max(err) > tol:) syncs every iteration. - Flag hoistable allocation in a loop as a fixable inefficiency.
- Flag
mpi4pyin runtime code that partitions/communicates array data alongsidecupynumeric(R108) — but first confirm it issues MPI calls on a hot path; ignore a grep hit in a README, build script, or alt-launcher. - Flag
order=onreshape/asarray/flattenas R109 — always, regardless of whether the version warns or silently no-ops. - Always cite R304 in INFO for
np.random.*under multi-GPU: cross-GPU bit-identical reproducibility is impossible by default (--gpus N/LEGATE_GPUSis the Legate launcher arg). - Flag Python builtins on arrays (
sum/max/min/any/iter(arr)) — host-iteration fallback (R110; upstream best practices). Allowlen(arr)(shape lookup; preferarr.shape[0]/arr.sizefor 0-d safety). - Flag
cupymixed withcupynumericin a hot loop (R111); the runtimes don't share GPU memory, so every hop goes through host NumPy. - Look up every NumPy API the code calls in
assets/api-support.md(glyph legend in Step 2).
For the deep "why," read references/gpu-stack.md (memory, SM, communication, dispatch) and references/execution-model.md (lazy execution, sync points, mapper).
Step 4 — Produce a structured assessment
Deliver the report in this order. Cite file:line for every finding so the user can navigate.
- Verdict in one sentence — see "Verdict framework" below.
- What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
- What blocks (BLOCKS findings) — each tied to
idioms-that-block.mdand a recipe inrefactor-recipes.md. - What's fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
- Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs
--gpus N. - API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
- Decision-framework summary — Gates 1–6 from
references/decision-framework.md, marked pass / fail / uncertain. - Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.
All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write "None for this code" or "n/a — see verdict" in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See assets/sample_report.md for worked reports.
Step 5 — Hand off to cuPyNumeric Doctor for runtime validation
Direct the user to run cuPyNumeric Doctor once they have applied the recipes and the code runs:
CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py
cuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing, nonzero misuse, mpi4py import, in-place ops on views). End the assessment at: "now run with cuPyNumeric Doctor enabled; here is what to look for in its output."
Verdict framework
Assign the verdict qualitatively, from the kinds of findings, not a score:
| Verdict | When | Action |
|---|---|---|
| READY | No BLOCKS; few/no REFACTOR | Swap the import; benchmark |
| LIGHT REFACTOR | A few recipe-fixable patterns (R201–R206), or one or two simple BLOCKS | Apply 1–3 recipes from refactor-recipes.md; re-walk to READY |
| SIGNIFICANT REFACTOR | Multiple BLOCKS in hot paths, or any R108 (mpi4py) — rewrites, not disqualifications | Real project; budget 1–3 engineer-weeks per module |
| NOT RECOMMENDED | Only two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land here | Restructure first or use a different runtime |
Apply these in order; the first match wins:
- Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
- Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
- Any R108 (
mpi4py) → SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification). - Multiple BLOCKS (R101–R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
- One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
- Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
- No BLOCKS, no REFACTOR → READY.
- APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.
Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a size or compute-pattern failure. Full framework: references/decision-framework.md.
What scales vs what blocks (at-a-glance)
- SCALES (keep as-is) — vectorized elementwise, reductions, matmul / einsum,
np.where, large-per-GPU stencil slicingarr[1:-1, 1:-1],out=, boolean-mask indexing. - BLOCKS (remove before migration) — element loops,
np.vectorize,for row in arr,.item()/.tolist()/bool(arr)in a hot loop, reducingif/whilein a loop,arr[::2],dtype=object,mpi4py,order=,min/max/sum(arr). - REFACTOR (apply a recipe) — alloc in a loop,
x = x + yrebind in a loop,vstack/hstack/concatenatein a loop,np.nonzero()+ indexing, view-mutation ofdiag/flip/flatten,reshapein a hot loop. - INFO (cost note, not a blocker) — SciPy imports, single-device
linalg.qr/svd, single-transformfft.*, size-thresholdedlinalg.solve/cholesky.
Full taxonomy in idioms-that-scale.md and idioms-that-block.md. Pass over silently any API the manifest doesn't list (out of scope of the upstream table — flagging it would be noise).
Reading order
The canonical, read-in-order guide lives in references/getting-started.md — read it once for orientation.
For a non-trivial assessment the must-reads are idioms-that-block.md, refactor-recipes.md, and decision-framework.md; the rest (idioms-that-scale.md, gpu-stack.md, execution-model.md, partitioning-and-balance.md, case-studies.md) are read on demand.
Limitations
- Does not run cuPyNumeric. No runtime required; this is the pre-port check. Actual speedup measurement happens after migration.
- Does not auto-generate refactored code. It identifies what to change and points to recipes; the user (or a follow-up agent) applies them.
- Does not profile the workload. For runtime measurement use
legate.timing.time()and the upstream profiling and debugging guide. - Does not replace judgment. Pattern matching misses implicit syncs inside logging, decorators that hide
.tolist(), runtime-data-dependent partition mismatches. Read the source too, especially in borderline cases.
Examples
A worked assessment of the bundled assets/examples/ fixtures (an example, not a template):
Verdict: LIGHT REFACTOR.
scales_well.pytranslates cleanly;needs_refactor.pyneeds one allocation hoisted;blocks_scaling.pysyncs every iteration via.item().What works:
scales_well.py:23-31(stencil R005),:40-44(reduction R002),:18-22(elementwise R001). What blocks:blocks_scaling.py:51-58(R104 —.item()in hot loop) → RR-sync. What's fixable:needs_refactor.py:21-28(R201 — alloc in loop) → RR-alloc. Next: apply the recipes; re-walk to READY; enableCUPYNUMERIC_DOCTOR=1on the first real run.
The full worked report is in assets/sample_report.md.
Authoritative upstream references
- Comparison table (source for
assets/api-support.md): https://nv-legate.github.io/cupynumeric/api/comparison.html (mirror, most current) /.../latest/api/comparison.htmlon docs.nvidia.com (canonical) - Best practices, Doctor, profiling, differences with NumPy, Legate launcher — under https://docs.nvidia.com/cupynumeric/latest/ (
user/practices.html,user/doctor.html,user/profiling_debugging.html,user/differences.html) and https://docs.nvidia.com/legate/latest/manual/usage/running.html - Source: https://github.com/nv-legate/cupynumeric
Available Scripts
| Script | Purpose | Arguments |
|---|---|---|
scripts/fetch_api_support.py | Scrape the upstream comparison table into assets/api-support.md. Python stdlib only; standalone. | --default-path (write the committed assets/api-support.md); --docs-nvidia-url (use canonical docs.nvidia.com instead of the default GitHub Pages mirror) |
The user runs this to refresh the manifest (python scripts/fetch_api_support.py --default-path).
Bundled references and assets
The references/ files are enumerated under Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets: assets/api-support.md (committed API snapshot, load in Step 2), assets/sample_report.md and assets/examples/*.py (worked report and fixtures).
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Fetched: line in the manifest > ~90 days old | Stale snapshot | Run fetch_api_support.py --default-path (user-run) |
| Manifest missing or scraper fails | Upstream HTML changed | WebFetch the comparison table for that assessment |
| NOT RECOMMENDED for many fixable BLOCKS | Heuristics applied out of order | Re-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count |
| Kernel authoring or post-migration profiling | Out of scope | Decline and redirect (see "When to use") — no verdict |