Name: Cupynumeric Migration Readiness
Author: NVIDIA

cuPyNumeric Migration Readiness

Purpose

Use this skill BEFORE the migration, not during. Answer one question: which of the user's existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.

This is a static, read-only assessment. Inspect the user's source with Read, Grep, and Glob. Do not execute the user's code, modify or write files, or print environment variables or secrets. The legate, and cuPyNumeric Doctor commands shown below are suggestions for the user to run — not actions this skill performs.

If this skill has never been seen before, head to references/getting-started.md first.

When to use this skill

Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.

Decline and redirect when the request is not a pre-migration assessment:

Post-migration performance / profiling ("already ported, why is it slow?") → point to legate --profile and the upstream profiling and debugging walkthrough.
Custom CUDA / kernel authoring ("write/optimize a CUDA kernel")

A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.

Instructions

Run all five steps below, in order. Read the user's code and reason about it semantically; do not emit a one-shot prose verdict.

Step 1 — Gather context

Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.

Source location. Default to the current working directory when no path is given.
Approximate hot-path array sizes at runtime. Default to 30–50 million elements. Map the user's numbers (or this default) to the Gate 2 tiers (65K per-GPU floor; 10M+ for real single-GPU speedup; 100M+ for multi-GPU).
Target hardware. Default to 1–4 GPUs, single-node. Confirm before assuming multi-node. For CPU-only runs, ask about RAM per node instead of FBMEM.
Dominant compute pattern. Stencil / GEMM / Monte Carlo / reductions / mixed-with-SciPy. Ask the user to name it; otherwise infer it from the code in Step 3.

State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.

Step 2 — Load the API support manifest

Read assets/api-support.md, the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:

✓✓ numpy.X — implemented and works on multi-GPU (the best path).
✓ numpy.X — implemented but single-GPU/CPU only (caveats multi-node).
🟡 numpy.X — <note> — partial support; read the note.
✗ numpy.X — not implemented on the cuPyNumeric distributed path. Behavior on call is version-specific (some unsupported APIs route through host NumPy, others raise an exception) — either way, hot-path use is a migration blocker. Do not promise users a silent fallback to host-NumPy.

If the Fetched: line is more than ~90 days old, refresh the snapshot — see the Available Scripts section.

Step 3 — Read the code semantically

Walk the user's files with Read and Grep and classify each region of array math against references/idioms-that-scale.md and references/idioms-that-block.md (full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm arr traces back to a cupynumeric array (or np.* aliased to it) and check whether the access sits inside a hot loop. Apply these rules:

Flag element loops (for i in range(n): arr[i] = ...) as blockers; treat an epoch/step/file loop with a vectorized body as fine — distinguish the two.
Flag scalar sync — .item() / float() / int() / bool() / complex() on a cuPyNumeric array inside a hot loop (per-iteration host sync); allow it at the boundary.
Flag reducing conditions — if/while over an array reduction (while np.max(err) > tol:) syncs every iteration.
Flag hoistable allocation in a loop as a fixable inefficiency.
Flag mpi4py in runtime code that partitions/communicates array data alongside cupynumeric (R108) — but first confirm it issues MPI calls on a hot path; ignore a grep hit in a README, build script, or alt-launcher.
Flag order= on reshape / asarray / flatten as R109 — always, regardless of whether the version warns or silently no-ops.
Always cite R304 in INFO for np.random.* under multi-GPU: cross-GPU bit-identical reproducibility is impossible by default (--gpus N / LEGATE_GPUS is the Legate launcher arg).
Flag Python builtins on arrays (sum/max/min/any/iter(arr)) — host-iteration fallback (R110; upstream best practices). Allow len(arr) (shape lookup; prefer arr.shape[0] / arr.size for 0-d safety).
Flag cupy mixed with cupynumeric in a hot loop (R111); the runtimes don't share GPU memory, so every hop goes through host NumPy.
Look up every NumPy API the code calls in assets/api-support.md (glyph legend in Step 2).

For the deep "why," read references/gpu-stack.md (memory, SM, communication, dispatch) and references/execution-model.md (lazy execution, sync points, mapper).

Step 4 — Produce a structured assessment

Deliver the report in this order. Cite file:line for every finding so the user can navigate.

Verdict in one sentence — see "Verdict framework" below.
What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
What blocks (BLOCKS findings) — each tied to idioms-that-block.md and a recipe in refactor-recipes.md.
What's fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs --gpus N.
API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
Decision-framework summary — Gates 1–6 from references/decision-framework.md, marked pass / fail / uncertain.
Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.

All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write "None for this code" or "n/a — see verdict" in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See assets/sample_report.md for worked reports.

Step 5 — Hand off to cuPyNumeric Doctor for runtime validation

Direct the user to run cuPyNumeric Doctor once they have applied the recipes and the code runs:

CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py

cuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing, nonzero misuse, mpi4py import, in-place ops on views). End the assessment at: "now run with cuPyNumeric Doctor enabled; here is what to look for in its output."

Verdict framework

Assign the verdict qualitatively, from the kinds of findings, not a score:

Verdict	When	Action
READY	No BLOCKS; few/no REFACTOR	Swap the import; benchmark
LIGHT REFACTOR	A few recipe-fixable patterns (R201–R206), or one or two simple BLOCKS	Apply 1–3 recipes from `refactor-recipes.md`; re-walk to READY
SIGNIFICANT REFACTOR	Multiple BLOCKS in hot paths, or any R108 (`mpi4py`) — rewrites, not disqualifications	Real project; budget 1–3 engineer-weeks per module
NOT RECOMMENDED	Only two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land here	Restructure first or use a different runtime

Apply these in order; the first match wins:

Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
Any R108 (mpi4py) → SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification).
Multiple BLOCKS (R101–R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
No BLOCKS, no REFACTOR → READY.
APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.

Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a size or compute-pattern failure. Full framework: references/decision-framework.md.

What scales vs what blocks (at-a-glance)

SCALES (keep as-is) — vectorized elementwise, reductions, matmul / einsum, np.where, large-per-GPU stencil slicing arr[1:-1, 1:-1], out=, boolean-mask indexing.
BLOCKS (remove before migration) — element loops, np.vectorize, for row in arr, .item()/.tolist()/bool(arr) in a hot loop, reducing if/while in a loop, arr[::2], dtype=object, mpi4py, order=, min/max/sum(arr).
REFACTOR (apply a recipe) — alloc in a loop, x = x + y rebind in a loop, vstack/hstack/concatenate in a loop, np.nonzero() + indexing, view-mutation of diag/flip/flatten, reshape in a hot loop.
INFO (cost note, not a blocker) — SciPy imports, single-device linalg.qr/svd, single-transform fft.*, size-thresholded linalg.solve/cholesky.

Full taxonomy in idioms-that-scale.md and idioms-that-block.md. Pass over silently any API the manifest doesn't list (out of scope of the upstream table — flagging it would be noise).

Reading order

The canonical, read-in-order guide lives in references/getting-started.md — read it once for orientation.

For a non-trivial assessment the must-reads are idioms-that-block.md, refactor-recipes.md, and decision-framework.md; the rest (idioms-that-scale.md, gpu-stack.md, execution-model.md, partitioning-and-balance.md, case-studies.md) are read on demand.

Limitations

Does not run cuPyNumeric. No runtime required; this is the pre-port check. Actual speedup measurement happens after migration.
Does not auto-generate refactored code. It identifies what to change and points to recipes; the user (or a follow-up agent) applies them.
Does not profile the workload. For runtime measurement use legate.timing.time() and the upstream profiling and debugging guide.
Does not replace judgment. Pattern matching misses implicit syncs inside logging, decorators that hide .tolist(), runtime-data-dependent partition mismatches. Read the source too, especially in borderline cases.

Examples

A worked assessment of the bundled assets/examples/ fixtures (an example, not a template):

Verdict: LIGHT REFACTOR. scales_well.py translates cleanly; needs_refactor.py needs one allocation hoisted; blocks_scaling.py syncs every iteration via .item().

What works: scales_well.py:23-31 (stencil R005), :40-44 (reduction R002), :18-22 (elementwise R001). What blocks: blocks_scaling.py:51-58 (R104 — .item() in hot loop) → RR-sync. What's fixable: needs_refactor.py:21-28 (R201 — alloc in loop) → RR-alloc. Next: apply the recipes; re-walk to READY; enable CUPYNUMERIC_DOCTOR=1 on the first real run.

The full worked report is in assets/sample_report.md.

Authoritative upstream references

Comparison table (source for assets/api-support.md): https://nv-legate.github.io/cupynumeric/api/comparison.html (mirror, most current) / .../latest/api/comparison.html on docs.nvidia.com (canonical)
Best practices, Doctor, profiling, differences with NumPy, Legate launcher — under https://docs.nvidia.com/cupynumeric/latest/ (user/practices.html, user/doctor.html, user/profiling_debugging.html, user/differences.html) and https://docs.nvidia.com/legate/latest/manual/usage/running.html
Source: https://github.com/nv-legate/cupynumeric

Available Scripts

Script	Purpose	Arguments
`scripts/fetch_api_support.py`	Scrape the upstream comparison table into `assets/api-support.md`. Python stdlib only; standalone.	`--default-path` (write the committed `assets/api-support.md`); `--docs-nvidia-url` (use canonical `docs.nvidia.com` instead of the default GitHub Pages mirror)

The user runs this to refresh the manifest (python scripts/fetch_api_support.py --default-path).

Bundled references and assets

The references/ files are enumerated under Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets: assets/api-support.md (committed API snapshot, load in Step 2), assets/sample_report.md and assets/examples/*.py (worked report and fixtures).

Troubleshooting

Symptom	Cause	Fix
`Fetched:` line in the manifest > ~90 days old	Stale snapshot	Run `fetch_api_support.py --default-path` (user-run)
Manifest missing or scraper fails	Upstream HTML changed	`WebFetch` the comparison table for that assessment
NOT RECOMMENDED for many fixable BLOCKS	Heuristics applied out of order	Re-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count
Kernel authoring or post-migration profiling	Out of scope	Decline and redirect (see "When to use") — no verdict

Cupynumeric Migration Readiness