cuPyNumeric HDF5 I/O
Purpose
Use legate.io.hdf5 to read and write cuPyNumeric arrays as HDF5 files. Reach for it whenever a cuPyNumeric array must land in — or load from — an .h5/.hdf5 file: every rank reads and writes its own tile in parallel, so never funnel a large array through a single process.
Answer inline. Treat the snippets and rules below as complete and verified — answer save / load / stream / fence / bridge questions directly, without opening the assets/ scripts or reading the installed legate source. Reach for the assets only to run a verification.
Activate
Activate when the user asks about: saving a cuPyNumeric array to an .h5 / .hdf5 file, loading an HDF5 dataset into a cuPyNumeric array, reading a large HDF5 dataset in chunks, producing a single file for an HPC post-processing pipeline, or speeding up HDF5 disk I/O with GPUDirect Storage.
When NOT to use
Redirect these requests elsewhere instead of reaching for legate.io.hdf5:
- Route Parquet / Arrow / cuDF, raw-binary, or sharded / custom on-disk layouts to the cupynumeric-parallel-data-load skill — it owns cuPyNumeric's no-built-in-loader paths;
legate.io.hdf5covers single-file HDF5 only. - Answer pure array compute with cuPyNumeric ops (FFT, matmul, reductions, slicing, linear algebra) — this skill covers disk I/O only.
- Send chunked or object-store (S3) output to a chunked format such as Zarr — not single-file HDF5.
- Load
.npzor pickled archives with NumPy (np.load), then bridge withcn.asarray(...)—legate.io.hdf5reads HDF5 only, andcupynumeric.loadreads single.npyonly. - Use h5py directly for plain HDF5 reads with no cuPyNumeric/Legate —
with h5py.File(path, "r") as f: arr = f["dataset"][:].
Prerequisites
Install h5py before importing anything from legate.io.hdf5:
conda install -c conda-forge h5py # required; legate/io/hdf5.py imports it at load
Expect from legate.io.hdf5 import ... to raise ModuleNotFoundError until you do — the module imports h5py at load time. (h5py · conda-forge build)
API
| Function | Signature | Purpose |
|---|---|---|
to_file | to_file(array, path, dataset_name) | Write a cuPyNumeric array / LogicalArray to one HDF5 file as a virtual dataset (VDS) — each rank writes its own tile. |
from_file | from_file(path, dataset_name) -> LogicalArray | Read one HDF5 dataset into a distributed array. |
from_file_batched | from_file_batched(path, dataset_name, chunk_size) -> Iterator[(LogicalArray, offsets)] | Read a dataset in chunks — chunks the file read, not the assembled array. |
Import all three from legate.io.hdf5. Always pass dataset_name as the full path to a single array inside the file (e.g. "/data" or "/group/x"), never a group.
Examples
Round trip
import cupynumeric as cn
from legate.core import get_legate_runtime
from legate.io.hdf5 import from_file, to_file
a = cn.arange(64, dtype=cn.float32).reshape(8, 8)
# Write: pass the cuPyNumeric ndarray straight in - no manual conversion.
to_file(array=a, path="out.h5", dataset_name="/data")
get_legate_runtime().issue_execution_fence(block=True) # needed before any external reader
# Read: from_file returns a legate LogicalArray; cn.asarray bridges it back.
b = cn.asarray(from_file("out.h5", dataset_name="/data"))
assert cn.array_equal(a, b)
Run assets/hdf5_roundtrip.py to verify (optional — not needed to answer).
Read a large file in chunks
Use from_file_batched to read the source file in chunks instead of pulling it into host memory all at once. It yields one LogicalArray per chunk plus that chunk's offsets in the global shape. Expect clipped boundary chunks (an axis of length 5 with chunk_size=2 yields 2, 2, 1), so place each chunk by its actual shape, not the requested chunk_size. Note that this chunks the file read, not the result — the assembled array (out) still has to fit in distributed memory:
import h5py
import cupynumeric as cn
from legate.core import get_legate_runtime
from legate.io.hdf5 import from_file_batched
with h5py.File("big.h5", "r") as f: # read shape/dtype without loading data
shape, dtype = f["data"].shape, f["data"].dtype
out = cn.empty(shape, dtype=dtype)
for chunk, (r0, c0) in from_file_batched("big.h5", "data", chunk_size=(4096, 4096)):
out[r0:r0 + chunk.shape[0], c0:c0 + chunk.shape[1]] = cn.asarray(chunk)
get_legate_runtime().issue_execution_fence(block=True)
Keep every chunk_size entry positive and its length equal to the dataset's rank, or from_file_batched raises ValueError. Run assets/hdf5_batched_read.py to verify (optional).
Instructions
- Pass the cuPyNumeric ndarray directly to
to_file- it implements__legate_data_interface__, whichto_fileaccepts asLogicalArrayLike. Skip anynp.array(...)round-trip. - Bridge results back with
cn.asarray(...).from_fileand eachfrom_file_batchedchunk return a LegateLogicalArray; wrap it withcn.asarray(la)to get a cuPyNumeric ndarray (zero-copy, no host bounce). - Fence before any external reader. Legate I/O is asynchronous:
to_fileonly queues the write. Insertget_legate_runtime().issue_execution_fence(block=True)before h5py, a subprocess, or another tool opens the file. Skip the fence for afrom_fileissued later in the same Legate program — the runtime preserves that ordering. - Run from outside the cuPyNumeric source tree (e.g.
cd /tmp). Python puts the cwd first onsys.path, so an in-treecupynumeric/directory shadows the installed package (ModuleNotFoundError: cupynumeric.install_info). - Give every rank the same
path. The program runs on every rank (SPMD), so passto_file/from_filean identicalpathon each — a per-ranktempfile.mkstemp()name breaks the collective I/O. When the program creates the file itself, write it with the collectiveto_file, not a per-rankh5pywrite.
to_file behavior to plan around
- Expect an HDF5 virtual dataset (VDS): each rank writes its own tile and the file presents them as one logical dataset.
- Treat
to_fileas destructive — it overwritespathif it already exists, so guard any file you must not clobber. - Let
to_filecreate missing parent directories; do not pre-create them. - Give
patha file name (/path/to/file.h5), never a directory — a directory raisesValueError. Pass a bound array (one with a known shape);to_fileraisesValueErroron an unbound array — a Legate array created without a shape (e.g.create_array(dtype, ndim=n)) whose extent a producing task fills in later. cuPyNumeric ndarrays are always bound — even lazy/deferred ones — so this only affects rawLogicalArrays.
GPUDirect Storage (GDS)
Always set LEGATE_IO_USE_VFD_GDS=1 for runs that read HDF5 into GPU memory — whether or not the cluster has GPUDirect-capable storage:
export LEGATE_IO_USE_VFD_GDS=1 # set before launching
# or, with the legate driver:
legate --io-use-vfd-gds my_script.py
- Read into the GPU through the GDS VFD, not the default path. The default (POSIX) VFD stages each GPU read through zero-copy memory (ZCMEM), of which Legate reserves only 128 MB — so a GPU read of an array larger than ~128 MB aborts. The GDS VFD removes that staging buffer.
- Leave it unset when reading into host (CPU) memory — the VFD GDS plugin is unnecessary there and only adds overhead.
- Keep
=1even without GPUDirect-capable storage — cuFile falls back to compatibility mode automatically (setexport CUFILE_ALLOW_COMPAT_MODE=trueif it is not already on), and=1still avoids the ZCMEM abort. - Attribute it correctly: the GDS VFD is the nv-legate/vfd-gds plugin over NVIDIA cuFile, not KvikIO (KvikIO backs Legate's Zarr/tile I/O, not HDF5). Confirm it engaged by grepping the run log for
H5FD__gds_open: Successfully opened file w/GDS VFD.
Troubleshooting
| Symptom | Cause and fix |
|---|---|
ModuleNotFoundError: No module named 'h5py' on import | h5py is missing — conda install -c conda-forge h5py. |
File looks empty/truncated to h5py right after to_file | The async write hasn't landed — add get_legate_runtime().issue_execution_fence(block=True) before the external read. |
ValueError from to_file | path is a directory — pass a file path such as results/data.h5. |
ModuleNotFoundError: No module named 'cupynumeric.install_info' | Running inside the source tree — cd /tmp (any directory outside the repo). |
| Abort/crash reading a GPU array ≳128 MB | Default 128 MB ZCMEM staging buffer — set LEGATE_IO_USE_VFD_GDS=1 for GPU reads. |
from_file returned LogicalArray(...) | Expected — wrap it with cn.asarray(...). |
Limitations & version notes
- Import from
legate.io.hdf5(Legate 26.01+); rewrite anylegate.core.io.hdf5import left over from the 25.03 line (e.g. the 25.03 launch blog still shows the old path). - Install h5py explicitly — it ships in no default cuPyNumeric env.
- Point
dataset_nameat a single array, never a group; traverse groups with h5py first to discover dataset paths. - On GPU, always read with
LEGATE_IO_USE_VFD_GDS=1(see GPUDirect Storage) — the default path aborts on GPU arrays larger than the 128 MB ZCMEM buffer. Leave it unset for CPU reads.
Verify
cd /tmp # outside the cupynumeric source tree
conda install -c conda-forge h5py # one-time, if not already present
LEGATE_CONFIG="--cpus 4" LEGATE_AUTO_CONFIG=0 python <skill>/assets/hdf5_roundtrip.py
LEGATE_CONFIG="--cpus 4" LEGATE_AUTO_CONFIG=0 python <skill>/assets/hdf5_batched_read.py
Expect HDF5 ROUND TRIP OK and HDF5 BATCHED READ OK. Add --gpus 1 (and LEGATE_IO_USE_VFD_GDS=1) to exercise the GPU / GDS path.