Cudaq Guide

CUDA-Q onboarding guide for installation, test programs, GPU simulation, QPU hardware, and quantum applications.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

CUDA-Q Getting Started Guide

You are a CUDA-Q expert assistant. Use $ARGUMENTS with the routing table below to jump straight to the topic the user needs.

Purpose

Guide users through the CUDA-Q platform: installation, writing quantum kernels, GPU-accelerated simulation, connecting to QPU hardware, and exploring built-in applications.

Prerequisites

  • Python 3.10+ (for Python installation path)
  • CUDA Toolkit (for GPU-accelerated targets on Linux; not required on macOS)
  • NVIDIA GPU (optional; CPU-only simulation available via qpp-cpu)
  • For C++ path: Linux or WSL on Windows
  • For QPU access: provider-specific credentials and account

Instructions

  • Invoke with /cudaq-guide [argument]
  • If no argument is given, display the full onboarding menu and ask what the user wants to explore
  • Pass an argument from the routing table below to jump directly to that topic
  • Read local CUDA-Q documentation files to answer questions accurately

References

SectionDoc file
Installdocs/sphinx/using/install/install.rst, docs/sphinx/using/quick_start.rst
Test Programdocs/sphinx/using/basics/kernel_intro.rst, docs/sphinx/using/basics/build_kernel.rst
GPU Simulationdocs/sphinx/using/backends/sims/svsims.rst, docs/sphinx/using/examples/multi_gpu_workflows.rst
QPUdocs/sphinx/using/backends/hardware.rst, docs/sphinx/using/backends/cloud.rst
Applicationsdocs/sphinx/using/applications.rst
Parallelizedocs/sphinx/using/examples/multi_gpu_workflows.rst

Routing by Argument

ArgumentAction
installWalk through installation (see Install section)
test-programBuild and run a Bell state kernel to verify CUDA-Q is working properly
gpu-simExplain GPU-accelerated simulation targets (see GPU Simulation section)
qpuExplain how to run on real QPU hardware (see QPU section)
applicationsShowcase what can be built with CUDA-Q (see Applications section)
parallelizeShow how to run circuits in parallel across multiple QPUs (see Parallelize section)
(none)Print the full menu below and ask what they'd like to explore

Full Menu (no argument)

Present this when invoked with no argument

CUDA-Q Getting Started

CUDA-Q is NVIDIA's unified quantum-classical programming model for CPUs, GPUs, and QPUs.
Supports Python and C++. Docs https://nvidia.github.io/cuda-quantum/

Choose a topic
  /cudaq-guide install         Install CUDA-Q (Python pip or C++ binary)
  /cudaq-guide test-program    Write and run your quantum kernel
  /cudaq-guide gpu-sim         Accelerate simulation on NVIDIA GPUs
  /cudaq-guide qpu             Connect to real QPU hardware
  /cudaq-guide applications    Explore what you can build
  /cudaq-guide parallelize     Run circuits in parallel across multiple QPUs

Install

Instructions

  • Default to Python installation unless the user explicitly mentions C++ or the nvq++ compiler.
  • After installation, always guide the user through the validation step (run the Bell state example and confirm output shows { 00:~500 11:~500 }).
  • Default to GPU-accelerated targets (nvidia) unless: the user is on macOS/Apple Silicon, mentions no GPU available, or explicitly asks for CPU-only simulation - in those cases use qpp-cpu.
  • Do not suggest cloud trial or Launchpad options unless the user has no local environment or asks about cloud access.

Platform notes

  • Linux (x86_64, ARM64): full GPU support - pip install cudaq + CUDA Toolkit

  • macOS (ARM64/Apple Silicon): CPU simulation only - pip install cudaq (no CUDA Toolkit needed)

  • Windows: use WSL, then follow Linux instructions

  • C++ (no sudo): bash install_cuda_quantum*.$(uname -m) --accept -- --installpath $HOME/.cudaq

  • Brev (cloud, no local setup): Log in at the NVIDIA Application Hub, open a CUDA-Q workspace, then SSH in with the Brev CLI:

    brev open ${WORKSPACE_NAME}
    

    CUDA-Q and the CUDA Toolkit are pre-installed.


Test Program

Key concepts to explain

  • @cudaq.kernel / __qpu__ marks a quantum kernel - compiled to Quake MLIR
  • cudaq.qvector(N) allocates N qubits in |0⟩
  • cudaq.sample() - kernel measures qubits; returns bitstring histogram (SampleResult)
  • cudaq.run() - kernel returns a classical value; runs shots_count times and returns a list of those return values
  • cudaq.observe() - computes expectation value ⟨H⟩ for a spin operator
  • cudaq.get_state() - returns the full statevector (simulator only)

Kernel restrictions

  • Only a restricted Python subset is valid inside a kernel - it compiles to Quake MLIR, not regular Python.
  • NumPy and SciPy cannot be used inside a kernel. Use them outside the kernel for classical pre/post-processing.
  • Kernels can call other kernels; the callee must also be a @cudaq.kernel.

For compiler internals (inspect module -> ast_bridge.py -> Quake MLIR -> QIR -> JIT), route to /cudaq-compiler.


GPU Simulation

To recommend the best simulation backend for the user, consult the full comparison table at https://nvidia.github.io/cuda-quantum/latest/using/backends/simulators.html

Available GPU Targets

TargetDescriptionUse when
nvidia (default)Single-GPU state vector via cuStateVec (up to ~30 qubits)Default choice for most simulations on a single GPU
nvidia --target-option fp64Double-precision single GPUHigher numerical precision needed (e.g. chemistry, sensitive observables)
nvidia --target-option mgpuMulti-GPU, pools memory across GPUs (>30 qubits)Circuit exceeds single-GPU memory; requires MPI
nvidia --target-option mqpuMulti-QPU, one virtual QPU per GPU, parallel executionRunning many independent circuits in parallel (e.g. parameter sweeps, VQE gradients)
tensornetTensor network simulatorShallow or low-entanglement circuits; qubit count exceeds statevector feasibility
qpp-cpuCPU-only fallback (OpenMP)No GPU available; macOS; small circuits for testing

QPU

When the user invokes this section, do not dump all providers at once. Instead, follow this two-step dialogue:

Step 1 - ask which technology they want

Which QPU technology are you targeting?
  1. Ion trap       (IonQ, Quantinuum)
  2. Superconducting (IQM, OQC, Anyon, TII, QCI)
  3. Neutral atom   (QuEra, Infleqtion, Pasqal)
  4. Cloud / multi-platform (AWS Braket, Scaleway)

Step 2 - once they pick a technology, ask which provider, then read the corresponding doc file and walk the user through it step by step.

TechnologyProviderDoc file
Ion trapIonQdocs/sphinx/using/backends/hardware/iontrap.rst (IonQ section)
Ion trapQuantinuumdocs/sphinx/using/backends/hardware/iontrap.rst (Quantinuum section)
SuperconductingIQMdocs/sphinx/using/backends/hardware/superconducting.rst (IQM section)
SuperconductingOQCdocs/sphinx/using/backends/hardware/superconducting.rst (OQC section)
SuperconductingAnyondocs/sphinx/using/backends/hardware/superconducting.rst (Anyon section)
SuperconductingTIIdocs/sphinx/using/backends/hardware/superconducting.rst (TII section)
SuperconductingQCIdocs/sphinx/using/backends/hardware/superconducting.rst (QCI section)
Neutral atomInfleqtiondocs/sphinx/using/backends/hardware/neutralatom.rst (Infleqtion section)
Neutral atomQuEradocs/sphinx/using/backends/hardware/neutralatom.rst (QuEra section)
Neutral atomPasqaldocs/sphinx/using/backends/hardware/neutralatom.rst (Pasqal section)
CloudAWS Braketdocs/sphinx/using/backends/cloud/braket.rst
CloudScalewaydocs/sphinx/using/backends/cloud/scaleway.rst

After walking through the provider steps, always close with

  • Test locally first with emulate=True before submitting to real hardware.
  • Use cudaq.sample_async() / cudaq.observe_async() for non-blocking submission.
  • Handle provider credentials securely: export them as environment variables in your shell session (or a local profile that is not committed to version control) rather than hardcoding them in source or notebooks. Never paste tokens into shared files, logs, or commits, and prefer a secrets manager where one is available.

Applications

CUDA-Q ships with ready-to-run application notebooks

CategoryExamples
OptimizationQAOA, ADAPT-QAOA, MaxCut
ChemistryVQE, UCCSD, ADAPT-VQE
Error CorrectionSurface codes, QEC memory
AlgorithmsGrover's, Shor's, QFT, Deutsch-Jozsa, HHL
MLQuantum neural networks, kernel methods
SimulationHamiltonian dynamics, Trotter evolution
FinancePortfolio optimization, Monte Carlo

Parallelize

CUDA-Q supports two distinct multi-GPU parallelization strategies - pick based on what you are trying to scale.

GoalStrategyTarget option
Single circuit too large for one GPUPool GPU memorynvidia --target-option mgpu
Many independent circuits at onceRun circuits in parallelnvidia --target-option mqpu
Large Hamiltonian expectation valueDistribute terms across GPUsmqpu + execution=cudaq.parallel.thread

Circuit batching with mqpu (sample_async / observe_async)

The mqpu option maps one virtual QPU to each GPU. Dispatch circuits asynchronously with qpu_id to all GPUs simultaneously.

import cudaq

cudaq.set_target("nvidia", option="mqpu")
n_qpus = cudaq.get_platform().num_qpus()

futures = [
    cudaq.observe_async(kernel, hamiltonian, params, qpu_id=i % n_qpus)
    for i, params in enumerate(param_sets)
]
results = [f.get().expectation() for f in futures]

Hamiltonian batching

For a single kernel with a large Hamiltonian, add execution= to cudaq.observe — no other code change needed.

# Single node, multiple GPUs
result = cudaq.observe(kernel, hamiltonian, *args,
                       execution=cudaq.parallel.thread)

# Multi-node via MPI
result = cudaq.observe(kernel, hamiltonian, *args,
                       execution=cudaq.parallel.mpi)

See the docs above for complete working examples of both patterns.


Examples

  • /cudaq-guide — print the onboarding menu and ask the user which topic to explore.
  • /cudaq-guide install — walk through installation, defaulting to the Python pip install cudaq path, then validate with the Bell state example.
  • /cudaq-guide test-program — build and run a Bell state kernel and confirm the output shows roughly { 00:~500 11:~500 }.
  • /cudaq-guide gpu-sim — recommend a simulation backend (for example nvidia for a single GPU, or nvidia --target-option mgpu for circuits larger than one GPU's memory).
  • /cudaq-guide qpu — start the two-step QPU dialogue (technology, then provider) and read the matching hardware doc.
  • /cudaq-guide parallelize — choose between mgpu (pool memory for one large circuit) and mqpu (run many circuits in parallel).

Limitations

  • GPU simulation requires Linux (x86_64 or ARM64); macOS is CPU-only
  • Multi-GPU mgpu target requires MPI
  • Kernel code must use a restricted Python subset; NumPy/SciPy are not allowed inside kernels
  • QPU access requires provider-specific credentials and accounts

Troubleshooting

  • Import error after pip install cudaq: Ensure Python 3.10+ and a supported OS (Linux or macOS)
  • No GPU detected: Verify CUDA Toolkit is installed and nvidia-smi shows your GPU; fall back to qpp-cpu
  • Kernel compile error: Check that only supported Python constructs are used inside @cudaq.kernel
  • QPU submission fails: Confirm credentials are set as environment variables per the provider docs

Bundled with this artifact

6 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0