RAG-Perf — config-driven perf benchmark CLI
Purpose
Drive a deployed NVIDIA RAG Blueprint server with a YAML config, run a server-side profiling pass (per-stage timing, citation quality, bottleneck inference) and an optional aiperf load test (TTFT / E2E / token & request throughput / error rate), and write a unified report. The CLI is intentionally minimal: rag-perf -c <config> plus --help / --version. Behaviour is fully config-driven; field variations belong in YAML.
Scope
- Accuracy / RAGAS scoring of answer quality → use the rag-eval skill.
- Deploying, repairing, or configuring services (compose, helm, NIM env vars) → use the rag-blueprint skill.
- Production monitoring / alerting — rag-perf is a one-shot benchmark tool.
- Runtime requirement: a deployed RAG server reachable on the network.
Prerequisites
- Repo cloned; run commands from the repo root (config paths in the presets are repo-root-relative).
- Python 3.11+ and uv on PATH.
- Install rag-perf into its own uv-managed venv:
uv sync --project scripts/rag-perf. - For unit tests: install dev extras as well —
uv sync --project scripts/rag-perf --extra dev(otherwisepytest-asynciois missing and async tests error out at collection time). - A reachable RAG server (default
http://localhost:8081). For the aiperf phase, the bundlednvidia_ragendpoint plugin must be installed —pip install -e ./scripts/rag-perfregisters it via theaiperf.pluginsentry point. - For synthetic queries: an OpenAI-compatible chat-completions endpoint reachable at
synthetic.llm_url(defaulthttp://localhost:8999/v1/chat/completions). - rag-perf itself runs without
NVIDIA_API_KEY(unlike rag-eval). The synthetic LLM endpoint may require its own auth — that's the deployment's concern.
Instructions
-
Pick a preset. The three under
scripts/rag-perf/configs/are:quick_profile.yaml— profile-only, ~30 s. Skips load test. For fast iteration on retrieval / reranker tuning.single_run.yaml— one concurrency level, profiling + aiperf, ~2 min. Regression checks.sweep.yaml— multi-axis sweep.load.concurrency,rag.vdb_top_k,rag.reranker_top_kare allint | list[int]; any of them as a list becomes a sweep axis (Cartesian product).
-
Edit the preset. Required: replace
rag.collection_names: ["<collection_name>"]with a real collection on the deployed ingestor server. Verify the collection exists viaGET /v1/collectionson the ingestor. The placeholder<collection_name>validates fine but every request will fail at retrieval. Use a copied YAML preset for variants; the CLI surface is intentionally config-only. -
Run. From repo root:
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yamlSame form for the other presets. The CLI accepts only
-c / --config(required),--help,--version. -
Read stdout. Every invocation prints, in order: a startup banner, a one-line summary, the fully resolved config as YAML (so the run is reproducible from terminal output), per-grid-point progress with the shlex-joined aiperf command in copy-pastable form, a rich per-point summary table (stage breakdown with bars, citation quality, bottleneck, load-test block), and finally a side-by-side comparison table auto-labelled by whichever axis varied. See
references/output-and-analysis.md. -
Inspect artifacts. Layout depends on run shape — flat for single-point +
iterations=1, nested underiter_<i>/<point>/...otherwise. Seereferences/output-and-analysis.mdfor the full directory tree, file purposes, and how to parseresults.json/results.csv/report.md. -
Summarise for the user. When reporting back, follow the playbook in
references/output-and-analysis.md#summarising-results-to-the-user: pick the canonical result file for the run shape, build a headline table (concurrency × top-k axes × TTFT × throughput × bottleneck × citation quality), compute scaling efficiency on sweeps, always flag zero citations / non-zero error rate / suspectllm_ttft_ms/ small-sample p99, and propose a concrete next-experiment YAML. -
Tune. Schema is fully documented in
docs/performance-benchmarking.mdand the deeper-dive references below. Common knobs: turnaiperf.enabled: falsefor profile-only mode, increaseload.iterationsfor variance estimation, setload.sleep_between_points_s: 60for overnight Cartesian sweeps.
Examples
Profile-only (quickest signal on retrieval / reranker tuning):
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml
Output: rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}. The aiperf_rag_on/ directory is omitted. Filenames are profile_* because aiperf.enabled: false.
Single benchmark point with full report:
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml
Output: flat run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}.
Concurrency sweep:
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml
Output: nested run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/ per point, plus aggregate report.md / results.json / results.csv at the run root.
Run unit tests:
uv sync --project scripts/rag-perf --extra dev # one-time, installs pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/
Limitations
- The CLI is config-only: author or copy YAML to vary a parameter.
load.concurrency/rag.vdb_top_k/rag.reranker_top_kacceptint | list[int]; the validator requires unique list values because each value names a unique point dir.input.fileandinput.syntheticfollow an XOR rule — both set fails validation. When neither is set,syntheticauto-fills with defaults so a bare config still validates.- File-based input format is inferred from extension only (
.jsonlor.csv); other extensions are rejected. - Synthetic generation streams each query to disk as it completes (failure-resilient) but fails fast on the first LLM error — partial JSONL is preserved. Re-run after fixing the endpoint.
- Reasoning models (Nemotron Omni, Qwen-Reasoning) require
synthetic.disable_thinking: true(the default). Without it the model exhausts the token budget on chain-of-thought andcontentreturns empty — the generator now raises with a clear message instead of substitutingreasoning_contentfor the answer. - aiperf-specific knobs outside the YAML surface (request rate distribution, GPU telemetry config, etc.) require editing
AiperfRunner._base_aiperf_cmdinscripts/rag-perf/rag_perf/runner.py. - Procedural detail lives under
references/to keep this file concise.
Troubleshooting
| Error / signal | Likely cause | What to do |
|---|---|---|
Configuration errors in <yaml>: • input — ... XOR rule | Both input.file and input.synthetic set | Pick one. The XOR validator runs at YAML load time. |
input.file must end in .jsonl or .csv | Extension other than .jsonl / .csv | Rename or convert. |
load.concurrency has duplicate values | e.g. [2, 2, 4] | Each concurrency maps to a unique point dir; dedupe. |
warmup_requests must be >= 1 | YAML had warmup_requests: 0 | aiperf rejects warmup=0; minimum is 1. |
LLM returned empty content (reasoning_content was populated — model exhausted its budget on chain-of-thought; raise min_query_tokens or set synthetic.disable_thinking=true). | Reasoning model used CoT and ran out of tokens | Set synthetic.disable_thinking: true (the default) or raise min_query_tokens. |
✗ All N profiling requests failed across M point(s). + exit 1 | Bad URL, server down, wrong collection | Verify target.url, rag.collection_names (the <collection_name> placeholder will hit this). |
Per-iteration ⚠ N profiling requests failed warning, run continues | Some requests timed out / errored mid-run | Check rag-server logs, raise target.timeout_s, drop concurrency. |
RuntimeError: Random synthetic query generation failed at query N: ... | LLM endpoint rejected a request mid-generation | Partial JSONL is at synthetic.jsonl_output_path; fix endpoint and re-run with reduced num_queries, or point input.file at the partial file. |
Citation count (mean): 0 and Citation relevance score: N/A for a non-empty deployment | Collection mismatch between rag.collection_names and what's actually ingested | Run curl -s http://<ingestor>:8082/v1/collections to list real collections. |
Tests error with ModuleNotFoundError: No module named 'pytest_asyncio' | Dev extras missing | uv sync --project scripts/rag-perf --extra dev. |
CI: ModuleNotFoundError: No module named 'ruamel' from tests/unit/test_rag_perf/ | rag-perf package missing from CI venv | Add uv pip install -e ./scripts/rag-perf after the top-level install in the unit-tests job. |
Gotchas
- Run from repo root. Preset configs reference
scripts/rag-perf/examples/queries.jsonlandscripts/rag-perf/prompts/default_prompts.yamlwith repo-root-relative paths. Running from insidescripts/rag-perf/will fail those file lookups. - CLI is config-only. Edit the YAML or copy a preset for URL, concurrency, collection, and similar fields.
- Always edit
rag.collection_namesbefore the first run. The presets ship with["<collection_name>"]as a deliberate placeholder. Validation passes, retrieval fails silently for every request — manifests asCitation count (mean): 0everywhere. load.concurrency_list,rag.vdb_top_k_list,rag.reranker_top_k_listare read-only properties that normalise scalar-or-list to a list. Use them when reasoning about the grid; the underlying YAML field is whatever the user wrote.aiperf.enabled: falsechanges filenames. The top-level outputs becomeprofile_report.md/profile_results.json/profile_results.csv. The aggregate sweep table also suppresses load-test rows and the "Optimal throughput" footer.- Resolved-config dump is verbose (50+ lines) — expected. It's what makes terminal output a self-contained reproducer; don't filter it out in scripts.
- The aiperf shell command is logged before each subprocess. Look for
\n $ python -m aiperf profile -m ... --endpoint-type nvidia_rag ...in stdout — copy-paste runnable for reproducing a single point outside rag-perf. --endpoint-type nvidia_ragcomes from the bundled plugin atscripts/rag-perf/rag_perf/plugin/nvidia_rag.py. It teaches aiperf about the RAG/v1/generaterequest shape and parses citations + per-stagemetricsout of the SSE stream. If aiperf can't resolvenvidia_rag, rag-perf needs editable installation in the venv — re-runuv sync --project scripts/rag-perf(oruv pip install -e ./scripts/rag-perf).- Sweep-mode point-name collision. When two points differ only in concurrency (e.g.
[1, 4]× singlevdb_top_k), the dir name encodes everything:CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:.... Cluster / GPU / experiment_name (output.cluster,output.gpu,output.experiment_name) are appended too — useful for diff-friendly artifact paths across machines. load.iterations > 1repeats the entire grid. Each repetition writes to its owniter_<i>/. Aggregate CSV row count =n_points × iterations.
Source of truth
| Piece | Location |
|---|---|
| Driver | scripts/rag-perf/rag_perf/cli.py (main is the single Click command) |
| Schema | scripts/rag-perf/rag_perf/config.py (RunConfig and sub-models) |
| Orchestrator | scripts/rag-perf/rag_perf/runner.py (BenchmarkRunner.run, RagProfiler, AiperfRunner) |
| aiperf plugin | scripts/rag-perf/rag_perf/plugin/nvidia_rag.py |
| User-facing doc | docs/performance-benchmarking.md |
| Presets | scripts/rag-perf/configs/{quick_profile,single_run,sweep}.yaml |
| Sample queries | scripts/rag-perf/examples/queries.jsonl |
| Synthetic prompts | scripts/rag-perf/prompts/default_prompts.yaml |
| Config schema details | references/config-schema.md |
| Synthetic-query generation | references/synthetic-generation.md |
| Output layout & metric semantics | references/output-and-analysis.md |
Agent playbook
- Sync deps:
uv sync --project scripts/rag-perf(one-time per checkout). - Pick & customise a preset: copy
scripts/rag-perf/configs/<preset>.yamlif you want a variant; always setrag.collection_namesto a real collection. - Run:
uv run --project scripts/rag-perf rag-perf -c <config>from repo root. - Read the per-point + aggregate tables on stdout. Bottleneck inference is in the per-point profiling section; comparison across points is the final aggregate table.
- Parse artifacts under
output.dir/run_<ts>/— seereferences/output-and-analysis.md. For multi-point runs,results.csvhas one row per (point × iteration). - Summarise for the user using the playbook in
references/output-and-analysis.md#summarising-results-to-the-user— headline table, scaling-efficiency math for sweeps, mandatory flags for zero citations / non-zero errors / suspectllm_ttft_ms/ low sample size, and a concrete next-experiment YAML. - Tune retrieval / reranker: flip to
quick_profile.yamloraiperf.enabled: falsefor fast iteration, then return tosingle_run.yaml/sweep.yamlwhen characterising under load. - Triage failures: see Troubleshooting above and
references/output-and-analysis.mdfor empty-citation / bottleneck=N/A patterns.