Name: Rag Eval
Author: NVIDIA

On-disk RAG evaluation (`corpus/` + `train.json`)

Purpose

Guide agents through NVIDIA RAG Blueprint filesystem benchmarks: preparing corpus/ and train.json, running scripts/eval/evaluate_rag.py, tuning retrieval and generation flags for quality comparisons, interpreting RAGAS JSON outputs, and triaging failures (HTTP/stream errors, empty contexts, collection mismatch, judge API).

For latency, throughput, and load testing, use the rag-perf skill (scripts/rag-perf, docs/performance-benchmarking.md) — not this skill.

When not to use

Do not use this skill for: deploying or repairing services (use rag-blueprint); evaluating APIs without the corpus/ + train.json layout; general ML experimentation unrelated to this evaluator; production monitoring/alerting; or latency/throughput benchmarking (use rag-perf).

Prerequisites

Repo cloned; run commands from repo root (imports and paths assume this).
Python 3.11+ and uv; eval deps: uv sync --project scripts/eval.
Reachable RAG server and ingestor (defaults often localhost:8081 / 8082).
NVIDIA_API_KEY for RAGAS (see credential hygiene); optional RAG_EVAL_JUDGE_MODEL.
Dataset roots passed to --dataset-paths each contain corpus/ and train.json.

Instructions

Prepare data — Ensure each dataset directory matches the layout and train.json rules in references/dataset-and-conversion.md. When sources arrive as public links (sites or dataset pages), materialize documents under corpus/—prefer PDF for multimodal content so images stay embedded; convert CSV/JSONL/etc. using the patterns there.
Run eval — uv run --project scripts/eval python scripts/eval/evaluate_rag.py with --dataset-paths, --host, and --port. See references/benchmark-execution.md for command examples, outputs, and errors. Use references/evaluate-rag-cli.md for flag-level detail.
Tune quality — Adjust --top_k / --vdb_top_k, reranker and query-rewriting toggles, and generation overrides (--temperature, --top-p, --max-tokens) as documented in references/benchmark-execution.md when comparing retrieval/generation configs for RAGAS scores.
Analyze results — Use references/result-analysis.md for scripts; scan rag_*_evaluation_summary.json for headline RAGAS metrics.
Triage errors — Use the error signal table and the Troubleshooting section below.

Examples

Set API key without putting secrets in shell history (preferred patterns): load from a gitignored env file or secrets manager; avoid committing .env; rotate keys if exposed. Details: references/benchmark-execution.md#credential-hygiene-nvidia_api_key.

Minimal eval (key already in environment):

uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
  --dataset-paths /path/to/my_dataset \
  --host localhost \
  --port 8081

Pretty-print summary JSON:

python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json

More examples (skip ingestion, quality sweeps): references/benchmark-execution.md.

Limitations

Evaluator behavior is fixed to the filesystem contract and evaluate_rag.py; it does not substitute for custom offline judges or non-RAG benchmarks.
Vector DB / embedding choices follow deployed ingestor and RAG env — not overridden by this CLI alone.
Scores depend on retrieval quality, judge model availability, and NVIDIA_API_KEY; empty contexts yield partial RAGAS metrics (see references).
Large procedural detail lives under references/ to keep routing concise; read those files when the user needs step-by-step conversion, full flags, or error tables.

Troubleshooting

Error / signal	Likely cause	What to do
Immediate exit mentioning `NVIDIA_API_KEY`	Missing or invalid key	Set key via secure channel; see credential hygiene in `references/benchmark-execution.md`.
`train.json must be a JSON array`	Wrong JSON shape	Top-level array of objects; validate per `references/dataset-and-conversion.md`.
Fewer rows in `evaluation_data.json` than `train.json`	Per-query failures	Check stderr: network or stream JSON errors; see error table in benchmark-execution.
Empty `generated_contexts` everywhere	Retrieval gap	Verify collection, ingestion, `top_k` / `vdb_top_k`, and `ingestor_server_url` without `/v1` suffix.
Ingestor 404 on upload	Bad ingestor base URL	Pass `http://host:port` only — code appends `/v1/`.

Full signal table: references/benchmark-execution.md#common-error-cases-and-signals.

Gotchas

Run from repo root: paths and imports in scripts/eval/evaluate_rag.py assume this; a wrong directory silently breaks imports.
--ingestor_server_url: pass http://host:port without /v1—the code appends /v1/ automatically. Including /v1 causes 404s on ingestor calls.
Vector DB / embedding settings: not set by this CLI; configure via the deployed ingestor and RAG server env vars (e.g. APP_VECTORSTORE_URL, embedding model).
--model / --llm_endpoint: forwarded verbatim only when explicitly set; omit to keep the server's configured LLM.
Stale collections: a previous run's ingested data persists unless you use --force_ingestion. Use --collection with a unique name when comparing quality across isolated runs.
Empty context metrics: if all generated_contexts are empty, RAGAS scores only nv_accuracy and leaves the other two metrics blank—this is not a silent success.

Source of truth

Piece	Location
Driver	`scripts/eval/evaluate_rag.py` (`CORPUS_DIRECTORY` = `corpus`, `EVAL_DATA` = `train.json`)
Human README (always in-repo)	`scripts/eval/README.md`
Full CLI (flags, defaults)	`scripts/eval/evaluate_rag.py --help`; `references/evaluate-rag-cli.md`
Dataset / conversion	`references/dataset-and-conversion.md`
Runs, outputs, errors	`references/benchmark-execution.md`
Result analysis scripts	`references/result-analysis.md`
Latency / throughput	rag-perf skill, `docs/performance-benchmarking.md`

Agent playbook

Run eval — uv sync --project scripts/eval then uv run --project scripts/eval python scripts/eval/evaluate_rag.py with required --dataset-paths, --host, and --port (and env NVIDIA_API_KEY). Argument --ingestor_server_url is optional (defaults to http://localhost:8082); pass it only when overriding the ingestor endpoint.
Quality tuning — See references/benchmark-execution.md: --top_k/--vdb_top_k, reranker and query-rewriting toggles, --temperature, --top-p, --max-tokens.
Data conversion — Follow references/dataset-and-conversion.md.
Analyze results — references/result-analysis.md; quick scan: python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json.
Error triage — references/benchmark-execution.md#common-error-cases-and-signals.

Rag Eval

On-disk RAG evaluation (`corpus/` + `train.json`)

Purpose

When not to use

Prerequisites

Instructions

Examples

Limitations

Troubleshooting

Gotchas

Source of truth

Agent playbook

Bundled with this artifact

More on the bench

Writing Systems Papers

Wiki Enrich

Vast Gpu

On-disk RAG evaluation (corpus/ + train.json)

Purpose

When not to use

Prerequisites

Instructions

Examples

Limitations

Troubleshooting

Gotchas

Source of truth

Agent playbook

Bundled with this artifact

More on the bench

Writing Systems Papers

Wiki Enrich

Vast Gpu

On-disk RAG evaluation (`corpus/` + `train.json`)