On-disk RAG evaluation (corpus/ + train.json)
Purpose
Guide agents through NVIDIA RAG Blueprint filesystem benchmarks: preparing corpus/ and train.json, running scripts/eval/evaluate_rag.py, tuning retrieval and generation flags for quality comparisons, interpreting RAGAS JSON outputs, and triaging failures (HTTP/stream errors, empty contexts, collection mismatch, judge API).
For latency, throughput, and load testing, use the rag-perf skill (scripts/rag-perf, docs/performance-benchmarking.md) — not this skill.
When not to use
Do not use this skill for: deploying or repairing services (use rag-blueprint); evaluating APIs without the corpus/ + train.json layout; general ML experimentation unrelated to this evaluator; production monitoring/alerting; or latency/throughput benchmarking (use rag-perf).
Prerequisites
- Repo cloned; run commands from repo root (imports and paths assume this).
- Python 3.11+ and uv; eval deps:
uv sync --project scripts/eval. - Reachable RAG server and ingestor (defaults often
localhost:8081/8082). NVIDIA_API_KEYfor RAGAS (see credential hygiene); optionalRAG_EVAL_JUDGE_MODEL.- Dataset roots passed to
--dataset-pathseach containcorpus/andtrain.json.
Instructions
- Prepare data — Ensure each dataset directory matches the layout and
train.jsonrules inreferences/dataset-and-conversion.md. When sources arrive as public links (sites or dataset pages), materialize documents undercorpus/—prefer PDF for multimodal content so images stay embedded; convert CSV/JSONL/etc. using the patterns there. - Run eval —
uv run --project scripts/eval python scripts/eval/evaluate_rag.pywith--dataset-paths,--host, and--port. Seereferences/benchmark-execution.mdfor command examples, outputs, and errors. Usereferences/evaluate-rag-cli.mdfor flag-level detail. - Tune quality — Adjust
--top_k/--vdb_top_k, reranker and query-rewriting toggles, and generation overrides (--temperature,--top-p,--max-tokens) as documented inreferences/benchmark-execution.mdwhen comparing retrieval/generation configs for RAGAS scores. - Analyze results — Use
references/result-analysis.mdfor scripts; scanrag_*_evaluation_summary.jsonfor headline RAGAS metrics. - Triage errors — Use the error signal table and the Troubleshooting section below.
Examples
Set API key without putting secrets in shell history (preferred patterns): load from a gitignored env file or secrets manager; avoid committing .env; rotate keys if exposed. Details: references/benchmark-execution.md#credential-hygiene-nvidia_api_key.
Minimal eval (key already in environment):
uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
--dataset-paths /path/to/my_dataset \
--host localhost \
--port 8081
Pretty-print summary JSON:
python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json
More examples (skip ingestion, quality sweeps): references/benchmark-execution.md.
Limitations
- Evaluator behavior is fixed to the filesystem contract and
evaluate_rag.py; it does not substitute for custom offline judges or non-RAG benchmarks. - Vector DB / embedding choices follow deployed ingestor and RAG env — not overridden by this CLI alone.
- Scores depend on retrieval quality, judge model availability, and
NVIDIA_API_KEY; empty contexts yield partial RAGAS metrics (see references). - Large procedural detail lives under
references/to keep routing concise; read those files when the user needs step-by-step conversion, full flags, or error tables.
Troubleshooting
| Error / signal | Likely cause | What to do |
|---|---|---|
Immediate exit mentioning NVIDIA_API_KEY | Missing or invalid key | Set key via secure channel; see credential hygiene in references/benchmark-execution.md. |
train.json must be a JSON array | Wrong JSON shape | Top-level array of objects; validate per references/dataset-and-conversion.md. |
Fewer rows in evaluation_data.json than train.json | Per-query failures | Check stderr: network or stream JSON errors; see error table in benchmark-execution. |
Empty generated_contexts everywhere | Retrieval gap | Verify collection, ingestion, top_k / vdb_top_k, and ingestor_server_url without /v1 suffix. |
| Ingestor 404 on upload | Bad ingestor base URL | Pass http://host:port only — code appends /v1/. |
Full signal table: references/benchmark-execution.md#common-error-cases-and-signals.
Gotchas
- Run from repo root: paths and imports in
scripts/eval/evaluate_rag.pyassume this; a wrong directory silently breaks imports. --ingestor_server_url: passhttp://host:portwithout/v1—the code appends/v1/automatically. Including/v1causes 404s on ingestor calls.- Vector DB / embedding settings: not set by this CLI; configure via the deployed ingestor and RAG server env vars (e.g.
APP_VECTORSTORE_URL, embedding model). --model/--llm_endpoint: forwarded verbatim only when explicitly set; omit to keep the server's configured LLM.- Stale collections: a previous run's ingested data persists unless you use
--force_ingestion. Use--collectionwith a unique name when comparing quality across isolated runs. - Empty context metrics: if all
generated_contextsare empty, RAGAS scores onlynv_accuracyand leaves the other two metrics blank—this is not a silent success.
Source of truth
| Piece | Location |
|---|---|
| Driver | scripts/eval/evaluate_rag.py (CORPUS_DIRECTORY = corpus, EVAL_DATA = train.json) |
| Human README (always in-repo) | scripts/eval/README.md |
| Full CLI (flags, defaults) | scripts/eval/evaluate_rag.py --help; references/evaluate-rag-cli.md |
| Dataset / conversion | references/dataset-and-conversion.md |
| Runs, outputs, errors | references/benchmark-execution.md |
| Result analysis scripts | references/result-analysis.md |
| Latency / throughput | rag-perf skill, docs/performance-benchmarking.md |
Agent playbook
- Run eval —
uv sync --project scripts/evalthenuv run --project scripts/eval python scripts/eval/evaluate_rag.pywith required--dataset-paths,--host, and--port(and envNVIDIA_API_KEY). Argument--ingestor_server_urlis optional (defaults tohttp://localhost:8082); pass it only when overriding the ingestor endpoint. - Quality tuning — See
references/benchmark-execution.md:--top_k/--vdb_top_k, reranker and query-rewriting toggles,--temperature,--top-p,--max-tokens. - Data conversion — Follow
references/dataset-and-conversion.md. - Analyze results —
references/result-analysis.md; quick scan:python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json. - Error triage —
references/benchmark-execution.md#common-error-cases-and-signals.