Rag Eval

Filesystem RAG benchmarks: corpus/, train.json, evaluate_rag.py (RAGAS quality). Not for prod monitoring, latency/throughput benchmarking (use rag-perf), or evals outside this repo layout.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

On-disk RAG evaluation (corpus/ + train.json)

Purpose

Guide agents through NVIDIA RAG Blueprint filesystem benchmarks: preparing corpus/ and train.json, running scripts/eval/evaluate_rag.py, tuning retrieval and generation flags for quality comparisons, interpreting RAGAS JSON outputs, and triaging failures (HTTP/stream errors, empty contexts, collection mismatch, judge API).

For latency, throughput, and load testing, use the rag-perf skill (scripts/rag-perf, docs/performance-benchmarking.md) — not this skill.

When not to use

Do not use this skill for: deploying or repairing services (use rag-blueprint); evaluating APIs without the corpus/ + train.json layout; general ML experimentation unrelated to this evaluator; production monitoring/alerting; or latency/throughput benchmarking (use rag-perf).

Prerequisites

  • Repo cloned; run commands from repo root (imports and paths assume this).
  • Python 3.11+ and uv; eval deps: uv sync --project scripts/eval.
  • Reachable RAG server and ingestor (defaults often localhost:8081 / 8082).
  • NVIDIA_API_KEY for RAGAS (see credential hygiene); optional RAG_EVAL_JUDGE_MODEL.
  • Dataset roots passed to --dataset-paths each contain corpus/ and train.json.

Instructions

  1. Prepare data — Ensure each dataset directory matches the layout and train.json rules in references/dataset-and-conversion.md. When sources arrive as public links (sites or dataset pages), materialize documents under corpus/—prefer PDF for multimodal content so images stay embedded; convert CSV/JSONL/etc. using the patterns there.
  2. Run evaluv run --project scripts/eval python scripts/eval/evaluate_rag.py with --dataset-paths, --host, and --port. See references/benchmark-execution.md for command examples, outputs, and errors. Use references/evaluate-rag-cli.md for flag-level detail.
  3. Tune quality — Adjust --top_k / --vdb_top_k, reranker and query-rewriting toggles, and generation overrides (--temperature, --top-p, --max-tokens) as documented in references/benchmark-execution.md when comparing retrieval/generation configs for RAGAS scores.
  4. Analyze results — Use references/result-analysis.md for scripts; scan rag_*_evaluation_summary.json for headline RAGAS metrics.
  5. Triage errors — Use the error signal table and the Troubleshooting section below.

Examples

Set API key without putting secrets in shell history (preferred patterns): load from a gitignored env file or secrets manager; avoid committing .env; rotate keys if exposed. Details: references/benchmark-execution.md#credential-hygiene-nvidia_api_key.

Minimal eval (key already in environment):

uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
  --dataset-paths /path/to/my_dataset \
  --host localhost \
  --port 8081

Pretty-print summary JSON:

python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json

More examples (skip ingestion, quality sweeps): references/benchmark-execution.md.

Limitations

  • Evaluator behavior is fixed to the filesystem contract and evaluate_rag.py; it does not substitute for custom offline judges or non-RAG benchmarks.
  • Vector DB / embedding choices follow deployed ingestor and RAG env — not overridden by this CLI alone.
  • Scores depend on retrieval quality, judge model availability, and NVIDIA_API_KEY; empty contexts yield partial RAGAS metrics (see references).
  • Large procedural detail lives under references/ to keep routing concise; read those files when the user needs step-by-step conversion, full flags, or error tables.

Troubleshooting

Error / signalLikely causeWhat to do
Immediate exit mentioning NVIDIA_API_KEYMissing or invalid keySet key via secure channel; see credential hygiene in references/benchmark-execution.md.
train.json must be a JSON arrayWrong JSON shapeTop-level array of objects; validate per references/dataset-and-conversion.md.
Fewer rows in evaluation_data.json than train.jsonPer-query failuresCheck stderr: network or stream JSON errors; see error table in benchmark-execution.
Empty generated_contexts everywhereRetrieval gapVerify collection, ingestion, top_k / vdb_top_k, and ingestor_server_url without /v1 suffix.
Ingestor 404 on uploadBad ingestor base URLPass http://host:port only — code appends /v1/.

Full signal table: references/benchmark-execution.md#common-error-cases-and-signals.

Gotchas

  • Run from repo root: paths and imports in scripts/eval/evaluate_rag.py assume this; a wrong directory silently breaks imports.
  • --ingestor_server_url: pass http://host:port without /v1—the code appends /v1/ automatically. Including /v1 causes 404s on ingestor calls.
  • Vector DB / embedding settings: not set by this CLI; configure via the deployed ingestor and RAG server env vars (e.g. APP_VECTORSTORE_URL, embedding model).
  • --model / --llm_endpoint: forwarded verbatim only when explicitly set; omit to keep the server's configured LLM.
  • Stale collections: a previous run's ingested data persists unless you use --force_ingestion. Use --collection with a unique name when comparing quality across isolated runs.
  • Empty context metrics: if all generated_contexts are empty, RAGAS scores only nv_accuracy and leaves the other two metrics blank—this is not a silent success.

Source of truth

PieceLocation
Driverscripts/eval/evaluate_rag.py (CORPUS_DIRECTORY = corpus, EVAL_DATA = train.json)
Human README (always in-repo)scripts/eval/README.md
Full CLI (flags, defaults)scripts/eval/evaluate_rag.py --help; references/evaluate-rag-cli.md
Dataset / conversionreferences/dataset-and-conversion.md
Runs, outputs, errorsreferences/benchmark-execution.md
Result analysis scriptsreferences/result-analysis.md
Latency / throughputrag-perf skill, docs/performance-benchmarking.md

Agent playbook

  1. Run evaluv sync --project scripts/eval then uv run --project scripts/eval python scripts/eval/evaluate_rag.py with required --dataset-paths, --host, and --port (and env NVIDIA_API_KEY). Argument --ingestor_server_url is optional (defaults to http://localhost:8082); pass it only when overriding the ingestor endpoint.
  2. Quality tuning — See references/benchmark-execution.md: --top_k/--vdb_top_k, reranker and query-rewriting toggles, --temperature, --top-p, --max-tokens.
  3. Data conversion — Follow references/dataset-and-conversion.md.
  4. Analyze resultsreferences/result-analysis.md; quick scan: python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json.
  5. Error triagereferences/benchmark-execution.md#common-error-cases-and-signals.

Bundled with this artifact

9 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Writing Systems Papers

Paragraph-level structural blueprint for 10-12 page systems papers targeting OSDI, SOSP, ASPLOS, NSDI, and EuroSys. Provides page allocation, paragraph templates, and writing patterns. Use when user says "写系统论文", "systems paper structure", "OSDI paper", "SOSP paper", or wants fine-grained structural guidance for a systems conference submission.

ai-prompt-engineering+1
0
SKILL0

Wiki Enrich

Fill in the per-paper TODO sections of research-wiki/papers/<slug>.md pages that literature-ingest skills leave as bare scaffolds. Use when user says 'enrich wiki', 'fill paper TODOs', 'wiki body 補完', '把 paper 摘要寫進 wiki', 'research-wiki 自動填', or after a batch ingest that left papers/ as TODO scaffolds.

ai-prompt-engineering+1
0
SKILL0

Vast Gpu

Rent, manage, and destroy GPU instances on vast.ai. Use when user says "rent gpu", "vast.ai", "rent a server", "cloud gpu", or needs on-demand GPU without owning hardware.

ai-prompt-engineering+1
0