Rag Architect

Use when the user asks to design a RAG pipeline, choose a chunking strategy or embedding model, pick a vector database, or evaluate retrieval quality (precision@k, recall@k, NDCG). Examples: 'design a RAG system for our docs', 'what chunk size should I use for this corpus', 'evaluate my retriever against ground truth'. NOT for general LLM cost tuning (use llm-cost-optimizer) or agent loops over retrieval (use agenthub).

Published by @Alireza Rezvani·0 agent reads / 30d·0 saves·

RAG Architect

Design, tune, and evaluate production RAG pipelines with three deterministic tools. Run the tools against the actual corpus and requirements — do not pick chunk sizes or databases by intuition.

Hard rules

  1. Never present model names or vendor prices as current facts. Embedding models and vector-DB pricing rot in months. Recommend a tier (see table below), name a current-generation candidate, and tell the user to verify against the provider's live pricing page.
  2. Every design ends with an evaluation run. A RAG design without retrieval_evaluator.py numbers is a hypothesis, not a deliverable.
  3. Chunking is corpus-driven. Run chunking_optimizer.py on the real documents before choosing a strategy.

Embedding model tiers (pattern, not price list)

TierCurrent-generation examples (verify before use)When
Fast / self-hostedall-MiniLM-L6-v2, bge-smallCost-sensitive, small scale, real-time
Balanced openall-mpnet-base-v2, bge-large, e5-largeQuality without API dependency
Quality APItext-embedding-3-large, voyage-3-largeAccuracy-priority general retrieval
Codevoyage-code-3, CodeBERT-familyCode search corpora

Pricing discipline: build the cost model with a placeholder table — columns model | $/1M tokens (verify) | dims | as-of date — and have the user fill in live numbers. Same for vector DBs (Pinecone/Weaviate/Qdrant/Chroma/pgvector): the selection criteria (managed vs self-hosted, scale, filtering, existing Postgres) are durable; the dollar figures are not.

Workflow

All paths relative to this skill folder. Outputs chain: corpus analysis → design → evaluation.

1. Analyze the corpus and pick chunking

python3 chunking_optimizer.py /path/to/docs --extensions .md .txt -o chunking.json

Emits chunking.json with corpus_info, per-strategy strategy_results, a recommendation, and sample_chunks. Use recommendation.strategy and its config; show the user 2-3 sample_chunks so they can sanity-check boundaries.

2. Design the pipeline from requirements

Write a requirements JSON with these keys (all required): document_types[], document_count, avg_document_size (chars), queries_per_day, query_patterns[], latency_requirement, budget_monthly, accuracy_priority (0-1), cost_priority (0-1), maintenance_complexity.

python3 rag_pipeline_designer.py requirements.json -o design.json

Emits design.json with chunking, embedding, vector_db, retrieval, reranking, evaluation, total_cost, architecture_diagram (mermaid), and config_templates. Present the diagram; label every cost_monthly figure as an estimate to verify (rule 1).

3. Evaluate retrieval quality

Prepare queries.json (list of {id, text} or {"queries": [...]}) and ground_truth.json ({query_id: [relevant_doc_ids]}), then:

python3 retrieval_evaluator.py queries.json /path/to/docs ground_truth.json --k-values 3 5 10 -o eval.json

Reports precision@k, recall@k, MRR, NDCG@k, plus poor_precision_examples / poor_recall_examples for failure analysis.

4. Verification loop

The design is done only when:

  1. eval.json meets targets — typical floors: precision@5 ≥ 0.8, recall@10 ≥ 0.85 (set per use case with the user).
  2. If below target: inspect the poor-example lists, then change one variable (chunking strategy → re-run step 1; embedding tier; add reranking; hybrid retrieval) and re-run step 3. Repeat.
  3. Every recommended model/price in the deliverable carries a "verify current pricing/model availability" note with an as-of date.

References

  • references/chunking_strategies_comparison.md — strategy trade-offs the optimizer implements
  • references/embedding_model_benchmark.md — benchmark methodology (dated snapshot; staleness warning at top)
  • references/rag_evaluation_framework.md — metric definitions (faithfulness, relevance, precision/recall/NDCG)

Bundled with this artifact

7 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0