LLM-OPS -- IA de Producao

Overview

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao. Ativar para: implementar RAG, criar pipeline de embeddings, Pinecone/Chroma/pgvector, fine-tuning, prompt engineering, reducao de custos de LLM, evals, cache semantico, streaming, agents.

When to Use This Skill

When you need specialized assistance with this domain

Do Not Use This Skill When

The task is unrelated to llm ops
A simpler, more specific tool can handle the request
The user needs general-purpose assistance without domain expertise

How It Works

A diferenca entre um prototipo de IA e um produto de IA e operabilidade. LLM-Ops e a engenharia que torna IA confiavel, escalavel e economica.

Arquitetura Rag Completa

[Documentos] -> [Chunking] -> [Embeddings] -> [Vector DB] | [Query] -> [Embed query] -> [Semantic Search] -> [Top K chunks] | [LLM + Context] -> [Resposta]

Pipeline De Indexacao

from anthropic import Anthropic import chromadb

client = Anthropic()
chroma = chromadb.PersistentClient(path="./chroma_db")

def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk: chunks.append(chunk)
    return chunks

def index_document(doc_id, content_text, metadata=None):
    chunks = chunk_text(content_text)
    ids = [f"{doc_id}_chunk_{i}" for i in range(len(chunks))]
    collection.upsert(ids=ids, documents=chunks)
    return len(chunks)

Pipeline De Query Com Rag

def rag_query(query, top_k=5, system=None): results = collection.query( query_texts=[query], n_results=top_k, include=["documents", "metadatas", "distances"]) context_parts = [] for doc, meta, dist in zip(results["documents"][0], results["metadatas"][0], results["distances"][0]): if dist < 1.5: src = meta.get("source", "doc") context_parts.append(f"[Fonte: {src}] {doc}") context = "

".join(context_parts) response = client.messages.create( model="claude-opus-4-20250805", max_tokens=1024, system=system or "Responda baseado no contexto.", messages=[{"role": "user", "content": f"Contexto: {context}

{query}"}]) return response.content[0].text

Escolha Do Vector Db

DB	Melhor Para	Hosting	Custo
Chroma	Desenvolvimento, local	Self-hosted	Gratis
pgvector	Ja usa PostgreSQL	Self/Cloud	Gratis
Pinecone	Producao gerenciada	Cloud	USD 70+/mes
Weaviate	Multi-modal	Self/Cloud	Gratis+
Qdrant	Alta performance	Self/Cloud	Gratis+

Pgvector

CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE knowledge_embeddings ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), content TEXT NOT NULL, embedding vector(1536), metadata JSONB, created_at TIMESTAMPTZ DEFAULT NOW() ); CREATE INDEX ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); SELECT content, 1 - (embedding <=> QUERY_VECTOR) AS similarity FROM knowledge_embeddings ORDER BY similarity DESC LIMIT 5;

Estrutura De Prompt De Elite

Componentes do system prompt Auri:

Identidade: Nome (Auri), Tom (Natural, caloroso, direto), Plataforma (Amazon Alexa)
Regras: Maximo 3 paragrafos curtos, sem markdown, linguagem conversacional
Capacidades: analise de negocios, conselho baseado em dados, criatividade
Limitacoes: sem internet tempo real, sem transacoes financeiras
Personalizacao: {user_name}, {user_preferences}, {relevant_history}

Chain-Of-Thought

def cot_analysis(problem: str) -> str: steps = [ "1. O que exatamente esta sendo pedido?", "2. Que informacoes sao criticas para resolver?", "3. Quais abordagens possiveis existem?", "4. Qual abordagem e melhor e por que?", "5. Quais riscos ou limitacoes existem?", ] prompt = f"Analise passo a passo:

PROBLEMA: {problem}

" prompt += " ".join(steps) + "

Resposta final (concisa, para voz):" return call_claude(prompt)

Cache Semantico

class SemanticCache: def init(self, similarity_threshold=0.95): self.threshold = similarity_threshold self.cache = {}

    def get_cached(self, query, embedding):
        for cached_emb, (response, _) in self.cache.items():
            if cosine_similarity(embedding, cached_emb) >= self.threshold:
                return response
        return None

    def set_cache(self, query, embedding, response):
        self.cache[tuple(embedding)] = (response, query)

Estimativa De Custos Claude

PRICING = { "claude-opus-4-20250805": {"input": 15.00, "output": 75.00}, "claude-sonnet-4-5": {"input": 3.00, "output": 15.00}, "claude-haiku-3-5": {"input": 0.80, "output": 4.00}, }

def estimate_monthly_cost(model, avg_input, avg_output, req_per_day):
    p = PRICING[model]
    daily = (avg_input + avg_output) * req_per_day / 1e6
    monthly = daily * p["input"] * 30
    return {"model": model, "monthly_cost": "USD %.2f" % monthly}

Framework De Avaliacao

from anthropic import Anthropic client = Anthropic()

def evaluate_response(question, expected, actual, criteria):
    criteria_text = "

".join(f"- {c}" for c in criteria) eval_prompt = ( f"Avalie a resposta do assistente de IA.

" f"PERGUNTA: {question} RESPOSTA ESPERADA: {expected} " f"RESPOSTA ATUAL: {actual}

Criterios: {criteria_text}

" "Nota 0-10 e justificativa para cada criterio. Formato JSON." ) response = client.messages.create( model="claude-haiku-3-5", max_tokens=1024, messages=[{"role": "user", "content": eval_prompt}] ) import json return json.loads(response.content[0].text)

AURI_EVALS = [
    {
        "question": "Quais sao os principais riscos de abrir startup agora?",
        "criteria": ["precisao_factual", "relevancia", "clareza_para_voz"]
    },
]

6. Comandos

Comando	Acao
/rag-setup	Configura pipeline RAG completo
/embed-docs	Indexa documentos no vector DB
/prompt-optimize	Otimiza prompt para qualidade e custo
/cost-estimate	Estima custo mensal do LLM
/eval-run	Roda suite de evals de qualidade
/cache-setup	Configura cache semantico
/model-select	Escolhe modelo ideal para o caso de uso

Best Practices

Provide clear, specific context about your project and requirements
Review all suggestions before applying them to production code
Combine with other complementary skills for comprehensive analysis

Common Pitfalls

Using this skill for tasks outside its domain expertise
Applying recommendations without understanding your specific context
Not providing enough project context for accurate analysis

Limitations

Use this skill only when the task clearly matches the scope described above.
Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

LLM Ops