Autoresearch

Run Karpathy-style autoresearch optimization on any content. Generates 50+ variants, scores with a 5-expert simulated panel, evolves winners through multiple rounds, outputs optimized version + full experiment log. Use when optimizing landing pages, email sequences, ad copy, headlines, form pages, CTA text, or any conversion-focused content. Triggers on "optimize this page", "run autoresearch", "score these variants", "A/B test this copy".

Published by @ericosiu·0 agent reads / 30d·0 saves·

Autoresearch Skill

Karpathy-style optimization loops for any conversion-focused content. No traffic needed. Simulated expert panel. Minutes, not weeks.

When to use this: Pre-launch content optimization. Generate 50+ variants, score with 5 simulated experts, evolve winners, output the best version + full experiment log.

When NOT to use this: Post-launch real-traffic A/B testing — that requires real analytics, not simulated scoring.

The sequence: Run autoresearch FIRST to hit 85+ simulated score. Then deploy. Then validate with real traffic.


What You'll Produce

Every run outputs 3 files:

FilePurpose
{name}-optimized.{ext}The winning optimized content
data/{name}-experiments.jsonFull experiment log — all variants + all scores
data/{name}-optimization-report.mdHuman-readable summary with winner rationale

Expert Panel (5 Personas)

Score every variant against all 5. Batch all variants into a single API call per round.

#PersonaScoring Lens
1CMO at a mid-market B2B company (50M+ revenue)"Would this make me stop and engage?"
2Skeptical founder"Do I believe this? Would I trust this company?"
3Conversion rate optimizer"Is this clear, specific, and action-driving?"
4Senior copywriter"Is this compelling, differentiated, and well-crafted?"
5Your CEO/founder"Direct, ROI-obsessed, no BS. Would I put this on my site?"

Customization: Replace persona #5 with your own CEO/founder voice. Define their priorities and communication style in a references/founder-voice.md file.

Each judge scores 0–100. Final score = average across all 5 judges.


Round Structure (Per Content Element)

Round 1:
  → Generate 10 variants of the element
  → Batch-score all 10 with the 5-expert panel (1 API call)
  → Rank by average score
  → Keep top 3

Round 2 (Evolution):
  → Analyze what the top 3 did right
  → Generate 10 new variants that push those winning patterns further
  → Batch-score all 10 (1 API call)
  → Keep top 3

Round 3 (If score < threshold):
  → Identify weakest scoring dimension
  → Generate 10 variants optimized for that dimension
  → Batch-score → keep top 1

Multi-element cross-breeding:
  → Take top 1 winner from each element
  → Generate 5 combinations that mix winning elements
  → Score holistically as complete units
  → Output the single best combination

Stop condition: Top variant hits minimum score threshold (default: 80) OR 3 rounds complete.


Content Types & Score Dimensions

Landing Pages

Elements to optimize: Hero headline, subheadline, CTA text, problem section, social proof

Score dimensions:

  • first_impression — Does it grab immediately?
  • clarity — Is the offer instantly understood?
  • trust — Does it feel credible?
  • urgency — Is there a reason to act now?
  • would_convert — Would the judge actually click?

Email Sequences

Elements to optimize: Subject line, opening line, body copy, CTA, PS line

Score dimensions:

  • would_open — Subject line pass rate
  • would_read — Does the opening hook?
  • would_click — Is the CTA compelling?
  • would_reply — Does it feel personal enough to respond to?
  • spam_risk — Does it feel spammy? (lower = better; invert for final score)

Ad Copy

Elements to optimize: Headline, description, CTA

Score dimensions:

  • scroll_stopping — Does it interrupt the scroll?
  • clarity — Is the value prop clear in 3 seconds?
  • click_worthiness — Does the judge want to click?
  • relevance — Does it match likely audience intent?
  • differentiation — Does it stand out from competitors?

Form Pages

Elements to optimize: Headline, subtext, value prop bullets, button text, field order, thank-you copy

Score dimensions:

  • first_impression — Does it feel worth filling out?
  • trust — Do they believe their info is safe and the offer is real?
  • completion_likelihood — Would the judge start filling it out?
  • lead_quality — Would this attract serious prospects (not tire-kickers)?
  • would_fill_out — Final gut check: would they submit?

Step-by-Step Execution Protocol

Step 1: Intake & Parse

Read the source content. Identify content type automatically or confirm with user:

  • HTML file → landing page or form page
  • Markdown / plain text → email or ad copy
  • If ambiguous, ask: "Is this a landing page, email sequence, ad copy, or form page?"

Extract all optimizable elements. List them back to user:

Found 5 elements to optimize:
1. Hero headline: "We help B2B companies grow"
2. Subheadline: "Full-service digital marketing..."
3. CTA: "Get Started"
4. Problem statement: [excerpt]
5. Social proof: [excerpt]

Optimizing: all | Variants per round: 10 | Min score: 80

Step 2: Get API Key

Check for Anthropic API key: $ANTHROPIC_API_KEY environment variable.

export ANTHROPIC_API_KEY="your-api-key-here"

Step 3: Run Optimization Rounds

For each element, run the round structure above.

Critical API efficiency rule: ALWAYS batch all variants into a single prompt. Never call the API once per variant. A round with 10 variants = 1 API call.

Model preference (in order):

  1. claude-sonnet-4-5 (preferred — fast + smart)
  2. claude-opus-4 (if highest quality needed)
  3. Any claude-3.5+ model if the above aren't available

Step 4: Cross-Breed (Multi-Element)

After all elements have winners:

  1. Assemble the top winner from each element into a complete unit
  2. Generate 5 holistic variants that naturally combine the winning elements
  3. Score the complete units (not just individual parts)
  4. Pick the winner with the highest holistic score

Step 5: Write Output Files

# Create output directory
mkdir -p data

# Write optimized content
# Write experiments JSON
# Write optimization report

Experiments JSON structure:

{
  "run_id": "autoresearch-{name}-{timestamp}",
  "content_type": "landing_page",
  "source_file": "path/to/original",
  "min_score_threshold": 80,
  "rounds": [
    {
      "round": 1,
      "element": "hero_headline",
      "variants": [
        {
          "id": 1,
          "text": "...",
          "scores": {
            "cmo": 72,
            "skeptical_founder": 68,
            "cro": 75,
            "copywriter": 70,
            "founder": 65
          },
          "avg_score": 70
        }
      ],
      "top_3": [1, 4, 7],
      "winner_score": 82
    }
  ],
  "final_winner": {
    "hero_headline": "...",
    "subheadline": "...",
    "cta": "...",
    "holistic_score": 87
  }
}

Step 6: Report Back

Summarize results to user:

  • Final winning score
  • Biggest score jump (which element improved most)
  • Top 2 runner-up alternatives (in case winner doesn't feel right)
  • Path to all 3 output files
  • Clear next step

User Options

OptionDefaultDescription
elementsallWhich elements to optimize
variants_per_round10How many variants to generate per round
min_score80Stop when this score is hit
rounds3Max rounds before stopping
auto_applyfalseWhether to overwrite the source file with winners
content_typeauto-detectForce a content type if auto-detect is wrong

Quality Gates

  • < 70: Don't ship. Something fundamental is broken.
  • 70-79: Marginal. One more round targeting the lowest-scoring dimension.
  • 80-84: Good. Shippable. Validate with real traffic.
  • 85-89: Strong. Ship with confidence.
  • 90+: Rare. Ship immediately.

Anti-Patterns to Avoid

  • Never call the API once per variant. Always batch. A 10-variant round = 1 call.
  • Don't over-optimize for one dimension. If you're hitting 95 on clarity but 45 on trust, the overall score is misleading.
  • Don't run more than 5 rounds. If you're not hitting 80 after 3 rounds, the problem is strategic (wrong positioning), not tactical (wrong words).
  • Don't cross-breed until each element has its own winner. Premature cross-breeding creates incoherent combinations.

Bundled with this artifact

3 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Draft Outreach

Research a prospect then draft personalized outreach. Uses web research by default, supercharged with enrichment and CRM. Trigger with "draft outreach to [person/company]", "write cold email to [prospect]", "reach out to [name]".

sales-gtm-revops+1
0
SKILL0

Social Media Strategy

Develop comprehensive social media strategies across multiple platforms. Create content calendars, establish posting schedules, design engagement tactics, plan influencer outreach, and structure paid social campaigns with ROI measurement.

marketing-growth-copy+1
0
SKILL0

Competitive Ads Extractor

Extracts and analyzes competitors' ads from ad libraries (Facebook, LinkedIn, etc.) to understand what messaging, problems, and creative approaches are working. Helps inspire and improve your own ad campaigns.

marketing-growth-copy+1
0