Top Models

β€” models
Loading… Β· Source: data/models.json Β· Auto-scraped from artificialanalysis.ai/models Β· Model Picker β†’
Access:
Creator:
# Name Creator Released In $/M Out $/M Blended Intelligence Coding Agentic GPQA % HLE % IFBench % τ²-Bench % LCR % GDPval % SciCode % TerminalBench % Omniscience % Non-Halluc % Context Value/$ Safety Spd Lat Frontier Slug

Recommendations by Use Case

Ranked by fit. Premium = OpenAI/Anthropic/Google, used directly via their own API. Standard/Budget below are available via OpenRouter.

Premium OpenAI / Anthropic / Google β€” direct API only Β· v4.1 per-task metrics

🧠 Deepest Reasoning / Highest Stakes

Architecture, hard debugging, anything worth paying premium for

  1. Claude Opus 4.8 β€” $1.80/task Β· intel 56 Β· 64% non-halluc Β· best balance of smart + trustworthy
  2. Claude Fable 5 β€” $2.75/task Β· highest intel (60) Β· only 45% non-halluc β€” verify unattended output, don't trust blind

πŸ–₯️ Best Coding

Complex refactors, unfamiliar codebases, architecture-level code review

  1. Claude Fable 5 β€” $2.75/task Β· coding 76 Β· highest coding score tracked Β· pair with a second opinion given the non-halluc caveat above
  2. GPT-5.5 (xhigh) β€” $1.03/task Β· coding 75 Β· 14% non-halluc β€” interactive/reviewed use only
Standard Best quality, cost secondary Β· v4.1 per-task metrics (AA Index v4.1, 2026-06-16)

🧠 Coaching & Daily Ops

Check-ins, briefings, reminders, conversational coaching

  1. MiniMax M3 β€” $0.18/task Β· 7.1 min/task Β· intel 44 Β· 84% non-halluc Β· your current cron default
  2. Qwen3.7 Plus β€” (paid-tier stats pending) Β· intel 39 Β· strong all-rounder with vision

πŸ–₯️ Server & Coding Ops

Debugging, scripts, codebase navigation, agent workflows

  1. MiMo-V2.5-Pro β€” $0.06/task Β· 7.7 min/task Β· intel 42 Β· 75% non-halluc Β· 1M context Β· safe for unattended
  2. DeepSeek V4 Pro β€” $0.06/task Β· 7.2 min/task Β· intel 44 Β· 73% τ³-Banking Β· interactive only (6% non-halluc disqualifies for unattended)
Budget Cheapest that still works Β· v4.1 per-task metrics

🧠 Coaching & Daily Ops

Check-ins, briefings, reminders, conversational coaching

MiMo-V2.5 β€” $0.14/$0.28 per M (free-tier cost equivalent) Β· (paid-tier per-task pending) Β· cheapest safe option for unattended coaching

πŸ–₯️ Server & Coding Ops

Debugging, scripts, codebase navigation, agent workflows

  1. MiMo-V2.5 β€” $0.14/$0.28 per M Β· (per-task pending) Β· safe default for unattended + interactive
  2. DeepSeek V4 Flash β€” $0.14/$0.28 per M Β· 95% τ²-Bench Β· interactive only (feedback loop catches errors)
Validated by 4-LLM cross-check (GPT-5.5 xhigh, Sonnet 4.6 max, Gemini 3.1 Pro, M3): Standard Coaching = M3 (unanimous). Standard Coding = MiMo-Pro (safer Non-Halluc than V4 Pro for unattended). Budget = MiMo family across the board. Don't use DeepSeek for coaching β€” V4 Pro Non-Halluc 6.0% and V4 Flash 10.3% are disqualifying for trusted unattended flows.
How these were chosen (4-LLM cross-validation method)
Each pick follows this method: (1) define the use case, (2) apply Non-Hallucination < 20% as a hard disqualifier for any unattended flow, (3) compare same-price models head-to-head, (4) distinguish unattended vs interactive workloads. Cross-validated against 4 external LLMs to surface blind spots. Models recommending themselves were discounted.

Premium Chat Models β€” Direct API Access

Questions that don't need tools, files, or automation β†’ use the free web interface directly and skip API costs.

Quick factual / "what is X?" β†’ Gemini 3.5 Flash β€” free, fastest
Deep reasoning / "explain why" β†’ Claude Sonnet 4.6 β€” best nuanced analysis
Code review / "review this PR" β†’ GPT-5.5 (xhigh) β€” highest coding score (59.1)
Creative writing / brainstorming β†’ Claude Sonnet 4.6 β€” most natural prose
Real-time news / opinions β†’ Grok β€” least filtered, fastest trending access
Image / video / multimodal β†’ Gemini 3.1 Pro β€” best multimodal reasoning
Quick coding help / "fix this bug" β†’ Gemini 3.5 Flash β€” free, fast, good enough