Top Models — Evaluation Shortlist

#		Name	Creator	Released	In $/M	Out $/M	Blended	Intelligence	Coding	Agentic	GPQA %	HLE %	IFBench %	τ²-Bench %	LCR %	GDPval %	SciCode %	TerminalBench %	Omniscience %	Non-Halluc %	Context	Value/$	Safety	Spd	Lat	Frontier	Slug

Recommendations by Use Case

Ranked by fit. Premium = OpenAI/Anthropic/Google, used directly via their own API. Standard/Budget below are available via OpenRouter.

Premium OpenAI / Anthropic / Google — direct API only · v4.1 per-task metrics

🧠 Deepest Reasoning / Highest Stakes

Architecture, hard debugging, anything worth paying premium for

Claude Opus 4.8 — $1.80/task · intel 56 · 64% non-halluc · best balance of smart + trustworthy
Claude Fable 5 — $2.75/task · highest intel (60) · only 45% non-halluc — verify unattended output, don't trust blind

🖥️ Best Coding

Complex refactors, unfamiliar codebases, architecture-level code review

Claude Fable 5 — $2.75/task · coding 76 · highest coding score tracked · pair with a second opinion given the non-halluc caveat above
GPT-5.5 (xhigh) — $1.03/task · coding 75 · 14% non-halluc — interactive/reviewed use only

Standard Best quality, cost secondary · v4.1 per-task metrics (AA Index v4.1, 2026-06-16)

🧠 Coaching & Daily Ops

Check-ins, briefings, reminders, conversational coaching

MiniMax M3 — $0.18/task · 7.1 min/task · intel 44 · 84% non-halluc · your current cron default
Qwen3.7 Plus — (paid-tier stats pending) · intel 39 · strong all-rounder with vision

🖥️ Server & Coding Ops

Debugging, scripts, codebase navigation, agent workflows

MiMo-V2.5-Pro — $0.06/task · 7.7 min/task · intel 42 · 75% non-halluc · 1M context · safe for unattended
DeepSeek V4 Pro — $0.06/task · 7.2 min/task · intel 44 · 73% τ³-Banking · interactive only (6% non-halluc disqualifies for unattended)

Budget Cheapest that still works · v4.1 per-task metrics

🧠 Coaching & Daily Ops

Check-ins, briefings, reminders, conversational coaching

MiMo-V2.5 — $0.14/$0.28 per M (free-tier cost equivalent) · (paid-tier per-task pending) · cheapest safe option for unattended coaching

🖥️ Server & Coding Ops

Debugging, scripts, codebase navigation, agent workflows

MiMo-V2.5 — $0.14/$0.28 per M · (per-task pending) · safe default for unattended + interactive
DeepSeek V4 Flash — $0.14/$0.28 per M · 95% τ²-Bench · interactive only (feedback loop catches errors)

Validated by 4-LLM cross-check (GPT-5.5 xhigh, Sonnet 4.6 max, Gemini 3.1 Pro, M3): Standard Coaching = M3 (unanimous). Standard Coding = MiMo-Pro (safer Non-Halluc than V4 Pro for unattended). Budget = MiMo family across the board. Don't use DeepSeek for coaching — V4 Pro Non-Halluc 6.0% and V4 Flash 10.3% are disqualifying for trusted unattended flows.

How these were chosen (4-LLM cross-validation method)

Each pick follows this method: (1) define the use case, (2) apply Non-Hallucination < 20% as a hard disqualifier for any unattended flow, (3) compare same-price models head-to-head, (4) distinguish unattended vs interactive workloads. Cross-validated against 4 external LLMs to surface blind spots. Models recommending themselves were discounted.

Premium Chat Models — Direct API Access

Questions that don't need tools, files, or automation → use the free web interface directly and skip API costs.

Quick factual / "what is X?" → Gemini 3.5 Flash — free, fastest

Deep reasoning / "explain why" → Claude Sonnet 4.6 — best nuanced analysis

Code review / "review this PR" → GPT-5.5 (xhigh) — highest coding score (59.1)

Creative writing / brainstorming → Claude Sonnet 4.6 — most natural prose

Real-time news / opinions → Grok — least filtered, fastest trending access

Image / video / multimodal → Gemini 3.1 Pro — best multimodal reasoning

Quick coding help / "fix this bug" → Gemini 3.5 Flash — free, fast, good enough