Which AI? · the scores behind the opinions

The AI Council, benchmarked — every number links to its source.

Every check runs on a council of independent AI models — all on AWS Bedrock, all inside the BAA boundary. Below: how each seat scores on MedQA (the US medical-licensing-style exam) and LegalBench (Stanford's legal-reasoning benchmark), next to the human reference points. Independent published evals of the exact model versions where available — siblings labeled as such, no score invented.

Medicine — MedQA (USMLE-style)

The medical checks run on models benchmarked against the US medical licensing exam (MedQA). Human pass ≈ 60%, expert physicians ≈ 87% — the lead models on this council score ~92%. These are still AI opinions, which is exactly why everything stays triage-only: literature discussion for your own physician, never a prescription.

Who	MedQA score
Expert physicians, same questionshuman reference	87% Liévin et al. ↗
Human passing threshold (USMLE-style)human reference	≈60% PLOS Digit. Health ↗
Claude Opus 4.1Anthropic · chairs the consensus	92.5% vals.ai ↗
Claude Sonnet 4.6Anthropic · senior seat	92.1% vals.ai ↗
Amazon Nova ProAmazon · document analysis	81.1% Stanford HELM ↗
Claude Haiku 4.5Anthropic · fast seat	79.6% vals.ai ↗
Llama 4 MaverickMeta · open-weight diversity	43.3% vals.ai ↗anomalously low vs its own model family on this harness — its legal score is 77.8%
Amazon Nova LiteAmazon · fast cross-check seat — cross-checks, never chairs	—no published score for this exact model
Llama 3.3 70BMeta · independent reasoning seat	—closest published sibling (Llama 3.1 70B): 84.8% vals.ai ↗
Mistral Pixtral LargeMistral (EU) · European lab · vision seat	—its text backbone (Mistral Large 24.11): 76.2% vals.ai ↗

Law — LegalBench (legal reasoning)

Legal triage runs on models independently scored on LegalBench — Stanford's legal-reasoning benchmark — where the lead seats score 82–84%. For context, a passing lawyer needs ≈58–62% on the bar exam's MBE (a different test; shown for context, and it links to its source like everything else). These are still AI opinions, which is exactly why this stays triage-only, never legal advice.

Who	LegalBench score
Passing-lawyer threshold (bar exam MBE — a different test, shown for context)human reference	≈60% NCBE ↗
Claude Opus 4.1Anthropic · chairs the consensus	83.5% vals.ai ↗
Claude Sonnet 4.6Anthropic · senior seat	82.1% vals.ai ↗
Claude Haiku 4.5Anthropic · fast seat	81.2% vals.ai ↗
Llama 4 MaverickMeta · open-weight diversity	77.8% vals.ai ↗
Llama 3.3 70BMeta · independent reasoning seat	77.2% vals.ai ↗
Amazon Nova ProAmazon · document analysis	73.6% Stanford HELM ↗
Amazon Nova LiteAmazon · fast cross-check seat — cross-checks, never chairs	—no published score for this exact model
Mistral Pixtral LargeMistral (EU) · European lab · vision seat	—no published score for this exact model

What the same questions cost with humans

Two worked examples from the live checks — the legal side and the medical side.

Legal — what the same questions cost per side, with humans

Contested divorce, lawyer fees	$15,000 – $50,000+ per side
Human-led expert panel (family lawyer + forensic accountant + mediator)	$10,000 – $30,000+
Family-lawyer retainer, just to start	$3,000 – $10,000
One consultation hour	$300 – $500
DivorceCheck AI Opinion	$25 – $249

Medical — what this costs with humans

Human-led longevity expert panel (MD + specialists, concierge)	$10,000 – $100,000 / year
Longevity-clinic membership	$2,500 – $15,000+ / year
Executive health assessment	$2,000 – $5,000
One functional-medicine consult	$300 – $600
ImmortalityCheck AI Opinion & Discussion	$25 – $249

Our system doesn't replace any of them — but it sure helps to educate you before you pay them.

Pick your council size

2 models — $25 · 5 — $49 · 10 — $99 · 12 (full bench) — $249

More seats = more independent opinions debating your case before the chair writes the consensus. RandomCheck runs the live council seats on every $5 question.

Ask the AI Council →

Sources

Independent published evals of the exact model versions where available — siblings labeled as such, no score invented, re-verified 2026-07-03. Official benchmarks: MedQA paper · MedQA data · LegalBench · LegalBench paper. Independent leaderboards: vals.ai MedQA · vals.ai LegalBench · Stanford HELM · NCBE MBE. Scores move as models ship.