Benchmark · MMLU-Pro

We put the consensus to the test.

Name: consens.io consensus benchmark: MMLU-Pro pooled snapshot
Creator: consens.io
License: https://creativecommons.org/licenses/by/4.0/

consens.io sends one question to six leading models and fuses their answers into a single consensus. So we asked the obvious question back: does that consensus actually beat the models it's built from? Here's the run, graded against known-correct answers, nothing hidden.

314 questions 6 models + consensus MMLU-Pro Closed-book · no tools Consensus by Gemini 3.5 Flash

When the models disagree

Accuracy on the questions where they split · MMLU-Pro

72.1%

70.6%

69.1%

64.7%

58.8%

44.1%

39.7%

The blue bar is the consens.io consensus, first on the questions where the models can't agree.

How it ran

One question, six answers, one consensus.

MMLU-Pro is a hard, expert-level multiple-choice exam. Every question runs closed-book: no web search, no tools, no outside sources, just what each model knows.

01 · Ask

Six models, independently

The same question goes to all six consensus models in parallel. Each answers on its own, with tools and web access switched off.

02 · Fuse

The consensus is built

The exact same consensus logic the app uses combines the six answers into one final answer, with Gemini 3.5 Flash as the synthesizing model, the same path a real prompt takes.

03 · Grade

Checked against the truth

Every answer (each model, a plain majority vote, and the consensus) is scored against the dataset's known-correct option.

The headline number

Accuracy across all 314 questions.

The consensus comes out on top, but only just. At the top of the field the gaps are tiny, and the 95% confidence intervals (the whiskers) overlap heavily. Read honestly, the leading systems are in a statistical tie. The takeaway isn't "we win every question." It's that the consensus reliably tracks the best models instead of dragging toward the average.

Overall accuracy on MMLU-Pro

Share of 314 questions answered correctly · whiskers show the 95% confidence interval

Accuracy (%)

100 75 50 25 0

91.1%

90.8%

90.5%

89.8%

89.5%

88.2%

85.0%

84.1%

consens.io
consensus

Claude
Opus 4.8

GPT-5.5

ΣMajority
vote

Gemini
3.5 Flash

Grok 4.3

DeepSeek
V4 Pro

Mistral
Medium 3.5

consens.io consensus Single model Majority vote (baseline) 95% confidence interval

Where it gets interesting

When the models disagree, the consensus pulls clear.

On most questions the models already agree, so combining them changes little. The real test is the 68 questions where they split. Here picking the wrong model costs you, and the gap opens up. The consensus lands on top again, and well above a naive majority vote. The whiskers are wide (only 68 questions), so treat this as a strong signal, not a settled fact.

Accuracy on the 68 disagreement questions

Only questions where the six models did not unanimously agree · whiskers show the 95% confidence interval

Accuracy (%)

100 75 50 25 0

72.1%

70.6%

69.1%

66.2%

64.7%

58.8%

44.1%

39.7%

consens.io
consensus

Claude
Opus 4.8

GPT-5.5

ΣMajority
vote

Gemini
3.5 Flash

Grok 4.3

DeepSeek
V4 Pro

Mistral
Medium 3.5

consens.io consensus Single model Majority vote (baseline) 95% confidence interval

What it means

Three honest reads.

#1The consensus ranks first overall and on the hard questions, though the top of the field is within the margin of error.

+32ptsOn disagreement questions the consensus (72.1%) sits far above the weakest model (39.7%): picking wrong is expensive.

0Errors and abstentions for the consensus: it produced a clean, parsable answer on every single question.

Never the weak link

It tracks the best, not the average

Fusing six answers doesn't regress to the mean. The consensus stays at the top of the pack instead of being pulled down by weaker models.

Better than a vote

Reading answers beats counting them

The consensus reasons over the six answers and beats a plain majority vote, clearest exactly when the models split.

No overclaiming

The gaps at the top aren't significant

With 314 questions the confidence intervals overlap. We're not claiming a knockout, just that combining beats betting on one model.

The fine print

Exactly how this was measured.

Setup

Dataset: 314 questions from MMLU-Pro, an expert-level multiple-choice benchmark, pooled across three runs from different sampling strategies (disagreement-enriched, category-balanced, and random).
Closed-book: web search, retrieval and tools were disabled for every model. Each answered from its own knowledge only.
Models: GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash, Mistral Medium 3.5, DeepSeek V4 Pro and Grok 4.3, the same six the app uses.
Consensus: built with the app's production consensus logic; the synthesizing model for this run was Gemini 3.5 Flash.
Grading: one attempt per question, scored against the dataset's ground-truth option. Majority vote is a simple plurality of the six model answers.

Reading the numbers

Whiskers are 95% confidence intervals. Where they overlap, the difference is not statistically significant, and at the top of the overall chart, they overlap a lot.
The disagreement subset is only 68 questions, so its intervals are wide. The ordering is a strong, consistent signal rather than a proven result.
This is a snapshot of one frozen prompt and model line-up. We're keeping the prompt fixed so the run keeps collecting comparable, unbiased data over time.

Try it yourself

Ask once. Compare every answer. Keep the consensus.

Open app