01 · Ask
Six models, independently
The same question goes to all six consensus models in parallel. Each answers on its own, with tools and web access switched off.
Benchmark · MMLU-Pro
consens.io sends one question to six leading models and fuses their answers into a single consensus. So we asked the obvious question back: does that consensus actually beat the models it's built from? Here's the run, graded against known-correct answers, nothing hidden.
When the models disagree
Accuracy on the questions where they split · MMLU-Pro
The blue bar is the consens.io consensus, first on the questions where the models can't agree.
How it ran
MMLU-Pro is a hard, expert-level multiple-choice exam. Every question runs closed-book: no web search, no tools, no outside sources, just what each model knows.
01 · Ask
The same question goes to all six consensus models in parallel. Each answers on its own, with tools and web access switched off.
02 · Fuse
The exact same consensus logic the app uses combines the six answers into one final answer, with Gemini 3.5 Flash as the synthesizing model, the same path a real prompt takes.
03 · Grade
Every answer (each model, a plain majority vote, and the consensus) is scored against the dataset's known-correct option.
The headline number
The consensus comes out on top, but only just. At the top of the field the gaps are tiny, and the 95% confidence intervals (the whiskers) overlap heavily. Read honestly, the leading systems are in a statistical tie. The takeaway isn't "we win every question." It's that the consensus reliably tracks the best models instead of dragging toward the average.
Overall accuracy on MMLU-Pro
Share of 314 questions answered correctly · whiskers show the 95% confidence interval
Where it gets interesting
On most questions the models already agree, so combining them changes little. The real test is the 68 questions where they split. Here picking the wrong model costs you, and the gap opens up. The consensus lands on top again, and well above a naive majority vote. The whiskers are wide (only 68 questions), so treat this as a strong signal, not a settled fact.
Accuracy on the 68 disagreement questions
Only questions where the six models did not unanimously agree · whiskers show the 95% confidence interval
What it means
Never the weak link
Fusing six answers doesn't regress to the mean. The consensus stays at the top of the pack instead of being pulled down by weaker models.
Better than a vote
The consensus reasons over the six answers and beats a plain majority vote, clearest exactly when the models split.
No overclaiming
With 314 questions the confidence intervals overlap. We're not claiming a knockout, just that combining beats betting on one model.
The fine print
Try it yourself