I want to find the most interesting papers from ~200 recent AI publications. “Interesting” sounds subjective, but I gave the model specific criteria: prioritize counter-intuitive findings and emergent behaviors, downweight incremental benchmarks, and so on. The question: can I trust a single response, or do I need to ask 30 times and vote?
Below, I compare single-shot (one call, take what you get) against consensus (30 runs, keep papers that appear in >50%). The answer depends on how hard the model thinks.
Results for :
Single-shot
Papers from one sampled run
Top 5 by frequency
Muted = below 50% threshold
All Selections
* in top X · in single-shot