
Voice AI is moving faster than the tools we use to measure it. Every major AI lab — OpenAI, Google DeepMind, Anthropic, xAI — is racing to ship voice models capable of natural, real-time conversation.
But the benchmarks used to evaluate these models still mostly work on synthetic speech, English-only prompts, and scripted test sets that bear little resemblance to the way people actually speak.
AI scalesthe beginning of big data annotation its founder was poached by Meta last year to lead the Superintelligence Labstill going strong and solving the problem: starting today Sound demonstrationwhat it calls the first global preference-based arena designed to benchmark voice AI through the lens of real human interaction.
This product offers users a unique strategic value: free access to the world’s leading frontier models. Through Scale’s ChatLab platform, users can connect with high-end models (typically requiring multiple subscriptions of $20 per month) at no cost. In return, users sometimes participate blindly, head-to-head "battles" choosing which of two anonymous leading voice models offers the better experience provides input for the industry’s most authentic, human-preferred voice AI models.
"Voice AI is truly the fastest moving frontier in AI right now," said Janie Gu, Showdown product manager at Scale AI. "But the way we evaluate sound models hasn’t kept up."
Results from thousands of spontaneous voice conversations in more than 60 languages reveal opportunity gaps that other benchmarks consistently miss.
How Scale’s Voice Showdown works
Voice Showdown is built on ChatLab, Scale’s model-agnostic chat platform, where users can freely interact with a frontier AI model of their choice within an app for free. The platform has been made available to Scale’s global community of more than 500,000 annotators, with nearly 300,000 submitting at least one survey. Scale is opening the platform to a public waiting list today.
The scoring mechanism is elegant in its simplicity: when the user has a natural voice conversation with the model, the system occasionally – in less than 5% of all voice prompts – performs a side-by-side comparison. The same request is sent to a second, anonymous model, and the user chooses which response they prefer.
This design solves three problems plaguing current audio benchmarks.
First, each prompt comes from real human speech with accents, background noise, incomplete sentences, and conversational filler, not synthesized audio from text.
Second, the platform covers more than 60 languages across 6 continents, with more than a third of matches taking place in non-English languages, including Spanish, Arabic, Japanese, Portuguese, Hindi and French.
Third, because battles take place in users’ actual daily conversations, 81% of prompts are conversational or open-ended—questions with no single correct answer. This rules out automated scoring and makes human preference the only reliable signal.
Voice Showdown currently operates two scoring modes: Dictate (users speak, models respond with text) and Speech-to-Speech or S2S (Speech-to-Speech, users speak, models respond). A third mode—Full Duplex, which captures real-time, interruptible conversation—is under development.
Incentive voting
One design detail sets Voice Showdown apart from its most similar text benchmark, Chatbot Arena (LM Arena). Critics on LM Arena have noted that users sometimes cast empty votes with little stake in the result. Voice Showdown addresses this directly: once a user votes for their preferred model, the app switches them to that model for the rest of the conversation. If you voted GPT-4o Audio on Gemini, you’re now talking to GPT-4o Audio. Reconciliation of the result by preference prevents random or dishonest voting.
The system also controls for confounds that could distort comparisons: both model responses start broadcasting at the same time (eliminating speed bias), voice gender is matched across both options (eliminating gender preference bias), and no model is identified by name during voting.
The new Voice AI leaderboard that every enterprise decision maker should be paying attention to
The Voice Showdown begins on March 18, 2026 with 11 frontier models evaluated on 52 model-voice pairs. Not all models support both rating modes — the Dictate leaderboard includes 8 models and the S2S includes 6 models.
Dictation Leaderboard (Speech-In, Text-Out)
In this mode, users submit a verbal request and rate two side-by-side text responses. Here are the base points:
-
Gemini 3 Pro (1073)
-
Gemini 3 Flash (1068)
-
GPT-4o Audio (1019)
-
Question 3 All (1000)
-
Voxtral Small (925)
-
Gemma 3n (918)
-
GPT Realtime (875)
-
Phi-4 Multimodal (729)
Note: Gemini 3 Pro and Gemini 3 Flash are statistically ranked at the top.
Speech-to-Speech (S2S) Leaderboard
In this mode, users talk to the model and rate two competing voice responses. Also the basics:
-
Gemini 2.5 Flash Audio (1060)
-
GPT-4o Audio (1059)
-
Grok Sound (1024)
-
Question 3 All (1000)
-
GPT Realtime (962)
-
GPT Realtime 1.5 (920)
Note: Gemini 2.5 Flash Audio and GPT-4o Audio are statistically tied for the top spot in early evaluations.
The Dictate rankings are led by Google’s Gemini 3 Pro and Gemini 3 Flash, which statistically ranked No. 1 with Elo scores of 1,043-1,044 after style control.
GPT-4o Audio clearly takes third place. Light weight models including the Gemma3n, Voxtral Small and Phi-4 Multimodal follow closely behind.
Speech-to-Speech (S2S) ratings show tighter competition at the top, with Gemini 2.5 Flash Audio and GPT-4o Audio statistically tied for #1 in the baseline.
After adjusting for response length and format—factors that can increase perceived quality—GPT-4o Audio comes out ahead (1,102 Elo vs. 1,075 for Gemini 2.5 Flash Audio).
The Grok Voice jumps to a close second at 1,093 under style control, suggesting that its No. 3 ranking undercuts actual performance quality.
The Qwen 3 Omni, an open-weight model from Alibaba’s Qwen team, fares better than its popularity would suggest – it ranks fourth in both modes, ahead of several high-profile names.
"When people come in, they go for the big names," Gu noted. "But for preference, lesser-known models like Gwen are actually ahead."
Surprising revealed by real-world selection data
Ratings aside, Voice Showdown’s real value is in diagnosing failures — and that paints a more complex picture of voice AI than most leaderboards reveal.
The multilingual gap is worse than you think
Language strength is the sharpest differentiator between models. At Dictate, the Gemini 3 models lead in every language tested.
The winner in S2S depends very much on which language is spoken: GPT-4o Audio leads in Arabic and Turkish; Gemini 2.5 Flash Audio is strongest in French; Grok Voice is competitive in Japanese and Portuguese.
But a more alarming finding is how often some models stop responding in the user’s language.
GPT Realtime 1.5 — OpenAI’s newer real-time voice model — responds to non-English queries in English about 20% of the time, even in high-resource, officially supported languages like Hindi, Spanish, and Turkish.
Its predecessor, GPT Realtime, falls short of this ratio by about half (~10%). Gemini 2.5 Flash Audio and GPT-4o Audio sit at ~7%.
This phenomenon works both ways: some models translate a previously spoken non-English context into English, or simply mishear a command and generate an unrelated response in the wrong language entirely.
User comments from the platform clearly reflect the frustration: "I told him I had an interview with Quest Management today and instead of answering he told me about ‘Risk Management’."
"GPT Realtime 1.5 thought I was speaking gibberish and recommended mental health help, while Gwen 3 Omni correctly identified that I was speaking a local Nigerian language."
The reason current benchmarks miss this: they are based on synthetic speech optimized for pure acoustic conditions and are rarely multilingual. In real environments, real speakers—with background noise, short phrases, and regional accents—disrupt speech understanding in ways that laboratory conditions would not expect.
Sound selection is about more than aesthetics
The Voice Showdown model evaluates models not only at the model level, but also at the individual voice level, and the difference within a model’s voice catalog is striking.
For one unnamed model in the study, the best-performing vote earned 30 percent more than the worst-performing vote from the same base model. Both voices share the same rationale and generational background. The difference is only in the audio presentation.
Top-performing voices tend to win or lose in terms of voice intelligibility and content completeness—whether or not the model hears you correctly and responds fully. But speech quality remains a deciding factor at the voice selection level, especially when the models are otherwise comparable. "Voice directly shapes how users evaluate interactions," Gu said.
Models are demeaned in conversation
Most benchmarks test a queue. Voice Showdown tests how models fare in extended conversations, and the results are not encouraging.
In loop 1, content quality accounts for 23% of model failures. At turn 11 and beyond, it becomes the primary failure mode at 43%. Most models see win rates decrease as conversations drag on, struggling to maintain coherence across multiple exchanges.
The GPT Realtime variants are an exception, improving marginally in later turns – consistent with their known strengths in longer contexts and their documented weakness for short, noisy phrases that dominate early interactions.
Query length shows a complementary pattern: short prompts (less than 10 seconds) are dominated by voice intelligibility failures (38%), long prompts (over 40 seconds) shift the main failure to content quality (31%). Shorter audio gives models less acoustic context to analyze; longer queries are understandable but harder to answer well.
Why do some voice AI models lose?
After each S2S comparison, users note why they prefer one answer over another on three axes: audio intelligibility, content quality, and speech output. Failure signatures differ significantly by model.
The Gwen 3 Omni’s downsides center around speech generation — its reasoning is competitive, but users are squeamish about how it sounds. GPT Realtime 1.5’s losses are dominated by voice comprehension failures (51%), consistent with language switching behavior in difficult prompts. The Grok Voice’s lows are more balanced on all three axes, showing no dominant weakness, but no particular strength either.
What’s next
The current leaderboard involves turn-based interaction – you talk, the model responds, it repeats. But real voice chats don’t work like that. People interrupt, change direction mid-sentence, and talk over each other.
Full Duplex scoring, designed to capture these real-time dynamics through human choices rather than scripted scenarios or automated measurements, is coming to the next Showdown, Scale says. No existing criterion captures the full duplex interaction through organic human selection data.
The leaderboard is live at scale.com/showdown. A public waiting list to join ChatLab and vote on comparisons is open today, with users getting free access to frontier sound models including GPT-4o, Gemini and Grok in exchange for occasional preferred sounds.




