
Test time scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by giving them additional computation cycles during inference. However, TTS strategies have historically been hand-crafted, relying heavily on human intuition to dictate the model’s reasoning rules.
To overcome this bottleneck, researchers from Meta, Google and several universities have introduced AutoTTSframework that automatically detects optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocations without manually adjusting heuristics.
By implementing optimal strategies discovered by AutoTTS, organizations can directly reduce token usage and operational costs associated with deploying advanced reasoning models in production environments. In experimental tests, AutoTTS efficiently managed savings budgets, successfully reducing token consumption by up to 69.5% without sacrificing accuracy.
Manual bottleneck at test time scale
Test time scale improves LLMs by giving them additional calculations when generating answers. This additional computation allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at the final answer.
A key challenge for designing TTS strategies is determining how to optimally allocate this additional computation. Historically, researchers have developed these strategies by hand based on assumptions to build robust heuristics. Engineers must assume rules and thresholds for a model to branch out into new ways of thinking, explore an existing path more deeply, prune unpromising branches, or stop thinking altogether.
Because this manual tuning process is limited by human intuition, many possible approaches remain unexplored. This often results in suboptimal trade-offs between model accuracy and computational cost.
Current TTS algorithms can be adapted to the width-depth control field – "width" number of reasoning areas explored, "depth" how much each one has developed. Self-sequence (SC) selects a certain number of trajectories and obtains the answer by majority vote. Adaptive sequencing (ASC) saves computation by stopping early after reaching a confidence limit. Parallel-probe takes a more granular approach, pruning unpromising branches and deepening the rest. All three are manual, and that’s a limitation AutoTTS is designed to break.
Although some more advanced methods use richer structures, such as tree search or external validators, they all share one key feature: they are carefully handcrafted. This manual approach limits the scope of strategy discovery by leaving a large portion of the potential resource allocation area untouched.
Automating strategy discovery with AutoTTS
AutoTTS revisits test time scaling optimization. Instead of viewing strategy design as a human task, AutoTTS approaches it as an algorithmic search problem in a controlled environment.
This framework redefines the roles of both the human engineer and the AI model. Instead of manually developing specific rules for when LLM should branch, prune, or stop rooting, the engineer’s role shifts to building the discovery environment. Human boundaries define the control space of states and actions, optimization objectives that balance accuracy with cost, and specific feedback mechanisms.
A researcher LLM like Claude Code formulates the strategy. This explorer again acts as an autonomous agent offering TTS “controllers”. These controllers are policies or algorithms defined in code that dictate how the AI model allocates its computational budget when making inferences. The researcher tests and improves these controllers based on feedback until they discover the optimal resource allocation policy.
To make this automated search computationally affordable, AutoTTS relies on an “offline replay environment”. If a researcher LLM had to refer to the underlying reasoning model to generate new tokens every time they tested a new strategy, the computational cost would be astronomical. Instead, it relies on thousands of reasoning trajectories pre-collected from the LLM database. These trajectories include "probe signals," These are intermediate responses that help the supervisor assess progress in different areas of thought.
During the discovery cycle, the intelligence agent proposes a controller and evaluates it against this offline data. The agent observes the execution traces of the proposed controller, which show it the computation over time. By analyzing these traces, the agent can diagnose specific failure modes, such as noting that the controller pruned branches too aggressively in a specific scenario. This provides an advantage over looking only at the end result. The agent then rewrites its code to improve the accuracy-to-cost ratio.
Inside the controller designed with artificial intelligence
Because an exploratory agent is not limited by human intuition, it can discover highly coordinated, complex rules that a human engineer could never code by hand. An optimal controller discovered by AutoTTS, called the Confidence Momentum Controller, uses several non-obvious mechanisms to manage the calculations:
-
Trend based stop: Manual strategies often instruct the model to stop thinking after a certain confidence threshold is reached. The AutoTTS agent discovered that instant confidence can be misleading due to temporary spikes. Instead, the controller follows the exponential moving average (EMA) of confidence and stops only when the overall confidence level is high and the trend is not actively declining.
-
Combined width-depth control: Manually developed algorithms usually "expansion" new ways of reasoning and "deepening" as separate decisions of the current paths. AutoTTS discovered a closed feedback loop in which two actions are related. If the trust of the current branch stops or decreases, the controller automatically spawns new branches.
-
Alignment-based depth distribution: Instead of giving equal computational budget to all active reasoning fields, the controller dynamically determines which branches agree with the current leading answer. Then it gives preference to those branches "it explodes" additional calculation. This concentrates the computation budget on the emerging consensus to quickly check whether it is correct.
Cost savings and accuracy gains in real-world benchmarks
To test whether AI can independently discover a better trial-scaling strategy, the researchers set up a rigorous evaluation framework. The main experiments were performed on Qwen3 models with parameters ranging from 0.6B to 8B. The researchers also tested the system’s ability to generalize on a distilled 8B version of the DeepSeek-R1 model.
The exploratory AI agent is first tasked with discovering the optimal strategy using the AIME24 mathematical reasoning benchmark. This discovered strategy was then tested on two standardized mathematics measures, the AIME25 and HMMT25, as well as the graduate-level general reasoning measure GPQA-Diamond.
The controller detected by AutoTTS was pitted against four manually developed test time measurement algorithms in the industry. These basics include Self-Continuity with 64 parallel reasoning (SC@64), Adaptive-Sequential (ASC), Parallel-End and Early-Stop Self-Continuity (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when the response appears stable.
When put into balanced, cost-aware mode, the controller discovered by AutoTTS reduced total token consumption by about 69.5% compared to SC@64. At the same time, the controller maintained the same average accuracy across the four Gwen models. When the resulting budget was turned on, AutoTTS outperformed all hand-crafted benchmarks in five of the eight test cases with the highest accuracy.
This efficiency translated into other tasks. In the GPQA-Diamond benchmark, the balanced AutoTTS variant reduced the inference token cost from 510K tokens to just 151K tokens, while slightly improving overall accuracy. In the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark, while cutting token costs by nearly half.
For practitioners building enterprise AI applications, these practices highlight two key operational benefits:
-
To maximize performance: AutoTTS not only saves token consumption. It actively increases the achievable peak performance of the base model. The AI-engineered controller is extremely good at quickly detecting noisy or unproductive grounding areas and redirecting the computational budget to branches that consistently generate the most useful grounding signals.
-
Effective personal development: Since the framework is based on an offline replay environment, the entire discovery process cost only $39.90 and took 160 minutes. For enterprise teams, this means that optimized reasoning strategies and internal tasks tailored to specific models are now available without a dedicated research budget.
Also AutoTTS framework and the Confidence Momentum Controller are available on GitHub; CMC can be used as a drop-in replacement for other TTS controllers.





