
Building a basic LLM from scratch costs millions and requires internet-scale data – so most businesses don’t bother. Sapient thinks its a cheaper way.
Sapient’s researchers developed to overcome this brute force scaling dogma HRM-Textan architecture that replaces standard Transformers with a highly exemplary efficient Hierarchical Recursive Model (HRM). first introduced last year.
HRM divides calculations into slow-moving strategic and fast-moving operational levels. Instead of brute force autoregressive prediction on raw text, HRM-Text works exclusively on instruction-response pairs. This is close to real enterprise settings where users typically expect a targeted response to a specific task.
The researchers were able to train 1B-parameter HRM-Text from scratch at a fraction of the cost and features of normal LLMs. Their model achieved competitive performance with larger outdoor models in key industry benchmarks.
For real-world AI applications, this means that basic training is no longer limited to high-resource institutions. With HRM-Text, organizations can pre-engineer their own high-performance thinking models from scratch and integrate them with external knowledge stores.
Training bottleneck
When we train an LLM, we don’t care if it remembers the exact order of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, grounded understanding of human language, logic, facts, and reasoning.
The current approach is brute force: hack the internet, run the next token prediction trillions of times, and assume the model develops a working internal model of the world.
Basically, this means that we spend millions of dollars of computing power and force models to memorize everything they gather from the web so that they learn to think indirectly. For example, standard decoder-only models spend valuable computation to determine the loss to reconstruct the query itself, even though the user’s query is already known and provided during inference.
Instead of seeing this as just a computational hurdle, the industry should see it as a serious business constraint. In comments to VentureBeat, Sapient Intelligence CEO Guan Wang framed this as an issue. "economy of iteration."
"Enterprises today face three complex challenges: training is expensive, infrastructure is heavy, and testing cycles are too slow." Wang said. "The industry’s scale dependency says: “When the model fails, scale it up. Add more data. Add more GPUs.’ It’s worked, but it’s reaching a point of diminishing returns. More scale often means more storage, more latency, more infrastructure, and more vendor dependencies. It doesn’t necessarily give the enterprise a better thinking engine."
These architectural and computational inefficiencies are precisely why fine-tuning existing tight transformers is not always a silver bullet for businesses. Fine-tuning to preserve the overall capabilities of the model often requires mixing process-based general-purpose data, making it computationally heavy and difficult to manage.
"Imagine a hedge fund, insurer or bank with highly proprietary data: internal research notes, transaction logic, compliance rules, analyst notes, risk models, portfolio constraints," Wang said. "They may not want to send this data to an external boundary model, and they may not need a giant general-purpose model that remembers the internet. What they need is a compact reasoning core that can learn task structure, think between rules and numbers, and work in a controlled environment."
Because HRM-Text focuses its computing strictly on task performance and implicit reasoning, it allows businesses to start with a smaller, smarter model and adapt it to a property with less infrastructure.
Revising architectures with HRM-Text
Introduced in 2025, the HRM represents a major departure from traditional Transformer models. In order to build a more efficient engine for example, HRM separates calculations into slow-moving strategic and fast-moving execution levels. The fast L module performs local iterative refinement, while the slow H module maintains a stable semantic context between cycles. Processing consists of two high-level cycles, where each cycle performs three fast L-module updates followed by a single slow H-module update.
Repetitive architectures shared with the standard setting (e.g Samsung’s TRM) can sometimes handle small logic puzzles, but Sapient researchers found them to be quite unstable when scaled up to 1 billion parameters for language tasks. The separation between the slow H-module and the fast L-module of HRM is not merely an aesthetic choice, but a mathematical necessity. As Wang says: "For logic networks, you can sometimes get away with a small recursive mechanism because the world is pure and bounded. Language is not like that. Language needs both rapid local sophistication and slow semantic stability."
Although the original HRM has proven highly effective for controlled, symbolic reasoning problems, researchers have hit a wall when applying it to the massive, open-ended complexities of generalized language modeling. While HRM’s loops make him an incredibly efficient thinker, those loops make it mathematically inconvenient to exercise the various chaos of human language. Using repeated loops in the language creates massive mathematical instabilities, especially exploding or vanishing gradients.
To avoid this feedback loop in the neural network, the researchers introduced two major architectural innovations in HRM-Text. First, they developed MagicNorm, a special normalization technique specifically designed to keep internal signals constant regardless of how many times the model loops through the thought process.
Second, they developed a warm-up method to stabilize the exercise. During early training, the model is evaluated only on short, shallow reasoning loops. As training progresses, the system warms up, gradually giving the model deeper and longer reasoning sequences.
They also shifted the training goal from predicting the next token to task completion, where the model is only rewarded for a complete response, as opposed to the individual tokens it generates. To achieve this goal, they changed the training data of HRM-Text from raw text to only instruction-response pairs.
HRM-Text is active
The researchers built a highly compact 1 billion parameter HRM-Text model. Instead of using a standard multi-stage pipeline that requires crunching trillions of words of raw internet text, they trained it from scratch on a robustly curated database of just 40 billion tokens. The training data consisted entirely of instruction-response pairs on general instructions, mathematics, symbolic logic, textbook exercises, and rewritten knowledge.
They trained the model using the task completion goal. They explicitly removed the model to force it to rely on its internal hierarchical architecture rather than moving the logic step by step. "thinking" tokens from the training data.
The model was evaluated on a variety of standard core AI metrics that are broadly indexed to knowledge, reasoning, logic, mathematics, and perception. The researchers tested HRM-Text against both small models and high-resource open-weighted and fully open models.
The results show that there is a significant change in the boundary between computation and performance. 1B-parameter HRM-Text achieved 60.7% for MMLU, 84.5% for GSM8K, and 56.2% for MATH. This performance is highly competitive with (and in several cases exceeds) the 2B to 7B parameter foundation models tested.
For enterprise audiences, the most important output consists of performance statistics and practical implications. Building a foundational model from scratch is a multimillion-dollar undertaking typically reserved for tech giants. HRM-Text was trained on a cluster of 16 GPUs in just 1.9 days. The total estimated cost was about $1,500. Compared to models like Gemma and Llama, Gwen achieved its competitive scores using 100–900 times fewer training tokens and 96–432 times less computation.
Another important point is the separation of thinking from memorization of knowledge. From a practical point of view, despite HRM-Text’s small 40B-token training diet, its success in reasoning-intensive tasks proves that a model does not need to remember the entire web to be an intelligent reasoning engine.
For enterprise applications, this behavior is a feature, not a bug. The researchers propose a future in which businesses deploy highly compact, incredibly cheap repeatable models. "basis of reasoning" specialized for business logic. Instead of forcing the model to remember company databases during initial training, the model acts as a reasoning engine, relying on external search engines to retrieve factual knowledge.
Critics have noted that training on instruction-response pairs compares favorably with models trained on raw text. "apples and oranges" scenario. Wang pushes back against this framework, pointing out that every serious modern LLM sees instruction-response data during training or adaptation. "So the comparison is not apples-to-oranges. Closer to apple cores and apples. We started directly from the basic task format, because this is how people actually use models: they give instructions and expect a useful response," he said.
The researchers also ran rigorous contamination tests to make sure the model wasn’t just memorizing benchmark responses. On DROP, a benchmark that indicates a marginal contamination signal under a given setting, HRM-Text still scored an impressive 81.1% in the strictly clean, 0% contamination subset.
Finally, Wang argues that for businesses "Correct assessment is not trivia recall. This is a workflow evaluation… Give HRM-Text a task such as: multi-step financial reasoning, compliance logic, scientific workflow automation, structured extraction and then reasoning."
The practical application and future of enterprise AI
While reliable pricing and affordability draw attention, Sapient is clear about the model’s current limits. The initial release is treated as a proof of concept, similar to early GPT releases, designed to demonstrate the architecture’s unique advantages.
"To be honest, HRM-Text is not yet a plug-and-play ChatGPT replacement," Wang said. "This is a compact foundational language reasoning model. For the enterprise engineering team, operational work revolves around templates, mode selection, focus masking, and alignment."
For aspiring AI engineering teams to get started, a certain but standard text generation discipline is required. The model lists native support in the Transformers library (requires transformers >= 5.9.0), and usage paths for vLLM and SGLang are actively being developed. A major engineering challenge involves managing the design of PrefixLM: producing multithreaded chat applications will require careful KV-cache logic to ensure that user prompts receive full bidirectional attention while the assistant’s results remain causal.
"When the cost of developing a capable reasoning model drops to about $1,500, AI ceases to be just an infrastructure question and becomes a strategy question," Wang said. "A Fortune 500 company has already asked “Can we afford a foundation model?” should not ask the question. He asked, “What should our model know about our business and what rationale should it be optimized for?”"





