
Most verticals are not clean, well-oiled SaaS databases; the reality is messy documentation, proprietary schemas, hidden workflows, and long-running tasks that most general-purpose models struggle with.
This prompted construction project management company Trunk Tools to build a three-layered architecture—perception, semantics, agents—based on highly detailed data to support high-precision, highly connected industrial automation.
Trunk says their purpose-built stack has reduced review cycles from months to days, preventing costly field errors and enabling autonomous agents to think through millions of pages of documentation.
“We really set out to take data from fragmented systems, preprocess it, structure it, go from our ontology to a knowledge graph, and then train AI models,” said Sarah Buchner, founder and CEO of Trunk and a former carpenter.
For builders in other verticals, Trunk’s approach can serve as a blueprint for turning data chaos into agent-ready, industry-specific workflows.
Where general purpose LLMs break down industry data
Foundation LLMs, while powerful, are always optimized for breadth, not depth.
“General-purpose LLMs are trained to be good at everything, so they’re bad at anything,” said Kriti Faujdar, senior product manager for AI infrastructure, agent AI, security and LLM platforms. For example: Rare terms, domain-specific discretion, unspoken context that any practitioner ‘only knows’.
Web, apps and software developer Sébastien De Bollivier agreed that the biggest bottleneck is data reliability, which is “jargon-dense, abbreviation-heavy and format-specific.”
“A GPT-4 class model may understand the French legal contract, but it will violate the specific article references that practitioners need to refer to,” he said.
In addition, the most valuable enterprise data is never converted to pre-training, Faujdar noted. It sits on embedded systems and proprietary formats. “RAG helps a bit,” he said. “But it still gives better facts to a model that can’t think properly in the domain.”
Prior training on domain information is essential; businesses should then fine-tune and base their assessments on good engagement examples. “A few thousand samples from real practitioners are better than millions of scratchy, noisy samples." Faujdar said.
Expertise mix (EM) can provide specialization without inference costs. Pairing RAG with fine tuning also works well; RAG handles the actual long trail, while fine-tuning fixes vocabulary and reasoning.
De Bollivier pointed out the advantage of hybrid stacks: a general-purpose model for judgment and orchestration, a smaller fine-tuned model for domain-specific extraction (or intensive search over a selected corpus). He recommended: “Fine-tune the model to make it ‘smarter’ about the domain, fine-tune it to make it more reliable in the specific output format required by the workflow.”
Commerce and construction are certainly industries that are seeing traction with these techniques, as are law and healthcare, de Bollivier said. These verticals have “standard document formats that equate to high risks for errors and clear domain training ROI.”
Faujdar offered an honest warning worth noting: Specialized models can often spawn outside of their domain, so they’re often not useful outside of their expertise (unless they’re retrained).
Perception, semantics, agents: Inside Trunk’s three-layer stack
In highly specialized domains like construction, “data waste” doesn’t cut into large language models (LLM), said Amrish Kapoor, CTO of Trunk. This is because most transformers are probabilistic models: Given a picture, they say it is “probably” a tree, or “probably” a child playing by a tree.
This does not make them sufficient for highly accurate symbolic interpretation. For example, a symbol with a width of 2 millimeters in construction documents has a very different meaning depending on where it is placed.
Furthermore, probabilistic models constrained by context constraints struggle with long-term project memory. “I don’t mean the context window of a few signs,” Kapoor said. “I’m talking long-term memory over months and years, because that’s how long some of these projects are.”
Instead, Trunk’s three-tiered system divides the workflow into:
-
Perception (reading and extracting information from mixed documents such as PDFs, images or scans)
-
Semantic/graphical layer (making sense of that information and understanding their relationships).
-
LLM and agents on top.
Buchner says that construction drawings are usually symbolic. A door is not always labeled as a “door”. Sometimes it’s just an arch on the wall that the trained eye learns to read through years of experience.
“The perception layer is what teaches the AI to read that language,” he said. The semantic layer then gives meaning to that information; for example, joining a drawing detailing the door, the specification governing it and the trade installing it. This helps answer critical questions from project engineers: No "Is there a door here?" but "does this door cause problems?"
Especially in construction, this change is important because the cost of the problem compounds over time. “Resolving a conflict caught during design is relatively low cost,” Buchner said, “whereas the same problem caught in the field can cost tens of thousands of dollars.”
At a high level, the system identifies the type of document and starts extracting information based on the content (image, tables, paragraph text). This data is then “transformed and expanded” on the platform, which triggers agent workflows such as knowledge graph connections and end-user workflows.
For example, an agent might review an architectural bulletin and create a visual overlay comparing the older version with the newer version (noting additions and deletions), then create written stories that describe in plain language what those changes are. This helps users understand what has changed and coordinate with trading partners on updated prices and change orders.
The scale of the building information problem
Buchner said construction workflows are “designed with implicit assumptions and connections between data from countless sources.” And it is “humanly impossible” to process or make sense of that amount of unstructured data.
Buchner estimates that the average high-rise building generates about 3.6 million pages of relevant documents. “If you printed it on a stack of paper, it would be as tall as the building itself.”
All three layers of the trunk stack—perceptual, semantic, LLM—are trained on “very specific data sets” of customers with “open permissions” and auto-tagging/IP, Kapoor explained. Customers who do not want Trunk training on their data can opt out.
Data is de-identified and aggregated, and Trunk also collects “tons more” labeled data through other pipelines, such as 3D building information modeling (BIM).
Trunk says it only sends agents that achieve about 95% accuracy. The team maintains continuous evaluation pipelines based on ground truth data from customers and experts. They also use the LLM model as a referee.
“The concept of LLM as a judge is to assess how good you are, both subjectively and objectively,” says Kapoor. Objectivity can be an easy “true” or “untrue,” but subjectivity requires more nuance.
For example, when creating an email or story or explanation, LLM as a judging framework can create an aggregated score or numerical value that combines different metrics and tests the model’s performance or risk.
There can be problems with latency in particular, Buchner noted; As the reasoning capacity of the underlying models increases, so does the risk of latency. Trunk maintains a set of benchmarks to objectively measure latency whenever changes are made to the underlying infrastructure, agents, and API calls.
Then, “Before we deliver to customers, we ensure that marginal changes to the end-user experience are worth the performance improvements,” Buchner said.
60 days to 10: measurable income
The Trunk platform empowers seven AI agents dedicated to construction, such as analyzing request for information (RFI) responses, reviewing proposals, or reviewing drawings and presentations.
The submitting agent notes missing, conflicting, or inconsistent information in, for example, product specifications and RFIs. While this is an important step in the construction process, “it’s a very tedious workflow,” Buchner said, because human reviewers must compare documents “with many other parts of the document.”
But the agent can do it in seconds, and Trunk says he’s cut delivery cycles from 50 days to 60 days to 10 days, “which has huge schedule and financial implications.”
Trunk is now where these agents communicate directly with each other, which is “pretty exciting,” Buchner said. So, for example, an agent will review an architectural drawing for accuracy, then hand it off to agents who independently handle RFIs and ask additional questions.
“If there are problems with the drawings, the RFI agent takes over and proactively seeks clarification,” Buchner explained.
Trunk says it saves its customers 20 to 40 minutes per question. Buchner said users in the field know better than anyone how “time-sucking” it is to go back and forth from office trailers, dig through project documents in scattered systems or printed PDFs, iron out inconsistencies and go back to coordinate with trading partners.
Trunk says customers report these additional results:
-
Average time savings of 8 minutes per single document search (status check, spatial searches, quantity queries).
-
Average time savings of 20 minutes for a standard reference (cross-referencing 2-3 specific passages to generate an answer.
-
Average time savings of 40 minutes for multi-document research (listing and filtering queries, mapping relationships, analyzing RFIs and presentations on 4-6 documents).
-
Average time savings of 75 minutes for complex tasks (creating RFIs and other communication materials, in-depth cross-referencing between documents, tracking changes).
In one case, Trunk’s agent reviewing the drawings noted that the structural beam was raised 8.5 inches. However, this was not documented by the architect. If the change hadn’t been caught, the project manager likely would have had to remove and reinstall the correct-sized beam, Buchner said. That rework would add $10,000 or more to the budget and “certainly have schedule implications.”
Buchner pointed to other examples: one agent quoted landscaping subcontractors $60,000 in inflated prices without justification; Identifying a fireplace that needed to be sealed before installing drywall, saving nearly $100,000 in labor, materials and delays; and said a panel is required for the electric door, which is not included in the electrical drawings.
Learning for other industries
Trunk’s approach to construction agents applies to any vertical with unstructured, high-volume industry data. Builders working in specific verticals must understand the industry-specific data challenges faced by their end users and build a technical infrastructure that can transform unstructured data into something “LLM can go through and understand,” Buchner said. “Only then can you build connections between data points that enable agent workflows.” A lot of money is invested in core models, so businesses should build modular systems that can take advantage of the strengths of different models as they continue to evolve, Buchner advised. Then, he said, “build a technical advantage where generic models don’t invest and don’t do well.”





