
Data teams that build AI agents continue to operate in the same failure mode. Questions that require combining structured data with unstructured content, customer reviews, or academic papers with sales figures alongside citation counts disrupt single-loop RAG systems.
New research from Databricks puts a number on this failure gap. The company’s AI research team has tested its multi-step agent approach against state-of-the-art single-cycle RAG databases on nine enterprise knowledge tasks, achieving gains of 20% or more on Stanford’s STaRK benchmark suite and consistently improving within Databricks. KARLBench evaluation framework. The results show that the performance difference between single-loop RAG and multi-step agents in hybrid data tasks is an architectural problem rather than a model quality problem.
The work builds on the previous Databricks trained retriever research showing search improvements in unstructured data using metadata-aware queries. This latest research addresses a class of questions that enterprises cannot answer with most current agent architectures by adding structured data sources, relational tables, and SQL warehouses into the same reasoning loop.
"RAG works but doesn’t scale," Michael Benderski, director of research at Databricks, told VentureBeat. "If you want to improve your agent and understand why your sales are declining, now you need to help the agent see charts and view sales data. Your RAG pipeline will be incompetent at this."
A single-turn search cannot encode structural constraints
A key finding is that standard RAG systems fail when a query mixes a fine-grained filter with an open semantic search.
think of a question like "Which of our products have seen a decline in sales over the past three months, and what potentially related issues are coming up in customer reviews on various vendor sites?" Sales data lives in the warehouse. Browsing sentiment resides in unstructured documents on vendor sites. A single-loop RAG system cannot split this request, route each half to the correct data source, and combine the results.
To confirm that this is an architecture issue and not a model quality issue, Databricks reprinted the STaRK bases using the current state-of-the-art foundation model. The stronger model still lost 21% in the academic domain and 38% in the biomedical domain to the multistep agent.
STaRK is a benchmark published by Stanford researchers covering three semi-structured search domains: Amazon product data, Microsoft Academic Graph, and biomedical knowledge base.
How the Controller Agent handles things that RAG cannot
Databricks built the Supervisor Agent as a production implementation of this research approach, and its architecture shows why gains are consistent across task types. This approach involves three main steps:
Parallel tool decomposition. Instead of issuing a broad query and hoping the results cover both structured and unstructured needs, the agent runs SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. The parallel step is what allows us to handle queries that cross data type boundaries without requiring the data to be normalized first.
Self-correction. When the initial search attempt reaches a dead end, the agent detects failure, reformulates the query, and tries a different path. In the STaRK benchmark task, which requires finding an article by an author with exactly 115 previous publications on a given topic, the agent first queries both SQL and vector search in parallel. If the two result sets do not match, it matches both constraints and issues an SQL JOIN, then calls the vector search engine to check the result before returning the answer.
Declarative configuration. The agent is not tailored for any particular database or task. Connecting it to a new data source means describing in plain language what that source contains and what questions it should answer. No special code is required.
"An agent can do things like split a query into an SQL query and an out-of-the-box search query." Bendersky said. "It can combine SQL and RAG results, reason about those results, issue follow-up queries, and then reason about whether the final answer was actually found."
It’s not just about hybrid search
Extracting information from both structured and unstructured data is not an entirely new concept.
LlamaIndex, LangChain, and Microsoft Fabric agents all offer some form of hybrid search. Bendersky differs in how the Databricks approach frames the problem architecturally.
"We hardly see it as a hybrid search where you combine inputs and search results or inputs and tables." he said. "We see it more as an agent with access to many tools."
A practical consequence of this framework is that adding a new data source means connecting it to an agent and writing a description of its contents. The agent handles routing and orchestration without additional code.
Specific RAG pipelines require data to be converted into a search engine-readable format, typically text chunks with inputs. SQL tables should be fixed, JSON should be normalized. Each new data source added to the pipeline means more conversion work. Databricks research claims that as enterprise data spans more source types, this overhead makes custom pipelines increasingly unfeasible compared to an agent that queries each source in its native format.
"Just bring the agent to the information," Bendersky said. "You give the agent more resources and it will learn to use them quite well."
What this means for businesses
For data engineers evaluating building custom RAG pipelines or adopting a declarative agent framework, the research offers a clear direction: if the task involves queries involving structured and unstructured data, building custom search is more difficult. The study found that instructions and tool descriptions were the only things that differed between placements across all tested tasks. The agent took care of the rest.
The practical limitations are real, but manageable. This approach works well with five to ten data sources. Adding too many at once without determining which sources are complementary rather than conflicting makes the agent slower and less reliable. Bendersky recommends scaling up gradually and testing results at each step, rather than combining all available data upfront.
Data accuracy is a prerequisite. The agent can query between incompatible formats, SQL sales tables as well as JSON review feeds without requiring normalization. It cannot correct source data that is factually incorrect. Adding a plain-language description of each data source at the time of intake helps the agent direct inquiries right from the start.
The study sees this as an initial step in a longer trajectory. As enterprise AI workloads mature, agents are expected to negotiate dozens of resource types, including dashboards, code repositories, and external data feeds. The study argues that the declarative approach is what makes this scaling easier, as adding a new resource remains a configuration problem rather than an engineering one.
"It’s like a ladder" Bendersky said. "The agent will slowly acquire more information and then gradually improve overall."




