How MassMutual and Mass General Brigham turned AI pilot expansion into production results

Enterprise AI programs rarely fail because of bad ideas. More often than not, they get stuck in unmanaged pilot mode and never make it to production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided this trap and what the results will look like when discipline replaces expansion.

At MassMutual, the results are concrete: 30% developer productivity, IT help desk resolution times reduced from 11 minutes, and customer service calls from 15 minutes to one or two.

“Why are we always interested in this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve a problem, how will we know we’ve solved it? And what’s the value of doing it?”

Defining metrics, creating strong feedback loops

MassMutual, a 175-year-old company that serves millions of policyholders and customers, has put artificial intelligence into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims and more.

Merritt said his team followed the scientific method by starting with a hypothesis and testing whether it had an outcome that would significantly advance the business. Some ideas are great, but they may be “hard to do business” due to factors such as lack of information or access, or regulatory constraints.

“Until we’re clear on how we’re going to measure and define success, we’re not going to move forward with an idea.”

Ultimately, it’s up to the various departments and leaders to define what quality means: Choose a metric and define a minimum level of quality before the tool is put into the hands of teams and partners.

This starting point creates a rapid feedback loop. “What’s holding us back is where there’s a lack of shared clarity about the outcome we’re trying to achieve,” Merritt said. “We don’t go into production until we have a business partner who says, ‘Yes, it works.'”

His team is strategic in evaluating emerging tools and is “extremely rigorous” in what they test and measure. "good" means For example, they perform confidence estimation to reduce hallucination rates, set thresholds and evaluation criteria, and monitor feature and output drift.

Merritt also operates with a no-commitment policy, meaning the company is not locked into using a particular model. He says there is an “incredibly heterogeneous” technology environment that combines best-of-breed models alongside core systems that run on COBOL. This flexibility is not accidental. His team built common service layers, microservices, and APIs that sit between the AI layer and everything below it — so changing it when a better model comes along doesn’t mean starting over.

Because, Merritt explained, “the best breed today could be the worst breed tomorrow, and we don’t want to be left behind.”

Weed instead of letting a thousand flowers bloom

Mass. General Brigham (MGB), for his part, initially took a more spray-and-pray approach.

About 15,000 researchers in the nonprofit healthcare system have been using AI, ML and deep learning for the past 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event.

But last year, he made a bold choice: His team shut down most of the unmanned AI pilots. In the beginning, “we followed the blooming (methodology) of a thousand flowers, but we didn’t have a thousand roses, we were probably trying to bloom a few dozen,” he said.

Like Merritt’s team at MassMutual, MGB moved to a more holistic view, exploring how they developed specific tools for specific sections of their workflows. They asked what capabilities they wanted and needed and what investments they required.

Srirama’s team also talked about roadmaps with major platform providers – Epic, Workday, ServiceNow, Microsoft. It noted that this was a “pivotal point” as vendors realized that they were developing internal tools that they already provide (or plan to distribute).

As Srirama said: “Why build it ourselves? We already have the platform. It will be in the workflow. Use it.”

However, a market that can make difficult decisions is still emerging. “My analogy is that six blind men were asked to touch an elephant, what does it look like?” Sriraman said. “You’ll get six different answers.”

He noted that there was nothing wrong with that; it’s just that everyone is discovering and experimenting as the landscape changes.

Instead of a Wild West environment, Srirama’s team distributes Microsoft Copilot to users across the business, using a “small landing zone” where they can safely test more complex products and control token usage.

They have also begun to “consciously deploy AI champions” among business teams. “It is the reflection of a thousand flowers blooming, carefully planted and nurtured,” Sriraman said.

Observability is another important consideration; he describes a real-time dashboard that manages model drift and security, allowing IT teams to manage AI “a little more pragmatically.” Health monitoring is important with AI systems, he noted, and his team has established principles and policies around AI use, not least access privileges.

Guardrails are a must in clinical settings: AI systems never make the final decision. "There will always be a physician or physician assistant to close the decision," Sriraman said. He cited radiology report generation as an area where AI is heavily used, but where the radiologist always signs off.

Sriraman was clear: "Don’t: Don’t disclose PHI (protected health information) in confusion. Simple as that, right?"

And most importantly, there should be safety mechanisms. “We need a big red button, kill it,” Sriraman said. “Without it, we don’t put anything into operating conditions.”

Ultimately, while agent AI is a transformative technology, the enterprise approach to it doesn’t have to be drastically different. “There is nothing new about this,” Sriraman said. “You can substitute AI for BPM (business process management) in the 90s and 2000s. The same concepts apply.”

Source link

How MassMutual and Mass General Brigham turned AI pilot expansion into production results

Defining metrics, creating strong feedback loops

Weed instead of letting a thousand flowers bloom

Leave a ReplyCancel Reply

Five refurbished iPhones under $500 that are still holding up well in 2026

Best Google Keep alternatives for transition

Your Amazon Luna library is about to become more limited

Defining metrics, creating strong feedback loops

Weed instead of letting a thousand flowers bloom

Leave a ReplyCancel Reply

Trending now

Five refurbished iPhones under $500 that are still holding up well in 2026

Best Google Keep alternatives for transition

Your Amazon Luna library is about to become more limited