Three ways AI can learn to understand the physical world

Large language models suffer from limitations in domains that require understanding the physical world—from robotics to autonomous driving to manufacturing. This limitation discourages investors world modelswith AMI Labs raises $1.03 billion seed round shortly after World Labs secured $1 billion.

Large language models (LLMs) excel at processing abstract knowledge by predicting the next token, but they are fundamentally lacking in physical causality. They cannot reliably predict the physical consequences of real-world actions.

AI researchers and thought leaders are increasingly vocal about these limitations as the industry tries to push AI out of web browsers and into physical spaces. In an interview with a podcaster Dwarkesh PatelTuring Award winner Richard Sutton warned that LLMs imitate what people say rather than model the world, limiting their ability to learn from experience and adapt to changes in the world.

Therefore, models based on LLMs, incl visual language models (VLMs), exhibit brittle behavior and can break with very small changes in their inputs.

CEO of Google DeepMind Demis Hassabis Echoing this point in another interview, he pointed out that today’s AI models suffer from “fragmented intelligence”. They can solve complex math Olympiads, but fail at basic physics because they lack a critical understanding of real-world dynamics.

To solve this problem, researchers are focusing on building models of the world that act as built-in simulators, allowing AI systems to safely test hypotheses before taking physical action. However, “world models” is an umbrella term covering several different architectural approaches.

This created three distinct architectural approaches, each with distinct advantages.

JEPA: built for real-time

The first major approach focuses on learning hidden representations instead of trying to predict the dynamics of the world at the pixel level. AMI Labs approved this method basically Colocation Prediction Architecture (JEPA).

JEPA models try to mimic how people make sense of the world. When we observe the world, we don’t remember every pixel or trivial detail in a scene. For example, if you look at a car moving down the street, you follow its trajectory and speed; you don’t accurately calculate the light reflection on each leaf of the background trees.

JEPA models replicate this human cognitive shortcut. Instead of forcing the neural network to predict exactly what the next frame of video will look like, the model learns a smaller set of abstract or “hidden” features. He rejects unimportant details and focuses all his attention on the basic rules of interaction of elements in the scene. This makes the model robust against background noise and small changes that disrupt other models.

This architecture is highly computational and memory efficient. By ignoring irrelevant details, it requires fewer training samples and operates with significantly lower latency. These features make it suitable for applications such as robotics, self-driving cars and high-risk enterprise workflows where efficiency and real-time output are non-negotiable.

For example, AMI is partnering with healthcare company Nabla to use this architecture to simulate operational complexity and reduce cognitive load in agile healthcare systems.

JEPA architecture pioneer and AMI co-founder Yann LeCun explained that World models based on JEPA designed to be "it’s controllable in the sense that you can give them goals and all they can do with the build is achieve those goals" In an interview with Newsweek.

Gaussian splats: built for space

The second approach relies on generative models to create a complete spatial environment from scratch. accepted by companies such as World Labsthis method takes an initial query (which can be an image or text description) and uses a generative model to generate a 3D Gaussian plot. Gaussian rendering is a technique for representing 3D scenes using millions of tiny, mathematical particles that define geometry and lighting. Unlike straight video generation, these 3D images can be imported directly into standard physics and 3D engines such as Unreal Engine, where users and other AI agents can freely move and interact with them from any angle.

The main benefit here is a dramatic reduction in the time and one-time production costs required to create complex interactive 3D environments. This solves the exact problem outlined by Fei-Fei Li, founder of World Labs, who noted that LLMs ultimately “masters of words in the dark,” has a flowery language but lacks spatial intelligence and physical experience.World Labs’ Marble model gives the AI a spatial awareness.

Although this approach is not designed for split-second, real-time execution, it has great potential for building static learning environments for spatial computing, interactive entertainment, industrial design, and robotics. Enterprise value is evident at Autodesk Serious support from World Labs to integrate these models into industrial design applications.

End-to-end generation: built for scale

A third approach uses an end-to-end generative model to process prompts and user actions, quickly generating the scene, physical dynamics, and reactions. Instead of exporting a static 3D file to an external physics engine, the model itself acts as the engine. It takes an initial request along with a continuous stream of user actions and generates subsequent frames of the environment in real-time, computing physics, lighting and object reactions locally.

DeepMind’s Genie 3 and Nvidia Cosmos belongs to this category. These models provide an extremely simple interface for creating endless interactive experiences and large amounts of synthetic data. DeepMind demonstrated this natively with Genie 3demonstrates how the model maintains strict object persistence and consistent physics at 24 frames per second without relying on a separate memory module.

This approach translates directly into heavy synthetic data factories. Nvidia Cosmos uses this architecture to extend synthetic data and physical AI reasoning, enabling autonomous vehicle and robotics developers to synthesize rare, dangerous edge states without the cost and risk of physical testing. Waymo (an Alphabet subsidiary) based his world model on Genie 3 and adapted it to train their self-driving cars.

The disadvantage of this end-to-end generative method is the large computational cost required to continuously render the physics and pixels simultaneously. However, achieving Hassabis’ vision requires investment, which he argues requires a deep, intrinsic understanding of physical causality because current AI lacks critical capabilities to operate safely in the real world.

What comes next: hybrid architectures

LLMs will continue to serve as reasoning and communication interfaces, but world models position themselves as the foundational infrastructure for physical and spatial data pipelines. As the underlying models mature, we see the emergence of hybrid architectures that leverage the strengths of each approach.

For example, it recently developed the cybersecurity startup DeepTempo LogLMa model that combines elements of LLM and JEPA to detect anomalies and cyber threats from security and network logs.

Source link

Three ways AI can learn to understand the physical world

JEPA: built for real-time

Gaussian splats: built for space

End-to-end generation: built for scale

What comes next: hybrid architectures

Leave a ReplyCancel Reply

Microsoft’s “commitment to Windows quality” begins with an overhaul of the beta program

Heck, the ‘Power Rangers’ reboot could have had four movies

The largest orbital computing cluster is open for business

JEPA: built for real-time

Gaussian splats: built for space

End-to-end generation: built for scale

What comes next: hybrid architectures

Leave a ReplyCancel Reply

Trending now

Microsoft’s “commitment to Windows quality” begins with an overhaul of the beta program

Heck, the ‘Power Rangers’ reboot could have had four movies

The largest orbital computing cluster is open for business