
Artificial intelligence that can see and understand what’s happening in video (especially live streaming) is an attractive product for many businesses and organizations. Apart from acting as security "watchdog" across sites and objects, such an AI model can be used to cut out the most interesting parts of marketing videos and repurpose them for social purposes, identify inconsistencies and gaps in videos and flag them for removal, and identify the body language and movements of participants in controlled studies or candidates applying for new roles.
While there are some AI models that offer this type of functionality today, it is far from a mainstream capability. Two-year-old startup Perceptron Inc. is trying to change all that. It announced its release today Flagship proprietary video analysis reasoning model, Mk1 (briefly "Mark One") cost — $0.15 per million tokens entered via its application programming interface (API) / $1.50 per million exits — is roughly 80-90% less than other leading proprietary competitors, namely Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini.3.
The company, formerly led by Meta FAIR and Microsoft co-founder and CEO Armen Aghajanyan, spent 16 months developing a project. "multimodal prescription" from scratch to solve the complexities of the physical world.
This release marks a new era in which models are expected to understand causation, object dynamics, and the laws of physics with the same fluency they once applied to grammar.
Interested users and potential corporate customers can try it out for themselves The public demo site from Perceptron is here.
Performance on location and video metrics
The model’s performance is supported by a set of industry-standard benchmarks aimed at sound understanding.
In spatial reasoning (ER Benchmarks), the Mk1 scored 85.1 on EmbSpatialBench, beating Google’s Robotics-ER 1.5 (78.4) and Alibaba’s Q3.5-27B (approx. 84.5).
The Mk1’s score of 72.4 on the specialized RefSpatialBench represents a huge leap over rivals such as the GPT-5m (9.0) and Sonnet 4.5 (2.2) and highlights a significant advantage in speech intelligibility.
Video benchmarks show a similar advantage; on EgoSchema "A hard subset"—when the first and last frame inferences weren’t enough—the Mk1 scored 41.4, matching Alibaba’s Q3.5-27B and significantly beating the Gemini 3.1 Flash-Lite (25.0).
In VSI-Bench, the Mk1 reached 88.5, the highest score recorded among the models compared, further confirming its ability to perform actual temporal reasoning tasks.
Market positioning and the efficiency frontier
The Perceptron clearly took aim "Efficiency Frontier," a metric representing average scores per video and justification criteria against blended cost per million tokens.
Benchmark data reveals that the Mk1 occupies a unique position: it matches or exceeds performance figures. "border" While maintaining a cost profile closer to models like the GPT-5 and Gemini 3.1 Pro "Lite" or "Flash" versions.
Specifically, the Perceptron Mk1 is priced at $0.15 per million input tokens and $1.50 per million output tokens. In comparison, "Efficiency Frontier" the chart shows the GPT-5 at a significantly higher blended cost (closer to $2.00) and the Gemini 3.1 Pro at around $3.00, while the Mk1 sits at the $0.30 blended cost mark with superior justification scores.
This aggressive pricing strategy is intended to make high-level physical AIs available for large-scale industrial use rather than just experimental research.
Architecture and time continuity
The technical core of the Perceptron Mk1 is its ability to process native video at 2 frames per second (FPS) in a substantial 32K token context window.
Unlike traditional visual language models (VLMs), which often treat video as discrete sequences of still images, Mk1 is designed for temporal continuity.
This architecture allows the model "to watch" extends even flows through occlusions and preserves object identity, a critical requirement for robotics and surveillance applications.
Developers can query the model for specific moments in a long stream and receive structured timecodes in return, simplifying the process of video clipping and event detection.
Reasoning with the laws of physics
The main differentiator for the Mk1 is its "Don’t think physically" ability. Perceptron defines this as high-precision spatial awareness that enables a model to understand object dynamics and physical interactions in real-world settings.
For example, the model can analyze a scene to determine whether a basketball shot was taken before or after the buzzer by jointly reasoning about the ball’s position in the air and the shot clock reading.
This requires more than pattern recognition; requires an understanding of how objects move through space and time.
The model can "pixel resolution" hundreds of signs and counting within dense, complex scenes. It can also read analog gauges and clocks that have historically been difficult for purely digital vision systems to interpret with high reliability.
Also seems to have a strong general world and historical knowledge. In my short test, I uploaded my own public domain 1906 New York skyscraper construction film From the US Library of Congress, and Mk1 not only failed to correctly depict the content of the footage – including the strange, atypical scenes of workers hanging from ropes – but from the looks of the footage, it was even quickly dated (early 1900s).
A developer platform for physical AI
Accompanying the model release is an extensible developer platform designed to turn these high-level perceptual capabilities into functional applications with minimal code.
It provides several specialized functions, such as the Perceptron SDK, which can be accessed through Python "focus," "to count" and "Learning in Context".
The focus function allows users to automatically zoom in and crop out specific regions of the frame based on a natural language command, such as detecting and locating personal protective equipment (PPE) on a construction site. The counting function is optimized for busy scenes, such as identifying and pointing to individual items of each puppy or product in a group.
In addition, the platform supports learning in context and allows developers to adapt Mk1 to specific tasks by providing several examples, such as showing a picture of an apple and labeling each instance of Category 1 in a new scene to the model.
Licensing strategies and the Isaac series
Perceptron uses a dual path strategy for model weights and licensing. The flagship Perceptron Mk1 is a closed-source model available through the API, designed for enterprise-class performance and security.
However, the company also maintains its position "Isaac" series starting with Isaac 0.1 release in September 2025as an alternative to open weights. Isaac 0.2-2b-previewReleased in December 2025, it is a 2-billion-parameter vision language model with reasoning capabilities available for edge and low-latency deployments.
Although weights for Isaac models are open in the popular AI code sharing community Hugging FacePerceptron offers commercial licenses for companies requiring maximum control of weights or local deployment.
This approach allows the company to support both the open source community and specialized industry partners who need proprietary flexibility. The documentation notes that Isaac 0.2 models are specifically optimized for sub-200ms latency, making them ideal for real-time peripherals.
Information about the creation and focus of the Perceptron
Perceptron AI is a Bellevue, Washington-based physical artificial intelligence startup founded by Aghajanian and Akshat Shrivastava, former research scientists at Meta’s Facebook AI Research (FAIR) lab.
The company’s public filings show it was incorporated by November 2024, Washington corporate filings for Perceptron.ai Inc. state. Application for foreign registration on October 9, 2024Shrivastava and Aghajanyan were included in the list of governors.
From the end of 2024, in the introductory notes of the founders, Aghajanyan He said he left Meta after about six years and “joined forces” with Shrivastava to create artificial intelligence for the physical world, which Shrivastava said grew out of the company’s work on efficiency, multimodality and new model architectures.
The structure appears to follow directly from the pair’s work on multimodal foundation models at Meta. In May 2024 Meta researchers published Chameleona family of early fusion models designed to understand and generate mixed sequences of text and images, work that Perceptron later described as part of the generation behind its models.
Continuation document for July 2024, MoMainvestigated more efficient early coupling training for mixed modal models and listed both Shrivastava and Aghajanyan among the authors. Perceptron’s noted thesis extends that line of research to “physical artificial intelligence”: models that can process real-world video and other sensory streams for use cases such as robotics, manufacturing, geospatial analysis, security and content moderation.
Partner ecosystems and vision for the future
The real-world impact of the Mk1 is already being demonstrated through Perceptron’s partner network. Early adopters are using the model for a variety of applications, such as automatically cutting highlights from live sports, using the model’s temporal understanding to identify key plays without human intervention.
In the robotics sector, partners transform teleoperation episodes into training data, effectively automating the data tagging and cleaning process for robotic arms and mobile devices.
Other use cases include multimodal quality control agents that can detect defects on production lines and inspect assembly steps in real-time, and wearable assistants in smart glasses that provide context-aware assistance to users.
Aghajanyan said that these releases are the culmination of research to make artificial intelligence perform its best function in the physical world. "physical AI" as ubiquitous as digital artificial intelligence.





