Google’s Gemini Omni turns enterprise video production into a conversation with access to the Flash API



For most businesses, a 90-second training video or product explainer has never been an easy ask. This means a well-planned briefing, an in-house film crew or an outside vendor, filming, editing and some editing. Change one line of text on the screen, according to the legal review, and the whole chain starts again. Costs and long lines are why so many indoor videos are never made.

It’s this equation that Google wants to rewrite Gemini Omni Flashthe first model of the new model "Omni" family, after debuting to consumers at I/O 2026, is now available to developers and enterprise customers via API. Google frames the family’s ambition as creating anything. "from any input," starts with the video. But caption interaction isn’t just a request for sharper text-to-video. This is the ability to edit the finished clip through the chat.

When the model goes on sale in May, VentureBeat’s venture analysis noted the catch: with no software interface, the Omni was a consumer and consumer tool, not a production one. This API introduction changes that. Most videographers in an organization put conversational editing in front of the marketing and learning and development teams.

The pitch: a five-instrument pipeline collapses into a conversation

So far, many teams have assembled AI-related videos the hard way, combining LLM for a script, text-to-image model, image-to-video model, a separate lip-sync tool, and a sound generator, each with its own contract, texture, and data path.

Omni’s enterprise argument is fusion: a model that captures text, images and video and returns the finished clip with synchronized audio.

This simplicity factor is the part that decision makers should draw first. Consolidating multiple point tools into one model means fewer vendors and one place to control output and enforce data processing rules. The equation changes for an organization that eschews generative video because it doesn’t cost much to glue tools together.

With a conversational editor, each tutorial builds on the last, so a marketer can relight, frame, or change a wardrobe for a product shot without rebuilding from scratch and losing pieces that have already worked. This is the difference between ordering a reshoot and sending a note.

A physics engine for multimodal references and brand assets

Omni accepts more than just a text request. Along with words describing what you want, you can feed it lots of reference images and existing video clips, and it deduces those features. Give it a photo of a specific object, ask the model to place that object on the scene, and it reproduces the color and rough shape of the real thing instead of inventing a generic stand. While the match isn’t pixel-perfect, it’s recognizably close. This reference-based management feature is what makes it commercially interesting: a photo of a product, a brand logo, or a specific location can quickly be included as an illustrated and unexpected ingredient.

Of Google’s four highlighted strengths, two relate directly to enterprise business. First, the world model is the system’s understanding of how physical scenes behave. Add light rain and puddles to the existing footage, and it shows the reflections of people and objects on wet pavement, the kind of physical consistency that separates real footage from obvious AI video.

The second is to insert text and logo. Point it at a stage full of signs, and you can have it rewrite those signs in a different language or for a brand of your choice, and even leave a company logo. The results aren’t flawless: cue tracking wasn’t always perfect in complex scenes during testing, and some text reverted to the original language between frames. For training videos that need on-screen labels or ads that need a logo placed on stage, this is a capability worth a closer look, and a reminder that a product still needs human review before it ships.

Where mutual APIs and limitations still bite

Under the hood, it runs on Google’s new interaction API, a stateful interface built for multi-threaded tasks rather than open chat. Each turn advances the previous video and its references, allowing edits to be stacked sequentially. Developers can chain generations. They can take a clip, turn a cat into a cougar kitten, convert the video to an 8-bit retro, then a watercolor look, and save each version to a branch later.

The limitations are real and worth budgeting around. Clips are currently limited to 10 seconds by model published model card. To make something longer, you create chunks and edit them together. Uploaded footage can also be edited as long as it runs 10 seconds or less and the user has the rights to it. Google’s own model card is clear that consistency between edits and rendering accurate text remain open issues.

Fences, watermarks and Google will not cross the line

For a CISO, demos are less important than the original work shipped alongside the model. Every Omni clip carries Google’s SynthID watermark, Google is extending C2PA Content Credentials across its generative tools, and has launched an AI Content Detection API that marks AI-generated media from both Google and other vendors.

Google also drew a thoughtful line. The model will not capture a photo and audio clip of a person and synchronize them with speech. However, it will record someone speaking and translate it into another language, which is a useful way to localize global learning content. For regulated businesses, these restrictions and baked-in sourcing are features, not friction.

Figures: cheap, only 720p and (initially) top

Pricing has fallen alongside the API and is aggressive. Omni Flash costs $0.10 per second of 720p video generated, which puts a ten-second clip at about a dollar. It matches the Veo 3.1 Fast at the same resolution, doubles the Veo 3.1 Lite and cuts the standard Veo 3.1 by three quarters.

per second (USD)

Gemini Omni Flash

I see 3.1 Lite

3.1 I see fast

I see 3.1

720p

$0.10

$0.05

$0.10

$0.40

1080p

no

$0.08

$0.12

$0.40

4K

no

no

$0.30

$0.60

The table also exposes the catch. Omni Flash only produces 720p. There’s no 1080p or 4K option, while Veo levels up to 4K. For indoor training and most social videos, 720p is fine. For a premium brand case designed for a large screen, this is a real ceiling, and the Veo 3.1 still has a job to do.

Clips run 3-10 seconds in 720p in landscape (16:9) or portrait (9:16). As reference inputs, the model accepts up to seven images and up to three video clips of three seconds or less. Although it produces audio along with the video it produces, it does not yet accept audio as input. The output is standard MP4, and each clip ships with a SynthID watermark and C2PA credentials.

When it comes to quality, the early signal is strong. On LMArena’s Text-to-Video Arena leaderboard, where people vote on competing models’ head-to-head performances, Omni Flash topped the list with 1,527 points.

What it means for budgets and what’s still missing

With a realistic price in hand, the iteration story becomes concrete. Every talk edit is the next generation you pay for, so about a dollar for every ten second pass at 720p adds up to a hefty editing session. It’s not the cost of editing that the model changes, but the number of wasted ones that informs the situation: as the context is passed between queues, these generations mostly go toward refining a working setup, rather than restarting from an empty command and hoping for the next attempt.

Omni is not alone in this area. Veo 3.1 remains Google’s production-grade choice when you need higher resolution, and rivals from Bytedance, Alibaba and OpenAI are chasing similar budgets. What Omni adds is editing capability itself: the ability to treat video as a live document instead of a single shot.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *