The team behind continuous batch jobs says your idle GPUs should be producing results, not obscurity

Each GPU cluster has a dead time. Training jobs end, workloads change, and equipment goes dark, while power and cooling costs continue. For Neocloud operators, these idle periods are lost margin.

The obvious solution is spot GPU markets – renting out spare power to those who need it. But spot instances mean that the cloud vendor is still the one making the rent, and the engineers who buy that power are still paying for raw computing with no results.

FriendliAI’s answer is different: generate direct results on unused hardware, optimize token throughput and share the revenue with the operator. FriendliAI was founded by researcher Byung-Gon Chun, whose paper on persistent batching laid the foundation for vLLM, an open-source inference engine used in most production deployments today.

As a professor at Seoul National University for more than a decade, Chun studied the efficient execution of machine learning models at scale. A paper called that study came out Orcawhich introduced continuous stacking. The technique dynamically processes inference requests instead of waiting for a fixed batch to fill before executing. It is now an industry standard and a core mechanism within vLLM.

This week, FriendliAI is launching a new platform called InferenceSense. Just as publishers use Google AdSense to monetize unsold ad inventory, neocloud operators can use InferenceSense to fill unused GPU cycles with paid AI inference workloads and collect a portion of the token revenue. The operator’s own jobs are always prioritized – when the scheduler buys back the GPU, InferenceSense makes a profit.

"What we’re suggesting is that instead of letting GPUs sit idle, they can monetize idle GPUs by inferring," Chun told VentureBeat.

How the Seoul National University lab built the engine within vLLM

Chun founded FriendliAI in 2021, before most of the industry shifted its focus from training to inference. The company’s main product is a custom inference service for AI startups and enterprises working with open weight models. FriendliAI also appears as a deployment option on Hugging Face alongside Azure, AWS, and GCP, and currently supports over 500,000 open models from the platform.

InferenceSense now extends this inference engine to the capacity challenge that GPU operators face across workloads.

How it works

InferenceSense runs on top of Kubernetes, which most neocloud operators already use for resource orchestration. The operator allocates a GPU pool to a Kubernetes cluster managed by FriendliAI – declaring which nodes are available and under which conditions they can be reclaimed. Idle detection goes through Kubernetes itself.

"We have our own orchestrator running on the GPUs of these neocloud or simply cloud manufacturers," Chun said. "We definitely use Kubernetes, but the application running on top is really a highly optimized result stack."

When GPUs are not in use, InferenceSense spins up isolated containers serving paid inference workloads in open weight models including DeepSeek, Qwen, Kimi, GLM, and MiniMax. When the operator scheduler needs hardware back, inference workloads are preempted and GPUs are returned. FriendliAI says the handover happens in seconds.

Demand is aggregated through FriendliAI’s direct clients and result aggregators such as OpenRouter. The operator provides power; FriendliAI manages the demand pipeline, model optimization and service stack. There are no down payments and no minimum commitments. A real-time dashboard shows operators which models are running, tokens are being processed, and revenue is being collected.

Why token throughput has surpassed raw capacity rental

Spot GPU marketplaces from providers such as CoreWeave, Lambda Labs, and RunPod involve a cloud vendor leasing its hardware to a third party. InferenceSense runs on hardware that the neocloud operator already owns, the operator determines which nodes will participate and pre-determines scheduling agreements with FriendliAI. The distinction is important: spot markets monetize potential, InferenceSense monetizes tokens.

The token throughput per GPU-clock determines how much InferenceSense can earn during idle windows. FriendliAI claims its engine delivers two to three times the performance of a standard vLLM deployment, though Chun notes that the number varies by workload type. Most competitive inference stacks are built on Python-based open source frameworks. FriendliAI’s engine is written in C++ and uses individual GPU cores rather than Nvidia’s cuDNN library. The company has built its own model representation layer to partition and execute models on hardware with its own implementations of speculative decoding, quantization, and KV-cache management.

Since FriendliAI’s engine processes more tokens per GPU-hour than a standard vLLM stack, operators should get more revenue per idle cycle by running their own inference service.

What AI Engineers Estimating Cost of Consequences Should Consider

For AI engineers evaluating where to run the resulting workloads, the neocloud vs. hyperscale decision is usually based on cost and availability.

InferenceSense adds a new consideration: if neoclouds can monetize idle capacity by inferring, they have a greater economic incentive to keep token prices competitive.

This is not a reason to change infrastructure decisions today – it is too early. But engineers tracking overall inference costs should watch to see if the neocloud implementation of platforms like InferenceSense puts downward pressure on API prices for models like DeepSeek and Gwen over the next 12 months.

"When we have more efficient suppliers, overall costs will go down," Chun said. "With InferenceSense, we can contribute to making these models cheaper."

Source link

The team behind continuous batch jobs says your idle GPUs should be producing results, not obscurity

How the Seoul National University lab built the engine within vLLM

How it works

Why token throughput has surpassed raw capacity rental

What AI Engineers Estimating Cost of Consequences Should Consider

Leave a ReplyCancel Reply

This innovative budget phone is perfect for reducing eye strain, and it’s just dropped for a Prime Day price.

PowerToys just shrunk by 28% and took on three separate programs I had installed

Stop blaming your phone for Android Auto lag — your head unit is more important than Google admits

How the Seoul National University lab built the engine within vLLM

How it works

Why token throughput has surpassed raw capacity rental

What AI Engineers Estimating Cost of Consequences Should Consider

Leave a ReplyCancel Reply

Trending now

This innovative budget phone is perfect for reducing eye strain, and it’s just dropped for a Prime Day price.

PowerToys just shrunk by 28% and took on three separate programs I had installed

Stop blaming your phone for Android Auto lag — your head unit is more important than Google admits