NVIDIA released Nemotron 3 Nano Omni on April 28, an open multimodal model that processes video, audio, images, and text in a single system. It uses a 30B-parameter hybrid mixture-of-experts architecture with roughly 3B active parameters per token, a 256K context window, and adds Conv3D and event-based vision sampling for video. NVIDIA published weights on Hugging Face and made the model available through OpenRouter, build.nvidia.com, AWS SageMaker JumpStart, and 25+ partner platforms.

The technical claim is efficiency, not headline accuracy. NVIDIA reports roughly 9x higher throughput than other open omni models at equivalent interactivity, and the model tops six leaderboards covering document intelligence and video and audio understanding. The MoE design routes each token through only a small fraction of the network — that is what lets a 30B model run with the latency profile of a much smaller dense one, which matters when the workload is an agent making many sequential calls.

Open multimodal weights at this capability tier are still uncommon. Most agent stacks today either glue a closed multimodal API to a workflow engine, or chain a text LLM to separate vision and speech models. A single open model that handles all three modalities with long context cuts orchestration cost and lets teams fine-tune the whole pipeline. Adopters listed at launch include Foxconn, Palantir, Eka Care, and Aible, with Dell, Oracle, Docusign, and Infosys evaluating.

Takeaway for learners: when you read a model release, separate the accuracy story from the efficiency story. Nemotron 3 Nano Omni is not claiming to be the smartest model — it is claiming to be the cheapest fast multimodal one. For agent workloads, that distinction is often what decides which model ships into production.