Skip to content
OBLAIDISH NEWS
Nvidia releases 2.6B-parameter SANA-WM for 1-minute 720p video generation
TX_954477AI

Nvidia releases 2.6B-parameter SANA-WM for 1-minute 720p video generation

Nvidia's SANA-WM, a 2.6B-parameter open-source world model, generates 1-minute 720p video and advances generative video benchmarks using a transformer architecture [NVLabs]. It is available for research use.

Nvidia has released SANA-WM, a 2.6B-parameter open-source world model that generates 1-minute 720p video sequences [NVLabs]. Built on a transformer architecture, the model achieves state-of-the-art performance in autoregressive video generation, outperforming prior models in frame consistency and scene complexity at 720p resolution.

── What shipped ──

SANA-WM generates full 720p video up to 60 seconds long, trained on a large-scale dataset of internet-sourced videos. The model uses a tokenized video approach, where visual frames are encoded into discrete tokens and predicted autoregressively. At inference, it maintains temporal coherence better than previous open models, reducing flicker and structural drift over long sequences. Nvidia released the model weights, training code, and inference pipeline on GitHub under a research license.

── Why it matters ──

SANA-WM pushes the boundary for open, high-resolution video generation. Most prior open models cap at 10-30 seconds or lower resolution, limiting practical use. By extending to 60 seconds at 720p, it enables longer-form content creation for research in simulation, storytelling, and synthetic data generation.

Open access allows developers to audit, fine-tune, and benchmark the model—critical for reproducibility. Unlike closed systems such as OpenAI’s Sora, SANA-WM permits modification and local deployment, accelerating community-driven improvements in video modeling.

The transformer backbone confirms the architecture’s scalability beyond language and image tasks. SANA-WM’s design draws from vision transformers and LLMs, using causal attention across spatiotemporal tokens—a method that may influence future multimodal models.

Editor's take: This isn’t just a technical upgrade—it’s a strategic move to set the open standard for long-horizon video generation. By open-sourcing a high-capacity model now, Nvidia positions itself as a foundational player before commercial rollouts dominate the space.

operator_channel
[ comments_offline · provider_not_configured ]
transmission_log

Subscribe to the broadcast.

Daily digest of the day's most important tech news. No fluff. Engineering signal only.

// delivered via substack · double-opt-in confirmation