VGGT-Ω

Input Video transmission@parhoman

Reconstruction

Abstract

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-Ω, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol.

We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-Ω uses only ∼30% of the GPU memory of its predecessor, which allows us to train VGGT-Ω with 15× more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-Ω achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, e.g., improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding.

Gallery

Reconstructions across static landmarks, dynamic action, aerial / FPV flights, indoor captures, and underwater scenes. Each row shows the input frames and an orbit render side-by-side on the left (click to fullscreen); on the right, an interactive viewer of the VGGT-Ω 4D reconstruction.

BibTeX

@article{wang2026vggtomega,
  title   = {VGGT-{$\Omega$}: Scaling Feed-Forward 3D Reconstruction for Static and Dynamic Scenes},
  author  = {Wang, Jianyuan and Chen, Minghao and Zhang, Shangzhan and Karaev, Nikita and Sch\"onberger, Johannes and Labatut, Patrick and Bojanowski, Piotr and Novotny, David and Vedaldi, Andrea and Rupprecht, Christian},
  journal = {arXiv preprint},
  year    = {2026}
}

Acknowledgements

We greatly appreciate the support of the many people who have helped shape VGGT-Ω. We are currently preparing the full acknowledgements and will update this section soon.