WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

University of Virginia

Overview

TL;DR

Map and Locate Task Overview

WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments. Our self-supervised WildRayZer learns to render static novel views from dynamic images without any 3D or GT mask supervision. It extends the state-of-the-art self-supervised large view synthesis model RayZer to dynamic environments by adding a learned motion mask estimator and a masked 3D scene encoder.

Abstract

We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments, where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, causing ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.

Dynamic RE10K Data Curation

We target novel view synthesis from dynamic, in-the-wild videos where both the camera and scene undergo motion. Rather than relying on controlled captures, we mine diverse handheld footage from public sources.

Data Curation Pipeline: Our pipeline proceeds in three stages: (1) Source identification — querying YouTube channels for real-estate walkthroughs and indoor pet-interaction videos; (2) Image-level filtering — assessing visual quality and removing videos with intrusive overlays; (3) Sequence extraction — detecting scene cuts and subdividing clips based on camera translation to ensure sufficient parallax.

Benchmark Construction: We build two evaluation splits: D-RE10K Motion Mask providing motion annotations for 99 Internet video sequences, and D-RE10K-iPhone, a 50-sequence real-world paired transient/clean dataset captured with tripod-mounted iPhone for sparse-view transient-aware NVS evaluation.

Examples

Sample sequences showcasing diverse indoor scenes with transient objects.

Method

WildRayZer Framework

WildRayZer self-supervised learning framework. (a) Training. A camera-only static renderer explains the rigid background; residuals between renderings and targets highlight dynamic regions, which are sharpened with pseudo-motion masks constructor. Motion estimator is distilled from these pseudo-masks and used to gate dynamic image tokens before scene encoding; the same pseudo-masks also gate dynamic pixels in the rendering loss. (b) Inference. Given a set of dynamic input images, the model predicts camera parameters and motion masks in a single feed-forward pass. The motion estimator operates only on input views, masks out dynamic tokens, and the renderer synthesizes transient-free novel views from the inferred static scene.

Results

Visual Comparison

Visual Comparison

Qualitative comparison of novel view synthesis results on Dynamic-RE10K and D-RE10K-iPhone.

Additional Results on DAVIS

DAVIS Qualitative Results

Qualitative results on DAVIS dataset demonstrating transient-aware novel view synthesis.

Citation

@article{chen2026wildrayzerselfsupervisedlargeview,
      title={WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments}, 
      author={Xuweiyi Chen and Wentao Zhou and Zezhou Cheng},
      year={2026},
      eprint={2601.10716},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.10716}, 
}

Acknowledgements

The authors acknowledge the Adobe Research Gift, the University of Virginia Research Computing and Data Analytics Center, Advanced Micro Devices AI and HPC Cluster Program, Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, and National Artificial Intelligence Research Resource (NAIRR) Pilot for computational resources, including the Anvil supercomputer (National Science Foundation award OAC 2005632) at Purdue University and the Delta and DeltaAI advanced computing resources (National Science Foundation award OAC 2005572).