Researchers at OpenImagingLab published AnyRecon on April 21, 2026, a 3D reconstruction system that builds explorable 3D scenes from sparse photo captures in 105 seconds. The leading alternative (Difix3D+) requires 1,200 seconds for the same task at lower quality. Code and model checkpoints are already publicly available.

What Happened

AnyRecon takes a handful of photos from different angles and reconstructs a navigable 3D environment, handling the irregular viewpoints and large gaps typical of handheld phone footage. On the DL3DV benchmark, it achieves a PSNR of 20.95 dB across interpolation and extrapolation tasks, outperforming Difix3D+ (17.88 dB), ViewCrafter (15.86 dB), and Uni3C (16.33 dB), while running 11 times faster.

The system is built on Wan2.1-I2V-14B, an open-source video diffusion model, fine-tuned using LoRA for geometry-controlled reconstruction. Model weights are available on HuggingFace.

Why It Matters

High-quality 3D scene reconstruction from casual footage is a long-standing challenge in VFX and game development. Existing tools either require structured capture setups or lengthy processing pipelines that take 20 or more minutes per scene segment. AnyRecon gets this down to under two minutes.

The practical applications are direct. VFX artists scouting locations can turn phone footage into workable 3D environments for compositing or virtual production. Game developers can capture real-world spaces for reference or direct use. The system handles sequences longer than 200 frames, enough to reconstruct a full room, a building exterior, or a stretch of terrain from a walkthrough video.

Unlike methods that condition on just one or two reference frames, AnyRecon uses all available captures, which dramatically improves consistency in areas only partially visible from any given viewpoint. For a complementary approach to 3D generation from single images, TRELLIS.2 now runs on Apple Silicon and handles image-to-mesh conversion offline.

Key Details

Four innovations drive the speed and quality results:

  • Global Scene Memory: Stores all input reference frames as a persistent key-value cache inside the diffusion transformer, enabling conditioning on any number of views rather than just one or two.
  • Non-Compressive Latent Encoding: Removes temporal compression that causes blurring when input frames span large viewpoint gaps.
  • 4-Step Diffusion Distillation: Compresses inference from 50 steps to 4, reducing generation time from 1,820 to 105 seconds with only a 0.24 dB quality drop.
  • Geometry-Aware View Selection: Uses 3D spatial overlap to choose which reference frames inform each reconstruction segment, improving accuracy under occlusion.

The system operates as a closed feedback loop: generated views continuously update a shared 3D geometry memory, which then guides generation of the next scene segment, preventing drift across long trajectories.

What to Do Next

Code and model weights are publicly available. Visit the project page to see side-by-side video comparisons against competing methods, then check the GitHub repository for setup instructions. The full paper covers the architecture in detail.

Running AnyRecon requires substantial GPU memory to support a Wan2.1-scale video diffusion pipeline. Community-optimized versions with quantized weights are likely to follow. For production-ready 3D spatial capture without the GPU overhead, Sony XYN Spatial Capture offers a hardware-accelerated alternative aimed at virtual production.