Researchers from Google, Cornell University, and Stanford University submitted CityRAG to arXiv on April 21, 2026, demonstrating an AI system that generates minutes-long, navigable video walkthroughs of real city streets from a single input photo. The model uses actual building layouts, street geometry, and turn radii from real-world locations, producing walkthroughs that stay accurate even around corners not visible in the source image.
For the broader landscape, see our complete guide to AI video generation in 2026.
What Happened
CityRAG fine-tunes the Wan 2.1 14-billion-parameter image-to-video model on 5.5 million geo-registered Google Street View panoramas from 10 cities. A user provides a street-level photo and a desired camera trajectory; the system retrieves geo-registered Street View data for that location and generates a photorealistic walkthrough video that faithfully matches the real-world geography, including building facades, road configurations, and traffic infrastructure.
The research team tested the system with AI-modified input images, including a snow-covered version of Honolulu streets. CityRAG still rendered the correct building layouts and street configurations of the actual location, even when the input was clearly artificial. Sequences chain autoregressively for 1,000-plus frames, enabling multi-minute city walkthroughs with loop-closure capability for circular camera moves.
Why It Matters
For filmmakers, game designers, and location-dependent visual creators, the persistent challenge is grounding concept art in physical reality. Tools like AnyRecon can already reconstruct static 3D scenes from sparse photos, but CityRAG goes further by generating navigable video grounded in real geography. Knowing what a location looks like on a map is different from being able to visualize a camera moving through it at street level. CityRAG bridges that gap by generating walkthrough video matched to actual built environments across 10 cities including Paris, San Francisco, London, Honolulu, and Athens.
The loop-closure capability is directly relevant to production work: a camera can return to its starting point with geometrically consistent architecture, enabling circular establishing shots or seamless background loops without manually stitching footage.
Key Details
- Base model: Wan 2.1 (14B), fine-tuned on 5.5M Google Street View panoramas
- Output: 480p (832x480), 73 frames per segment, chainable to minutes-long sequences
- Cities covered: Paris, San Francisco, London, Honolulu, Athens, San Juan, Anchorage, Hyderabad, Philadelphia, Sao Paolo
- Training scale: 32 A100 GPUs, one week (~20,000 iterations)
- Availability: Research paper only. No public tool, API, or code release yet.
What to Do Next
CityRAG is not publicly available. The researchers used Google Street View data under a licensing arrangement that is not redistributable, and no release timeline has been announced. The project page has video demos across all 10 cities, which are worth reviewing if your work involves location-based pre-visualization, virtual production, or world-building for games and film.
Google is also investing in Maps-based AI through Gemini-powered 3D navigation in Google Maps. Watch the arXiv paper (2604.19741) and project page for future code or API releases. Given the research team includes Google engineers and the Street View data relationship, a Cloud-integrated or commercial release path is plausible, though not announced.