Nvidia LocateAnything: 10x Faster Vision Grounding

Nvidia released LocateAnything, a new vision-language grounding model, on May 27, 2026. The model achieves object detection and visual grounding 10 times faster than Qwen3-VL and 2.5 times faster than Rex-Omni while improving accuracy across GUI grounding, document understanding, and image localization benchmarks. The nvidia/LocateAnything-3B model (4 billion parameters) is available now on HuggingFace with a live demo.

What Happened

Researchers at Nvidia, PolyU, Princeton, NJU, and UIUC published the LocateAnything paper (arXiv:2605.27365) introducing a new approach to visual grounding called Parallel Box Decoding (PBD). The core problem PBD solves: standard vision-language models decode bounding boxes as a sequence of individual tokens, which is slow and loses the geometric relationships between box coordinates.

LocateAnything treats each bounding box as a single atomic unit decoded in one step rather than sequentially. This parallel approach enables both faster inference and higher localization accuracy, particularly at high overlap thresholds where coordinate precision matters most.

Nvidia trained the model on LocateAnything-Data, a custom dataset containing more than 138 million training samples covering diverse grounding tasks: natural images, GUI screenshots, document pages, and OCR localization. The full technical details are available on the Nvidia Research page.

Why It Matters

Visual grounding, the ability to locate specific objects or regions in an image based on text descriptions, is a foundational capability for AI-powered creative workflows. Every time an AI tool selects a subject for background removal, identifies UI elements for automation, or processes a design document, it is doing some form of visual grounding.

LocateAnything is not a consumer product, but its performance profile matters to creators in two ways. First, it sets a new speed-accuracy baseline that downstream tools will adopt. Second, it is already directly usable through HuggingFace for creators building custom AI pipelines, automations, or prototypes.

The 10x throughput advantage over Qwen3-VL is significant for batch workflows: processing hundreds of product images, labeling large design asset libraries, or running real-time GUI automation at scale all become faster and cheaper when the underlying model is faster.

How Parallel Box Decoding Works

Standard vision-language models decode bounding boxes by generating four coordinate tokens sequentially: x1, y1, x2, y2. Each token depends on the previous one, which creates a bottleneck. More critically, the sequential approach can produce geometrically inconsistent boxes where the model generates a valid x1 but then picks an x2 that violates the expected aspect ratio.

LocateAnything 10x faster parallel box decoding for vision grounding

Parallel Box Decoding generates all four coordinates simultaneously as a single unit. This preserves geometric coherence (the model reasons about the full box at once) and enables the hardware-level parallelism that makes inference faster. The result is what the researchers call improvements at the "speed-accuracy frontier" across diverse grounding benchmarks.

Performance Benchmarks

Benchmark	LocateAnything	Rex-Omni	Improvement
LVIS (mean F1)	+3.8% over Rex-Omni	Baseline	+3.8 F1
LVIS IoU@0.95	31.1	20.7	+10.4 F1 at high IoU
COCO (mean F1)	+1.8% over Rex-Omni	Baseline	+1.8 F1
ScreenSpot-Pro (GUI)	60.3 mean F1	Prior SOTA	New state-of-the-art
DocLayNet (documents)	76.8 mean F1	Prior SOTA	New state-of-the-art
Throughput vs Qwen3-VL	12.7 BPS	1.1 BPS (Qwen3-VL)	11.5x faster

LocateAnything benchmark performance comparison

The LVIS IoU@0.95 jump from 20.7 to 31.1 is particularly significant for creators: high IoU thresholds test whether bounding boxes are tight and precise, not just loosely correct. Tight boxes matter for masking, cutout tools, and any workflow where an inaccurate selection creates extra cleanup work.

The ScreenSpot-Pro benchmark covers GUI grounding, locating UI elements (buttons, input fields, icons) from text descriptions. A 60.3 mean F1 at SOTA level means LocateAnything can accurately identify interface elements even in complex, dense UI layouts.

Creator Workflow Use Cases

LocateAnything is a research model, but its capabilities map directly to three creator workflow problems:

LocateAnything object detection for creator workflows

GUI automation and AI agents: Any AI workflow that involves clicking through design tools, web apps, or creative software requires a model to locate interface elements from natural language descriptions. LocateAnything's SOTA performance on ScreenSpot-Pro means it can accurately identify UI targets in tools like Figma, Canva, or Adobe apps without false positives. Creators building AI-assisted automation scripts or using computer-use agents benefit from the higher precision at scale.

Document and asset processing: Designers, brand managers, and creative directors work with large libraries of PDFs, brand guidelines, spec sheets, and reference documents. LocateAnything's 76.8 F1 on DocLayNet demonstrates strong document layout understanding: it can locate charts, tables, figures, and text blocks by description. This feeds directly into AI workflows that extract data from design docs or process layout information at scale.

Image editing and masking: The LVIS and COCO improvements, specifically the gains at high IoU thresholds, indicate tighter, more precise object localization. For AI-assisted photo editing, precise localization reduces the manual cleanup required after an AI-generated selection or mask. Tools that use localization models for subject separation, background removal, or region-aware editing will benefit from the accuracy improvements LocateAnything introduces.

For related Nvidia AI work relevant to creative workflows, see our deep dive on NVIGI 1.6 and AI-powered NPC workflows covering Nvidia's recent applied AI releases.

How to Try LocateAnything

The model is available immediately through two channels:

HuggingFace demo: The fastest way to test LocateAnything without setup. The nvidia/LocateAnything demo Space accepts an image and a text query and returns bounding box predictions. No API key or account required beyond a HuggingFace login.

Model download: The nvidia/LocateAnything-3B model (4B parameters) is downloadable for local inference or API integration. This is the path for creators building custom pipelines, batch-processing workflows, or integrating localization into their tooling.

The dataset (LocateAnything-Data, 138M+ samples) is listed as incoming on the official research page as of May 28, 2026.

What to Do Next

If you are evaluating LocateAnything for a creative workflow, the HuggingFace demo is the fastest first step. Upload a representative sample from your actual use case (a product photo, a UI screenshot, a document page) and test the grounding quality before committing to integration.

For developers and creators building on top of vision-language models, the arXiv paper includes ablation studies that show which components of PBD contribute most to the accuracy and speed gains. The full paper is at arxiv.org/abs/2605.27365. The Nvidia Research project page includes additional benchmark comparisons and methodology notes.

Frequently Asked Questions

What is Nvidia LocateAnything?

Nvidia LocateAnything is a vision-language model for visual grounding and object detection. It locates specific objects, UI elements, or document regions in images based on text descriptions. The model uses Parallel Box Decoding, a new technique that decodes bounding boxes in a single step rather than sequentially, achieving 10x higher throughput than Qwen3-VL with improved accuracy.

How fast is LocateAnything compared to other models?

LocateAnything achieves 12.7 boxes per second (BPS), compared to 1.1 BPS for Qwen3-VL (11.5x faster) and 5.0 BPS for quantized Rex-Omni (2.5x faster). This throughput advantage translates directly to cost and latency savings for batch processing workflows.

What tasks can LocateAnything perform?

LocateAnything handles four main grounding tasks: natural image object detection, GUI grounding (locating UI elements in screenshots), document layout understanding, and OCR localization. It achieves state-of-the-art results on the ScreenSpot-Pro GUI benchmark (60.3 mean F1) and DocLayNet document benchmark (76.8 mean F1).

Can creators access LocateAnything without coding?

Yes. The HuggingFace demo at huggingface.co/spaces/nvidia/LocateAnything lets you test the model by uploading an image and entering a text query. No setup or API keys are needed beyond a HuggingFace account. Downloading the model for local use or API integration requires more technical setup.

What is Parallel Box Decoding?

Parallel Box Decoding (PBD) is the key innovation in LocateAnything. Standard vision-language models generate bounding box coordinates one token at a time (x1, then y1, then x2, then y2), which is slow and can produce geometrically inconsistent boxes. PBD generates all coordinates simultaneously as a single unit, preserving geometric relationships and enabling faster parallel computation on hardware.

Is LocateAnything suitable for real-time applications?

At 12.7 BPS on standard benchmarks, LocateAnything is significantly faster than prior unified grounding models. For real-time applications, the inference speed depends on hardware and batch size. The HuggingFace model page includes configuration details for optimizing deployment.