LocateAnything-3B
model3B vision-language grounding model from NVIDIA's LPR group. Introduces Parallel Box Decoding — a non-autoregressive box-output head that improves throughput ~2.5× over standard VLM detection while handling object detection, phrase grounding, GUI element grounding, and OCR in a single model.
Trained on 12M images / 785M boxes. Open weights; project page includes demos and benchmarks.
Model Details
Parameters 3B