Falcon Perception
modelEarly-fusion dense Transformer for open-vocabulary grounding and segmentation from natural-language prompts. Processes image patches and text tokens in a shared parameter space from the first layer with a hybrid attention mask (bidirectional over image tokens, causal over text/task tokens) and a "Chain-of-Perception" autoregressive interface — each detected instance emits <coord> → <size> → <seg>, with coordinates encoded via Fourier features and segmentation produced from a dot-product against upsampled image features.
Two open-weight checkpoints: Falcon Perception (0.6B) for full grounding + segmentation, and Falcon OCR (0.3B) specialised for document layout + OCR. Trained on 54M images with 195M positive expressions and 488M hard negatives in three stages, using multi-teacher distillation from DINOv3 + SigLIP2 and ensemble consensus validation (SAM 3 + Qwen3-VL-30B + Moondream3) — 700 GPU-hours total.
Benchmark headline: SA-Co 68.0 Macro-F1 (vs SAM 3's 62.3); on the new PBench diagnostic the lead widens with prompt complexity — +13.4 on OCR-guided, +21.9 on spatial, +15.8 on relations, +14.2 on dense scenes. Falcon OCR reports olmOCR 80.3 and OmniDocBench 88.6; full layout+OCR throughput is 2.9 images/s on H100.
Companion release: PBench, a capability-stratified diagnostic benchmark (L0–L4 + dense). Docker / vLLM server / MLX (Apple Silicon) shipped alongside. CC-BY 4.0.
Model Details
Benchmark Scores
| Benchmark | Score | Mode |
|---|---|---|
| SA-Co Macro-F1 | 68.0 | — |
| PBench L2 OCR-guided | 38.0 | — |
| PBench L3 spatial | 53.5 | — |
| PBench Dense (100s instances) | 72.6 | — |
| olmOCR (Falcon OCR) | 80.3 | — |
| OmniDocBench (Falcon OCR) | 88.6 | — |
Variants
| Name | Parameters | Notes |
|---|---|---|
| Falcon Perception | 0.6B | Full grounding + segmentation; SA-Co 68.0 Macro-F1 (#1 vs SAM 3) |
| Falcon Perception 300M | 300M | Detection-only (bounding boxes); no segmentation head |
| Falcon OCR | 0.3B | Document layout + OCR; olmOCR 80.3, OmniDocBench 88.6 |