Transformer in Transformer (TNT)
model paperVision architecture that treats image patches as "visual sentences" and further divides them into smaller patches as "visual words," enabling multi-granularity attention. Achieves 81.5% top-1 accuracy on ImageNet, about 1.7% higher than existing vision transformers at similar compute cost. Published at NeurIPS 2021.
Outputs 2
TNT
modelTransformer in Transformer
paperarXiv: 2103.00112