BERT
paperBidirectional Encoder Representations from Transformers. Pre-trained deep bidirectional representations by jointly conditioning on both left and right context in all layers via masked language modeling. 110M (Base) and 340M (Large) parameters.
BERT revolutionized NLP, pushing GLUE to 80.5% (+7.7 pts) and SQuAD to 93.2 F1. Spawned an entire generation of models (RoBERTa, ALBERT, DeBERTa, XLNet) and became the dominant approach for search, classification, and NER. NAACL 2019. ~100K+ citations. By Devlin, Chang, Lee, and Toutanova. Apache 2.0.
Model Details
Architecture DENSE
Parameters 340M
Paper
arXiv: 1810.04805
Venue: NAACL 2019