GLADIS: A General and Large Acronym Disambiguation Benchmark - Département Informatique et Réseaux Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

GLADIS: A General and Large Acronym Disambiguation Benchmark

Résumé

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, AcroBERT, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.
Fichier principal
Vignette du fichier
2302.01860.pdf (694.67 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04039173 , version 1 (21-03-2023)

Licence

Paternité - Pas d'utilisation commerciale - Partage selon les Conditions Initiales

Identifiants

  • HAL Id : hal-04039173 , version 1

Citer

Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek. GLADIS: A General and Large Acronym Disambiguation Benchmark. EACL 2023 - The 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023, Dubrovnik, Croatia. ⟨hal-04039173⟩
197 Consultations
35 Téléchargements

Partager

Gmail Facebook X LinkedIn More