A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages


Wikipedia is one of the most widely used datasets to train standard and contextualized word embeddings. However, even for mid-resource languages, the amount of data found in Wikipedia might be too small to train high-quality embeddings. Moreover, Wikipedia data covers a single specific genre and style. Conversely, Common Crawl is a source of much larger and much more diverse data, although at the expense of some level of noise. We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for several mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.

In The 58th Annual Meeting of the Association for Computational Linguistics
Pedro Javier Ortiz Suárez
Pedro Javier Ortiz Suárez

Je suis doctorant en informatique à Sorbonne Université et à l’équipe de recherche ALMAnaCH à Inria