Estnltk — open source tools for Estonian natural language processing



Timo Petmanson
University of Tartu

Estnltk is a Python library for Estonian natural language processing (NLP). In recent years, several major NLP components for Estonian have become available under free open source licenses, which is an important milestone in Estonian NLP domain. The goal of Estnltk is to become the main platform for Estonian NLP and glue together existing free components to make them easily usable. Current situation requires the researchers to write their own interfaces to the tools, which can be very time-consuming. Also, a simple platform is a great resource for students who are interested in NLP domain.

Estnltk uses various existing libraries for providing NLP functionality. The most important component is vabamorf, which is a C++ library for morphological analysis, disambiguation and synthesis [1]. For named entity recognition, Estner library provides necessary code and also a valuable training dataset, which is required for training the default models that come with the software [2]. Estnltk also includes temporal time expression (TIMEX) library [3]. The de facto library for NLP in English, the NLTK toolkit is also a dependency [4].

In addition to providing an API that is simple to use for software developers, Estnltk also aims to be useful for language researches and linguists in general. The library has tools for sentiment analysis, text classification and information extraction, which requires no programming knowledge once they are set up. Including useful tools is a major goal in the future of Estnltk.

[1] Kaalep, Heiki-Jaan. "An Estonian morphological analyser and the impact of a corpus on its development." Computers and the Humanities 31, no. 2 (1997): 115-133.

[2] Tkachenko, Alexander; Petmanson, Timo; Laur, Sven. "Named Entity Recognition in Estonian." ACL 2013 (2013): 78.

[3] Bird, Steven. "NLTK: the natural language toolkit." In Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69-72. Association for Computational Linguistics, 2006.

[4] Orasmaa, Siim. "Automaatne ajaväljendite tuvastamine eestikeelsetes tekstides." Eesti Rakenduslingvistika Ühingu aastaraamat 8 (2012): 153-169.