Named Entity Recognition in Estonian

Alexander Tkachenko
University of Tartu

Knowledge confined within natural language can be made more accessible for machine processing by means of transforming the text into a structured, normalised database form. Information Extraction aims to do just this - its goal is to automatically extract structured information from unstructured text documents using natural language processing. One basic sub-task in Information Extraction involves the recognition of predefined information units such as names of persons, organisations, locations, and numeric expressions including time, date, money and percent expressions. Named Entity Recognition (NER) is the process of identifying these entities in text.

In this work, we discuss common issues related to building a NER system using supervised learning framework. Specifically, we aim to investigate an effect of language-agnostic and language-specific features on system performance. In NER, language-agnostic features are largely based on character makeup of words and include information such as prefixes, suffixes, capitalisation, etc. Language-specific features, however, are based on words' grammatical and morphological information. These include, for instance, word's lemma, part of speech, case, etc. Although language-specific feature have been shown to result in a higher performance, they require availability of sophisticated tools such as morphological analyser or part of speech tagger, which are not available for many less popular languages.

Additionally, we present our recent findings in using unlabelled text to boost NER performance.

As a result of experimentation, we achieved an overall F1-score of 87%, which is compatible with results reported for similar languages.