Utilizing Linguistic Resources for Historical Text Clustering

Andres Karjus
Liisi Veski
University of Tartu

Text is the prevalent medium and target of study not only in linguistics, but in a broad range of humanities. When there is plenty of textual data at hand, various computational methods, developed over the last couple of decades, can be used to gain insight into the data, cluster and group the text, model the topics discussed therein, etc. However, when the texts under observation are few and short, state of the art methods appear to perform rather poorly. We propose an idea for supplementing the clustering of small texts by replacing the words they contain using the hyperonymy relations found in a wordnet (a type of lexical database resource, crafted by linguists). The idea of such generaliation, or abstraction, is by no means novel in itself (cf. Hovy, Lin 1999; Durme et al 2009). However, what we propose to use this methodology for is directly reducing the "long tail" of words occurring once or twice in a typycal distribution of words in a text, and cluster the texts using the vectors of the wordnet-generalized terms.

As a case study, we observe the usage contexts of words meaning 'nation, national' in the essays of Estonian scholars and politicians of the 1930's, a time where such matters were hotly debated across Europe. The results show some improvement over the simple bag-of-words TFIDF baseline, but not for all texts. As such, we will discuss possible ways to improve the model.

Hovy, E., Lin, C.-Y., 1999. Automated text summarization in summarist.
Van Durme, B., Michalak, P., Schubert, L. K., 2009. Deriving Generalized Knowledge from Corpora using WordNet Abstraction.