The Austrian Baroque Corpus ABaC:us: What does the linguistic annotation add?



First Author: Claudia Resch
Other Author: Eva Wohlfarter
Austrian Centre for Digital Humanities, Austrian Academy of Sciences

The term "corpus" in linguistics refers to a large and structured set of texts which is usually electronically stored and processed. The purpose of this paper is to introduce the Austrian Baroque Corpus (ABaC:us) which has been built up by an interdisciplinary team since 2010.
ABaC:us consists of text data and images dating from the baroque era, in particular the years from 1650 to 1750. It includes 17 texts with more than 210.000 running words, of which five texts - attributed to the Augustinian monk Abraham a Sancta Clara (1644-1709) - constitute the very core of the corpus. The texts of ABaC:us belong mainly to the so-called Memento Mori genre, thus to texts associated with death and dying.
The corpus aims to combine traditional philological expertise and up-to-date text technology to preserve the cultural and linguistic heritage embedded in the texts. In order to ensure reusability, well-established text technological standards - XML annotations according to the guidelines of the Text Encoding Initiative (version P5, http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf) - were adopted. The focus of the paper, however, lies on the linguistic annotation: With Tree Tagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/), an open standard to apply Part of Speech tagging, and the Stuttgart-T├╝bingen-Tagset (http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf) word class and lemma information were automatically added to every word in the five main texts of the corpus. But what does the linguistic annotation add to the value of the corpus? The question is legitimate, as the manual correction of the annotation - which was necessary to obtain high quality data - was a rather time-consuming process.
The linguistic annotation allows for more complex linguistic research, such as the analysis of stylistic and rhetorical features, recurring patterns and grammatical elements. Can the linguistic analysis of the corpus help us to enable a deeper knowledge of the society of the past? With several examples from ABaC:us, this paper aims to open the debate.


References

Boot, Peter. 2009. Mesotext. Digitised Emblems, Modelled Annotations and Humanities Scholarship. Amsterdam: Pallas Publications, Amsterdam University Press

Czeitschner, Ulrike, Declerck, Thierry, and Resch, Claudia. 2014. Porting Elements of the Austrian Baroque Corpus onto the Linguistic Linked Open Data Format. In: Osenova, Petya, Simov, Kiril, Georgiev, Georgi and Nakov, Preslav (eds.): Proceedings of the Joint Workshop on NLP & LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction associated with the 9th International Conference on Recent Advances in Natural Language Processing (RANLP 2013). Sofia: p. 12-16

Dipper, Stefanie. 2010. POS-Tagging of Historical Language Data: First Experiments. In: Semantic Approaches in Natural Language Processing. Proceedings of the 10th Conference on Natural Language Processing (KONVENS-10). Saarbr├╝cken: p. 117-121

Hinrichs, Erhard, Zastrow, Thomas. 2012. Linguistic Annotations for a Diachronic Corpus of German. In: Linguistic Issues in Language Technology, Volume 7, issue 7, p. 1-16

Kawaguchi, Yuji, Minegishi, Makoto, Viereck Wolfgang (eds.). 2011. Corpus-based Analysis and Diachronic Linguistics. Amsterdam and Philadelphia: John Benjamins

Moerth, Karlheinz, Resch, Claudia, Declerck, Thierry and Czeitschner, Ulrike. 2012. Linguistic and Semantic Annotation in Religious Memento Mori Literature. In: Atwell, Eric, Brierley, Claire and Sawalha, Majdi (eds.): Proceedings of the LREC 2012 Workshop: Language Resources and Evaluation for Religious Texts. Paris: ELRA, p. 49-52

Resch, Claudia, Declerck, Thierry, Krautgartner, Barbara and Czeitschner, Ulrike. 2014. ABaC:us revisited - Extracting and Linking Lexical Data from a historical Corpus of Sacred Literature. In: Atwell, Eric, Brierley, Claire and Sawalha, Majdi (eds.): Proceedings of the 2nd Workshop on Language Resources and Evaluation for Religious Texts / LREC 2014. Reykjavik: p. 36-41