Corpus Studies of Russian Everyday Speech and Oral Communication

Bogdanova-Beglarian Natalia, Sherstinova Tatiana, Blinova Olga, Martynenko Gregory
St. Petersburg State University

The paper presents the ORD ("One day of speech") corpus of Russian everyday speech which contains long-term audio recordings of daily communication [1]. Nowadays, the ORD corpus is the most representative collection of everyday spoken Russian containing more than 1000 hours of recordings gathered from 110 main participants and hundreds of their interlocutors; speech transcripts numbers about 500000 words and it is planned to extend transcripts up to 1 million words. Speech is selectively annotated on different levels — phonetic, lexical, grammatical, and pragmatic; quantitative data processing is made for annotations on each level [2]. The paper gives brief overview of studies which are (or have been) conducted on the ORD data in the followings aspects: 1) phonetics (study of reduction; temporal studies; speech patterns; hesitations; etc.); 2) lexical studies (new words; new meanings; frequency word lists; lexical richness and concentration; slang; argot; etc. ); 3) morphology studies (POS-distribution; frequency lists of grammatical forms; grammatical errors; etc.) 4) syntactic studies (linear word order; syntactic complexity; specific syntactic phenomena of spontaneous speech; etc.); 5) discourse and communication studies (macro and micro structures of everyday communication; communication scenarios; discourse words and fillers; pragmatic studies; communication with "not-standard" interlocutors; etc.); 6) psycholinguistic studies (dependency of speech characteristics from speaker's psychological type); and 7) sociolinguistic studies (speech features of different social grouping; gender linguistics; styles and registers of spoken Russian; etc.) currently supported by Russian Scientific Foundation, project # 14-18-02070 “Everyday Russian Language in Different Social Groups” (cf., for example, [3]). The ORD corpus has different interdisciplinary applications, the major of which will be listed.

1. Asinovsky А., Bogdanova N., Rusakova M., Ryko A., Stepanova S., and Sherstinova T. (2009) The ORD Speech Corpus of Russian Everyday Communication "One Speaker's Day": Creation Principles and Annotation". Text, Speech and Dialogue, LNCS/LNAI 5729. vol. 5729, 250–257. Berlin/Heidelberg: Springer-Verlag.
2. Sherstinova T. (2010) Quantitative Data Processing in the ORD Speech Corpus of Russian Everyday Communication, In: Grzybek, P., Kelih, E., and Mačutek, J. (eds.) Text and Language: Structures, Functions, Interrelations, 195–206. Wien: Praesens Verlag.
3. Bogdanova-Beglarian N., Asinovsky A., Blinova O., Markasova E., Ryko A., and Sherstinova T. (2014) Zvukovoj korpus russkogo jazyka: novaja metodologija analiza ustnoj rechi [Sound Corpus of Russian: New Methodology of Oral Speech Analysis]. In: Jazyk i metod: Russkij jazyk v lingvisticheskikh issledovanijakh XXI veka. [Language and Methodogoly. The Russian Language in Linguistics Studies of the XXI-th Century]. Krakow: Jagiellonian University (in print).