Social unrest through the prism of language: computational linguistic at sociology service

Andrey Kutuzov
National Research University Higher School of Economics
Olga Miryasova
Institute of Sociology, Russian Academy of Science

Deep structure of society is manifested through how people speak or write, and this can help sociologist a lot. However, only recently linguistics and natural language processing developed to the point when they can offer robust methods of analyzing vast amounts of texts to extract meaningful features. We are describing a case when linguistics substantially helped sociology in studying a particular group of grassroots activists.

The group in question is a parents' movement located mainly in Moscow, Russia. In 2012, its participants united around the issue of bad catering in kindergartens. This group consisted mostly of mothers, age 22 to 45, who organized rallies, met with officials and addressed protest letters to the authorities. Activists communicated via their Internet forum at, discussing all aspects of their struggle.

This forum became a perfect source of linguistic data about the activists in question. That's why sociologists (initially using participant observation) decided to ask a linguist about what modern natural language processing can do with this data.

From the point of view of a sociologist, analysis of forum posts has the following advantages in comparison to interviews:

1. There are no interviewer's questions, utterances are made because their producers wanted to make them.
2. The utterances are not as artificial as in an interview (activists often tend to feel as if a sociologist' interview is the same thing as an interview to mass media).
3. The amount of texts is much more than in a typical interview (in this case we analyzed a corpus of almost a million words in size, 34 thousand posts, dated from January to December of 2012).
4. It is possible to collect utterances of many people (807 in our case) without substantial efforts.

At the same time, we stumbled upon a few disadvantages:

1. Automatic natural language processing at the level of vocabulary (an even at the level of syntax) often fails to capture cases of ellipsis, irony, extensive usage of euphemisms and co-reference: phenomena which are very frequent in informal collective discussions.
2. Qualitative interpretation of the results of quantitative linguistic analysis is often possible only with the help of an informed expert able to explain apparent inconsistencies in the data and to point at flaws in the text processing pipeline.

In general, sociologists wanted linguistic help with the following issues:

1. To compare relations to power and the authorities within different groups inside the grassroots movement (using textual data as a source). It is important for deciding, whether a particular subgroup perceives their problem as a private one or places it in the wider political context.
2. To refine the exact composition of subgroups picked by an expert. Expert estimation is important, but it is equally important to support it with statistical data about language use by the groups' representatives.
3. To discover key qualitative differences between the subgroups. These differences were found out to be manifested in the distribution of lexical frequencies, as used by the representatives of the subgroups.
4. To estimate how activists' stand changed over time. With traditional methods of sociology, estimation of social movement participants' opinion and behavior dynamics is so cumbersome that it is hardly ever practiced. Massive linguistic analysis of textual data allowed to describe the evolution of people behavior.

All these issues were more or less successfully resolved with the help of natural language processing and machine learning methods. Thus, the collaboration turned out to be fruitful. Computational linguistics allowed to prove sociologists' hypotheses and to come to some unexpected insights.