´╗┐Linguistic Insights for Building Language Technology



Anna Feldman & Jing Peng

Our area of work is Natural Language Processing, which is a field at the intersection of linguistics, computer science, and often many other disciplines. We will describe three ongoing projects that use insights from linguistics.

Morphological analysis, tagging and lemmatization are essential for many NLP applications of both practical and theoretical nature. Modern taggers and analyzers are very accurate. However, the standard way to create them for a particular language requires substantial amount of expertise, time and money. A tagger is usually trained on a large corpus (around 100,000+ words) annotated with correct tags. Morphological analyzers usually rely on large manually created lexicons. As a result, most of the world languages and dialects have no realistic prospect for morphological taggers or analyzers created this way. We have been developing a method for creating morphological taggers and analyzers of fusional languages without the need for large-scale knowledge- and labor-intensive resources for the target language. Instead, we rely on (i) resources available for a related language, (ii) a limited amount of high-impact, low-cost manually created resources, (iii) linguistic observations about the relationship between the source-target morphologies.

The main goal of the second project is to develop a language independent method for automatic idiom recognition. Idiomatic expressions, such as 'a blessing in disguise' and 'kick the bucket' are plentiful in everyday language, though they remain mysterious, as it is not clear exactly how people learn and understand them. There is no single agreed-upon definition of idiom that covers all members of this class, but idioms tend to be relatively fixed in grammatical form and meaning, but with relatively little predictability in the relation between form and meaning. Also, many idiomatic expressions can appear with both literal, i.e. fully predictable, interpretations given their form -- compare 'The little girl made a face at her mother.' (idiomatic) vs. 'The little girl made a face on the snowman using a carrot and two buttons.' (literal). As a result, idioms present great challenges for a variety of NLP applications, including machine translation systems, which often do not detect idiomatic language. To address these challenges, we use linguistic observations combined with machine learning. The starting point is that idioms are semantic outliers that violate cohesive structure, especially in local contexts. The following properties are quantified and incorporated into our algorithm: non-compositionality; violation of local cohesive ties; idiomaticity is not binary -- idioms fall on the continuum from being compositional to being partly unanalyzable to completely non-compositional.

The role of a learner's native language (L1) in second language (L2) acquisition has been widely discussed in the theories of Second Language Acquisition (SLA). The literature suggests that writers' spelling, grammar and lexicon in second languages are often influenced by patterns in their native language. However, the extent of the importance of L1 for acquiring L2 still cannot be determined exactly and remains a controversial topic of SLA research. Recently, the availability of learner corpora has provided opportunities for verifying SLA hypotheses. The previous literature suggests that the best performing features for native language identification are largely the features that rely on the content of the data, such as word n-grams, function words and character n-grams. This means that the future applicability of these features is limited to corpus specific data. The primary goal of our work is to address this problem. We use only non-content based features, part-of-speech tags (POS) and error tags. Exploring these features is useful for corpora independent approaches to native language identification. Our secondary goal is to analyze the features that perform best for highly inflectional data. We approach binary classification as the beginning step in the development of a systematic tool for recognizing a specific L1 from morphologically complex L2 data. We use machine learning techniques to identify features contributing to the classification between Indo-European (IE) and non-Indo-European (NIE) L1 backgrounds of learners of L2 Czech. The results of the experiments show that the non-content based features, especially error tags, are the strongest indicators of the learner's language background.