Why linguistics: from a technological perspective



Toni Badia
Universitat Pompeu Fabra, Barcelona

In a technological environment where engines are built that either perform linguistic acts or help users to perform them the question of what linguistics is useful for comes up again and again. And very often the “why” is highly connected to the “what” and to the “how”.

The predominant paradigm today in Natural Language Processing (NLP) is data-driven. Since the 1970’s, when Artificial Intelligence and NLP were theoretically grounded on logic and linguistic knowledge, a major shift in paradigm has occurred within NLP. The initial success of probabilistic models in speech recognition and machine translation (e.g., at IBM labs: Jelinek at al. 1975; Brown et al. 1990), together with the increase of processing power and storage capacity of computers, has led the field to rely basically on data from which to extract (usually via machine learning techniques) models to perform the tasks at hand. Interestingly this has brought together NLP and corpus linguistics.

The availability of data then is crucial for solving NLP tasks. It has sometimes been argued (Havely et al. 2009) that massive amount of data is enough for success in NLP tasks. Irrespective of the general validity of this statement, the fact remains that for a wide number of tasks there is no massive data available. Models have to be acquired from less data; and the relative scarcity of data can only be compensated if they are richly and carefully annotated, so that satisfying results can be obtained. And indeed in many specialised tasks a combination of massive data with carefully annotated data seems to be required (e.g., when combining in- with out-of-domain data in machine translation).

Why (do we need linguistics)

Linguistics is required to make sense of linguistic facts, which are scattered and intermixed with other communication factors and means. Linguistics has to be grounded on facts, at all levels of granularity: it has to provide explanation of both 1) communication facts taking place in specific time/space coordinates and 2) language as a cognitive phenomenon. Both levels of explanation must be consistent with one another.
- linguistic phenomena are scattered (in multiple linguistic acts and performed in a number of different languages) and intersected with other communication means (gestures, images, intonation...)
- a systematic approach to the correlation between linguistic form and meaning is needed
- linguistic analyses are necessary to provide data that are linguistically structured
- linguistic theories are necessary to provide consistency to the analyses and annotation of data

What (sort of linguistics do we need)

We basically need linguistic theories that are grounded on facts and help in carrying out linguistic annotations of linguistic acts:
- descriptive linguistics must be the core: theory-neutral descriptions, that are applicable cross-language and directly relating form and meaning
- linguistic theories must be validated against (quantitative) data
- linguistics must become a truly experimental science accounting for both: each minute communication act and language as a global communication system (one of the complex systems in nature)

How (should linguistic data be annotated)

Language is a communication means; it is therefore related to any other factor influencing communication (either the channel or the content). Linguistic data have to be treated in parallel to the other elements (image and sound data in audiovisual communication; social data in social media communication...). We need:
- rigorously annotated data (deriving from categorisation of linguistic phenomena)
- using cross-lingual criteria for annotating (so that linguistic diversity is addressed)
- quality annotations (simple, fully justified, grounded on facts, contrasted, replicable)
- clearly quantifiable
- comparable to, and integratable with, non-linguistic data (image, sound, social networks... data)

Brown,P.F. et al. 1990. A statistical approach to machine translation. Computational Linguistics, 16, 2, pp. 79-85
Halevy,A. et al. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems, Vol. 24, 2, pp. 8-12
Jelinek,F. et al. 1975. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory. 06/1975