Statistical methods for particle verb extraction from text corpus


Eleri Aedmaa
University of Tartu

Series of studies have been conducted on using association measures (AMs) to identify lexical association between pairs of words that potentially form a holistic unit, but the question "what is the best AM?" is still difficult to answer. It was unknown how the AMs perform on Estonian data and which AMs are most successful for collocation extraction. This study focused on a subtype of collocations or multi-word expressions, namely particle verbs – a frequent and regular phenomenon in Estonian and problematic subject in natural language processing. I tried to ascertain the best AM for the extraction of particle verbs through investigation of the impact of corpus size on the performance of the symmetrical association measures and compared symmetrical association measures t-test, mutual information, X2, log-likelihood function and minimum sensitivity and asymmetrical conditional probability and ΔP. t-test achieved best precision values, but as the corpus size increased, the performances of X2 and minimum sensitivity improved. In addition, I demonstrated that ΔP is successful for the task of particle verb extraction and provides us slightly different and more detailed information about the extracted particle verbs.