Standard Lexical representation of multi-word expressions

    Computational Linguistics & Logic
    Jan Odijk

    Project description / Abstract

    Multi-word expressions are expressions consisting of multiple words for which at least one aspect (syntactic, semantic, pragmatic, translation, etc. ) is not predictable from the individual words and their normal mode of combination. Therefore, such expressions and their unpredictable properties must be stored in a lexicon. Examples of multi-word expressions are idioms (e.g. to kick the bucket, with unpredictable meaning'to die'), support-verb constructions (e.g. Dutch 'een poging wagen', where the selection of the verb 'wagen' is specific to the noun 'poging'), compounds in certain languages (e.g. English), multi-word terms from specific domains (e.g. English 'part of speech', Dutch translation = 'woordsoort'), etc.

    State-of- the-Art systems that process natural language automatically ('NLP-systems', e.g. machine translaton systems, information retrieval systems, information exraction systems, question-answering systems, etc.) do not deal adequately with large numbers of multi-word expressions. NLP-systems can only be deployed successfully if they can handle such multi-word expressions adequately, and in addition there should be methods to rapidly include new multi-word expresssions in a NLP-system. Many NLP-systems (e.g. Machine translation) operate most successfully if they operate in a restricted domain. But each specific domain has its own multi-word terminology, which usually is not included in generic NLP-systems, so it must be possible to rapidly tune a generic NLP-system to a specific domain.

    This forms a major obstacle for the successful application of NLP technologies (Sag et al., 2001) in domains such as information retrieval (Thurmair, 2003) and summarization, question answering (Dowdall et al., 2003) and machine translation (Odijk 2000).

    The Dutch NOTaS group (http://www.stichtingnotas.nl/), a conglomerate of commercial and academic NLP-developers points out that a proper treatment of MWEs is of great importance, e.g. in the processing industries (Akkermans et al., 2004, p.118). The current work in many research institutions is too much focused on work concerning the individual head words only, and on general rather than domain specific language. They also point out, in a more general context, the need for standards.

    Van Dale Data also considers this a significant problem, and is prepared to materially support the current proposal by providing data sets to facilitate the research.

    Many (though not all) NLP-systems have methods to deal with multi-word expressions. However, these methods require that very rich and detailed information is included in the lexical item for each multi-word expression. As a consequence, the lexicons of these systems usually contain only a limited number of multi-word expressions. In addition, each system has its own approach to such expressions, uses its own grammatical framework and grammatical theory, and has details of implementation that are specific to that system. This results in the unfortunate situation that a lexical entry for a multi-word expression made for one particular system cannot be directly used in other systems.

    The research proposal focuses on the so-called Equivalence Class Method (ECM) for the standard lexical representation of MWEs recently developed by (Odijk, 2003a), see also (Odijk, 2003b; Odijk, 2004a; Odijk, 2004b). The ECM offers an innovative methodology in this domain and contributes to solving this problem (1) by presenting a method for a standardized way of lexically representing multi-word expressions that is technically simple and highly independent of specific grammatical frameworks, grammatic theories, and their implementations; and (2) by presenting a method on how to derive a system-specific representation from this standardized representation in a highly automated manner. The combination of these two ingredients makes it possible to describe multi-word expressions without all the rich and detailed information required in any specific system, and it makes it possible to reuse such lexical descriptions in a wide variety of NLP-systems.

    In the research project the proposed method will be further elaborated, empirically tested, and extended to cover other types of MWEs than flexible idioms.

Updated 02-03-2006
UiL-OTS@let.uu.nl