Project Goals

Of the many diverse world languages, very few are within reach of current natural language processing techniques. A key obstacle to the broader application of automatic text processing is that current approaches ignore the deep semantic structure underlying all human languages, instead focusing on shallow processing of large text corpora. Our radically new approach departs from current practice in that it is designed to discover and leverage deep syntactic and semantic structures elicited from human experts. Several key conceptual breakthroughs drive our proposed approach, embodied in techniques that:

(1) amplify hand-crafted semantic knowledge to obtain comprehensive coverage for a range of tasks;

(2) uncover language-specific and language-neutral semantic representations through comparative language analysis, exploiting the diversity of expression exhibited by world languages; and

(3) effectively apply these structures to a variety of linguistic tasks from MT to information extraction and summarization.

The goal of our research is to dramatically expand the capabilities of automatic systems to process low-resource and typologically diverse languages that are currently beyond the scope of existing approaches.

Key Research Questions

Can deep linguistic representations benefit practical textual analysis systems?

At first glance, this question appears rhetorical: clearly, the answer is yes. TA systems with deep knowledge should be more accurate because they "know" more than a shallow system: they possess enough expressive power to capture complex linguistic patterns (e.g., translation correspondences) that shallow systems cannot. Furthermore, they are more easily abstracted across different surface realizations of text, so such representations magnify the marginal benefit of additional data. However, in practice shallow systems are predominant - developing usable deep linguistic representations has proven to be difficult. For instance, it took more than a decade just to demonstrate the benefits of syntactic information for statistical machine translation. The challenges are threefold: first, defining representations conducive to improved text processing; second, developing algorithms that can handle these complex structures; and third, effectively incorporating them into underlying applications. Moreover, we must address these challenges for languages where relevant annotations are unavailable, making it difficult to construct example structures.

Can we marry learning from data with expert-crafted declarative linguistics?

Classical linguistics expresses knowledge about language in the form of abstract theories (e.g., grammars) that are domain- and task-neutral. In contrast, the dominant assumption of modern computational linguistics is that representations of language can be learned directly from textual data, either in raw or annotated form. Even when such data is plentiful, for many tasks the accuracy of machine learning systems leaves much to be desired. We hypothesize that such systems stand to benefit substantially from declarative expert linguistic knowledge. Leveraging high-quality, readily available linguistic knowledge is particularly essential for effective processing of low-density languages.

Can we uncover underlying linguistic structure through comparative language analysis?

We hypothesize that the connection between languages is a powerful source of information for automatic processing of a low-density language.In fact, linguists routinely utilize comparative analysis for studying low-density languages.Comparative data such as typology and family classification can strongly constrain the space of possible linguistic structures when a target language is analyzed in its broader linguistic context. Moreover, language diversity can shed considerable light on the linguistic representations of individual languages: structures implicit in one language can be straightforward to identify in another.

How can we extend MT-motivated linguistic-core capabilities to related TA tasks?

We hypothesize that analyzing the typology and structure of a language well enough to map into the corresponding structure of a very different language also provides the key to unlocking other analytic functions. For instance, the mapping between argument structure and semnatics is a key to extracting relations (who did what to whom?) reliabily vs just extracting entities from text.Extracted relations in turn are the key to meaningful summarization.