TechTalks from event: NAACL 2015

5C: Morphology, Syntax, Multilinguality, and Applications

  • Paradigm classification in supervised learning of morphology Authors: Malin Ahlberg, Markus Forsberg, Mans Hulden
    Supervised morphological paradigm learning by identifying and aligning the longest common subsequence found in inflection tables has recently been proposed as a simple yet competitive way to induce morphological patterns. We combine this non-probabilistic strategy of inflection table generalization with a discriminative classifier to permit the reconstruction of complete inflection tables of unseen words. Our system learns morphological paradigms from labeled examples of inflection patterns (inflection tables) and then produces inflection tables from unseen lemmas or base forms. We evaluate the approach on datasets covering 11 different languages and show that this approach results in consistently higher accuracies vis--vis other methods on the same task, thus indicating that the general method is a viable approach to quickly creating high-accuracy morphological resources.
  • Shift-Reduce Constituency Parsing with Dynamic Programming and POS Tag Lattice Authors: Haitao Mi and Liang Huang
    We present the first dynamic programming (DP) algorithm for shift-reduce constituency parsing, which extends the DP idea of Huang and Sagae (2010) to context-free grammars. To alleviate the propagation of errors from part-of-speech tagging, we also extend the parser to take a tag lattice instead of a fixed tag sequence. Experiments on both English and Chinese treebanks show that our DP parser significantly improves parsing quality over non-DP baselines, and achieves the best accuracies among empirical linear-time parsers.
  • Unsupervised Code-Switching for Multilingual Historical Document Transcription Authors: Dan Garrette, Hannah Alpert-Abrams, Taylor Berg-Kirkpatrick, Dan Klein
    Transcribing documents from the printing press era, a challenge in its own right, is more complicated when documents interleave multiple languages---a common feature of 16th century texts. Additionally, many of these documents precede consistent orthographic conventions, making the task even harder. We extend the state-of-the-art historical OCR model of Berg-Kirkpatrick et al. (2013) to handle word-level code-switching between multiple languages. Further, we enable our system to handle spelling variability, including now-obsolete shorthand systems used by printers. Our results show average relative character error reductions of 14\% across a variety of historical texts.
  • Matching Citation Text and Cited Spans in Biomedical Literature: a Search-Oriented Approach Authors: Arman Cohan, Luca Soldaini, Nazli Goharian
    Citation sentences (citances) to a reference ar- ticle have been extensively studied for sum- marization tasks. However, citances might not accurately represent the content of the cited article, as they often fail to capture the con- text of the reported findings and can be af- fected by epistemic value drift. Following the intuition behind the TAC (Text Analysis Conference) 2014 Biomedical Summarization track, we propose a system that identifies text spans in the reference article that are related to a given citance. We refer to this problem as citance-reference spans matching. We ap- proach the problem as a retrieval task; in this paper, we detail a comparison of different ci- tance reformulation methods and their combi- nations. While our results show improvement over the baseline (up to 25.9%), their absolute magnitude implies that there is ample room for future improvement.
  • Effective Feature Integration for Automated Short Answer Scoring Authors: Keisuke Sakaguchi, Michael Heilman, Nitin Madnani
    A major opportunity for NLP to have a real-world impact is in helping educators score student writing, particularly content-based writing (i.e., the task of automated short answer scoring). A major challenge in this enterprise is that scored responses to a particular question (i.e., labeled data) are valuable for modeling but limited in quantity. Additional information from the scoring guidelines for humans, such as exemplars for each score level and descriptions of key concepts, can also be used. Here, we explore methods for integrating scoring guidelines and labeled responses, and we find that stacked generalization (Wolpert, 1992) improves performance, especially for small training sets.