TechTalks from event: NAACL 2015

3A: Generation and Summarization

  • How to Make a Frenemy: Multitape FSTs for Portmanteau Generation Authors: Aliya Deri and Kevin Knight
    A portmanteau is a type of compound word that fuses the sounds and meanings of two component words; for example, frenemy (friend + enemy) or smog (smoke + fog). We develop a system, including a novel multitape FST, that takes an input of two words and outputs possible portmanteaux. Our system is trained on a list of known portmanteaux and their component words, and achieves 45% exact matches in cross-validated experiments.
  • Aligning Sentences from Standard Wikipedia to Simple Wikipedia Authors: William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, Wei Wu
    This work improves monolingual sentence alignment for text simplification, specifically for text in standard and simple Wikipedia. We introduce a method that improves over past efforts by using a greedy (vs. ordered) search over the document and a word-level se- mantic similarity score based on Wiktionary (vs. WordNet) that also accounts for structural similarity through syntactic dependencies. Experiments show improved performance on a hand-aligned set, with the largest gain coming from structural similarity. Resulting datasets of manually and automatically aligned sentence pairs are made available.
  • Inducing Lexical Style Properties for Paraphrase and Genre Differentiation Authors: Ellie Pavlick and Ani Nenkova
    We present an intuitive and effective method for inducing style scores on words and phrases. We exploit signal in a phrases rate of occurrence across stylistically contrasting corpora, making our method simple to implement and efficient to scale. We show strong results both intrinsically, by correlation with human judgements, and extrinsically, in applications to genre analysis and paraphrasing.

3B: Information Extraction and Question Answering

  • Entity Linking for Spoken Language Authors: Adrian Benton and Mark Dredze
    Research on entity linking has considered a broad range of text, including newswire, blogs and web documents in multiple languages. However, the problem of entity linking for spoken language remains unexplored. Spoken language obtained from automatic speech recognition systems poses different types of challenges for entity linking; transcription errors can distort the context, and named entities tend to have high error rates. We propose features to mitigate these errors and evaluate the impact of ASR errors on entity linking using a new corpus of entity linked broadcast news transcripts.
  • Spinning Straw into Gold: Using Free Text to Train Monolingual Alignment Models for Non-factoid Question Answering Authors: Rebecca Sharp, Peter Jansen, Mihai Surdeanu, Peter Clark
    Monolingual alignment models have been shown to boost the performance of question answering systems by "bridging the lexical chasm" between questions and answers. The main limitation of these approaches is that they require semistructured training data in the form of question-answer pairs, which is difficult to obtain in specialized domains or low-resource languages. We propose two inexpensive methods for training alignment models solely using free text, by generating artificial question-answer pairs from discourse structures. Our approach is driven by two representations of discourse: a shallow sequential representation, and a deep one based on Rhetorical Structure Theory. We evaluate the proposed model on two corpora from different genres and domains: one from Yahoo! Answers and one from the biology domain, and two types of non-factoid questions: manner and reason. We show that these alignment models trained directly from discourse structures imposed on free text improve performance considerably over an information retrieval baseline and a neural network language model trained on the same data.
  • Personalized Page Rank for Named Entity Disambiguation Authors: Maria Pershina, Yifan He, Ralph Grishman
    The task of Named Entity Disambiguation is to map entity mentions in the document to their correct entries in some knowledge base. We present a novel graph-based disambiguation approach based on Personalized PageRank (PPR) that combines local and global evidence for disambiguation and effectively filters out noise introduced by in- correct candidates. Experiments show that our method outperforms state-of-the-art ap- proaches by achieving 91.7% in micro- and 89.9% in macroaccuracy on a dataset of 27.8K named entity mentions.

3C: Machine Learning for NLP

  • When and why are log-linear models self-normalizing? Authors: Jacob Andreas and Dan Klein
    Several techniques have recently been pro- posed for training self-normalized discriminative models. These attempt to find parameter settings for which unnormalized model scores approximate the true label probability. However, the theoretical properties of such techniques (and of self-normalization generally) have not been investigated. This paper examines the conditions under which we can expect self-normalization to work. We characterize a general class of distributions that admit self-normalization, and prove generalization bounds for procedures that minimize empirical normalizer variance. Motivated by these results, we describe a novel variant of an established procedure for training self-normalized models. The new procedure avoids computing normalizers for most training examples, and decreases training time by as much as factor of ten while preserving model quality.
  • Deep Multilingual Correlation for Improved Word Embeddings Authors: Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel, Karen Livescu
    Word embeddings have been found useful for many NLP tasks, including part-of-speech tagging, named entity recognition, and parsing. Adding multilingual context when learning embeddings can improve their quality, for example via canonical correlation analysis (CCA) on embeddings fromtwo languages. In this paper, we extend this idea to learn deep non-linear transformations of word embeddings of the two languages, using the recently proposed deep canonical correlation analysis. The resulting embeddings, when evaluated on multiple word and bigram similarity tasks, consistently improve over monolingual embeddings and over embeddings transformed with linear CCA.
  • Disfluency Detection with a Semi-Markov Model and Prosodic Features Authors: James Ferguson, Greg Durrett, Dan Klein
    We present a discriminative model for detecting disfluencies in spoken language transcripts. Structurally, our model is a semi-Markov conditional random field with features targeting characteristics unique to speech repairs. This gives a significant performance improvement over standard chain-structured CRFs that have been employed in past work. We then incorporate prosodic features over silences and relative word duration into our semi-CRF model, resulting in further performance gains; moreover, these features are not easily replaced by discrete prosodic indicators such as ToBI breaks. Our final system, the semi-CRF with prosodic information, achieves an F-score of 85.4, which is 1.3 F1 better than the best prior reported F-score on this dataset.