Aggregating and Predicting Sequence Labels from Crowd Annotations

Abstract

Many important problems in Natural Language Processing (NLP) may be viewed as sequence labeling tasks, such as part-of-speech (PoS) tagging, named-entity recognition (NER), and Information Extraction (IE). As with other machine learning tasks, automatic sequence labeling typically requires annotated corpora on which to train predictive models. While such annotation was traditionally performed by domain experts, crowdsourcing has become a popular means to acquire large labeled datasets at lower cost, though annotations from laypeople may be lower quality than those from domain experts. It is therefore essential to model crowdsourced label quality, both to estimate individual annotator reliability and to aggregate individual annotations to induce a single set of reference standard consensus labels. While many models have been proposed for aggregating crowd labels for binary or multiclass classification problems, far less work has explored crowd-based annotation of sequences. For aggregating crowd labels, we propose a novel Hidden Markov Model variant. To predict sequences in unannotated text, we propose a neural approach based on Long Short Term Memory. For evaluation, we consider two practical applications in two text genres: NER in news and IE from medical abstracts. Recognizing named-entities such as people, organizations or locations can be viewed as sequence labeling in which each label specifies whether each word is Inside, Outside or Beginning (IOB) a named-entity. For this task, we consider the English portion of the CoNLL-2003 dataset, using crowd labels collected by Rodrigues et al. (2014). For the IE application, we use a set of biomedical abstracts that describe Randomized Control Trials (RCTs). The crowdsourced annotations comprise labeled text spans within these abstracts that describe the patient populations enrolled in the corresponding RCTs. For example, an abstract may contain the text: "we recruited and enrolled diabetic patients". The task is to identify and label such descriptions. Identifying these sequences is useful for downstream systems that process biomedical literature, e.g., clinical search engines. Reported experiments both benchmark existing state-of-the-art approaches (sequential and non-sequential) and show that our proposed models achieve best-in-class performance.

First Name

An Thanh

Last Name

Nguyen

Industry

Health Information, Technology

Organization

Express Scripts

Capstone Type

Student Project

Date

Spring 2017