|
Workshop on New Forms of Generalization in Deep Learning and Natural Language Processing
|
|
|
June 5th at NAACL 2018!
TL;DR: We build models that work well on our datasets but when we play with them we are surprised that they are brittle and break. Let’s analyze their failings propose new evaluations & models.
Overview
Deep learning has brought a wealth of state-of-the-art results and new capabilities. Although methods have achieved near human-level performance on many benchmarks, numerous recent studies imply that these benchmarks only weakly test their intended purpose, and that simple examples produced either by human or machine, cause systems to fail spectacularly[1][2][3][4][5][6][7]. For example, a recently released textual entailment demo was criticized on social media for predicting:
“John killed Mary” | → | “Mary killed John” |
Entailing with 92% confidence |
Such surprising failures combined with the inability to interpret state-of-the-art models have eroded confidence in our systems, and while these systems are not perfect, the real flaw lies with our benchmarks that do not adequately measure a model’s ability to generalize, and are thus easily gameable.
This workshop provides a venue for exploring new approaches for measuring and enforcing generalization in models. We are soliciting work in the following areas:
- Analysis of existing models and their failings
- Creation of new evaluation paradigms,
e.g. zero-shot learning, Winnograd schema, and datasets that avoid explicit types of gamification.
- Modeling advances
regularization, compositionality, interpretability, inductive bias, multi-task learning, and other methods that promote generalization.
Some of our goals are similar in spirit to those of the recent “Build it Break it” shared task.[8] However, we propose going beyond identifying areas of weakness (i.e. “breaking” existing systems), and discussing scalable evaluations that more rigorously test generalization as well as modeling techniques for enforcing it.
| |
|
|
|
|
|
Accepted Papers
(to be presented as posters)
Accepted Cross / Non Archival Submissions (to be presented as posters)
Annotation Artifacts in Natural Language Inference Data | Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman and Noah A. Smith |
Stress Test Evaluation for Natural Language Inference | Abhilasha Ravichander, Aakanksha Naik, Norman Sadeh, Carolyn Rose and Graham Neubig |
Deep RNNs Learn Hierarchical Syntax | Terra Blevins, Omer Levy and Luke Zettlemoyer |
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks | Mohit Iyyer, John Wieting, Kevin Gimpel and Luke Zettlemoyer |
Evaluating Compositionality in Sentence Embeddings | Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller, Samuel Gershman and Noah Goodman |
Organizers:
| |
|
Steering committee:
| |
|
Program committee:
- Jacob Andreas
UC Berkeley
- Antoine Bosselut
U Washington
- Kai-Wei Chang
UC Los Angeles
- Eunsol Choi
U Washington
- Christos Christodoulopoulos
Amazon, Inc
- Ryan Cotterell
Johns Hopkins U
- Greg Durrett
UT Austin
- Nicholas FitzGerald
U Washington
- Maxwell Forbes
U Washington
- Spandana Gella
Edinburgh U
- Luheng He
U Washington
- Srinivasan Iyer
U Washington
- Mohit Iyyer
UMass Amherst
- Robin Jia
Stanford U
- Ioannis Konstas
Heriot-Watt U
- Jonathan Kummerfeld
U Michigan
- Alice Lai
UI Urbana-Champaign
- Mike Lewis
Facebook AI Research
- Tal Linzen
Johns Hopkins U
- Ishan Misra
Carnegie Mellon U
- Vicente Ordonez
U Virginia
- Siva Reddy
Stanford U
- Alan Ritter
Ohio State U
- Rajhans Samdani
Spoke
- Sameer Singh
UC Irvine
- Alane Suhr
Cornell U
- Chen-Tse Tsai
Bloomberg LP
- Shyam Upadhyay
U Pennsylvania
- Andreas Vlachos
U Sheffield
| |
|
Citations:
- Levy et al. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? NAACL 2015
- Moosavi & Strube Lexical Features in Coreference Resolution: To be Used With Caution ACL 2017
- Agrawal et al. C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset 2017
- Yatskar et al. Commonly Uncommon: Semantic Sparsity in Situation Recognition CVPR 2016
- Jia & Liang Adversarial Examples for Evaluating Reading Comprehension Systems EMNLP 2017
- Levy et al. Zero-Shot Relation Extraction via Reading Comprehension CoNLL 2017
- Belinkov & Bisk Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018
- Ettinger et al. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task EMNLP Wksp 2017
| |