Workshop on New Forms of Generalization in Deep Learning and Natural Language Processing

Workshop on New Forms of Generalization
in Deep Learning and Natural Language Processing

June 5th at NAACL 2018!

TL;DR: We build models that work well on our datasets but when we play with them we are surprised that they are brittle and break.
Let’s analyze their failings propose new evaluations & models.

Schedule
Time	Speaker	Title
09:00 - 09:15	Organizers	Welcome and Introduction
09:15 - 09:50	Yejin Choi	Why NLU doesn’t generalize to NLG
09:50 - 10:25	Dan Roth	Incidental Supervision
10:25 - 10:35	Break
10:35 - 11:10	Percy Liang	Cheap Tricks and the Perils of Machine Learning
11:10 - 11:45	Ndapa Nakashole	Zero-Shot Learning for Word Translation: Successes and Failures
11:45 - 13:00	Lunch
13:00 - 14:30	Poster Session	Details Below
14:45 - 15:20	Sam Bowman	GLUE: Toward Task-Independant Sentence Understanding
15:20 - 15:55	Devi Parikh	Generalization "Opportunities" in Visual Question Answering
16:10 - 17:10	Panel	Twitter transcript
17:10 - 17:15	Organizers	Closing Remarks
Onsite Childcare Available

Overview

Deep learning has brought a wealth of state-of-the-art results and new capabilities. Although methods have achieved near human-level performance on many benchmarks, numerous recent studies imply that these benchmarks only weakly test their intended purpose, and that simple examples produced either by human or machine, cause systems to fail spectacularly^{[1][2][3][4][5][6][7]}. For example, a recently released textual entailment demo was criticized on social media for predicting:

“John killed Mary”	→	“Mary killed John”
Entailing with 92% confidence

Such surprising failures combined with the inability to interpret state-of-the-art models have eroded confidence in our systems, and while these systems are not perfect, the real flaw lies with our benchmarks that do not adequately measure a model’s ability to generalize, and are thus easily gameable.

This workshop provides a venue for exploring new approaches for measuring and enforcing generalization in models. We are soliciting work in the following areas:

Analysis of existing models and their failings
Creation of new evaluation paradigms,
e.g. zero-shot learning, Winnograd schema, and datasets that avoid explicit types of gamification.
Modeling advances
regularization, compositionality, interpretability, inductive bias, multi-task learning, and other methods that promote generalization.

Some of our goals are similar in spirit to those of the recent “Build it Break it” shared task.^[8] However, we propose going beyond identifying areas of weakness (i.e. “breaking” existing systems), and discussing scalable evaluations that more rigorously test generalization as well as modeling techniques for enforcing it.

Accepted Papers (to be presented as posters)

Commonsense mining as knowledge base completion? A study on the impact of novelty	Stanislaw Jastrzebski, Dzmitry Bahdanau, Seyedarian Hosseini, Michael Noukhovitch, Yoshua Bengio and Jackie Cheung
Deep learning evaluation using deep linguistic processing	Alexander Kuhnle and Ann Copestake
The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models	Noah Weber, Leena Shekhar and Niranjan Balasubramanian
Extrapolation in NLP	Jeff Mitchell, Pontus Stenetorp, Pasquale Minervini and Sebastian Riedel
Towards Inference-Oriented Reading Comprehension: ParallelQA	Soumya Wadhwa, Varsha Embar, Matthias Grabmair and Eric Nyberg

Accepted Cross / Non Archival Submissions (to be presented as posters)

Annotation Artifacts in Natural Language Inference Data	Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman and Noah A. Smith
Stress Test Evaluation for Natural Language Inference	Abhilasha Ravichander, Aakanksha Naik, Norman Sadeh, Carolyn Rose and Graham Neubig
Deep RNNs Learn Hierarchical Syntax	Terra Blevins, Omer Levy and Luke Zettlemoyer
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks	Mohit Iyyer, John Wieting, Kevin Gimpel and Luke Zettlemoyer
Evaluating Compositionality in Sentence Embeddings	Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller, Samuel Gershman and Noah Goodman

Organizers:

Mark Yatskar
Univ Washington
Yonatan Bisk
Univ Washington
Omer Levy
Univ Washington

Steering committee:

Yejin Choi
University of Washington
Devi Parikh
Georgia Tech / Facebook AI Research
Dan Roth
University of Pennsylvania

Program committee:

Jacob Andreas
UC Berkeley
Antoine Bosselut
U Washington
Kai-Wei Chang
UC Los Angeles
Eunsol Choi
U Washington
Christos Christodoulopoulos
Amazon, Inc
Ryan Cotterell
Johns Hopkins U
Greg Durrett
UT Austin
Nicholas FitzGerald
U Washington
Maxwell Forbes
U Washington
Spandana Gella
Edinburgh U
Luheng He
U Washington
Srinivasan Iyer
U Washington
Mohit Iyyer
UMass Amherst
Robin Jia
Stanford U
Ioannis Konstas
Heriot-Watt U
Jonathan Kummerfeld
U Michigan
Alice Lai
UI Urbana-Champaign
Mike Lewis
Facebook AI Research
Tal Linzen
Johns Hopkins U
Ishan Misra
Carnegie Mellon U
Vicente Ordonez
U Virginia
Siva Reddy
Stanford U
Alan Ritter
Ohio State U
Rajhans Samdani
Spoke
Sameer Singh
UC Irvine
Alane Suhr
Cornell U
Chen-Tse Tsai
Bloomberg LP
Shyam Upadhyay
U Pennsylvania
Andreas Vlachos
U Sheffield

Citations:

Levy et al. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? NAACL 2015
[link]
Moosavi & Strube Lexical Features in Coreference Resolution: To be Used With Caution ACL 2017
[link]
Agrawal et al. C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset 2017
[link]
Yatskar et al. Commonly Uncommon: Semantic Sparsity in Situation Recognition CVPR 2016
[link]
Jia & Liang Adversarial Examples for Evaluating Reading Comprehension Systems EMNLP 2017
[link]
Levy et al. Zero-Shot Relation Extraction via Reading Comprehension CoNLL 2017
[link]
Belinkov & Bisk Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018
[link]
Ettinger et al. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task EMNLP Wksp 2017
[link]