Workshop on New Forms of Generalization
in Deep Learning and Natural Language Processing

June 5th at NAACL 2018!

TL;DR: We build models that work well on our datasets but when we play with them we are surprised that they are brittle and break.
Let’s analyze their failings propose new evaluations & models.

09:00 - 09:15OrganizersWelcome and Introduction
09:15 - 09:50Yejin ChoiWhy NLU doesn’t generalize to NLG
09:50 - 10:25Dan RothIncidental Supervision
10:25 - 10:35Break
10:35 - 11:10Percy LiangCheap Tricks and the Perils of Machine Learning
11:10 - 11:45Ndapa NakasholeZero-Shot Learning for Word Translation: Successes and Failures
11:45 - 13:00Lunch
13:00 - 14:30Poster SessionDetails Below
14:45 - 15:20Sam BowmanGLUE: Toward Task-Independant Sentence Understanding
15:20 - 15:55Devi ParikhGeneralization "Opportunities" in Visual Question Answering
16:10 - 17:10PanelTwitter transcript
17:10 - 17:15OrganizersClosing Remarks

Onsite Childcare Available


Deep learning has brought a wealth of state-of-the-art results and new capabilities. Although methods have achieved near human-level performance on many benchmarks, numerous recent studies imply that these benchmarks only weakly test their intended purpose, and that simple examples produced either by human or machine, cause systems to fail spectacularly[1][2][3][4][5][6][7]. For example, a recently released textual entailment demo was criticized on social media for predicting:

“John killed Mary”
“Mary killed John”
Entailing with 92% confidence

Such surprising failures combined with the inability to interpret state-of-the-art models have eroded confidence in our systems, and while these systems are not perfect, the real flaw lies with our benchmarks that do not adequately measure a model’s ability to generalize, and are thus easily gameable.

This workshop provides a venue for exploring new approaches for measuring and enforcing generalization in models. We are soliciting work in the following areas:

  • Analysis of existing models and their failings
  • Creation of new evaluation paradigms,
    e.g. zero-shot learning, Winnograd schema, and datasets that avoid explicit types of gamification.
  • Modeling advances
    regularization, compositionality, interpretability, inductive bias, multi-task learning, and other methods that promote generalization.

Some of our goals are similar in spirit to those of the recent “Build it Break it” shared task.[8] However, we propose going beyond identifying areas of weakness (i.e. “breaking” existing systems), and discussing scalable evaluations that more rigorously test generalization as well as modeling techniques for enforcing it.

Accepted Papers (to be presented as posters)

Commonsense mining as knowledge base completion?
A study on the impact of novelty
Stanislaw Jastrzebski, Dzmitry Bahdanau, Seyedarian Hosseini, Michael Noukhovitch, Yoshua Bengio and Jackie Cheung
Deep learning evaluation using deep linguistic processingAlexander Kuhnle and Ann Copestake
The Fine Line between Linguistic Generalization and
Failure in Seq2Seq-Attention Models
Noah Weber, Leena Shekhar and Niranjan Balasubramanian
Extrapolation in NLPJeff Mitchell, Pontus Stenetorp, Pasquale Minervini
and Sebastian Riedel
Towards Inference-Oriented Reading Comprehension: ParallelQASoumya Wadhwa, Varsha Embar, Matthias Grabmair and Eric Nyberg

Accepted Cross / Non Archival Submissions (to be presented as posters)

Annotation Artifacts in Natural Language Inference DataSuchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman and Noah A. Smith
Stress Test Evaluation for Natural Language InferenceAbhilasha Ravichander, Aakanksha Naik, Norman Sadeh, Carolyn Rose and Graham Neubig
Deep RNNs Learn Hierarchical SyntaxTerra Blevins, Omer Levy and Luke Zettlemoyer
Adversarial Example Generation with Syntactically Controlled Paraphrase NetworksMohit Iyyer, John Wieting, Kevin Gimpel and Luke Zettlemoyer
Evaluating Compositionality in Sentence EmbeddingsIshita Dasgupta, Demi Guo, Andreas Stuhlmüller, Samuel Gershman and Noah Goodman

Steering committee:
  • Yejin Choi
    University of Washington
  • Devi Parikh
    Georgia Tech / Facebook AI Research
  • Dan Roth
    University of Pennsylvania
Program committee:
  • Jacob Andreas
    UC Berkeley
  • Antoine Bosselut
    U Washington
  • Kai-Wei Chang
    UC Los Angeles
  • Eunsol Choi
    U Washington
  • Christos Christodoulopoulos
    Amazon, Inc
  • Ryan Cotterell
    Johns Hopkins U
  • Greg Durrett
    UT Austin
  • Nicholas FitzGerald
    U Washington
  • Maxwell Forbes
    U Washington
  • Spandana Gella
    Edinburgh U
  • Luheng He
    U Washington
  • Srinivasan Iyer
    U Washington
  • Mohit Iyyer
    UMass Amherst
  • Robin Jia
    Stanford U
  • Ioannis Konstas
    Heriot-Watt U
  • Jonathan Kummerfeld
    U Michigan
  • Alice Lai
    UI Urbana-Champaign
  • Mike Lewis
    Facebook AI Research
  • Tal Linzen
    Johns Hopkins U
  • Ishan Misra
    Carnegie Mellon U
  • Vicente Ordonez
    U Virginia
  • Siva Reddy
    Stanford U
  • Alan Ritter
    Ohio State U
  • Rajhans Samdani
  • Sameer Singh
    UC Irvine
  • Alane Suhr
    Cornell U
  • Chen-Tse Tsai
    Bloomberg LP
  • Shyam Upadhyay
    U Pennsylvania
  • Andreas Vlachos
    U Sheffield
  1. Levy et al. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? NAACL 2015
  2. Moosavi & Strube Lexical Features in Coreference Resolution: To be Used With Caution ACL 2017
  3. Agrawal et al. C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset 2017
  4. Yatskar et al. Commonly Uncommon: Semantic Sparsity in Situation Recognition CVPR 2016
  5. Jia & Liang Adversarial Examples for Evaluating Reading Comprehension Systems EMNLP 2017
  6. Levy et al. Zero-Shot Relation Extraction via Reading Comprehension CoNLL 2017
  7. Belinkov & Bisk Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018
  8. Ettinger et al. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task EMNLP Wksp 2017