Workshop on New Forms of Generalization
in Deep Learning and Natural Language Processing

TL;DR: We build models that work well on our datasets but when we play with them we are surprised that they are brittle and break.
Let’s analyze their failings propose new evaluations & models.

NAACL 2018!


Deep learning has brought a wealth of state-of-the-art results and new capabilities. Although methods have achieved near human-level performance on many benchmarks, numerous recent studies imply that these benchmarks only weakly test their intended purpose, and that simple examples produced either by human or machine, cause systems to fail spectacularly[1][2][3][4][5][6][7]. For example, a recently released textual entailment demo was criticized on social media for predicting:

“John killed Mary”
“Mary killed John”
Entailing with 92% confidence

Such surprising failures combined with the inability to interpret state-of-the-art models have eroded confidence in our systems, and while these systems are not perfect, the real flaw lies with our benchmarks that do not adequately measure a model’s ability to generalize, and are thus easily gameable.

This workshop provides a venue for exploring new approaches for measuring and enforcing generalization in models. We are soliciting work in the following areas:

  • Analysis of existing models and their failings
  • Creation of new evaluation paradigms,
    e.g. zero-shot learning, Winnograd schema, and datasets that avoid explicit types of gamification.
  • Modeling advances
    regularization, compositionality, interpretability, inductive bias, multi-task learning, and other methods that promote generalization.

Some of our goals are similar in spirit to those of the recent “Build it Break it” shared task.[8] However, we propose going beyond identifying areas of weakness (i.e. “breaking” existing systems), and discussing scalable evaluations that more rigorously test generalization as well as modeling techniques for enforcing it.

Important Dates
Deadline for submissionMarch 16, 2018
Notification of acceptanceApril 2, 2018
Camera ReadyApril 16, 2018
Workshop dateJune 5, 2018
Onsite Childcare Available

Submission Formats
  • Adversarial examples / Breaking an existing system
    – 2 pages
  • Archival dataset, benchmark, or modeling papers
    – 4 pages
  • Non-archival cross submissions
    – 8 pages
Categories (1) and (2) are expected to use the NAACL-HLT 2018 style guides: LaTex, Word, or Overleaf.

(Non-)Archival: Archival means the work can be included in our proceedings. This means the work has not previously been published in a peer-reviewed venue and won't be submitted to one in the future. If your work was previously published or is under submission elsewhere, please choose non-archival so you can still present a poster/talk but we won't publish your paper in the ACL anthology.

Submission site:

Invited Speakers and Panelists:
Steering committee:
  • Yejin Choi
    University of Washington
  • Devi Parikh
    Georgia Tech / Facebook AI Research
  • Dan Roth
    University of Pennsylvania
Program committee:
  • Jacob Andreas
    UC Berkeley
  • Antoine Bosselut
    U Washington
  • Kai-Wei Chang
    UC Los Angeles
  • Eunsol Choi
    U Washington
  • Christos Christodoulopoulos
    Amazon, Inc
  • Ryan Cotterell
    Johns Hopkins U
  • Greg Durrett
    UT Austin
  • Nicholas FitzGerald
    U Washington
  • Maxwell Forbes
    U Washington
  • Spandana Gella
    Edinburgh U
  • Luheng He
    U Washington
  • Srinivasan Iyer
    U Washington
  • Mohit Iyyer
    UMass Amherst
  • Robin Jia
    Stanford U
  • Ioannis Konstas
    Heriot-Watt U
  • Jonathan Kummerfeld
    U Michigan
  • Alice Lai
    UI Urbana-Champaign
  • Mike Lewis
    Facebook AI Research
  • Tal Linzen
    Johns Hopkins U
  • Ishan Misra
    Carnegie Mellon U
  • Vicente Ordonez
    U Virginia
  • Siva Reddy
    Stanford U
  • Alan Ritter
    Ohio State U
  • Rajhans Samdani
  • Sameer Singh
    UC Irvine
  • Alane Suhr
    Cornell U
  • Chen-Tse Tsai
    Bloomberg LP
  • Shyam Upadhyay
    U Pennsylvania
  • Andreas Vlachos
    U Sheffield
  1. Levy et al. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? NAACL 2015
  2. Moosavi & Strube Lexical Features in Coreference Resolution: To be Used With Caution ACL 2017
  3. Agrawal et al. C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset 2017
  4. Yatskar et al. Commonly Uncommon: Semantic Sparsity in Situation Recognition CVPR 2016
  5. Jia & Liang Adversarial Examples for Evaluating Reading Comprehension Systems EMNLP 2017
  6. Levy et al. Zero-Shot Relation Extraction via Reading Comprehension CoNLL 2017
  7. Belinkov & Bisk Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018
  8. Ettinger et al. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task EMNLP Wksp 2017