CANARD

A Dataset for Question-in-Context Rewriting

CANARD is a dataset for question-in-context rewriting that consists of questions each given in a dialog context together with a context-independent rewriting of the question. The context of each question is the dialog utterences that precede the question. CANARD can be used to evaluate question rewriting models that handle important linguistic phenomena such as coreference and ellipsis resolution.

CANARD is based on QuAC (Choi et al., 2018)---a conversational reading comprehension dataset in which answers are selected spans from a given section in a Wikipedia article. Some questions in QuAC are unanswerable with their given sections. We use the answer 'I don't know.' for such questions.

CANARD is constructed by crowdsourcing question rewritings using Amazon Mechanical Turk. We apply several automatic and manual quality controls to ensure the quality of the data collection process. The dataset consists of 40,527 questions with different context lengths. More details are available in our EMNLP 2019 paper. An example is provided below. The dataset is distributed under the CC BY-SA 4.0 license.

Example Dialog with Rewritten Questions

Article: Morbid Angel

Section: Rise to success (1991-1995)

Question 1: How did they begin their rise to success?

Rewrite: How did Morbid Angel begin their rise to success?

Answer: 1991 saw the release of their second album, Blessed Are the Sick, which was met with widespread critical acclaim

َQuestion 2: Are there any singles from that album?

Rewrite: Are there any singles from Morbid Angel's album Blessed Are the Sick?

Answer: I don't know.

Question 3: What else is notable about Rise to Success?

Rewrite: Other than the release of Morbid Angel's second album, what else is notable about their rise to success?

Answer: Domination, on May 9, 1995, which featured new guitarist Erik Rutan of Ripping Corpse. It proved to be a somewhat controversial album among fans

Question 4: Why was it controversial?

Rewrite: Why was Morbid Angel's 1995 album Domination controversial among fans ?

Answer: It proved to be a somewhat controversial album among fans, featuring a slower, more atmospheric and experimental sound than on previous albums.

Question 5: Did this receive any recognition?

Rewrite: Did Morbid Angel's album Domination receive any recognition?

Answer: Music critic describes the album's sound as more groove-oriented

Question 6: What else did critics say?

Rewrite: Besides being more groove-oriented, what else did music critics say about Morbid Angel's album Domination?

Answer: I don't know.


Dataset Format and Download

We use the rewrites of QUAC’s development set as our test set (5,571 question-in-context and corresponding rewrite pairs) and use a 10% sample of QUAC’s training set rewrites as our development set (3,430); the rest are training data (31,526). Questions from the same QuAC dialog always end up in the same split (i.e., we sample 10% of the dialogs to create the development set). We also release a 100 pairs of rewrites for which the rewrites of each pair are provided by two different crowd workers. We use that dataset in our paper to estimate a reference human rewriting accuracy.

Each json file is an array of question, context, and rewrite objects. Each object has the following fields:

    • History: an array of previous dialog utterances in the same order they appear in the dialog. The first two utterances are always the Wikipedia article title followed by the section title.
    • Question: the target question to be rewritten.
    • Rewrite: reference rewrite.
    • QuAC_dialog_id: the id of QuAC dialog used to generate the example.
    • Question_no: the number of the question as in appears in the full dialog (the fist question has Question_no 1) .

Example

"History": [
            "Ara Parseghian",
            "First national title",
            "When did ara parseghian win his first title.",
            "In 1966,"
          ],
"Question": "what was their record for that year?",
"Rewrite": "what was Ara Parseghian's record for 1966?",
"QuAC_dialog_id": "C_4ae4e1bbf2534dd18304f05d7f88a440_0",
"Question_no": 2

The scripts used to train the seq2seq rewriting baseline reported in the paper and a trained OpenNMT model are available at https://github.com/aagohary/canard.

For any questions, please contact Ahmed Elgohary <elgohary@cs.umd.edu>.

EMNLP'19 Paper Bibtex:

@inproceedings{Elgohary:Peskov:Boyd-Graber-2019,
  Title = {Can You Unpack That? Learning to Rewrite Questions-in-Context},
  Author = {Ahmed Elgohary and Denis Peskov and Jordan Boyd-Graber},
  Booktitle = {Empirical Methods in Natural Language Processing},
  Year = {2019}
}