← home
Github: datasets/natural_questions.py

ir_datasets: Natural Questions

Index
  1. natural-questions
  2. natural-questions/dev
  3. natural-questions/train

"natural-questions"

Google Natural Questions is a Q&A dataset containing long, short, and Yes/No answers from Wikipedia. ir_datasets frames this around an ad-hoc ranking setting by building a collection of all long answer candidate passages. However, short and Yes/No annotations are also available in the qrels, as are the passages presented to the annotators (via scoreddocs).

Importantly, the document collection does not consist of all Wikipedia passages, but instead a union of the candidate passages presented to the annotators (akin to MS MARCO). dph-w100/natural-questions/train and dph-w100/natural-questions/dev contain a filtered set of the questions in this dataset and a full Wikipedia dump (which is a more realistic retrieval setting).

docs
28M docs

Language: en

Document type:
NqPassageDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. html: str
  4. start_byte: int
  5. end_byte: int
  6. start_token: int
  7. end_token: int
  8. document_title: str
  9. document_url: str
  10. parent_doc_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions docs
[doc_id]    [text]    [html]    [start_byte]    [end_byte]    [start_token]    [end_token]    [document_title]    [document_url]    [parent_doc_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions')
# Index natural-questions
indexer = pt.IterDictIndexer('./indices/natural-questions')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'html', 'document_title', 'document_url', 'parent_doc_id'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Kwiatkowski2019Nq}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} }
Metadata

"natural-questions/dev"

Official dev set.

queries
7.8K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/dev queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
28M docs

Inherits docs from natural-questions

Language: en

Document type:
NqPassageDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. html: str
  4. start_byte: int
  5. end_byte: int
  6. start_token: int
  7. end_token: int
  8. document_title: str
  9. document_url: str
  10. parent_doc_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/dev docs
[doc_id]    [text]    [html]    [start_byte]    [end_byte]    [start_token]    [end_token]    [document_title]    [document_url]    [parent_doc_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/dev')
# Index natural-questions
indexer = pt.IterDictIndexer('./indices/natural-questions')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'html', 'document_title', 'document_url', 'parent_doc_id'])

You can find more details about PyTerrier indexing here.

qrels
7.7K qrels
Query relevance judgment type:
NqQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. short_answers: List[str]
  5. yes_no_answer: str

Relevance levels

Rel.DefinitionCount%
1passage marked by annotator as a "long" answer to the question7.7K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, short_answers, yes_no_answer>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [short_answers]    [yes_no_answer]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
973K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/dev scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Kwiatkowski2019Nq}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} }
Metadata

"natural-questions/train"

Official train set.

queries
307K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/train queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/train')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
28M docs

Inherits docs from natural-questions

Language: en

Document type:
NqPassageDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. html: str
  4. start_byte: int
  5. end_byte: int
  6. start_token: int
  7. end_token: int
  8. document_title: str
  9. document_url: str
  10. parent_doc_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/train docs
[doc_id]    [text]    [html]    [start_byte]    [end_byte]    [start_token]    [end_token]    [document_title]    [document_url]    [parent_doc_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/train')
# Index natural-questions
indexer = pt.IterDictIndexer('./indices/natural-questions')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'html', 'document_title', 'document_url', 'parent_doc_id'])

You can find more details about PyTerrier indexing here.

qrels
152K qrels
Query relevance judgment type:
NqQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. short_answers: List[str]
  5. yes_no_answer: str

Relevance levels

Rel.DefinitionCount%
1passage marked by annotator as a "long" answer to the question152K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, short_answers, yes_no_answer>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [short_answers]    [yes_no_answer]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:natural-questions/train')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
40M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export natural-questions/train scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Kwiatkowski2019Nq}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} }
Metadata