ir_datasets : Natural Questions

import ir_datasets
dataset = ir_datasets.load("natural-questions")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions docs



[doc_id]    [text]    [html]    [start_byte]    [end_byte]    [start_token]    [end_token]    [document_title]    [document_url]    [parent_doc_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions')
# Index natural-questions
indexer = pt.IterDictIndexer('./indices/natural-questions')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'html', 'document_title', 'document_url', 'parent_doc_id'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.natural-questions')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Kwiatkowski2019Nq}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} }

Metadata

{
  "docs": {
    "count": 28390850,
    "fields": {
      "doc_id": {
        "max_len": 11,
        "common_prefix": ""
      }
    }
  }
}

`"natural-questions/dev"`

Official dev set.

queries

7.8K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.natural-questions.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

28M docs

Inherits docs from natural-questions

Language: en

Document type:

NqPassageDoc: (namedtuple)

doc_id: str
text: str
html: str
start_byte: int
end_byte: int
start_token: int
end_token: int
document_title: str
document_url: str
parent_doc_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/dev docs



[doc_id]    [text]    [html]    [start_byte]    [end_byte]    [start_token]    [end_token]    [document_title]    [document_url]    [parent_doc_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/dev')
# Index natural-questions
indexer = pt.IterDictIndexer('./indices/natural-questions')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'html', 'document_title', 'document_url', 'parent_doc_id'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.natural-questions.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

7.7K qrels

Query relevance judgment type:

NqQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
short_answers: List[str]
yes_no_answer: str

Relevance levels

Rel.	Definition	Count	%
1	passage marked by annotator as a "long" answer to the question	`7.7K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, short_answers, yes_no_answer>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [short_answers]    [yes_no_answer]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.natural-questions.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs

973K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/dev scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/dev')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.natural-questions.dev.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Kwiatkowski2019Nq}

Bibtex:

Metadata

{
  "docs": {
    "count": 28390850,
    "fields": {
      "doc_id": {
        "max_len": 11,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7830
  },
  "qrels": {
    "count": 7695,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 7695
        }
      }
    }
  },
  "scoreddocs": {
    "count": 973480
  }
}

`"natural-questions/train"`

Official train set.

queries

307K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/train')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.natural-questions.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

28M docs

Inherits docs from natural-questions

Language: en

Document type:

NqPassageDoc: (namedtuple)

doc_id: str
text: str
html: str
start_byte: int
end_byte: int
start_token: int
end_token: int
document_title: str
document_url: str
parent_doc_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/train docs



[doc_id]    [text]    [html]    [start_byte]    [end_byte]    [start_token]    [end_token]    [document_title]    [document_url]    [parent_doc_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/train')
# Index natural-questions
indexer = pt.IterDictIndexer('./indices/natural-questions')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'html', 'document_title', 'document_url', 'parent_doc_id'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.natural-questions.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

152K qrels

Query relevance judgment type:

NqQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
short_answers: List[str]
yes_no_answer: str

Relevance levels

Rel.	Definition	Count	%
1	passage marked by annotator as a "long" answer to the question	`152K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, short_answers, yes_no_answer>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [short_answers]    [yes_no_answer]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:natural-questions/train')
index_ref = pt.IndexRef.of('./indices/natural-questions') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.natural-questions.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs

40M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export natural-questions/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:natural-questions/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.