← home
Github: datasets/natural_questions.py

ir_datasets: Natural Questions

Index
  1. natural-questions
  2. natural-questions/dev
  3. natural-questions/train

"natural-questions"

Google Natural Questions is a Q&A dataset containing long, short, and Yes/No answers from Wikipedia. ir_datasets frames this around an ad-hoc ranking setting by building a collection of all long answer candidate passages. However, short and Yes/No annotations are also available in the qrels, as are the passages presented to the annotators (via scoreddocs).

Importantly, the document collection does not consist of all Wikipedia passages, but instead a union of the candidate passages presented to the annotators (akin to MS MARCO). dph-w100/natural-questions/train and dph-w100/natural-questions/dev contain a filtered set of the questions in this dataset and a full Wikipedia dump (which is a more realistic retrieval setting).

docsCitation

Language: en

Document type:
NqPassageDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. html: str
  4. start_byte: int
  5. end_byte: int
  6. start_token: int
  7. end_token: int
  8. document_title: str
  9. document_url: str
  10. parent_doc_id: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("natural-questions")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>

You can find more details about the Python API here.


"natural-questions/dev"

Official dev set.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"natural-questions/train"

Official train set.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.