ir_datasets
: Natural QuestionsGoogle Natural Questions is a Q&A dataset containing long, short, and Yes/No answers from Wikipedia. ir_datasets frames this around an ad-hoc ranking setting by building a collection of all long answer candidate passages. However, short and Yes/No annotations are also available in the qrels, as are the passages presented to the annotators (via scoreddocs).
Importantly, the document collection does not consist of all Wikipedia passages, but instead a union of the candidate passages presented to the annotators (akin to MS MARCO). dph-w100/natural-questions/train and dph-w100/natural-questions/dev contain a filtered set of the questions in this dataset and a full Wikipedia dump (which is a more realistic retrieval setting).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("natural-questions")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, html, start_byte, end_byte, start_token, end_token, document_title, document_url, parent_doc_id>
You can find more details about the Python API here.
Official dev set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("natural-questions/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Official train set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("natural-questions/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.