ir_datasets
: MSMARCO (document, version 2)Version 2 of the MS MARCO document ranking dataset. The corpus contains 12M documents (roughly 3x as many as version 1).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, headings, body>
You can find more details about the Python API here.
Official dev1 set with 4,552 queries.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Official dev2 set with 5,000 queries.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Official train set with 322,196 queries.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2019/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Subset of msmarco-document-v2/trec-dl-2019, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2020/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Subset of msmarco-document-v2/trec-dl-2020, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.