ir_datasets
: MSMARCO (passage, version 2)Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.
Change Log
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
Official dev1 set with 3,903 queries.
Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Official dev2 set with 4,281 queries.
Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Official train set with 277,144 queries.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
Official topics for the TREC Deep Learning (DL) 2021 shared task.
Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.
Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.