← home
Github: datasets/msmarco_passage_v2.py

ir_datasets: MSMARCO (passage, version 2)

Index
  1. msmarco-passage-v2
  2. msmarco-passage-v2/dev1
  3. msmarco-passage-v2/dev2
  4. msmarco-passage-v2/train
  5. msmarco-passage-v2/trec-dl-2021
  6. msmarco-passage-v2/trec-dl-2021/judged

"msmarco-passage-v2"

Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.

  • Version 1 of dataset: msmarco-passage
  • Documents: Text extracted from web pages
  • Queries: Natural language questions (from query log)
  • Dataset Paper

Change Log

  • On July 21, 2021, the task organizers updated the train, dev1, and dev2 qrels to remove duplicate entries from the files. This should not have change results from evaluation tools, but may result in non-repeatable results if these files were used in another process (e.g., model training). The original qrels file for msmarco-passage-v2/train can be found here to aid in result repeatability.
docsCitationMetadata
138M docs

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.


"msmarco-passage-v2/dev1"

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsCitationMetadata
3.9K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage-v2/dev2"

Official dev2 set with 4,281 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsCitationMetadata
4.3K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage-v2/train"

Official train set with 277,144 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsCitationMetadata
277K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage-v2/trec-dl-2021"

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

queriesdocsqrelsscoreddocsMetadata
477 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage-v2/trec-dl-2021/judged"

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

queriesdocsqrelsscoreddocsMetadata
53 queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.