← home
Github: datasets/msmarco_document_v2.py

ir_datasets: MSMARCO (document, version 2)

Index
  1. msmarco-document-v2
  2. msmarco-document-v2/anchor-text
  3. msmarco-document-v2/dev1
  4. msmarco-document-v2/dev2
  5. msmarco-document-v2/train
  6. msmarco-document-v2/trec-dl-2019
  7. msmarco-document-v2/trec-dl-2019/judged
  8. msmarco-document-v2/trec-dl-2020
  9. msmarco-document-v2/trec-dl-2020/judged
  10. msmarco-document-v2/trec-dl-2021
  11. msmarco-document-v2/trec-dl-2021/judged

"msmarco-document-v2"

Version 2 of the MS MARCO document ranking dataset. The corpus contains 12M documents (roughly 3x as many as version 1).

  • Version 1 of dataset: msmarco-document
  • Documents: Text extracted from web pages
  • Queries: Natural language questions (from query log)
  • Dataset Paper
docsCitationMetadata
12M docs

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.


"msmarco-document-v2/anchor-text"

For version 2 of MS MARCO, the anchor text collection enriches 4,821,244 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 97% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

docsCitationMetadata
4.8M docs

Language: en

Document type:
MsMarcoV2AnchorTextDocument: (namedtuple)
  1. doc_id: str
  2. text: str
  3. anchors: List[str]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/anchor-text")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, anchors>

You can find more details about the Python API here.


"msmarco-document-v2/dev1"

Official dev1 set with 4,552 queries.

queriesdocsqrelsscoreddocsCitationMetadata
4.6K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/dev2"

Official dev2 set with 5,000 queries.

queriesdocsqrelsscoreddocsCitationMetadata
5.0K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/train"

Official train set with 322,196 queries.

queriesdocsqrelsscoreddocsCitationMetadata
322K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2019"

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2019/judged).

queriesdocsqrelsCitationMetadata
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2019/judged"

Subset of msmarco-document-v2/trec-dl-2019, only including queries with qrels.

queriesdocsqrelsCitationMetadata
43 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2020"

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2020/judged).

queriesdocsqrelsCitationMetadata
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2020/judged"

Subset of msmarco-document-v2/trec-dl-2020, only including queries with qrels.

queriesdocsqrelsCitationMetadata
45 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2021"

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P@10, RR(rel=2)

queriesdocsqrelsscoreddocsMetadata
477 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2021/judged"

msmarco-document-v2/trec-dl-2021, but filtered down to the 57 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P@10, RR(rel=2)

queriesdocsqrelsscoreddocsMetadata
57 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.