← home
Github: datasets/msmarco_document_v2.py

ir_datasets: MSMARCO (document, version 2)

Index
  1. msmarco-document-v2
  2. msmarco-document-v2/dev1
  3. msmarco-document-v2/dev2
  4. msmarco-document-v2/train
  5. msmarco-document-v2/trec-dl-2019
  6. msmarco-document-v2/trec-dl-2019/judged
  7. msmarco-document-v2/trec-dl-2020
  8. msmarco-document-v2/trec-dl-2020/judged

"msmarco-document-v2"

Version 2 of the MS MARCO document ranking dataset. The corpus contains 12M documents (roughly 3x as many as version 1).

  • Version 1 of dataset: msmarco-document
  • Documents: Text extracted from web pages
  • Queries: Natural language questions (from query log)
  • Dataset Paper
docsCitation

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.


"msmarco-document-v2/dev1"

Official dev1 set with 4,552 queries.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/dev2"

Official dev2 set with 5,000 queries.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/train"

Official train set with 322,196 queries.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2019"

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2019/judged).

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2019/judged"

Subset of msmarco-document-v2/trec-dl-2019, only including queries with qrels.

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2020"

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2020/judged).

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document-v2/trec-dl-2020/judged"

Subset of msmarco-document-v2/trec-dl-2020, only including queries with qrels.

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.