← home
Github: datasets/msmarco_document.py

ir_datasets: MSMARCO (document)

Index
  1. msmarco-document
  2. msmarco-document/anchor-text
  3. msmarco-document/dev
  4. msmarco-document/eval
  5. msmarco-document/orcas
  6. msmarco-document/train
  7. msmarco-document/trec-dl-2019
  8. msmarco-document/trec-dl-2019/judged
  9. msmarco-document/trec-dl-2020
  10. msmarco-document/trec-dl-2020/judged
  11. msmarco-document/trec-dl-hard
  12. msmarco-document/trec-dl-hard/fold1
  13. msmarco-document/trec-dl-hard/fold2
  14. msmarco-document/trec-dl-hard/fold3
  15. msmarco-document/trec-dl-hard/fold4
  16. msmarco-document/trec-dl-hard/fold5

"msmarco-document"

"Based the questions in the [MS-MARCO] Question Answering Dataset and the documents which answered the questions a document ranking task was formulated. There are 3.2 million documents and the goal is to rank based on their relevance. Relevance labels are derived from what passages was marked as having the answer in the QnA dataset."

docsCitationMetadata
3.2M docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.


"msmarco-document/anchor-text"

For version 1 of MS MARCO, the anchor text collection enriches 1,703,834 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

docsCitationMetadata
1.7M docs

Language: en

Document type:
MsMarcoAnchorTextDocument: (namedtuple)
  1. doc_id: str
  2. text: str
  3. anchors: List[str]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/anchor-text")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, anchors>

You can find more details about the Python API here.


"msmarco-document/dev"

Official dev set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsCitationMetadata
5.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/eval"

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Official evaluation measures: RR@10

queriesdocsscoreddocsCitationMetadata
5.8K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/eval")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/orcas"

"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."

  • Queries: From query log
  • Relevance Data: User clicks
  • Scored docs: Indri Query Likelihood model
  • Dataset Paper

Official evaluation measures: RR, nDCG

queriesdocsqrelsscoreddocsCitationMetadata
10M queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/train"

Official train set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsCitationMetadata
367K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-2019"

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).

Official evaluation measures: nDCG@10, RR, MAP

queriesdocsqrelsscoreddocsCitationMetadata
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-2019/judged"

Subset of msmarco-document/trec-dl-2019, only including queries with qrels.

Official evaluation measures: nDCG@10, RR, MAP

queriesdocsqrelsscoreddocsCitationMetadata
43 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-2020"

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).

Official evaluation measures: nDCG@10, RR, MAP

queriesdocsqrelsscoreddocsCitationMetadata
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-2020/judged"

Subset of msmarco-document/trec-dl-2020, only including queries with qrels.

Official evaluation measures: nDCG@10, RR, MAP

queriesdocsqrelsscoreddocsCitationMetadata
45 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-hard"

A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-hard/fold1"

Fold 1 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-hard/fold2"

Fold 2 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-hard/fold3"

Fold 3 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-hard/fold4"

Fold 4 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-document/trec-dl-hard/fold5"

Fold 5 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.