← home
Github: datasets/msmarco_document.py

ir_datasets: MSMARCO (document)

Index
  1. msmarco-document
  2. msmarco-document/dev
  3. msmarco-document/eval
  4. msmarco-document/orcas
  5. msmarco-document/train
  6. msmarco-document/trec-dl-2019
  7. msmarco-document/trec-dl-2019/judged
  8. msmarco-document/trec-dl-2020
  9. msmarco-document/trec-dl-2020/judged

"msmarco-document"

"Based the questions in the [MS-MARCO] Question Answering Dataset and the documents which answered the questions a document ranking task was formulated. There are 3.2 million documents and the goal is to rank based on their relevance. Relevance labels are derived from what passages was marked as having the answer in the QnA dataset."

docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
Citation
bibtex: @inproceedings{Bajaj2016MSMA, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

"msmarco-document/dev"

Official dev set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/dev')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/dev')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
1Labeled by crowd worker as relevant

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/dev')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/dev')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

"msmarco-document/eval"

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/eval')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/eval')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/eval')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

"msmarco-document/orcas"

"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."

  • Queries: From query log
  • Relevance Data: User clicks
  • Scored docs: Indri Query Likelihood model
  • Dataset Paper
queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/orcas')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/orcas')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
1User click

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/orcas')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/orcas')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
Citation
bibtex: @article{craswell2020orcas, title={ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search}, author={Craswell, Nick and Campos, Daniel and Mitra, Bhaskar and Yilmaz, Emine and Billerbeck, Bodo}, journal={arXiv preprint arXiv:2006.05324}, year={2020} }

"msmarco-document/train"

Official train set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/train')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/train')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
1Labeled by crowd worker as relevant

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/train')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/train')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

"msmarco-document/trec-dl-2019"

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Irrelevant: Document does not provide any useful information about the query
1Relevant: Document provides some information relevant to the query, which may be minimal.
2Highly relevant: The content of this document provides substantial information on the query.
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
Citation
bibtex: @inproceedings{Craswell2020OverviewOT, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} }

"msmarco-document/trec-dl-2019/judged"

Subset of msmarco-document/trec-dl-2019, only including queries with qrels.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019/judged')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019/judged')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Irrelevant: Document does not provide any useful information about the query
1Relevant: Document provides some information relevant to the query, which may be minimal.
2Highly relevant: The content of this document provides substantial information on the query.
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019/judged')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2019/judged')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

"msmarco-document/trec-dl-2020"

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Irrelevant: Document does not provide any useful information about the query
1Relevant: Document provides some information relevant to the query, which may be minimal.
2Highly relevant: The content of this document provides substantial information on the query.
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
Citation
bibtex: @inproceedings{Craswell2021OverviewOT, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} }

"msmarco-document/trec-dl-2020/judged"

Subset of msmarco-document/trec-dl-2020, only including queries with qrels.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020/judged')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
MsMarcoDocument: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. body: str

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020/judged')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Irrelevant: Document does not provide any useful information about the query
1Relevant: Document provides some information relevant to the query, which may be minimal.
2Highly relevant: The content of this document provides substantial information on the query.
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020/judged')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Example

import ir_datasets
dataset = ir_datasets.load('msmarco-document/trec-dl-2020/judged')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>