← home
Github: datasets/msmarco_document_v2.py

ir_datasets: MSMARCO (document, version 2)

Index
  1. msmarco-document-v2
  2. msmarco-document-v2/anchor-text
  3. msmarco-document-v2/dev1
  4. msmarco-document-v2/dev2
  5. msmarco-document-v2/train
  6. msmarco-document-v2/trec-dl-2019
  7. msmarco-document-v2/trec-dl-2019/judged
  8. msmarco-document-v2/trec-dl-2020
  9. msmarco-document-v2/trec-dl-2020/judged
  10. msmarco-document-v2/trec-dl-2021
  11. msmarco-document-v2/trec-dl-2021/judged

"msmarco-document-v2"

Version 2 of the MS MARCO document ranking dataset. The corpus contains 12M documents (roughly 3x as many as version 1).

  • Version 1 of dataset: msmarco-document
  • Documents: Text extracted from web pages
  • Queries: Natural language questions (from query log)
  • Dataset Paper
docs
12M docs

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2 docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/anchor-text"

For version 2 of MS MARCO, the anchor text collection enriches 4,821,244 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 97% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

docs
4.8M docs

Language: en

Document type:
MsMarcoV2AnchorTextDocument: (namedtuple)
  1. doc_id: str
  2. text: str
  3. anchors: List[str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/anchor-text")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, anchors>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/anchor-text docs
[doc_id]    [text]    [anchors]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/anchor-text')
# Index msmarco-document-v2/anchor-text
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2_anchor-text', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Froebe2022Anchors}

Bibtex:

@inproceedings{Froebe2022Anchors, address = {Berlin Heidelberg New York}, author = {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen}, booktitle = {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, title = {{The Power of Anchor Text in the Neural Retrieval Era}}, year = 2022 }
Metadata

"msmarco-document-v2/dev1"

Official dev1 set with 4,552 queries.

queries
4.6K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev1 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev1 docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev1')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
4.7K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Document contains a passage labeled as relevant in msmarco-passage4.7K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev1 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
455K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev1 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev1')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/dev2"

Official dev2 set with 5,000 queries.

queries
5.0K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev2 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev2 docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev2')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
5.2K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Document contains a passage labeled as relevant in msmarco-passage5.2K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev2 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
500K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/dev2 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/dev2')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/train"

Official train set with 322,196 queries.

queries
322K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/train queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/train docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/train')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
332K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Document contains a passage labeled as relevant in msmarco-passage332K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
32M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/train scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/trec-dl-2019"

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2019/judged).

queries
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2019 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2019 docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2019')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
14K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: Document does not provide any useful information about the query8.2K59.0%
1Relevant: Document provides some information relevant to the query, which may be minimal.4.0K28.4%
2Highly relevant: The content of this document provides substantial information on the query.1.0K7.2%
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.745 5.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2019 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/trec-dl-2019/judged"

Subset of msmarco-document-v2/trec-dl-2019, only including queries with qrels.

queries
43 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2019/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2019/judged docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2019/judged')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
14K qrels

Inherits qrels from msmarco-document-v2/trec-dl-2019

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: Document does not provide any useful information about the query8.2K59.0%
1Relevant: Document provides some information relevant to the query, which may be minimal.4.0K28.4%
2Highly relevant: The content of this document provides substantial information on the query.1.0K7.2%
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.745 5.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2019/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/trec-dl-2020"

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2020/judged).

queries
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2020 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2020 docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2020')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
7.9K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: Document does not provide any useful information about the query6.4K80.2%
1Relevant: Document provides some information relevant to the query, which may be minimal.1.1K13.3%
2Highly relevant: The content of this document provides substantial information on the query.279 3.5%
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.233 2.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2020 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/trec-dl-2020/judged"

Subset of msmarco-document-v2/trec-dl-2020, only including queries with qrels.

queries
45 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2020/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2020/judged docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2020/judged')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
7.9K qrels

Inherits qrels from msmarco-document-v2/trec-dl-2020

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: Document does not provide any useful information about the query6.4K80.2%
1Relevant: Document provides some information relevant to the query, which may be minimal.1.1K13.3%
2Highly relevant: The content of this document provides substantial information on the query.279 3.5%
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.233 2.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2020/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-document-v2/trec-dl-2021"

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P@10, RR(rel=2)

queries
477 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021 docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
13K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: Document does not provide any useful information about the query4.9K37.2%
1Relevant: Document provides some information relevant to the query, which may be minimal.4.2K32.0%
2Highly relevant: The content of this document provides substantial information on the query.2.8K21.2%
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.1.3K9.6%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

scoreddocs
48K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

Metadata

"msmarco-document-v2/trec-dl-2021/judged"

msmarco-document-v2/trec-dl-2021, but filtered down to the 57 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P@10, RR(rel=2)

queries
57 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
12M docs

Inherits docs from msmarco-document-v2

Language: en

Document type:
MsMarcoV2Document: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. headings: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, headings, body>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021/judged docs
[doc_id]    [url]    [title]    [headings]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021/judged')
# Index msmarco-document-v2
indexer = pt.IterDictIndexer('./indices/msmarco-document-v2', meta={"docno": 25})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'headings', 'body'])

You can find more details about PyTerrier indexing here.

qrels
13K qrels

Inherits qrels from msmarco-document-v2/trec-dl-2021

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: Document does not provide any useful information about the query4.9K37.2%
1Relevant: Document provides some information relevant to the query, which may be minimal.4.2K32.0%
2Highly relevant: The content of this document provides substantial information on the query.2.8K21.2%
3Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.1.3K9.6%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

scoreddocs
5.7K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-document-v2/trec-dl-2021/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-document-v2/trec-dl-2021/judged scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document-v2/trec-dl-2021/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

Metadata