ir_datasets : MSMARCO (document)

import ir_datasets
dataset = ir_datasets.load("msmarco-document")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  }
}

`"msmarco-document/anchor-text"`

For version 1 of MS MARCO, the anchor text collection enriches 1,703,834 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

docs

1.7M docs

Language: en

Document type:

MsMarcoAnchorTextDocument: (namedtuple)

doc_id: str
text: str
anchors: List[str]

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/anchor-text")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, anchors>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/anchor-text docs



[doc_id]    [text]    [anchors]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/anchor-text')
# Index msmarco-document/anchor-text
indexer = pt.IterDictIndexer('./indices/msmarco-document_anchor-text')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Froebe2022Anchors}

Bibtex:

@inproceedings{Froebe2022Anchors, address = {Berlin Heidelberg New York}, author = {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen}, booktitle = {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, title = {{The Power of Anchor Text in the Neural Retrieval Era}}, year = 2022 }

{
  "docs": {
    "count": 1703834,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  }
}

`"msmarco-document/dev"`

Official dev set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Official evaluation measures: RR@10

5.2K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/dev docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/dev')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

5.2K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Document contains a passage labeled as relevant in msmarco-passage	`5.2K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

519K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/dev scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/dev')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 5193
  },
  "qrels": {
    "count": 5193,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 5193
        }
      }
    }
  },
  "scoreddocs": {
    "count": 519300
  }
}

`"msmarco-document/eval"`

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Official evaluation measures: RR@10

5.8K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/eval")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/eval queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/eval")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/eval docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/eval')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

579K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/eval")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/eval scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/eval')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 5793
  },
  "scoreddocs": {
    "count": 579300
  }
}

`"msmarco-document/orcas"`

"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."

Queries: From query log
Relevance Data: User clicks
Scored docs: Indri Query Likelihood model
Dataset Paper

Official evaluation measures: RR, nDCG

10M queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/orcas queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/orcas')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/orcas docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/orcas')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

19M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	User click	`19M`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/orcas qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/orcas')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR, nDCG]
)

You can find more details about PyTerrier experiments here.

983M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/orcas scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/orcas')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2020Orcas}

Bibtex:

@article{Craswell2020Orcas, title={ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search}, author={Craswell, Nick and Campos, Daniel and Mitra, Bhaskar and Yilmaz, Emine and Billerbeck, Bodo}, journal={arXiv preprint arXiv:2006.05324}, year={2020} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 10405342
  },
  "qrels": {
    "count": 18823602,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 18823602
        }
      }
    }
  },
  "scoreddocs": {
    "count": 982951086
  }
}

`"msmarco-document/train"`

Official train set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Official evaluation measures: RR@10

367K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/train')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/train docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/train')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

367K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Document contains a passage labeled as relevant in msmarco-passage	`367K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/train')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

37M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 367013
  },
  "qrels": {
    "count": 367013,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 367013
        }
      }
    }
  },
  "scoreddocs": {
    "count": 36701116
  }
}

`"msmarco-document/trec-dl-2019"`

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).

Shared Task Paper

Official evaluation measures: nDCG@10, RR, MAP

200 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019 docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

16K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`9.7K`	59.4%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`4.6K`	28.3%
2	Highly relevant: The content of this document provides substantial information on the query.	`1.1K`	7.1%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`841`	5.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR, MAP]
)

You can find more details about PyTerrier experiments here.

20K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 16258,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 9661,
          "2": 1149,
          "1": 4607,
          "3": 841
        }
      }
    }
  },
  "scoreddocs": {
    "count": 20000
  }
}

`"msmarco-document/trec-dl-2019/judged"`

Subset of msmarco-document/trec-dl-2019, only including queries with qrels.

Official evaluation measures: nDCG@10, RR, MAP

43 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019/judged docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019/judged')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

16K qrels

Inherits qrels from msmarco-document/trec-dl-2019

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`9.7K`	59.4%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`4.6K`	28.3%
2	Highly relevant: The content of this document provides substantial information on the query.	`1.1K`	7.1%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`841`	5.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR, MAP]
)

You can find more details about PyTerrier experiments here.

4.3K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2019/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 43
  },
  "qrels": {
    "count": 16258,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 9661,
          "2": 1149,
          "1": 4607,
          "3": 841
        }
      }
    }
  },
  "scoreddocs": {
    "count": 4300
  }
}

`"msmarco-document/trec-dl-2020"`

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).

Shared Task Paper

Official evaluation measures: nDCG@10, RR, MAP

200 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020 docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

9.1K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`7.3K`	80.6%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`1.2K`	13.0%
2	Highly relevant: The content of this document provides substantial information on the query.	`315`	3.5%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`265`	2.9%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR, MAP]
)

You can find more details about PyTerrier experiments here.

20K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 9098,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 7331,
          "3": 265,
          "1": 1187,
          "2": 315
        }
      }
    }
  },
  "scoreddocs": {
    "count": 20000
  }
}

`"msmarco-document/trec-dl-2020/judged"`

Subset of msmarco-document/trec-dl-2020, only including queries with qrels.

Official evaluation measures: nDCG@10, RR, MAP

45 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020/judged docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020/judged')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

9.1K qrels

Inherits qrels from msmarco-document/trec-dl-2020

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`7.3K`	80.6%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`1.2K`	13.0%
2	Highly relevant: The content of this document provides substantial information on the query.	`315`	3.5%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`265`	2.9%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR, MAP]
)

You can find more details about PyTerrier experiments here.

4.5K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-2020/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 45
  },
  "qrels": {
    "count": 9098,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 7331,
          "3": 265,
          "1": 1187,
          "2": 315
        }
      }
    }
  },
  "scoreddocs": {
    "count": 4500
  }
}

`"msmarco-document/trec-dl-hard"`

A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

50 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

8.5K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`5.2K`	60.7%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`2.5K`	29.5%
2	Highly relevant: The content of this document provides substantial information on the query.	`485`	5.7%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`356`	4.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 8544,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 5186,
          "2": 485,
          "1": 2517,
          "3": 356
        }
      }
    }
  }
}

`"msmarco-document/trec-dl-hard/fold1"`

Fold 1 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold1 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold1 docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold1')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

1.6K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`1.0K`	67.3%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`328`	21.1%
2	Highly relevant: The content of this document provides substantial information on the query.	`75`	4.8%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`106`	6.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold1 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 1557,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 328,
          "0": 1048,
          "3": 106,
          "2": 75
        }
      }
    }
  }
}

`"msmarco-document/trec-dl-hard/fold2"`

Fold 2 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold2 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold2 docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold2')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

1.3K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`922`	68.6%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`304`	22.6%
2	Highly relevant: The content of this document provides substantial information on the query.	`78`	5.8%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`41`	3.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold2 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 1345,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 78,
          "1": 304,
          "0": 922,
          "3": 41
        }
      }
    }
  }
}

`"msmarco-document/trec-dl-hard/fold3"`

Fold 3 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold3 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold3 docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold3')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

474 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`333`	70.3%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`65`	13.7%
2	Highly relevant: The content of this document provides substantial information on the query.	`44`	9.3%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`32`	6.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold3 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 474,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 333,
          "2": 44,
          "3": 32,
          "1": 65
        }
      }
    }
  }
}

`"msmarco-document/trec-dl-hard/fold4"`

Fold 4 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold4 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold4 docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold4')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

1.1K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`718`	68.1%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`258`	24.5%
2	Highly relevant: The content of this document provides substantial information on the query.	`34`	3.2%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`44`	4.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold4 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 3213835,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "D"
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 1054,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 258,
          "2": 34,
          "0": 718,
          "3": 44
        }
      }
    }
  }
}

`"msmarco-document/trec-dl-hard/fold5"`

Fold 5 of msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold5 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

3.2M docs

Inherits docs from msmarco-document

Language: en

Document type:

MsMarcoDocument: (namedtuple)

doc_id: str
url: str
title: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, body>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold5 docs



[doc_id]    [url]    [title]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold5')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])

You can find more details about PyTerrier indexing here.

qrels

4.1K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: Document does not provide any useful information about the query	`2.2K`	52.6%
1	Relevant: Document provides some information relevant to the query, which may be minimal.	`1.6K`	38.0%
2	Highly relevant: The content of this document provides substantial information on the query.	`254`	6.2%
3	Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine.	`133`	3.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-document/trec-dl-hard/fold5 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }