ir_datasets : MSMARCO (passage)

import ir_datasets
dataset = ir_datasets.load("msmarco-passage")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"msmarco-passage/dev"`

Official dev set.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Official evaluation measures: RR@10

101K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

59K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`59K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 101093
  },
  "qrels": {
    "count": 59273,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 59273
        }
      }
    }
  }
}

`"msmarco-passage/dev/judged"`

Subset of msmarco-passage/dev that only includes queries that have at least one qrel.

Official evaluation measures: RR@10

56K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

59K qrels

Inherits qrels from msmarco-passage/dev

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`59K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 55578
  },
  "qrels": {
    "count": 59273,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 59273
        }
      }
    }
  }
}

`"msmarco-passage/dev/small"`

Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).

Official evaluation measures: RR@10

7.0K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/small queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/small docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

7.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`7.4K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/small qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

6.7M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/small scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6980
  },
  "qrels": {
    "count": 7437,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 7437
        }
      }
    }
  },
  "scoreddocs": {
    "count": 6668967
  }
}

`"msmarco-passage/eval"`

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable eval queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Official evaluation measures: RR@10

101K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 101092
  }
}

`"msmarco-passage/eval/small"`

Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).

Official evaluation measures: RR@10

6.8K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval/small queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval/small docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

6.5M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval/small scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6837
  },
  "scoreddocs": {
    "count": 6515736
  }
}

`"msmarco-passage/train"`

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Official evaluation measures: RR@10

809K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

478M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

270M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 808731
  },
  "qrels": {
    "count": 532761,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532761
        }
      }
    }
  },
  "scoreddocs": {
    "count": 478002393
  },
  "docpairs": {
    "count": 269919004
  }
}

`"msmarco-passage/train/judged"`

Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Official evaluation measures: RR@10

503K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Inherits qrels from msmarco-passage/train

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

478M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

270M docpairs

Inherits docpairs from msmarco-passage/train

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 502939
  },
  "qrels": {
    "count": 532761,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532761
        }
      }
    }
  },
  "scoreddocs": {
    "count": 478002393
  },
  "docpairs": {
    "count": 269919004
  }
}

`"msmarco-passage/train/medical"`

Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Official evaluation measures: RR@10

79K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

55K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`55K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

49M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

29M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{MacAvaney2020MedMarco,Bajaj2016Msmarco}

Bibtex:

@inproceedings{MacAvaney2020MedMarco, author = {MacAvaney, Sean and Cohan, Arman and Goharian, Nazli}, title = {SLEDGE-Zero: A Zero-Shot Baseline for COVID-19 Literature Search}, booktitle = {EMNLP}, year = {2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 78895
  },
  "qrels": {
    "count": 54627,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 54627
        }
      }
    }
  },
  "scoreddocs": {
    "count": 48852277
  },
  "docpairs": {
    "count": 28969254
  }
}

`"msmarco-passage/train/split200-train"`

Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Official evaluation measures: RR@10

809K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

478M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

270M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 808531
  },
  "qrels": {
    "count": 532630,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532630
        }
      }
    }
  },
  "scoreddocs": {
    "count": 477883382
  },
  "docpairs": {
    "count": 269854839
  }
}

`"msmarco-passage/train/split200-valid"`

Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Official evaluation measures: RR@10

200 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

131 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`131`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

119K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

64K docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 131,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 131
        }
      }
    }
  },
  "scoreddocs": {
    "count": 119011
  },
  "docpairs": {
    "count": 64165
  }
}

`"msmarco-passage/train/triples-small"`

Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).

Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.

Official evaluation measures: RR@10

809K queries

Inherits queries from msmarco-passage/train

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-small queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-small docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Inherits qrels from msmarco-passage/train

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-small qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

478M scoreddocs

Inherits scoreddocs from msmarco-passage/train

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-small scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

40M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-small docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 808731
  },
  "qrels": {
    "count": 532761,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532761
        }
      }
    }
  },
  "scoreddocs": {
    "count": 478002393
  },
  "docpairs": {
    "count": 39780811
  }
}

`"msmarco-passage/train/triples-v2"`

Version of msmarco-passage/train, but with version 2 of the triples file.

This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).

Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.

Official evaluation measures: RR@10

809K queries

Inherits queries from msmarco-passage/train

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-v2 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-v2 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Inherits qrels from msmarco-passage/train

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-v2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

478M scoreddocs

Inherits scoreddocs from msmarco-passage/train

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-v2 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

398M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/triples-v2 docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 808731
  },
  "qrels": {
    "count": 532761,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532761
        }
      }
    }
  },
  "scoreddocs": {
    "count": 478002393
  },
  "docpairs": {
    "count": 397768673
  }
}

`"msmarco-passage/trec-dl-2019"`

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).

Shared Task Paper

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

200 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

9.3K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`5.2K`	55.7%
1	Related: The passage seems related to the query but does not answer it.	`1.6K`	17.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.8K`	19.5%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`697`	7.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2), AP(rel=2)]
)

You can find more details about PyTerrier experiments here.

190K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 9260,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 5158,
          "1": 1601,
          "2": 1804,
          "3": 697
        }
      }
    }
  },
  "scoreddocs": {
    "count": 189877
  }
}

`"msmarco-passage/trec-dl-2019/judged"`

Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

43 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

9.3K qrels

Inherits qrels from msmarco-passage/trec-dl-2019

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`5.2K`	55.7%
1	Related: The passage seems related to the query but does not answer it.	`1.6K`	17.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.8K`	19.5%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`697`	7.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2), AP(rel=2)]
)

You can find more details about PyTerrier experiments here.

41K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 43
  },
  "qrels": {
    "count": 9260,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 5158,
          "1": 1601,
          "2": 1804,
          "3": 697
        }
      }
    }
  },
  "scoreddocs": {
    "count": 41042
  }
}

`"msmarco-passage/trec-dl-2020"`

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).

Shared Task Paper

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

200 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`7.8K`	68.3%
1	Related: The passage seems related to the query but does not answer it.	`1.9K`	17.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.0K`	9.0%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`646`	5.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2), AP(rel=2)]
)

You can find more details about PyTerrier experiments here.

191K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 11386,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1020,
          "3": 646,
          "0": 7780,
          "1": 1940
        }
      }
    }
  },
  "scoreddocs": {
    "count": 190699
  }
}

`"msmarco-passage/trec-dl-2020/judged"`

Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

54 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Inherits qrels from msmarco-passage/trec-dl-2020

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`7.8K`	68.3%
1	Related: The passage seems related to the query but does not answer it.	`1.9K`	17.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.0K`	9.0%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`646`	5.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2), AP(rel=2)]
)

You can find more details about PyTerrier experiments here.

50K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 54
  },
  "qrels": {
    "count": 11386,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1020,
          "3": 646,
          "0": 7780,
          "1": 1940
        }
      }
    }
  },
  "scoreddocs": {
    "count": 50024
  }
}

`"msmarco-passage/trec-dl-hard"`

A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

50 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

4.3K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`2.5K`	57.8%
1	Related: The passage seems related to the query but does not answer it.	`810`	19.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`634`	14.9%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`350`	8.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 4256,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2462,
          "1": 810,
          "2": 634,
          "3": 350
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold1"`

Fold 1 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold1 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold1 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

1.1K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`582`	54.3%
1	Related: The passage seems related to the query but does not answer it.	`197`	18.4%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`181`	16.9%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`112`	10.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold1 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 1072,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 582,
          "1": 197,
          "2": 181,
          "3": 112
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold2"`

Fold 2 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold2 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold2 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

898 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`611`	68.0%
1	Related: The passage seems related to the query but does not answer it.	`151`	16.8%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`99`	11.0%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`37`	4.1%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold2 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 898,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "3": 37,
          "2": 99,
          "0": 611,
          "1": 151
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold3"`

Fold 3 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold3 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold3 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

444 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`342`	77.0%
1	Related: The passage seems related to the query but does not answer it.	`43`	9.7%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`36`	8.1%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`23`	5.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold3 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 444,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 342,
          "1": 43,
          "2": 36,
          "3": 23
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold4"`

Fold 4 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold4 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold4 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

716 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`396`	55.3%
1	Related: The passage seems related to the query but does not answer it.	`137`	19.1%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`151`	21.1%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`32`	4.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold4 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 716,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "3": 32,
          "2": 151,
          "1": 137,
          "0": 396
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold5"`

Fold 5 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold5 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold5 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

1.1K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`531`	47.2%
1	Related: The passage seems related to the query but does not answer it.	`282`	25.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`167`	14.8%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`146`	13.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold5 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }