ir_datasets : MSMARCO (passage)

import ir_datasets
dataset = ir_datasets.load("msmarco-passage")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"msmarco-passage/dev"`

Official dev set.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Official evaluation measures: RR@10

101K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

59K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`59K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

6.7M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 101093
  },
  "qrels": {
    "count": 59273,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 59273
        }
      }
    }
  },
  "scoreddocs": {
    "count": 6668967
  }
}

`"msmarco-passage/dev/judged"`

Subset of msmarco-passage/dev that only includes queries that have at least one qrel.

Official evaluation measures: RR@10

56K queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

59K qrels

Inherits qrels from msmarco-passage/dev

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`59K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

6.7M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 55578
  },
  "qrels": {
    "count": 59273,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 59273
        }
      }
    }
  },
  "scoreddocs": {
    "count": 6668967
  }
}

`"msmarco-passage/dev/small"`

Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).

Official evaluation measures: RR@10

7.0K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/small queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/small docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

7.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`7.4K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/dev/small qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6980
  },
  "qrels": {
    "count": 7437,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 7437
        }
      }
    }
  }
}

`"msmarco-passage/eval"`

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable eval queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Official evaluation measures: RR@10

101K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

6.5M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 101092
  },
  "scoreddocs": {
    "count": 6515736
  }
}

`"msmarco-passage/eval/small"`

Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).

Official evaluation measures: RR@10

6.8K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval/small queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/eval/small docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6837
  }
}

`"msmarco-passage/train"`

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Official evaluation measures: RR@10

809K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

478M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

270M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 808731
  },
  "qrels": {
    "count": 532761,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532761
        }
      }
    }
  },
  "scoreddocs": {
    "count": 478002393
  },
  "docpairs": {
    "count": 269919004
  }
}

`"msmarco-passage/train/judged"`

Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Official evaluation measures: RR@10

503K queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Inherits qrels from msmarco-passage/train

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

478M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

270M docpairs

Inherits docpairs from msmarco-passage/train

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/judged docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 502939
  },
  "qrels": {
    "count": 532761,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532761
        }
      }
    }
  },
  "scoreddocs": {
    "count": 478002393
  },
  "docpairs": {
    "count": 269919004
  }
}

`"msmarco-passage/train/medical"`

Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Official evaluation measures: RR@10

79K queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

55K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`55K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

49M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

29M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/medical docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{MacAvaney2020MedMarco,Bajaj2016Msmarco}

Bibtex:

@inproceedings{MacAvaney2020MedMarco, author = {MacAvaney, Sean and Cohan, Arman and Goharian, Nazli}, title = {SLEDGE-Zero: A Zero-Shot Baseline for COVID-19 Literature Search}, booktitle = {EMNLP}, year = {2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 78895
  },
  "qrels": {
    "count": 54627,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 54627
        }
      }
    }
  },
  "scoreddocs": {
    "count": 48852277
  },
  "docpairs": {
    "count": 28969254
  }
}

`"msmarco-passage/train/split200-train"`

Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Official evaluation measures: RR@10

809K queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

533K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`533K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

478M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

270M docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-train docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 808531
  },
  "qrels": {
    "count": 532630,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532630
        }
      }
    }
  },
  "scoreddocs": {
    "count": 477883382
  },
  "docpairs": {
    "count": 269854839
  }
}

`"msmarco-passage/train/split200-valid"`

Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Official evaluation measures: RR@10

200 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

131 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
1	Labeled by crowd worker as relevant	`131`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

119K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

64K docpairs

Document Pair type:

GenericDocPair: (namedtuple)

query_id: str
doc_id_a: str
doc_id_b: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/train/split200-valid docpairs



[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 131,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 131
        }
      }
    }
  },
  "scoreddocs": {
    "count": 119011
  },
  "docpairs": {
    "count": 64165
  }
}

`"msmarco-passage/trec-dl-2019"`

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).

Shared Task Paper

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

200 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

9.3K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`5.2K`	55.7%
1	Related: The passage seems related to the query but does not answer it.	`1.6K`	17.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.8K`	19.5%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`697`	7.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2), AP(rel=2)]
)

You can find more details about PyTerrier experiments here.

190K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 9260,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 5158,
          "1": 1601,
          "2": 1804,
          "3": 697
        }
      }
    }
  },
  "scoreddocs": {
    "count": 189877
  }
}

`"msmarco-passage/trec-dl-2019/judged"`

Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

43 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

9.3K qrels

Inherits qrels from msmarco-passage/trec-dl-2019

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`5.2K`	55.7%
1	Related: The passage seems related to the query but does not answer it.	`1.6K`	17.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.8K`	19.5%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`697`	7.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

41K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2019/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Craswell2019TrecDl,Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 43
  },
  "qrels": {
    "count": 9260,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 5158,
          "1": 1601,
          "2": 1804,
          "3": 697
        }
      }
    }
  },
  "scoreddocs": {
    "count": 41042
  }
}

`"msmarco-passage/trec-dl-2020"`

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).

Shared Task Paper

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

200 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`7.8K`	68.3%
1	Related: The passage seems related to the query but does not answer it.	`1.9K`	17.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.0K`	9.0%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`646`	5.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [nDCG@10, RR(rel=2), AP(rel=2)]
)

You can find more details about PyTerrier experiments here.

191K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 200
  },
  "qrels": {
    "count": 11386,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1020,
          "3": 646,
          "0": 7780,
          "1": 1940
        }
      }
    }
  },
  "scoreddocs": {
    "count": 190699
  }
}

`"msmarco-passage/trec-dl-2020/judged"`

Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

54 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Inherits qrels from msmarco-passage/trec-dl-2020

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`7.8K`	68.3%
1	Related: The passage seems related to the query but does not answer it.	`1.9K`	17.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`1.0K`	9.0%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`646`	5.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

50K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-2020/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Craswell2020TrecDl,Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 54
  },
  "qrels": {
    "count": 11386,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1020,
          "3": 646,
          "0": 7780,
          "1": 1940
        }
      }
    }
  },
  "scoreddocs": {
    "count": 50024
  }
}

`"msmarco-passage/trec-dl-hard"`

A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-document/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

50 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

4.3K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`2.5K`	57.8%
1	Related: The passage seems related to the query but does not answer it.	`810`	19.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`634`	14.9%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`350`	8.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 4256,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2462,
          "1": 810,
          "2": 634,
          "3": 350
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold1"`

Fold 1 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold1 queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold1 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

1.1K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`582`	54.3%
1	Related: The passage seems related to the query but does not answer it.	`197`	18.4%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`181`	16.9%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`112`	10.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold1 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 1072,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 582,
          "1": 197,
          "2": 181,
          "3": 112
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold2"`

Fold 2 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold2 queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold2 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

898 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`611`	68.0%
1	Related: The passage seems related to the query but does not answer it.	`151`	16.8%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`99`	11.0%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`37`	4.1%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold2 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 898,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "3": 37,
          "2": 99,
          "0": 611,
          "1": 151
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold3"`

Fold 3 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold3 queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold3 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

444 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`342`	77.0%
1	Related: The passage seems related to the query but does not answer it.	`43`	9.7%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`36`	8.1%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`23`	5.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold3 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 444,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 342,
          "1": 43,
          "2": 36,
          "3": 23
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold4"`

Fold 4 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold4 queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold4 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

716 qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`396`	55.3%
1	Related: The passage seems related to the query but does not answer it.	`137`	19.1%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`151`	21.1%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`32`	4.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold4 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 716,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "3": 32,
          "2": 151,
          "1": 137,
          "0": 396
        }
      }
    }
  }
}

`"msmarco-passage/trec-dl-hard/fold5"`

Fold 5 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

10 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold5 queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

8.8M docs

Inherits docs from msmarco-passage

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold5 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

1.1K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`531`	47.2%
1	Related: The passage seems related to the query but does not answer it.	`282`	25.0%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`167`	14.8%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`146`	13.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage/trec-dl-hard/fold5 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Mackie2021DlHard,Bajaj2016Msmarco}

Bibtex: