ir_datasets : MSMARCO (passage, version 2)

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  }
}

`"msmarco-passage-v2/dev1"`

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

3.9K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

4.0K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`4.0K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

390K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 3903
  },
  "qrels": {
    "count": 4009,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4009
        }
      }
    }
  },
  "scoreddocs": {
    "count": 390300
  }
}

`"msmarco-passage-v2/dev2"`

Official dev2 set with 4,281 queries.

Official evaluation measures: RR@10

4.3K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

4.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`4.4K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

428K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 4281
  },
  "qrels": {
    "count": 4411,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4411
        }
      }
    }
  },
  "scoreddocs": {
    "count": 428100
  }
}

`"msmarco-passage-v2/train"`

Official train set with 277,144 queries.

Official evaluation measures: RR@10

277K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

284K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`284K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

28M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 277144
  },
  "qrels": {
    "count": 284212,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 284212
        }
      }
    }
  },
  "scoreddocs": {
    "count": 27713673
  }
}

`"msmarco-passage-v2/trec-dl-2021"`

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

477 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`4.3K`	40.1%
1	Related: The passage seems related to the query but does not answer it.	`3.1K`	28.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`2.3K`	21.6%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.1K`	10.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

48K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 477
  },
  "qrels": {
    "count": 10828,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 4338,
          "3": 1086,
          "1": 3063,
          "2": 2341
        }
      }
    }
  },
  "scoreddocs": {
    "count": 47700
  }
}

`"msmarco-passage-v2/trec-dl-2021/judged"`

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

53 queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Inherits qrels from msmarco-passage-v2/trec-dl-2021

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`4.3K`	40.1%
1	Related: The passage seems related to the query but does not answer it.	`3.1K`	28.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`2.3K`	21.6%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.1K`	10.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

5.3K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier