ir_datasets : MSMARCO (passage, version 2)

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  }
}

`"msmarco-passage-v2/dev1"`

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

3.9K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

4.0K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`4.0K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

390K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 3903
  },
  "qrels": {
    "count": 4009,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4009
        }
      }
    }
  },
  "scoreddocs": {
    "count": 390300
  }
}

`"msmarco-passage-v2/dev2"`

Official dev2 set with 4,281 queries.

Official evaluation measures: RR@10

4.3K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

4.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`4.4K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

428K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 4281
  },
  "qrels": {
    "count": 4411,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4411
        }
      }
    }
  },
  "scoreddocs": {
    "count": 428100
  }
}

`"msmarco-passage-v2/train"`

Official train set with 277,144 queries.

Official evaluation measures: RR@10

277K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

284K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`284K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

28M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 277144
  },
  "qrels": {
    "count": 284212,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 284212
        }
      }
    }
  },
  "scoreddocs": {
    "count": 27713673
  }
}

`"msmarco-passage-v2/trec-dl-2021"`

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

477 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`4.3K`	40.1%
1	Related: The passage seems related to the query but does not answer it.	`3.1K`	28.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`2.3K`	21.6%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.1K`	10.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

48K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 477
  },
  "qrels": {
    "count": 10828,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 4338,
          "3": 1086,
          "1": 3063,
          "2": 2341
        }
      }
    }
  },
  "scoreddocs": {
    "count": 47700
  }
}

`"msmarco-passage-v2/trec-dl-2021/judged"`

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

53 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Inherits qrels from msmarco-passage-v2/trec-dl-2021

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`4.3K`	40.1%
1	Related: The passage seems related to the query but does not answer it.	`3.1K`	28.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`2.3K`	21.6%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.1K`	10.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

5.3K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier