← home
Github: datasets/msmarco_passage_v2.py

ir_datasets: MSMARCO (passage, version 2)

Index
  1. msmarco-passage-v2
  2. msmarco-passage-v2/dev1
  3. msmarco-passage-v2/dev2
  4. msmarco-passage-v2/train
  5. msmarco-passage-v2/trec-dl-2021

"msmarco-passage-v2"

Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.

  • Version 1 of dataset: msmarco-passage
  • Documents: Text extracted from web pages
  • Queries: Natural language questions (from query log)
  • Dataset Paper

Change Log

  • On July 21, 2021, the task organizers updated the train, dev1, and dev2 qrels to remove duplicate entries from the files. This should not have change results from evaluation tools, but may result in non-repeatable results if these files were used in another process (e.g., model training). The original qrels file for msmarco-passage-v2/train can be found here to aid in result repeatability.
docs

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

"msmarco-passage-v2/dev1"

Official dev1 set with 3,903 queries.

Official evaluation measures: RR@10

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
1Labeled by crowd worker as relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

"msmarco-passage-v2/dev2"

Official dev2 set with 4,281 queries.

Official evaluation measures: RR@10

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
1Labeled by crowd worker as relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

"msmarco-passage-v2/train"

Official train set with 277,144 queries.

Official evaluation measures: RR@10

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
1Labeled by crowd worker as relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

"msmarco-passage-v2/trec-dl-2021"

Official topics for the TREC Deep Learning (DL) 2021 shared task.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier