← home
Github: datasets/msmarco_passage_v2.py

ir_datasets: MSMARCO (passage, version 2)

Index
  1. msmarco-passage-v2
  2. msmarco-passage-v2/dev1
  3. msmarco-passage-v2/dev2
  4. msmarco-passage-v2/train
  5. msmarco-passage-v2/trec-dl-2021
  6. msmarco-passage-v2/trec-dl-2021/judged
  7. msmarco-passage-v2/trec-dl-2022
  8. msmarco-passage-v2/trec-dl-2022/judged

"msmarco-passage-v2"

Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.

  • Version 1 of dataset: msmarco-passage
  • Documents: Text extracted from web pages
  • Queries: Natural language questions (from query log)
  • Dataset Paper

Change Log

  • On July 21, 2021, the task organizers updated the train, dev1, and dev2 qrels to remove duplicate entries from the files. This should not have change results from evaluation tools, but may result in non-repeatable results if these files were used in another process (e.g., model training). The original qrels file for msmarco-passage-v2/train can be found here to aid in result repeatability.
docs
138M docs

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-passage-v2/dev1"

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

queries
3.9K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.dev1.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.dev1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
4.0K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Based on mapping from v1 of MS MARCO4.0K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.dev1.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
390K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev1 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.dev1.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-passage-v2/dev2"

Official dev2 set with 4,281 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

queries
4.3K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.dev2.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.dev2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
4.4K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Based on mapping from v1 of MS MARCO4.4K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.dev2.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
428K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/dev2 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.dev2.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-passage-v2/train"

Official train set with 277,144 queries.

Official evaluation measures: RR@10

queries
277K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
284K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Based on mapping from v1 of MS MARCO284K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
28M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/train scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.train.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }
Metadata

"msmarco-passage-v2/trec-dl-2021"

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

queries
477 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
11K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: The passage has nothing to do with the query.4.3K40.1%
1Related: The passage seems related to the query but does not answer it.3.1K28.3%
2Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.2.3K21.6%
3Perfectly relevant: The passage is dedicated to the query and contains the exact answer.1.1K10.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
48K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Metadata

"msmarco-passage-v2/trec-dl-2021/judged"

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

queries
53 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
11K qrels

Inherits qrels from msmarco-passage-v2/trec-dl-2021

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: The passage has nothing to do with the query.4.3K40.1%
1Related: The passage seems related to the query but does not answer it.3.1K28.3%
2Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.2.3K21.6%
3Perfectly relevant: The passage is dedicated to the query and contains the exact answer.1.1K10.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
5.3K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Metadata

"msmarco-passage-v2/trec-dl-2022"

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that the officially-released qrels include relevance labels propagated to duplicate passages, while results presented in the notebook papers remove duplicate documents. This means that the results are not directly comparable, and extra care should be taken when making comparisions among systems to ensure that they were evaluated in the same settings.

queries
500 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022 docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
386K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: The passage has nothing to do with the query.286K74.1%
1Related: The passage seems related to the query but does not answer it.52K13.5%
2Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.46K11.9%
3Perfectly relevant: The passage is dedicated to the query and contains the exact answer.1.7K0.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
50K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Metadata

"msmarco-passage-v2/trec-dl-2022/judged"

msmarco-passage-v2/trec-dl-2022, but filtered down to only the queries with qrels.

queries
76 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:
MsMarcoV2Passage: (namedtuple)
  1. doc_id: str
  2. text: str
  3. spans: Tuple[Tuple[int,int], ...]
  4. msmarco_document_id: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged docs
[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
386K qrels

Inherits qrels from msmarco-passage-v2/trec-dl-2022

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Irrelevant: The passage has nothing to do with the query.286K74.1%
1Related: The passage seems related to the query but does not answer it.52K13.5%
2Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.46K11.9%
3Perfectly relevant: The passage is dedicated to the query and contains the exact answer.1.7K0.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
7.6K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Metadata