ir_datasets : MSMARCO (passage, version 2)

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

\cite{Bajaj2016Msmarco}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  }
}

`"msmarco-passage-v2/dev1"`

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Official evaluation measures: RR@10

3.9K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.dev1.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.dev1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

4.0K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`4.0K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.dev1.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

390K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev1 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.dev1.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 3903
  },
  "qrels": {
    "count": 4009,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4009
        }
      }
    }
  },
  "scoreddocs": {
    "count": 390300
  }
}

`"msmarco-passage-v2/dev2"`

Official dev2 set with 4,281 queries.

Official evaluation measures: RR@10

4.3K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.dev2.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.dev2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

4.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`4.4K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.dev2.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

428K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/dev2 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.dev2.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 4281
  },
  "qrels": {
    "count": 4411,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4411
        }
      }
    }
  },
  "scoreddocs": {
    "count": 428100
  }
}

`"msmarco-passage-v2/train"`

Official train set with 277,144 queries.

Official evaluation measures: RR@10

277K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

284K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Based on mapping from v1 of MS MARCO	`284K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [RR@10]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

28M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.train.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

\cite{Bajaj2016Msmarco}

Bibtex:

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 277144
  },
  "qrels": {
    "count": 284212,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 284212
        }
      }
    }
  },
  "scoreddocs": {
    "count": 27713673
  }
}

`"msmarco-passage-v2/trec-dl-2021"`

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

477 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`4.3K`	40.1%
1	Related: The passage seems related to the query but does not answer it.	`3.1K`	28.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`2.3K`	21.6%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.1K`	10.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

48K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 477
  },
  "qrels": {
    "count": 10828,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 4338,
          "3": 1086,
          "1": 3063,
          "2": 2341
        }
      }
    }
  },
  "scoreddocs": {
    "count": 47700
  }
}

`"msmarco-passage-v2/trec-dl-2021/judged"`

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)

53 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

11K qrels

Inherits qrels from msmarco-passage-v2/trec-dl-2021

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`4.3K`	40.1%
1	Related: The passage seems related to the query but does not answer it.	`3.1K`	28.3%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`2.3K`	21.6%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.1K`	10.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

5.3K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2021/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 53
  },
  "qrels": {
    "count": 10828,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 4338,
          "3": 1086,
          "1": 3063,
          "2": 2341
        }
      }
    }
  },
  "scoreddocs": {
    "count": 5300
  }
}

`"msmarco-passage-v2/trec-dl-2022"`

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that the officially-released qrels include relevance labels propagated to duplicate passages, while results presented in the notebook papers remove duplicate documents. This means that the results are not directly comparable, and extra care should be taken when making comparisions among systems to ensure that they were evaluated in the same settings.

500 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022 queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022 docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

386K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`286K`	74.1%
1	Related: The passage seems related to the query but does not answer it.	`52K`	13.5%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`46K`	11.9%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.7K`	0.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

50K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

{
  "docs": {
    "count": 138364198,
    "fields": {
      "doc_id": {
        "max_len": 28,
        "common_prefix": "msmarco_passage_"
      }
    }
  },
  "queries": {
    "count": 500
  },
  "qrels": {
    "count": 386416,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 286459,
          "1": 52218,
          "2": 46080,
          "3": 1659
        }
      }
    }
  },
  "scoreddocs": {
    "count": 50000
  }
}

`"msmarco-passage-v2/trec-dl-2022/judged"`

msmarco-passage-v2/trec-dl-2022, but filtered down to only the queries with qrels.

76 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022/judged queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

138M docs

Inherits docs from msmarco-passage-v2

Language: en

Document type:

MsMarcoV2Passage: (namedtuple)

doc_id: str
text: str
spans: Tuple[Tuple[int,int], ...]
msmarco_document_id: str

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, spans, msmarco_document_id>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022/judged docs



[doc_id]    [text]    [spans]    [msmarco_document_id]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

386K qrels

Inherits qrels from msmarco-passage-v2/trec-dl-2022

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Irrelevant: The passage has nothing to do with the query.	`286K`	74.1%
1	Related: The passage seems related to the query but does not answer it.	`52K`	13.5%
2	Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information.	`46K`	11.9%
3	Perfectly relevant: The passage is dedicated to the query and contains the exact answer.	`1.7K`	0.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

7.6K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export msmarco-passage-v2/trec-dl-2022/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun