ir_datasets
: MSMARCO (passage, version 2)Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.
Change Log
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } } }
Official dev1 set with 3,903 queries.
Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Based on mapping from v1 of MS MARCO | 4.0K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 3903 }, "qrels": { "count": 4009, "fields": { "relevance": { "counts_by_value": { "1": 4009 } } } }, "scoreddocs": { "count": 390300 } }
Official dev2 set with 4,281 queries.
Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Based on mapping from v1 of MS MARCO | 4.4K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 4281 }, "qrels": { "count": 4411, "fields": { "relevance": { "counts_by_value": { "1": 4411 } } } }, "scoreddocs": { "count": 428100 } }
Official train set with 277,144 queries.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Based on mapping from v1 of MS MARCO | 284K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 277144 }, "qrels": { "count": 284212, "fields": { "relevance": { "counts_by_value": { "1": 284212 } } } }, "scoreddocs": { "count": 27713673 } }
Official topics for the TREC Deep Learning (DL) 2021 shared task.
Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 4.3K | 40.1% |
1 | Related: The passage seems related to the query but does not answer it. | 3.1K | 28.3% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 2.3K | 21.6% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 1.1K | 10.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 477 }, "qrels": { "count": 10828, "fields": { "relevance": { "counts_by_value": { "0": 4338, "3": 1086, "1": 3063, "2": 2341 } } } }, "scoreddocs": { "count": 47700 } }
msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.
Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage-v2/trec-dl-2021
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 4.3K | 40.1% |
1 | Related: The passage seems related to the query but does not answer it. | 3.1K | 28.3% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 2.3K | 21.6% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 1.1K | 10.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 53 }, "qrels": { "count": 10828, "fields": { "relevance": { "counts_by_value": { "0": 4338, "3": 1086, "1": 3063, "2": 2341 } } } }, "scoreddocs": { "count": 5300 } }