ir_datasets
: MSMARCO (passage, version 2)Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.
Change Log
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } } }
Official dev1 set with 3,903 queries.
Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.dev1.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.dev1')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Based on mapping from v1 of MS MARCO | 4.0K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.dev1.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev1")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev1 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev1')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.dev1.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 3903 }, "qrels": { "count": 4009, "fields": { "relevance": { "counts_by_value": { "1": 4009 } } } }, "scoreddocs": { "count": 390300 } }
Official dev2 set with 4,281 queries.
Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.dev2.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.dev2')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Based on mapping from v1 of MS MARCO | 4.4K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.dev2.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/dev2")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/dev2 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/dev2')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.dev2.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 4281 }, "qrels": { "count": 4411, "fields": { "relevance": { "counts_by_value": { "1": 4411 } } } }, "scoreddocs": { "count": 428100 } }
Official train set with 277,144 queries.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.train.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.train')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Based on mapping from v1 of MS MARCO | 284K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.train.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/train')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.train.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 277144 }, "qrels": { "count": 284212, "fields": { "relevance": { "counts_by_value": { "1": 284212 } } } }, "scoreddocs": { "count": 27713673 } }
Official topics for the TREC Deep Learning (DL) 2021 shared task.
Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 4.3K | 40.1% |
1 | Related: The passage seems related to the query but does not answer it. | 3.1K | 28.3% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 2.3K | 21.6% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 1.1K | 10.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 477 }, "qrels": { "count": 10828, "fields": { "relevance": { "counts_by_value": { "0": 4338, "3": 1086, "1": 3063, "2": 2341 } } } }, "scoreddocs": { "count": 47700 } }
msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.
Official evaluation measures: AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Inherits qrels from msmarco-passage-v2/trec-dl-2021
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 4.3K | 40.1% |
1 | Related: The passage seems related to the query but does not answer it. | 3.1K | 28.3% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 2.3K | 21.6% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 1.1K | 10.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[AP@100, nDCG@10, P(rel=2)@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2021/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2021/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2021/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2021.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 53 }, "qrels": { "count": 10828, "fields": { "relevance": { "counts_by_value": { "0": 4338, "3": 1086, "1": 3063, "2": 2341 } } } }, "scoreddocs": { "count": 5300 } }
Official topics for the TREC Deep Learning (DL) 2022 shared task.
Note that the officially-released qrels include relevance labels propagated to duplicate passages, while results presented in the notebook papers remove duplicate documents. This means that the results are not directly comparable, and extra care should be taken when making comparisions among systems to ensure that they were evaluated in the same settings.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022 docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 286K | 74.1% |
1 | Related: The passage seems related to the query but does not answer it. | 52K | 13.5% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 46K | 11.9% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 1.7K | 0.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 500 }, "qrels": { "count": 386416, "fields": { "relevance": { "counts_by_value": { "0": 286459, "1": 52218, "2": 46080, "3": 1659 } } } }, "scoreddocs": { "count": 50000 } }
msmarco-passage-v2/trec-dl-2022, but filtered down to only the queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from msmarco-passage-v2
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, spans, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged docs
[doc_id] [text] [spans] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
# Index msmarco-passage-v2
indexer = pt.IterDictIndexer('./indices/msmarco-passage-v2', meta={"docno": 28})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Inherits qrels from msmarco-passage-v2/trec-dl-2022
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 286K | 74.1% |
1 | Related: The passage seems related to the query but does not answer it. | 52K | 13.5% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 46K | 11.9% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 1.7K | 0.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage-v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage-v2/trec-dl-2022/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage-v2/trec-dl-2022/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage-v2/trec-dl-2022/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.msmarco-passage-v2.trec-dl-2022.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
{ "docs": { "count": 138364198, "fields": { "doc_id": { "max_len": 28, "common_prefix": "msmarco_passage_" } } }, "queries": { "count": 76 }, "qrels": { "count": 386416, "fields": { "relevance": { "counts_by_value": { "0": 286459, "1": 52218, "2": 46080, "3": 1659 } } } }, "scoreddocs": { "count": 7600 } }