ir_datasets
: MSMARCO (passage)A passage ranking benchmark with a collection of 8.8 million passages and question queries. Most relevance judgments are shallow (typically at most 1-2 per query), but the TREC Deep Learning track adds deep judgments. Evaluation typically conducted using MRR@10.
Note that the original document source files for this collection contain a double-encoding error that cause strange sequences like "å¬" and "ðºð". These are automatically corrrected (properly converting previous examples to "公" and "🇺🇸").
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } } }
Official dev set.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 59K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 101093 }, "qrels": { "count": 59273, "fields": { "relevance": { "counts_by_value": { "1": 59273 } } } } }
Subset of msmarco-passage/dev that only includes queries that have at least one qrel.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/dev
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 59K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 55578 }, "qrels": { "count": 59273, "fields": { "relevance": { "counts_by_value": { "1": 59273 } } } } }
Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 7.4K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 6980 }, "qrels": { "count": 7437, "fields": { "relevance": { "counts_by_value": { "1": 7437 } } } }, "scoreddocs": { "count": 6668967 } }
Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable eval queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 101092 } }
Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval/small queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval/small docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval/small scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 6837 }, "scoreddocs": { "count": 6515736 } }
Official train set.
Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
docpairs provides access to the "official" sequence for pairwise training.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 808731 }, "qrels": { "count": 532761, "fields": { "relevance": { "counts_by_value": { "1": 532761 } } } }, "scoreddocs": { "count": 478002393 }, "docpairs": { "count": 269919004 } }
Subset of msmarco-passage/train that only includes queries that have at least one qrel.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/train
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Inherits docpairs from msmarco-passage/train
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 502939 }, "qrels": { "count": 532761, "fields": { "relevance": { "counts_by_value": { "1": 532761 } } } }, "scoreddocs": { "count": 478002393 }, "docpairs": { "count": 269919004 } }
Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 55K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{MacAvaney2020MedMarco, author = {MacAvaney, Sean and Cohan, Arman and Goharian, Nazli}, title = {SLEDGE-Zero: A Zero-Shot Baseline for COVID-19 Literature Search}, booktitle = {EMNLP}, year = {2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 78895 }, "qrels": { "count": 54627, "fields": { "relevance": { "counts_by_value": { "1": 54627 } } } }, "scoreddocs": { "count": 48852277 }, "docpairs": { "count": 28969254 } }
Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 808531 }, "qrels": { "count": 532630, "fields": { "relevance": { "counts_by_value": { "1": 532630 } } } }, "scoreddocs": { "count": 477883382 }, "docpairs": { "count": 269854839 } }
Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 131 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 200 }, "qrels": { "count": 131, "fields": { "relevance": { "counts_by_value": { "1": 131 } } } }, "scoreddocs": { "count": 119011 }, "docpairs": { "count": 64165 } }
Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).
Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.
Official evaluation measures: RR@10
Inherits queries from msmarco-passage/train
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-small queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-small docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/train
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-small qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Inherits scoreddocs from msmarco-passage/train
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-small scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-small')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-small docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 808731 }, "qrels": { "count": 532761, "fields": { "relevance": { "counts_by_value": { "1": 532761 } } } }, "scoreddocs": { "count": 478002393 }, "docpairs": { "count": 39780811 } }
Version of msmarco-passage/train, but with version 2 of the triples file.
This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).
Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.
Official evaluation measures: RR@10
Inherits queries from msmarco-passage/train
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-v2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-v2 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/train
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-v2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Inherits scoreddocs from msmarco-passage/train
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-v2 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/triples-v2')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/triples-v2 docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 808731 }, "qrels": { "count": 532761, "fields": { "relevance": { "counts_by_value": { "1": 532761 } } } }, "scoreddocs": { "count": 478002393 }, "docpairs": { "count": 397768673 } }
Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 5.2K | 55.7% |
1 | Related: The passage seems related to the query but does not answer it. | 1.6K | 17.3% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.8K | 19.5% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 697 | 7.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2), AP(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 200 }, "qrels": { "count": 9260, "fields": { "relevance": { "counts_by_value": { "0": 5158, "1": 1601, "2": 1804, "3": 697 } } } }, "scoreddocs": { "count": 189877 } }
Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/trec-dl-2019
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 5.2K | 55.7% |
1 | Related: The passage seems related to the query but does not answer it. | 1.6K | 17.3% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.8K | 19.5% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 697 | 7.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2), AP(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 43 }, "qrels": { "count": 9260, "fields": { "relevance": { "counts_by_value": { "0": 5158, "1": 1601, "2": 1804, "3": 697 } } } }, "scoreddocs": { "count": 41042 } }
Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 7.8K | 68.3% |
1 | Related: The passage seems related to the query but does not answer it. | 1.9K | 17.0% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.0K | 9.0% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 646 | 5.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2), AP(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 200 }, "qrels": { "count": 11386, "fields": { "relevance": { "counts_by_value": { "2": 1020, "3": 646, "0": 7780, "1": 1940 } } } }, "scoreddocs": { "count": 190699 } }
Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/trec-dl-2020
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 7.8K | 68.3% |
1 | Related: The passage seems related to the query but does not answer it. | 1.9K | 17.0% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.0K | 9.0% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 646 | 5.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2), AP(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 54 }, "qrels": { "count": 11386, "fields": { "relevance": { "counts_by_value": { "2": 1020, "3": 646, "0": 7780, "1": 1940 } } } }, "scoreddocs": { "count": 50024 } }
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 2.5K | 57.8% |
1 | Related: The passage seems related to the query but does not answer it. | 810 | 19.0% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 634 | 14.9% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 350 | 8.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 4256, "fields": { "relevance": { "counts_by_value": { "0": 2462, "1": 810, "2": 634, "3": 350 } } } } }
Fold 1 of msmarco-passage/trec-dl-hard
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 582 | 54.3% |
1 | Related: The passage seems related to the query but does not answer it. | 197 | 18.4% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 181 | 16.9% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 112 | 10.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 10 }, "qrels": { "count": 1072, "fields": { "relevance": { "counts_by_value": { "0": 582, "1": 197, "2": 181, "3": 112 } } } } }
Fold 2 of msmarco-passage/trec-dl-hard
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 611 | 68.0% |
1 | Related: The passage seems related to the query but does not answer it. | 151 | 16.8% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 99 | 11.0% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 37 | 4.1% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 10 }, "qrels": { "count": 898, "fields": { "relevance": { "counts_by_value": { "3": 37, "2": 99, "0": 611, "1": 151 } } } } }
Fold 3 of msmarco-passage/trec-dl-hard
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 342 | 77.0% |
1 | Related: The passage seems related to the query but does not answer it. | 43 | 9.7% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 36 | 8.1% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 23 | 5.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 10 }, "qrels": { "count": 444, "fields": { "relevance": { "counts_by_value": { "0": 342, "1": 43, "2": 36, "3": 23 } } } } }
Fold 4 of msmarco-passage/trec-dl-hard
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 396 | 55.3% |
1 | Related: The passage seems related to the query but does not answer it. | 137 | 19.1% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 151 | 21.1% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 32 | 4.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 10 }, "qrels": { "count": 716, "fields": { "relevance": { "counts_by_value": { "3": 32, "2": 151, "1": 137, "0": 396 } } } } }
Fold 5 of msmarco-passage/trec-dl-hard
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Irrelevant: The passage has nothing to do with the query. | 531 | 47.2% |
1 | Related: The passage seems related to the query but does not answer it. | 282 | 25.0% |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 167 | 14.8% |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 146 | 13.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2)]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 10 }, "qrels": { "count": 1126, "fields": { "relevance": { "counts_by_value": { "3": 146, "1": 282, "0": 531, "2": 167 } } } } }