ir_datasets: MSMARCO (passage)A passage ranking benchmark with a collection of 8.8 million passages and question queries. Most relevance judgments are shallow (typically at most 1-2 per query), but the TREC Deep Learning track adds deep judgments. Evaluation typically conducted using MRR@10.
Note that the original document source files for this collection contain a double-encoding error that cause strange sequences like "å¬" and "ðºð". These are automatically corrrected (properly converting previous examples to "公" and "🇺🇸").
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
}
}
Official dev set.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 59K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 101093
},
"qrels": {
"count": 59273,
"fields": {
"relevance": {
"counts_by_value": {
"1": 59273
}
}
}
},
"scoreddocs": {
"count": 6668967
}
}
Subset of msmarco-passage/dev that only includes queries that have at least one qrel.
Official evaluation measures: RR@10
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/dev
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 59K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 55578
},
"qrels": {
"count": 59273,
"fields": {
"relevance": {
"counts_by_value": {
"1": 59273
}
}
}
},
"scoreddocs": {
"count": 6668967
}
}
Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 7.4K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 6980
},
"qrels": {
"count": 7437,
"fields": {
"relevance": {
"counts_by_value": {
"1": 7437
}
}
}
}
}
Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable eval queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 101092
},
"scoreddocs": {
"count": 6515736
}
}
Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval/small queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval/small docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 6837
}
}
Official train set.
Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
docpairs provides access to the "official" sequence for pairwise training.
Official evaluation measures: RR@10
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[RR@10]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 808731
},
"qrels": {
"count": 532761,
"fields": {
"relevance": {
"counts_by_value": {
"1": 532761
}
}
}
},
"scoreddocs": {
"count": 478002393
},
"docpairs": {
"count": 269919004
}
}
Subset of msmarco-passage/train that only includes queries that have at least one qrel.
Official evaluation measures: RR@10
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/train
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docpairs from msmarco-passage/train
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 502939
},
"qrels": {
"count": 532761,
"fields": {
"relevance": {
"counts_by_value": {
"1": 532761
}
}
}
},
"scoreddocs": {
"count": 478002393
},
"docpairs": {
"count": 269919004
}
}
Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.
Official evaluation measures: RR@10
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 55K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{MacAvaney2020MedMarco, author = {MacAvaney, Sean and Cohan, Arman and Goharian, Nazli}, title = {SLEDGE-Zero: A Zero-Shot Baseline for COVID-19 Literature Search}, booktitle = {EMNLP}, year = {2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 78895
},
"qrels": {
"count": 54627,
"fields": {
"relevance": {
"counts_by_value": {
"1": 54627
}
}
}
},
"scoreddocs": {
"count": 48852277
},
"docpairs": {
"count": 28969254
}
}
Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.
Official evaluation measures: RR@10
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 533K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 808531
},
"qrels": {
"count": 532630,
"fields": {
"relevance": {
"counts_by_value": {
"1": 532630
}
}
}
},
"scoreddocs": {
"count": 477883382
},
"docpairs": {
"count": 269854839
}
}
Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.
Official evaluation measures: RR@10
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Labeled by crowd worker as relevant | 131 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 200
},
"qrels": {
"count": 131,
"fields": {
"relevance": {
"counts_by_value": {
"1": 131
}
}
}
},
"scoreddocs": {
"count": 119011
},
"docpairs": {
"count": 64165
}
}
Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 5.2K | 55.7% |
| 1 | Related: The passage seems related to the query but does not answer it. | 1.6K | 17.3% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.8K | 19.5% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 697 | 7.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2), AP(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 200
},
"qrels": {
"count": 9260,
"fields": {
"relevance": {
"counts_by_value": {
"0": 5158,
"1": 1601,
"2": 1804,
"3": 697
}
}
}
},
"scoreddocs": {
"count": 189877
}
}
Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/trec-dl-2019
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 5.2K | 55.7% |
| 1 | Related: The passage seems related to the query but does not answer it. | 1.6K | 17.3% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.8K | 19.5% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 697 | 7.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 43
},
"qrels": {
"count": 9260,
"fields": {
"relevance": {
"counts_by_value": {
"0": 5158,
"1": 1601,
"2": 1804,
"3": 697
}
}
}
},
"scoreddocs": {
"count": 41042
}
}
Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 7.8K | 68.3% |
| 1 | Related: The passage seems related to the query but does not answer it. | 1.9K | 17.0% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.0K | 9.0% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 646 | 5.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[nDCG@10, RR(rel=2), AP(rel=2)]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 200
},
"qrels": {
"count": 11386,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1020,
"3": 646,
"0": 7780,
"1": 1940
}
}
}
},
"scoreddocs": {
"count": 190699
}
}
Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from msmarco-passage/trec-dl-2020
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 7.8K | 68.3% |
| 1 | Related: The passage seems related to the query but does not answer it. | 1.9K | 17.0% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 1.0K | 9.0% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 646 | 5.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Craswell2020TrecDl, title={Overview of the TREC 2020 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos}, booktitle={TREC}, year={2020} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 54
},
"qrels": {
"count": 11386,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1020,
"3": 646,
"0": 7780,
"1": 1940
}
}
}
},
"scoreddocs": {
"count": 50024
}
}
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 2.5K | 57.8% |
| 1 | Related: The passage seems related to the query but does not answer it. | 810 | 19.0% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 634 | 14.9% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 350 | 8.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 4256,
"fields": {
"relevance": {
"counts_by_value": {
"0": 2462,
"1": 810,
"2": 634,
"3": 350
}
}
}
}
}
Fold 1 of msmarco-passage/trec-dl-hard
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 582 | 54.3% |
| 1 | Related: The passage seems related to the query but does not answer it. | 197 | 18.4% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 181 | 16.9% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 112 | 10.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 10
},
"qrels": {
"count": 1072,
"fields": {
"relevance": {
"counts_by_value": {
"0": 582,
"1": 197,
"2": 181,
"3": 112
}
}
}
}
}
Fold 2 of msmarco-passage/trec-dl-hard
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 611 | 68.0% |
| 1 | Related: The passage seems related to the query but does not answer it. | 151 | 16.8% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 99 | 11.0% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 37 | 4.1% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 10
},
"qrels": {
"count": 898,
"fields": {
"relevance": {
"counts_by_value": {
"3": 37,
"2": 99,
"0": 611,
"1": 151
}
}
}
}
}
Fold 3 of msmarco-passage/trec-dl-hard
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 342 | 77.0% |
| 1 | Related: The passage seems related to the query but does not answer it. | 43 | 9.7% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 36 | 8.1% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 23 | 5.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 10
},
"qrels": {
"count": 444,
"fields": {
"relevance": {
"counts_by_value": {
"0": 342,
"1": 43,
"2": 36,
"3": 23
}
}
}
}
}
Fold 4 of msmarco-passage/trec-dl-hard
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 396 | 55.3% |
| 1 | Related: The passage seems related to the query but does not answer it. | 137 | 19.1% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 151 | 21.1% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 32 | 4.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 10
},
"qrels": {
"count": 716,
"fields": {
"relevance": {
"counts_by_value": {
"3": 32,
"2": 151,
"1": 137,
"0": 396
}
}
}
}
}
Fold 5 of msmarco-passage/trec-dl-hard
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from msmarco-passage
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant: The passage has nothing to do with the query. | 531 | 47.2% |
| 1 | Related: The passage seems related to the query but does not answer it. | 282 | 25.0% |
| 2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. | 167 | 14.8% |
| 3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. | 146 | 13.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Mackie2021DlHard, title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset}, author={Iain Mackie and Jeffrey Dalton and Andrew Yates}, journal={ArXiv}, year={2021}, volume={abs/2105.07975} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }{
"docs": {
"count": 8841823,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 10
},
"qrels": {
"count": 1126,
"fields": {
"relevance": {
"counts_by_value": {
"3": 146,
"1": 282,
"0": 531,
"2": 167
}
}
}
}
}