ir_datasets
: MSMARCO (passage)A passage ranking benchmark with a collection of 8.8 million passages and question queries. Most relevance judgments are shallow (typically at most 1-2 per query), but the TREC Deep Learning track adds deep judgments. Evaluation typically conducted using MRR@10.
Note that the original document source files for this collection contain a double-encoding error that cause strange sequences like "å¬" and "ðºð". These are automatically corrrected (properly converting previous examples to "公" and "🇺🇸").
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Official dev set.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-passage/dev that only includes queries that have at least one qrel.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/dev/small qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable eval queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval/small queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/eval/small docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/eval/small')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Official train set.
Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.
scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total avaialable train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).
docpairs provides access to the "official" sequence for pairwise training.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-passage/train that only includes queries that have at least one qrel.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/judged docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/medical')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/medical docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-train')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-train docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/train/split200-valid docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2019/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-2020/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold1 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold3 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold4 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-passage
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
# Index msmarco-passage
indexer = pt.IterDictIndexer('./indices/msmarco-passage')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: The passage has nothing to do with the query. |
1 | Related: The passage seems related to the query but does not answer it. |
2 | Highly relevant: The passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information. |
3 | Perfectly relevant: The passage is dedicated to the query and contains the exact answer. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-passage/trec-dl-hard/fold5 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-passage') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.