ir_datasets
: MSMARCO (document)"Based the questions in the [MS-MARCO] Question Answering Dataset and the documents which answered the questions a document ranking task was formulated. There are 3.2 million documents and the goal is to rank based on their relevance. Relevance labels are derived from what passages was marked as having the answer in the QnA dataset."
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Official dev set. All queries have exactly 1 (positive) relevance judgment.
scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/dev docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/dev')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/dev")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/dev scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.
scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/eval")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/eval queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/eval")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/eval docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/eval')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/eval")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/eval scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/orcas queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/orcas')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/orcas docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/orcas')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | User click |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/orcas qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/orcas')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/orcas")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/orcas scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Official train set. All queries have exactly 1 (positive) relevance judgment.
scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/train')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/train docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/train')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
1 | Labeled by crowd worker as relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/train')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019 docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-document/trec-dl-2019, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019/judged docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019/judged')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2019/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2019/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020 docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of msmarco-document/trec-dl-2020, only including queries with qrels.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020/judged docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020/judged')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-2020/judged')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-2020/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-2020/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold1 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold1 docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold1')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold1 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold1')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold2 docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold2')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold2')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold3 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold3 docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold3')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold3 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold3')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold4 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold4 docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold4')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold4 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold4')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold5 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-document
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, body>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold5 docs
[doc_id] [url] [title] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold5')
# Index msmarco-document
indexer = pt.IterDictIndexer('./indices/msmarco-document')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Irrelevant: Document does not provide any useful information about the query |
1 | Relevant: Document provides some information relevant to the query, which may be minimal. |
2 | Highly relevant: The content of this document provides substantial information on the query. |
3 | Perfectly relevant: Document is dedicated to the query, it is worthy of being a top result in a search engine. |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-document/trec-dl-hard/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-document/trec-dl-hard/fold5 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-document/trec-dl-hard/fold5')
index_ref = pt.IndexRef.of('./indices/msmarco-document') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.