ir_datasets
: TREC Disks 4 and 5To use this dataset, you need a copy of TREC Disks 4 and 5, provided by NIST.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.
ir_datasets needs the following directories from the source:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/disks45/corpus. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.
TREC Disks 4 and 5, including documents from the Financial Times, the Congressional Record, the Federal Register, the Foreign Broadcast Information Service, and the Los Angeles Times.
This dataset is a placeholder for the complete collection, but at this time, only the version of the dataset without the Congressional Record (disks45/nocr) are provided.
A version of disks45 without the Congressional Record. This is the typical setting for tasks like TREC 7, TREC 8, and TREC Robust 2004.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } } }
The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."
The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 294K | 94.4% |
1 | relevant | 16K | 5.3% |
2 | highly relevant | 1.0K | 0.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 250 }, "qrels": { "count": 311410, "fields": { "relevance": { "counts_by_value": { "1": 16381, "0": 293998, "2": 1031 } } } } }
Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 60K | 95.2% |
1 | relevant | 2.8K | 4.5% |
2 | highly relevant | 229 | 0.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 62789, "fields": { "relevance": { "counts_by_value": { "0": 59765, "1": 2795, "2": 229 } } } } }
Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 60K | 94.3% |
1 | relevant | 3.3K | 5.2% |
2 | highly relevant | 337 | 0.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 63917, "fields": { "relevance": { "counts_by_value": { "1": 3334, "0": 60246, "2": 337 } } } } }
Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 59K | 93.6% |
1 | relevant | 3.9K | 6.2% |
2 | highly relevant | 165 | 0.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 62901, "fields": { "relevance": { "counts_by_value": { "0": 58859, "1": 3877, "2": 165 } } } } }
Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 55K | 95.1% |
1 | relevant | 2.7K | 4.7% |
2 | highly relevant | 152 | 0.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 57962, "fields": { "relevance": { "counts_by_value": { "0": 55103, "1": 2707, "2": 152 } } } } }
Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 60K | 94.0% |
1 | relevant | 3.7K | 5.7% |
2 | highly relevant | 148 | 0.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 63841, "fields": { "relevance": { "counts_by_value": { "0": 60025, "1": 3668, "2": 148 } } } } }
The TREC 7 Adhoc Retrieval track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec7.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec7')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 76K | 94.2% |
1 | relevant | 4.7K | 5.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec7.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees1998Trec7, title = {Overview of the Seventh Text Retrieval Conference (TREC-7)}, author = {Ellen M. Voorhees and Donna Harman}, year = {1998}, booktitle = {TREC} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 80345, "fields": { "relevance": { "counts_by_value": { "0": 75671, "1": 4674 } } } } }
The TREC 8 Adhoc Retrieval track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec8.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec8')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 82K | 94.6% |
1 | relevant | 4.7K | 5.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec8.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees1999Trec8, title = {Overview of the Eight Text Retrieval Conference (TREC-8)}, author = {Ellen M. Voorhees and Donna Harman}, year = {1999}, booktitle = {TREC} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 86830, "fields": { "relevance": { "counts_by_value": { "0": 82102, "1": 4728 } } } } }