ir_datasets
: TREC Robust 2004To use this dataset, you need a copy of TREC disks 4 and 5, provided by NIST.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.
ir_datasets needs the following directories from the source:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/trec-robust04/trec45. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.
The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."
The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-robust04 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 294K | 94.4% |
1 | relevant | 16K | 5.3% |
2 | highly relevant | 1.0K | 0.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-robust04 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 250 }, "qrels": { "count": 311410, "fields": { "relevance": { "counts_by_value": { "1": 16381, "0": 293998, "2": 1031 } } } } }
Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold1 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-robust04
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold1 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 60K | 95.2% |
1 | relevant | 2.8K | 4.5% |
2 | highly relevant | 229 | 0.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold1 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 62789, "fields": { "relevance": { "counts_by_value": { "0": 59765, "1": 2795, "2": 229 } } } } }
Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold2 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-robust04
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold2 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 60K | 94.3% |
1 | relevant | 3.3K | 5.2% |
2 | highly relevant | 337 | 0.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold2 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 63917, "fields": { "relevance": { "counts_by_value": { "1": 3334, "0": 60246, "2": 337 } } } } }
Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold3 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-robust04
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold3 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 59K | 93.6% |
1 | relevant | 3.9K | 6.2% |
2 | highly relevant | 165 | 0.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold3 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 62901, "fields": { "relevance": { "counts_by_value": { "0": 58859, "1": 3877, "2": 165 } } } } }
Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold4 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-robust04
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold4 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 55K | 95.1% |
1 | relevant | 2.7K | 4.7% |
2 | highly relevant | 152 | 0.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold4 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 57962, "fields": { "relevance": { "counts_by_value": { "0": 55103, "1": 2707, "2": 152 } } } } }
Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold5 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-robust04
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold5 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 60K | 94.0% |
1 | relevant | 3.7K | 5.7% |
2 | highly relevant | 148 | 0.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold5 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{ "docs": { "count": 528155, "fields": { "doc_id": { "max_len": 16, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 63841, "fields": { "relevance": { "counts_by_value": { "0": 60025, "1": 3668, "2": 148 } } } } }