This documentation is for
v0.4.1. See
here for documentation of the current latest version on pypi.
ir_datasets
: TREC Robust 2004
Index
- trec-robust04
- trec-robust04/fold1
- trec-robust04/fold2
- trec-robust04/fold3
- trec-robust04/fold4
- trec-robust04/fold5
Data Access Information
To use this dataset, you need a copy of TREC disks 4 and 5, provided by NIST.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.
ir_datasets needs the following directories from the source:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/trec-robust04/trec45. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.
"trec-robust04"
The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."
The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.
queriesdocsqrelsCitation
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-robust04 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-robust04 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
ir_datasets.bib:
\cite{Voorhees2004Robust}
Bibtex:
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
"trec-robust04/fold1"
Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerious works
queriesdocsqrelsCitation
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold1 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from trec-robust04
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold1 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold1 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
ir_datasets.bib:
\cite{Voorhees2004Robust,Huston2014ACO}
Bibtex:
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
"trec-robust04/fold2"
Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerious works
queriesdocsqrelsCitation
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold2 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from trec-robust04
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold2 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
ir_datasets.bib:
\cite{Voorhees2004Robust,Huston2014ACO}
Bibtex:
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
"trec-robust04/fold3"
Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerious works
queriesdocsqrelsCitation
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold3 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from trec-robust04
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold3 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold3 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
ir_datasets.bib:
\cite{Voorhees2004Robust,Huston2014ACO}
Bibtex:
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
"trec-robust04/fold4"
Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerious works
queriesdocsqrelsCitation
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold4 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from trec-robust04
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold4 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold4 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
ir_datasets.bib:
\cite{Voorhees2004Robust,Huston2014ACO}
Bibtex:
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
"trec-robust04/fold5"
Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerious works
queriesdocsqrelsCitation
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold5 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from trec-robust04
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold5 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])
You can find more details about PyTerrier indexing here.
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-robust04/fold5 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
ir_datasets.bib:
\cite{Voorhees2004Robust,Huston2014ACO}
Bibtex:
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}