← home
Github: datasets/trec_robust04.py

ir_datasets: TREC Robust 2004

Index
  1. trec-robust04
  2. trec-robust04/fold1
  3. trec-robust04/fold2
  4. trec-robust04/fold3
  5. trec-robust04/fold4
  6. trec-robust04/fold5

Data Access Information

To use this dataset, you need a copy of TREC disks 4 and 5, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

ir_datasets needs the following directories from the source:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/trec-robust04/trec45. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.


"trec-robust04"

trec-robust04 is deprecated. Consider using disks45/nocr/trec-robust-2004 instead, which provides better parsing of the corpus.

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

queries
250 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs
528K docs

Language: en

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04 docs
[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels
311K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant294K94.4%
1relevant16K5.3%
2highly relevant1.0K0.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} }
Metadata

"trec-robust04/fold1"

trec-robust04/fold1 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold1 instead, which provides better parsing of the corpus.

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold1 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from trec-robust04

Language: en

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold1 docs
[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels
63K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant60K95.2%
1relevant2.8K4.5%
2highly relevant229 0.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold1 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"trec-robust04/fold2"

trec-robust04/fold2 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold2 instead, which provides better parsing of the corpus.

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold2 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from trec-robust04

Language: en

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold2 docs
[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels
64K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant60K94.3%
1relevant3.3K5.2%
2highly relevant337 0.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold2 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"trec-robust04/fold3"

trec-robust04/fold3 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold3 instead, which provides better parsing of the corpus.

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold3 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from trec-robust04

Language: en

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold3 docs
[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels
63K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant59K93.6%
1relevant3.9K6.2%
2highly relevant165 0.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold3 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"trec-robust04/fold4"

trec-robust04/fold4 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold4 instead, which provides better parsing of the corpus.

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold4 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from trec-robust04

Language: en

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold4 docs
[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels
58K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant55K95.1%
1relevant2.7K4.7%
2highly relevant152 0.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold4 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"trec-robust04/fold5"

trec-robust04/fold5 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold5 instead, which provides better parsing of the corpus.

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold5 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from trec-robust04

Language: en

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold5 docs
[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels
64K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant60K94.0%
1relevant3.7K5.7%
2highly relevant148 0.2%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export trec-robust04/fold5 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata