`ir_datasets`: TREC Robust 2004

Index

trec-robust04
trec-robust04/fold1
trec-robust04/fold2
trec-robust04/fold3
trec-robust04/fold4
trec-robust04/fold5

Data Access Information

To use this dataset, you need a copy of TREC disks 4 and 5, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

ir_datasets needs the following directories from the source:

FBIS
FR94
FT
LATIMES

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/trec-robust04/trec45. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.

`"trec-robust04"`

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Task Overview Paper
See also: aquaint/trec-robust-2005

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} }

`"trec-robust04/fold1"`

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerious works

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold1 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from trec-robust04

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold1 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold1 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }

`"trec-robust04/fold2"`

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerious works

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold2 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from trec-robust04

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold2 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

`"trec-robust04/fold3"`

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerious works

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold3 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from trec-robust04

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold3 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold3 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

`"trec-robust04/fold4"`

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerious works

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold4 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from trec-robust04

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold4 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold4 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

`"trec-robust04/fold5"`

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerious works

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold5 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from trec-robust04

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold5 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold5 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

ir_datasets: TREC Robust 2004

Data Access Information

"trec-robust04"

"trec-robust04/fold1"

"trec-robust04/fold2"

"trec-robust04/fold3"

"trec-robust04/fold4"

"trec-robust04/fold5"

`ir_datasets`: TREC Robust 2004

`"trec-robust04"`

`"trec-robust04/fold1"`

`"trec-robust04/fold2"`

`"trec-robust04/fold3"`

`"trec-robust04/fold4"`

`"trec-robust04/fold5"`