`ir_datasets`: TREC Robust 2004

Index

trec-robust04
trec-robust04/fold1
trec-robust04/fold2
trec-robust04/fold3
trec-robust04/fold4
trec-robust04/fold5

Data Access Information

To use this dataset, you need a copy of TREC disks 4 and 5, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

ir_datasets needs the following directories from the source:

FBIS
FR94
FT
LATIMES

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/trec-robust04/trec45. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.

`"trec-robust04"`

trec-robust04 is deprecated. Consider using disks45/nocr/trec-robust-2004 instead, which provides better parsing of the corpus.

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Task Overview Paper
See also: aquaint/trec-robust-2005

queries

250 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-robust04.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

528K docs

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-robust04')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

311K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`294K`	94.4%
1	relevant	`16K`	5.3%
2	highly relevant	`1.0K`	0.3%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-robust04.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} }

Metadata

{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 250
  },
  "qrels": {
    "count": 311410,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 16381,
          "0": 293998,
          "2": 1031
        }
      }
    }
  }
}

`"trec-robust04/fold1"`

trec-robust04/fold1 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold1 instead, which provides better parsing of the corpus.

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold1 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-robust04.fold1.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

528K docs

Inherits docs from trec-robust04

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold1 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-robust04.fold1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

63K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`60K`	95.2%
1	relevant	`2.8K`	4.5%
2	highly relevant	`229`	0.4%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold1 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold1')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-robust04.fold1.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

@inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }

Metadata

{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 62789,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 59765,
          "1": 2795,
          "2": 229
        }
      }
    }
  }
}

`"trec-robust04/fold2"`

trec-robust04/fold2 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold2 instead, which provides better parsing of the corpus.

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold2 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-robust04.fold2.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

528K docs

Inherits docs from trec-robust04

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold2 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-robust04.fold2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

64K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`60K`	94.3%
1	relevant	`3.3K`	5.2%
2	highly relevant	`337`	0.5%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold2 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold2')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-robust04.fold2.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

Metadata

{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 63917,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 3334,
          "0": 60246,
          "2": 337
        }
      }
    }
  }
}

`"trec-robust04/fold3"`

trec-robust04/fold3 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold3 instead, which provides better parsing of the corpus.

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold3 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-robust04.fold3.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

528K docs

Inherits docs from trec-robust04

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold3 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-robust04.fold3')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

63K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`59K`	93.6%
1	relevant	`3.9K`	6.2%
2	highly relevant	`165`	0.3%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold3 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold3')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-robust04.fold3.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

Metadata

{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 62901,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 58859,
          "1": 3877,
          "2": 165
        }
      }
    }
  }
}

`"trec-robust04/fold4"`

trec-robust04/fold4 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold4 instead, which provides better parsing of the corpus.

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold4 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-robust04.fold4.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

528K docs

Inherits docs from trec-robust04

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold4 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-robust04.fold4')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

58K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`55K`	95.1%
1	relevant	`2.7K`	4.7%
2	highly relevant	`152`	0.3%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold4 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold4')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-robust04.fold4.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

Metadata

{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 57962,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 55103,
          "1": 2707,
          "2": 152
        }
      }
    }
  }
}

`"trec-robust04/fold5"`

trec-robust04/fold5 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold5 instead, which provides better parsing of the corpus.

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold5 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-robust04.fold5.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

528K docs

Inherits docs from trec-robust04

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold5 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
# Index trec-robust04
indexer = pt.IterDictIndexer('./indices/trec-robust04')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-robust04.fold5')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

64K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`60K`	94.0%
1	relevant	`3.7K`	5.7%
2	highly relevant	`148`	0.2%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-robust04/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export trec-robust04/fold5 qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-robust04/fold5')
index_ref = pt.IndexRef.of('./indices/trec-robust04') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-robust04.fold5.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Voorhees2004Robust,Huston2014ACO}

Bibtex:

Metadata

{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 63841,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 60025,
          "1": 3668,
          "2": 148
        }
      }
    }
  }
}

ir_datasets: TREC Robust 2004

Data Access Information

"trec-robust04"

"trec-robust04/fold1"

"trec-robust04/fold2"

"trec-robust04/fold3"

"trec-robust04/fold4"

"trec-robust04/fold5"

`ir_datasets`: TREC Robust 2004

`"trec-robust04"`

`"trec-robust04/fold1"`

`"trec-robust04/fold2"`

`"trec-robust04/fold3"`

`"trec-robust04/fold4"`

`"trec-robust04/fold5"`