ir_datasets: TREC Disks 4 and 5To use this dataset, you need a copy of TREC Disks 4 and 5, provided by NIST.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.
ir_datasets needs the following directories from the source:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/disks45/corpus. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.
TREC Disks 4 and 5, including documents from the Financial Times, the Congressional Record, the Federal Register, the Foreign Broadcast Information Service, and the Los Angeles Times.
This dataset is a placeholder for the complete collection, but at this time, only the version of the dataset without the Congressional Record (disks45/nocr) are provided.
A version of disks45 without the Congressional Record. This is the typical setting for tasks like TREC 7, TREC 8, and TREC Robust 2004.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  }
}
The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."
The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 294K | 94.4% | 
| 1 | relevant | 16K | 5.3% | 
| 2 | highly relevant | 1.0K | 0.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 250
  },
  "qrels": {
    "count": 311410,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 16381,
          "0": 293998,
          "2": 1031
        }
      }
    }
  }
}
Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 60K | 95.2% | 
| 1 | relevant | 2.8K | 4.5% | 
| 2 | highly relevant | 229 | 0.4% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 62789,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 59765,
          "1": 2795,
          "2": 229
        }
      }
    }
  }
}
Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 60K | 94.3% | 
| 1 | relevant | 3.3K | 5.2% | 
| 2 | highly relevant | 337 | 0.5% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 63917,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 3334,
          "0": 60246,
          "2": 337
        }
      }
    }
  }
}
Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 59K | 93.6% | 
| 1 | relevant | 3.9K | 6.2% | 
| 2 | highly relevant | 165 | 0.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 62901,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 58859,
          "1": 3877,
          "2": 165
        }
      }
    }
  }
}
Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 55K | 95.1% | 
| 1 | relevant | 2.7K | 4.7% | 
| 2 | highly relevant | 152 | 0.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 57962,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 55103,
          "1": 2707,
          "2": 152
        }
      }
    }
  }
}
Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 60K | 94.0% | 
| 1 | relevant | 3.7K | 5.7% | 
| 2 | highly relevant | 148 | 0.2% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 63841,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 60025,
          "1": 3668,
          "2": 148
        }
      }
    }
  }
}
The TREC 7 Adhoc Retrieval track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec7.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec7')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 76K | 94.2% | 
| 1 | relevant | 4.7K | 5.8% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec7.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees1998Trec7, title = {Overview of the Seventh Text Retrieval Conference (TREC-7)}, author = {Ellen M. Voorhees and Donna Harman}, year = {1998}, booktitle = {TREC} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 80345,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 75671,
          "1": 4674
        }
      }
    }
  }
}
The TREC 8 Adhoc Retrieval track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec8.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from disks45/nocr
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec8')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 82K | 94.6% | 
| 1 | relevant | 4.7K | 5.4% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec8.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees1999Trec8, title = {Overview of the Eight Text Retrieval Conference (TREC-8)}, author = {Ellen M. Voorhees and Donna Harman}, year = {1999}, booktitle = {TREC} }{
  "docs": {
    "count": 528155,
    "fields": {
      "doc_id": {
        "max_len": 16,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 86830,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 82102,
          "1": 4728
        }
      }
    }
  }
}