← home
Github: datasets/disks45.py

ir_datasets: TREC Disks 4 and 5

Index
  1. disks45
  2. disks45/nocr
  3. disks45/nocr/trec-robust-2004
  4. disks45/nocr/trec-robust-2004/fold1
  5. disks45/nocr/trec-robust-2004/fold2
  6. disks45/nocr/trec-robust-2004/fold3
  7. disks45/nocr/trec-robust-2004/fold4
  8. disks45/nocr/trec-robust-2004/fold5
  9. disks45/nocr/trec7
  10. disks45/nocr/trec8

Data Access Information

To use this dataset, you need a copy of TREC Disks 4 and 5, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

ir_datasets needs the following directories from the source:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/disks45/corpus. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.


"disks45"

TREC Disks 4 and 5, including documents from the Financial Times, the Congressional Record, the Federal Register, the Foreign Broadcast Information Service, and the Los Angeles Times.

This dataset is a placeholder for the complete collection, but at this time, only the version of the dataset without the Congressional Record (disks45/nocr) are provided.


"disks45/nocr"

A version of disks45 without the Congressional Record. This is the typical setting for tasks like TREC 7, TREC 8, and TREC Robust 2004.

docs
528K docs

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} }
Metadata

"disks45/nocr/trec-robust-2004"

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

queries
250 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
311K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant294K94.4%
1relevant16K5.3%
2highly relevant1.0K0.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"disks45/nocr/trec-robust-2004/fold1"

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold1 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold1 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
63K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant60K95.2%
1relevant2.8K4.5%
2highly relevant229 0.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold1 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"disks45/nocr/trec-robust-2004/fold2"

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold2 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold2 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
64K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant60K94.3%
1relevant3.3K5.2%
2highly relevant337 0.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold2 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"disks45/nocr/trec-robust-2004/fold3"

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold3 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold3 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
63K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant59K93.6%
1relevant3.9K6.2%
2highly relevant165 0.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold3 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"disks45/nocr/trec-robust-2004/fold4"

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold4 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold4 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
58K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant55K95.1%
1relevant2.7K4.7%
2highly relevant152 0.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold4 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"disks45/nocr/trec-robust-2004/fold5"

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works

queries
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold5 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold5 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
64K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0not relevant60K94.0%
1relevant3.7K5.7%
2highly relevant148 0.2%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec-robust-2004/fold5 qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees2004Robust, title={Overview of the TREC 2004 Robust Retrieval Track}, author={Ellen Voorhees}, booktitle={TREC}, year={2004} } @inproceedings{Huston2014ACO, title={A Comparison of Retrieval Models using Term Dependencies}, author={Samuel Huston and W. Bruce Croft}, booktitle={CIKM}, year={2014} }
Metadata

"disks45/nocr/trec7"

The TREC 7 Adhoc Retrieval track.

queries
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec7 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec7 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
80K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant76K94.2%
1relevant4.7K5.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec7 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees1998Trec7}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees1998Trec7, title = {Overview of the Seventh Text Retrieval Conference (TREC-7)}, author = {Ellen M. Voorhees and Donna Harman}, year = {1998}, booktitle = {TREC} }
Metadata

"disks45/nocr/trec8"

The TREC 8 Adhoc Retrieval track.

queries
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec8 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs
528K docs

Inherits docs from disks45/nocr

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec8 docs
[doc_id]    [title]    [body]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])

You can find more details about PyTerrier indexing here.

qrels
87K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant82K94.6%
1relevant4.7K5.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export disks45/nocr/trec8 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees1996Disks45,Voorhees1999Trec8}

Bibtex:

@misc{Voorhees1996Disks45, title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set}, author = {Ellen M. Voorhees}, doi = {10.18434/t47g6m}, year = {1996}, publisher = {National Institute of Standards and Technology} } @inproceedings{Voorhees1999Trec8, title = {Overview of the Eight Text Retrieval Conference (TREC-8)}, author = {Ellen M. Voorhees and Donna Harman}, year = {1999}, booktitle = {TREC} }
Metadata