← home
Github: datasets/disks45.py

ir_datasets: TREC Disks 4 and 5

Index
  1. disks45
  2. disks45/nocr
  3. disks45/nocr/trec-robust-2004
  4. disks45/nocr/trec-robust-2004/fold1
  5. disks45/nocr/trec-robust-2004/fold2
  6. disks45/nocr/trec-robust-2004/fold3
  7. disks45/nocr/trec-robust-2004/fold4
  8. disks45/nocr/trec-robust-2004/fold5
  9. disks45/nocr/trec7
  10. disks45/nocr/trec8

Data Access Information

To use this dataset, you need a copy of TREC Disks 4 and 5, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

ir_datasets needs the following directories from the source:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/disks45/corpus. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.


"disks45"

TREC Disks 4 and 5, including documents from the Financial Times, the Congressional Record, the Federal Register, the Foreign Broadcast Information Service, and the Los Angeles Times.

This dataset is a placeholder for the complete collection, but at this time, only the version of the dataset without the Congressional Record (disks45/nocr) are provided.


"disks45/nocr"

A version of disks45 without the Congressional Record. This is the typical setting for tasks like TREC 7, TREC 8, and TREC Robust 2004.

docsCitationMetadata
528K docs

Language: en

Document type:
TrecParsedDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. body: str
  4. marked_up_doc: bytes

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, body, marked_up_doc>

You can find more details about the Python API here.


"disks45/nocr/trec-robust-2004"

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

queriesdocsqrelsCitationMetadata
250 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"disks45/nocr/trec-robust-2004/fold1"

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"disks45/nocr/trec-robust-2004/fold2"

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"disks45/nocr/trec-robust-2004/fold3"

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"disks45/nocr/trec-robust-2004/fold4"

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"disks45/nocr/trec-robust-2004/fold5"

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"disks45/nocr/trec7"

The TREC 7 Adhoc Retrieval track.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"disks45/nocr/trec8"

The TREC 8 Adhoc Retrieval track.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.