ir_datasets
: TREC Disks 4 and 5
Index
- disks45
- disks45/nocr
- disks45/nocr/trec-robust-2004
- disks45/nocr/trec-robust-2004/fold1
- disks45/nocr/trec-robust-2004/fold2
- disks45/nocr/trec-robust-2004/fold3
- disks45/nocr/trec-robust-2004/fold4
- disks45/nocr/trec-robust-2004/fold5
- disks45/nocr/trec7
- disks45/nocr/trec8
Data Access Information
To use this dataset, you need a copy of TREC Disks 4 and 5, provided by NIST.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.
ir_datasets needs the following directories from the source:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/disks45/corpus. The source document files themselves can either be compressed or uncompressed (it seems they have been distributed both ways in the past.) If ir_datasets does not find the files it is expecting, it will raise an error.
"disks45"
TREC Disks 4 and 5, including documents from the Financial Times, the Congressional Record, the Federal Register, the Foreign Broadcast Information Service, and the Los Angeles Times.
This dataset is a placeholder for the complete collection, but at this time, only the version of the dataset without the Congressional Record (disks45/nocr) are provided.
"disks45/nocr"
A version of disks45 without the Congressional Record. This is the typical setting for tasks like TREC 7, TREC 8, and TREC Robust 2004.
docsCitationMetadata
528K docs
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
ir_datasets.bib:
\cite{Voorhees1996Disks45}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
}
}
"disks45/nocr/trec-robust-2004"
The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."
The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.
queriesdocsqrelsCitationMetadata
250 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
311K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 294K | 94.4% |
1 | relevant | 16K | 5.3% |
2 | highly relevant | 1.0K | 0.3% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 250
},
"qrels": {
"count": 311410,
"fields": {
"relevance": {
"counts_by_value": {
"1": 16381,
"0": 293998,
"2": 1031
}
}
}
}
}
"disks45/nocr/trec-robust-2004/fold1"
Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
63K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 60K | 95.2% |
1 | relevant | 2.8K | 4.5% |
2 | highly relevant | 229 | 0.4% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold1 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold1')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold1.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 62789,
"fields": {
"relevance": {
"counts_by_value": {
"0": 59765,
"1": 2795,
"2": 229
}
}
}
}
}
"disks45/nocr/trec-robust-2004/fold2"
Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
64K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 60K | 94.3% |
1 | relevant | 3.3K | 5.2% |
2 | highly relevant | 337 | 0.5% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold2 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold2')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold2.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 63917,
"fields": {
"relevance": {
"counts_by_value": {
"1": 3334,
"0": 60246,
"2": 337
}
}
}
}
}
"disks45/nocr/trec-robust-2004/fold3"
Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
63K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 59K | 93.6% |
1 | relevant | 3.9K | 6.2% |
2 | highly relevant | 165 | 0.3% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold3 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold3')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold3.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 62901,
"fields": {
"relevance": {
"counts_by_value": {
"0": 58859,
"1": 3877,
"2": 165
}
}
}
}
}
"disks45/nocr/trec-robust-2004/fold4"
Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
58K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 55K | 95.1% |
1 | relevant | 2.7K | 4.7% |
2 | highly relevant | 152 | 0.3% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold4 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold4')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold4.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 57962,
"fields": {
"relevance": {
"counts_by_value": {
"0": 55103,
"1": 2707,
"2": 152
}
}
}
}
}
"disks45/nocr/trec-robust-2004/fold5"
Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
64K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 60K | 94.0% |
1 | relevant | 3.7K | 5.7% |
2 | highly relevant | 148 | 0.2% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004/fold5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec-robust-2004/fold5 qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004/fold5')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec-robust-2004.fold5.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees2004Robust,Huston2014ACO}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees2004Robust,
title={Overview of the TREC 2004 Robust Retrieval Track},
author={Ellen Voorhees},
booktitle={TREC},
year={2004}
}
@inproceedings{Huston2014ACO,
title={A Comparison of Retrieval Models using Term Dependencies},
author={Samuel Huston and W. Bruce Croft},
booktitle={CIKM},
year={2014}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 63841,
"fields": {
"relevance": {
"counts_by_value": {
"0": 60025,
"1": 3668,
"2": 148
}
}
}
}
}
"disks45/nocr/trec7"
The TREC 7 Adhoc Retrieval track.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec7.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec7')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
80K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 76K | 94.2% |
1 | relevant | 4.7K | 5.8% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec7")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec7 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec7')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec7.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees1998Trec7}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees1998Trec7,
title = {Overview of the Seventh Text Retrieval Conference (TREC-7)},
author = {Ellen M. Voorhees and Donna Harman},
year = {1998},
booktitle = {TREC}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 80345,
"fields": {
"relevance": {
"counts_by_value": {
"0": 75671,
"1": 4674
}
}
}
}
}
"disks45/nocr/trec8"
The TREC 8 Adhoc Retrieval track.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.disks45.nocr.trec8.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
528K docs
Inherits docs from disks45/nocr
Language: en
Document type:
TrecParsedDoc: (namedtuple)
- doc_id: str
- title: str
- body: str
- marked_up_doc: bytes
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, body, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 docs
[doc_id] [title] [body] [marked_up_doc]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
# Index disks45/nocr
indexer = pt.IterDictIndexer('./indices/disks45_nocr')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'body'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.disks45.nocr.trec8')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
87K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 82K | 94.6% |
1 | relevant | 4.7K | 5.4% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("disks45/nocr/trec8")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export disks45/nocr/trec8 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:disks45/nocr/trec8')
index_ref = pt.IndexRef.of('./indices/disks45_nocr') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.disks45.nocr.trec8.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Voorhees1996Disks45,Voorhees1999Trec8}
Bibtex:
@misc{Voorhees1996Disks45,
title = {NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set},
author = {Ellen M. Voorhees},
doi = {10.18434/t47g6m},
year = {1996},
publisher = {National Institute of Standards and Technology}
}
@inproceedings{Voorhees1999Trec8,
title = {Overview of the Eight Text Retrieval Conference (TREC-8)},
author = {Ellen M. Voorhees and Donna Harman},
year = {1999},
booktitle = {TREC}
}
{
"docs": {
"count": 528155,
"fields": {
"doc_id": {
"max_len": 16,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 86830,
"fields": {
"relevance": {
"counts_by_value": {
"0": 82102,
"1": 4728
}
}
}
}
}