ir_datasets
: GOVTo use this dataset, you need a copy of GOV, provided by the University of Glasgow.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the directories that look like the following:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/gov/corpus.
GOV web document collection. Used for early TREC Web Tracks. Not to be confused with gov2.
The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
The TREC Web Track 2002 ad-hoc ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
The TREC Web Track 2002 named page ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
The TREC Web Track 2003 ad-hoc ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description>
You can find more details about the Python API here.
The TREC Web Track 2003 named page ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
The TREC Web Track 2004 ad-hoc ranking benchmark.
Queries include a combination of topic distillation, homepage finding, and named page finding.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.