← home
Github: datasets/gov.py

ir_datasets: GOV

Index
  1. gov
  2. gov/trec-web-2002
  3. gov/trec-web-2002/named-page
  4. gov/trec-web-2003
  5. gov/trec-web-2003/named-page
  6. gov/trec-web-2004

Data Access Information

To use this dataset, you need a copy of GOV, provided by the University of Glasgow.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/gov/corpus.


"gov"

GOV web document collection. Used for early TREC Web Tracks. Not to be confused with gov2.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

docsMetadata
1.2M docs

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.


"gov/trec-web-2002"

The TREC Web Track 2002 ad-hoc ranking benchmark.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"gov/trec-web-2002/named-page"

The TREC Web Track 2002 named page ranking benchmark.

queriesdocsqrelsCitationMetadata
150 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov/trec-web-2003"

The TREC Web Track 2003 ad-hoc ranking benchmark.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
GovWeb02Query: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.


"gov/trec-web-2003/named-page"

The TREC Web Track 2003 named page ranking benchmark.

queriesdocsqrelsCitationMetadata
300 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov/trec-web-2004"

The TREC Web Track 2004 ad-hoc ranking benchmark.

Queries include a combination of topic distillation, homepage finding, and named page finding.

queriesdocsqrelsCitationMetadata
225 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.