← home
Github: datasets/gov.py

ir_datasets: GOV

Index
  1. gov
  2. gov/trec-web-2002
  3. gov/trec-web-2002/named-page
  4. gov/trec-web-2003
  5. gov/trec-web-2003/named-page
  6. gov/trec-web-2004

Data Access Information

To use this dataset, you need a copy of GOV, provided by the University of Glasgow.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/gov/corpus.


"gov"

GOV web document collection. Used for early TREC Web Tracks. Not to be confused with gov2.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

docs
1.2M docs

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"gov/trec-web-2002"

The TREC Web Track 2002 ad-hoc ranking benchmark.

queries
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
57K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant55K97.2%
1Relevant1.6K2.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Craswell2002TrecWeb}

Bibtex:

@inproceedings{Craswell2002TrecWeb, title={Overview of the TREC-2002 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2002} }
Metadata

"gov/trec-web-2002/named-page"

The TREC Web Track 2002 named page ranking benchmark.

queries
150 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002/named-page queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002/named-page docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
170 qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Name refers to this page170 100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002/named-page qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Craswell2002TrecWeb}

Bibtex:

@inproceedings{Craswell2002TrecWeb, title={Overview of the TREC-2002 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2002} }
Metadata

"gov/trec-web-2003"

The TREC Web Track 2003 ad-hoc ranking benchmark.

queries
50 queries

Language: en

Query type:
GovWeb02Query: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003 queries
[query_id]    [title]    [description]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
51K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant51K99.0%
1Relevant516 1.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Craswell2003TrecWeb}

Bibtex:

@inproceedings{Craswell2003TrecWeb, title={Overview of the TREC 2003 Web Track}, author={Nick Craswell and David Hawking and Ross Wilkinson and Mingfang Wu}, booktitle={TREC}, year={2003} }
Metadata

"gov/trec-web-2003/named-page"

The TREC Web Track 2003 named page ranking benchmark.

queries
300 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003/named-page queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003/named-page docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
352 qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Name refers to this page352 100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003/named-page qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Craswell2003TrecWeb}

Bibtex:

@inproceedings{Craswell2003TrecWeb, title={Overview of the TREC 2003 Web Track}, author={Nick Craswell and David Hawking and Ross Wilkinson and Mingfang Wu}, booktitle={TREC}, year={2003} }
Metadata

"gov/trec-web-2004"

The TREC Web Track 2004 ad-hoc ranking benchmark.

Queries include a combination of topic distillation, homepage finding, and named page finding.

queries
225 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2004 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2004 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
89K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant87K98.0%
1Relevant1.8K2.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2004 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Craswell2004TrecWeb}

Bibtex:

@inproceedings{Craswell2004TrecWeb, title={Overview of the TREC-2004 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2004} }
Metadata