← home
Github: datasets/gov2.py

ir_datasets: GOV2

Index
  1. gov2
  2. gov2/trec-mq-2007
  3. gov2/trec-mq-2008
  4. gov2/trec-tb-2004
  5. gov2/trec-tb-2005
  6. gov2/trec-tb-2005/efficiency
  7. gov2/trec-tb-2005/named-page
  8. gov2/trec-tb-2006
  9. gov2/trec-tb-2006/efficiency
  10. gov2/trec-tb-2006/efficiency/10k
  11. gov2/trec-tb-2006/efficiency/stream1
  12. gov2/trec-tb-2006/efficiency/stream2
  13. gov2/trec-tb-2006/efficiency/stream3
  14. gov2/trec-tb-2006/efficiency/stream4
  15. gov2/trec-tb-2006/named-page

Data Access Information

To use this dataset, you need a copy of GOV2, provided by the University of Glasgow.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the GOV2_data directory.

ir_datasets expects the above directory to be copied/linked under ~/.ir_datasets/gov/corpus.


"gov2"

GOV2 web document collection. Used for the TREC Terabyte Track.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

docsMetadata
25M docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.


"gov2/trec-mq-2007"

TREC 2007 Million Query track.

queriesdocsqrelsCitationMetadata
10K queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2007")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-mq-2008"

TREC 2008 Million Query track.

queriesdocsqrelsCitationMetadata
10K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2008")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2004"

The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"gov2/trec-tb-2005"

The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"gov2/trec-tb-2005/efficiency"

The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.

queriesdocsqrelsCitationMetadata
50K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/efficiency")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2005/named-page"

The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

queriesdocsqrelsCitationMetadata
252 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2006"

The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"gov2/trec-tb-2006/efficiency"

The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.

queriesdocsqrelsCitationMetadata
100K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2006/efficiency/10k"

Small stream from gov2/trec-tb-2006/efficiency, with 10,000 queries.

queriesdocsCitationMetadata
10K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/10k")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2006/efficiency/stream1"

Stream 1 of gov2/trec-tb-2006/efficiency (25,000 queries).

queriesdocsCitationMetadata
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2006/efficiency/stream2"

Stream 2 of gov2/trec-tb-2006/efficiency (25,000 queries).

queriesdocsCitationMetadata
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2006/efficiency/stream3"

Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).

queriesdocsqrelsCitationMetadata
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2006/efficiency/stream4"

Stream 4 of gov2/trec-tb-2006/efficiency (25,000 queries).

queriesdocsCitationMetadata
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"gov2/trec-tb-2006/named-page"

The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

queriesdocsqrelsCitationMetadata
181 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.