← home
Github: datasets/gov2.py

ir_datasets: GOV2

Index
  1. gov2
  2. gov2/trec-mq-2007
  3. gov2/trec-mq-2008
  4. gov2/trec-tb-2004
  5. gov2/trec-tb-2005
  6. gov2/trec-tb-2005/efficiency
  7. gov2/trec-tb-2005/named-page
  8. gov2/trec-tb-2006
  9. gov2/trec-tb-2006/efficiency
  10. gov2/trec-tb-2006/efficiency/10k
  11. gov2/trec-tb-2006/efficiency/stream1
  12. gov2/trec-tb-2006/efficiency/stream2
  13. gov2/trec-tb-2006/efficiency/stream3
  14. gov2/trec-tb-2006/efficiency/stream4
  15. gov2/trec-tb-2006/named-page

"gov2"

GOV2 web document collection. Used for the TREC Terabyte Track.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

"gov2/trec-mq-2007"

TREC 2007 Million Query track.

queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-mq-2007')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-mq-2007')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-mq-2007')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
Citation
bibtex: @inproceedings{Allen2007MQ, title={Million Query Track 2007 Overview}, author={James Allan and Ben Carterette and Javed A. Aslam and Virgil Pavlu and Blagovest Dachev and Evangelos Kanoulas}, booktitle={TREC}, year={2007} }

"gov2/trec-mq-2008"

TREC 2008 Million Query track.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-mq-2008')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-mq-2008')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-mq-2008')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
Citation
bibtex: @inproceedings{Allen2008MQ, title={Million Query Track 2008 Overview}, author={James Allan and Javed A. Aslam and Ben Carterette and Virgil Pavlu and Evangelos Kanoulas}, booktitle={TREC}, year={2008} }

"gov2/trec-tb-2004"

The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2004')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2004')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2004')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2004TrecTerabyte, title={Overview of the TREC 2004 Terabyte Track}, author={Charles Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2004} }

"gov2/trec-tb-2005"

The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2005TrecTerabyte, title={The TREC 2005 Terabyte Track}, author={Charles L. A. Clark and Falk Scholer and Ian Soboroff}, booktitle={TREC}, year={2005} }

"gov2/trec-tb-2005/efficiency"

The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005/efficiency')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005/efficiency')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005/efficiency')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2005TrecTerabyte, title={The TREC 2005 Terabyte Track}, author={Charles L. A. Clark and Falk Scholer and Ian Soboroff}, booktitle={TREC}, year={2005} }

"gov2/trec-tb-2005/named-page"

The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005/named-page')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005/named-page')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2005/named-page')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2005TrecTerabyte, title={The TREC 2005 Terabyte Track}, author={Charles L. A. Clark and Falk Scholer and Ian Soboroff}, booktitle={TREC}, year={2005} }

"gov2/trec-tb-2006"

The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }

"gov2/trec-tb-2006/efficiency"

The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }

"gov2/trec-tb-2006/efficiency/10k"

Small stream from gov2/trec-tb-2006/efficiency, with 10,000 queries.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/10k')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/10k')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

"gov2/trec-tb-2006/efficiency/stream1"

Stream 1 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream1')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream1')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

"gov2/trec-tb-2006/efficiency/stream2"

Stream 2 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream2')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream2')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

"gov2/trec-tb-2006/efficiency/stream3"

Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream3')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream3')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant
2Highly Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream3')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"gov2/trec-tb-2006/efficiency/stream4"

Stream 4 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream4')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/efficiency/stream4')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

"gov2/trec-tb-2006/named-page"

The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/named-page')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/named-page')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant
1Relevant

Example

import ir_datasets
dataset = ir_datasets.load('gov2/trec-tb-2006/named-page')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }