← home
Github: datasets/gov2.py

ir_datasets: GOV2

Index
  1. gov2
  2. gov2/trec-mq-2007
  3. gov2/trec-mq-2008
  4. gov2/trec-tb-2004
  5. gov2/trec-tb-2005
  6. gov2/trec-tb-2005/efficiency
  7. gov2/trec-tb-2005/named-page
  8. gov2/trec-tb-2006
  9. gov2/trec-tb-2006/efficiency
  10. gov2/trec-tb-2006/efficiency/10k
  11. gov2/trec-tb-2006/efficiency/stream1
  12. gov2/trec-tb-2006/efficiency/stream2
  13. gov2/trec-tb-2006/efficiency/stream3
  14. gov2/trec-tb-2006/efficiency/stream4
  15. gov2/trec-tb-2006/named-page

Data Access Information

To use this dataset, you need a copy of GOV2, provided by the University of Glasgow.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the GOV2_data directory.

ir_datasets expects the above directory to be copied/linked under ~/.ir_datasets/gov/corpus.


"gov2"

GOV2 web document collection. Used for the TREC Terabyte Track.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

docs
25M docs

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"gov2/trec-mq-2007"

TREC 2007 Million Query track.

queries
10K queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2007")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-mq-2007 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2007")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-mq-2007 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
73K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.DefinitionCount%
0Not Relevant54K74.4%
1Relevant15K20.1%
2Highly Relevant4.0K5.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2007")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-mq-2007 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Allen2007MQ}

Bibtex:

@inproceedings{Allen2007MQ, title={Million Query Track 2007 Overview}, author={James Allan and Ben Carterette and Javed A. Aslam and Virgil Pavlu and Blagovest Dachev and Evangelos Kanoulas}, booktitle={TREC}, year={2007} }
Metadata

"gov2/trec-mq-2008"

TREC 2008 Million Query track.

queries
10K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2008")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-mq-2008 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2008")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-mq-2008 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
15K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.DefinitionCount%
0Not Relevant12K80.7%
1Relevant2.9K19.3%
2Highly Relevant0 0.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2008")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-mq-2008 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Allen2008MQ}

Bibtex:

@inproceedings{Allen2008MQ, title={Million Query Track 2008 Overview}, author={James Allan and Javed A. Aslam and Ben Carterette and Virgil Pavlu and Evangelos Kanoulas}, booktitle={TREC}, year={2008} }
Metadata

"gov2/trec-tb-2004"

The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2004 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2004")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2004 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
58K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant47K81.7%
1Relevant9.3K16.1%
2Highly Relevant1.3K2.2%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2004")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2004 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2004TrecTerabyte}

Bibtex:

@inproceedings{Clarke2004TrecTerabyte, title={Overview of the TREC 2004 Terabyte Track}, author={Charles Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2004} }
Metadata

"gov2/trec-tb-2005"

The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
45K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant35K77.0%
1Relevant7.8K17.2%
2Highly Relevant2.6K5.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2005TrecTerabyte}

Bibtex:

@inproceedings{Clarke2005TrecTerabyte, title={The TREC 2005 Terabyte Track}, author={Charles L. A. Clark and Falk Scholer and Ian Soboroff}, booktitle={TREC}, year={2005} }
Metadata

"gov2/trec-tb-2005/efficiency"

The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.

queries
50K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/efficiency")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005/efficiency queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/efficiency")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005/efficiency docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
45K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant35K77.0%
1Relevant7.8K17.2%
2Highly Relevant2.6K5.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/efficiency")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005/efficiency qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2005TrecTerabyte}

Bibtex:

@inproceedings{Clarke2005TrecTerabyte, title={The TREC 2005 Terabyte Track}, author={Charles L. A. Clark and Falk Scholer and Ian Soboroff}, booktitle={TREC}, year={2005} }
Metadata

"gov2/trec-tb-2005/named-page"

The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

queries
252 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005/named-page queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/named-page")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005/named-page docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
12K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant0 0.0%
1Relevant12K100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/named-page")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2005/named-page qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2005TrecTerabyte}

Bibtex:

@inproceedings{Clarke2005TrecTerabyte, title={The TREC 2005 Terabyte Track}, author={Charles L. A. Clark and Falk Scholer and Ian Soboroff}, booktitle={TREC}, year={2005} }
Metadata

"gov2/trec-tb-2006"

The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
32K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant26K81.6%
1Relevant5.5K17.1%
2Highly Relevant426 1.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata

"gov2/trec-tb-2006/efficiency"

The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.

queries
100K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
32K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant26K81.6%
1Relevant5.5K17.1%
2Highly Relevant426 1.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata

"gov2/trec-tb-2006/efficiency/10k"

Small stream from gov2/trec-tb-2006/efficiency, with 10,000 queries.

queries
10K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/10k")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/10k queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/10k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/10k docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata

"gov2/trec-tb-2006/efficiency/stream1"

Stream 1 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream1 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream1 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata

"gov2/trec-tb-2006/efficiency/stream2"

Stream 2 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream2 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream2 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata

"gov2/trec-tb-2006/efficiency/stream3"

Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream3 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream3 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
32K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant26K81.6%
1Relevant5.5K17.1%
2Highly Relevant426 1.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream3 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata

"gov2/trec-tb-2006/efficiency/stream4"

Stream 4 of gov2/trec-tb-2006/efficiency (25,000 queries).

queries
25K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream4 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/efficiency/stream4 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata

"gov2/trec-tb-2006/named-page"

The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

queries
181 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/named-page queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
25M docs

Inherits docs from gov2

Language: en

Document type:
Gov2Doc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/named-page")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/named-page docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
2.4K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant1.6K65.8%
1Relevant807 34.2%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/named-page")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export gov2/trec-tb-2006/named-page qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Buttcher2006TrecTerabyte}

Bibtex:

@inproceedings{Buttcher2006TrecTerabyte, title={The TREC 2006 Terabyte Track}, author={Stefan B\"uttcher and Charles L. A. Clarke and Ian Soboroff}, booktitle={TREC}, year={2006} }
Metadata