ir_datasets
: GOV2To use this dataset, you need a copy of GOV2, provided by the University of Glasgow.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the GOV2_data directory.
ir_datasets expects the above directory to be copied/linked under ~/.ir_datasets/gov/corpus.
GOV2 web document collection. Used for the TREC Terabyte Track.
The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
TREC 2007 Million Query track.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2007")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-mq-2007 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2007")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-mq-2007 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2007")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export gov2/trec-mq-2007 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
TREC 2008 Million Query track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2008")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-mq-2008 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2008")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-mq-2008 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-mq-2008")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export gov2/trec-mq-2008 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2004")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2004 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2004")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2004 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2004")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2004 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/efficiency")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005/efficiency queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/efficiency")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005/efficiency docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/efficiency")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005/efficiency qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/named-page")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005/named-page queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/named-page")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005/named-page docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2005/named-page")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2005/named-page qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Small stream from gov2/trec-tb-2006/efficiency, with 10,000 queries.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/10k")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/10k queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/10k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/10k docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Stream 1 of gov2/trec-tb-2006/efficiency (25,000 queries).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream1 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream1 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Stream 2 of gov2/trec-tb-2006/efficiency (25,000 queries).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream2 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream2 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream3 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream3 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
2 | Highly Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream3 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Stream 4 of gov2/trec-tb-2006/efficiency (25,000 queries).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream4 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/efficiency/stream4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/efficiency/stream4 docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/named-page")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/named-page queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from gov2
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/named-page")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/named-page docs
[doc_id] [url] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant |
1 | Relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("gov2/trec-tb-2006/named-page")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov2/trec-tb-2006/named-page qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier