ir_datasets
: GOV2GOV2 web document collection. Used for the TREC Terabyte Track.
The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('gov2')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
TREC 2007 Million Query track.
TREC 2008 Million Query track.
The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.
The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.
The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.
Small stream from gov2/trec-tb-2006/efficiency, with 10,000 queries.
Stream 1 of gov2/trec-tb-2006/efficiency (25,000 queries).
Stream 2 of gov2/trec-tb-2006/efficiency (25,000 queries).
Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).
Stream 4 of gov2/trec-tb-2006/efficiency (25,000 queries).
The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.