ir_datasets
: ClueWeb09ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.
The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.
Language: multiple/other/unknown
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with only Arabic-language documents.
Language: ar
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/ar')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2009')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2009')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | not relevant |
1 | relevant |
2 | highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2009')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2010')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2010')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2010')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2011')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2011')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2011')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2012')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2012')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2012')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
Subset of ClueWeb09 with only German-language documents.
Language: de
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/de')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with only English-language documents.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2009')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2009')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | not relevant |
1 | relevant |
2 | highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2009')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2010')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2010')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2010')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2011')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2011')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2011')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2012')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2012')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2012')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
Subset of ClueWeb09 with only Spanish-language documents.
Language: es
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/es')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with only French-language documents.
Language: fr
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/fr')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with only Italian-language documents.
Language: it
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/it')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with only Japanese-language documents.
Language: ja
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/ja')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with only Korean-language documents.
Language: ko
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/ko')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Subset of ClueWeb09 with only Portuguese-language documents.
Language: pt
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/pt')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
TREC 2009 Million Query track.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/trec-mq-2009')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: multiple/other/unknown
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/trec-mq-2009')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | not relevant |
1 | relevant |
2 | highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/trec-mq-2009')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
Subset of ClueWeb09 with only Chinese-language documents.
Language: zh
Example
import ir_datasets
dataset = ir_datasets.load('clueweb09/zh')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>