ir_datasets
: ClueWeb09To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the directories that look like the following:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.
ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.
The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only Arabic-language documents.
Language: ar
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ar docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only German-language documents.
Language: de
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/de docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only English-language documents.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from clueweb09/en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from clueweb09/en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from clueweb09/en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: en
Note: Uses docs from clueweb09/en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only Spanish-language documents.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/es docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only French-language documents.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/fr docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only Italian-language documents.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/it docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only Japanese-language documents.
Language: ja
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ja docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only Korean-language documents.
Language: ko
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ko docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only Portuguese-language documents.
Language: pt
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/pt docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
TREC 2009 Million Query track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: multiple/other/unknown
Note: Uses docs from clueweb09
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
Subset of ClueWeb09 with only Chinese-language documents.
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/zh docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier