ir_datasets
: ClueWeb09To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the directories that look like the following:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.
ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.
The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 1040859705, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-" } } } }
Subset of ClueWeb09 with only Arabic-language documents.
Language: ar
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ar docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 29192662, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-ar000" } } } }
Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 50220423, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } } }
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 9.1K | 69.5% |
1 | relevant | 2.5K | 19.2% |
2 | highly relevant | 1.5K | 11.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{ "docs": { "count": 50220423, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 13118, "fields": { "relevance": { "counts_by_value": { "0": 9116, "1": 2514, "2": 1488 } } } } }
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 715 | 4.5% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 76.0% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.3K | 14.6% |
2 | HRel: The content of this page provides substantial information on the topic. | 682 | 4.3% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 90 | 0.6% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{ "docs": { "count": 50220423, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 15845, "fields": { "relevance": { "counts_by_value": { "0": 12040, "1": 2318, "-2": 715, "2": 682, "3": 90 } } } } }
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 499 | 3.8% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 11K | 83.5% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.1K | 8.4% |
2 | HRel: The content of this page provides substantial information on the topic. | 354 | 2.7% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 208 | 1.6% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{ "docs": { "count": 50220423, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 13081, "fields": { "relevance": { "counts_by_value": { "0": 10920, "1": 1100, "2": 354, "-2": 499, "3": 208 } } } } }
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 561 | 5.6% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 7.2K | 71.6% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.4K | 13.8% |
2 | HRel: The content of this page provides substantial information on the topic. | 300 | 3.0% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 17 | 0.2% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 580 | 5.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{ "docs": { "count": 50220423, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 10022, "fields": { "relevance": { "counts_by_value": { "-2": 561, "0": 7178, "1": 1386, "4": 580, "2": 300, "3": 17 } } } } }
Subset of ClueWeb09 with only German-language documents.
Language: de
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/de docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 49814309, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-de00" } } } }
Subset of ClueWeb09 with only English-language documents.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 503903810, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } } }
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 17K | 70.9% |
1 | relevant | 4.8K | 20.5% |
2 | highly relevant | 2.0K | 8.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{ "docs": { "count": 503903810, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 23601, "fields": { "relevance": { "counts_by_value": { "0": 16743, "1": 4832, "2": 2026 } } } } }
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.4K | 5.6% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 19K | 73.7% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 4.0K | 15.9% |
2 | HRel: The content of this page provides substantial information on the topic. | 1.1K | 4.3% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 138 | 0.5% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{ "docs": { "count": 503903810, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 25329, "fields": { "relevance": { "counts_by_value": { "0": 18665, "1": 4018, "-2": 1431, "2": 1077, "3": 138 } } } } }
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.0K | 5.3% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 15K | 78.5% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.0K | 10.5% |
2 | HRel: The content of this page provides substantial information on the topic. | 711 | 3.7% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 408 | 2.1% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{ "docs": { "count": 503903810, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 19381, "fields": { "relevance": { "counts_by_value": { "0": 15205, "2": 711, "1": 2038, "-2": 1019, "3": 408 } } } } }
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 858 | 5.3% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 72.7% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.2K | 13.8% |
2 | HRel: The content of this page provides substantial information on the topic. | 405 | 2.5% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 52 | 0.3% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 858 | 5.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{ "docs": { "count": 503903810, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-en" } } }, "queries": { "count": 50 }, "qrels": { "count": 16055, "fields": { "relevance": { "counts_by_value": { "-2": 858, "0": 11674, "1": 2208, "4": 858, "2": 405, "3": 52 } } } } }
Subset of ClueWeb09 with only Spanish-language documents.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/es docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 79333950, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-es" } } } }
Subset of ClueWeb09 with only French-language documents.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/fr docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 50883172, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-fr" } } } }
Subset of ClueWeb09 with only Italian-language documents.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/it docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 27250729, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-it" } } } }
Subset of ClueWeb09 with only Japanese-language documents.
Language: ja
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ja docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 67337717, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-ja" } } } }
Subset of ClueWeb09 with only Korean-language documents.
Language: ko
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ko docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 18075141, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-ko000" } } } }
Subset of ClueWeb09 with only Portuguese-language documents.
Language: pt
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/pt docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 37578858, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-pt" } } } }
TREC 2009 Million Query track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from clueweb09
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 26K | 74.1% |
1 | relevant | 5.9K | 17.0% |
2 | highly relevant | 3.1K | 9.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }{ "docs": { "count": 1040859705, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-" } } }, "queries": { "count": 40000 }, "qrels": { "count": 34534, "fields": { "relevance": { "counts_by_value": { "0": 25586, "1": 5856, "2": 3092 } } } } }
Subset of ClueWeb09 with only Chinese-language documents.
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/zh docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
{ "docs": { "count": 177489357, "fields": { "doc_id": { "max_len": 25, "common_prefix": "clueweb09-zh" } } } }