ir_datasets: ClueWeb09To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the directories that look like the following:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.
ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.
The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 1040859705,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-"
}
}
}
}
Subset of ClueWeb09 with only Arabic-language documents.
Language: ar
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ar docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ar')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 29192662,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-ar000"
}
}
}
}
Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
}
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2009.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2009')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | not relevant | 9.1K | 69.5% |
| 1 | relevant | 2.5K | 19.2% |
| 2 | highly relevant | 1.5K | 11.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2009.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 13118,
"fields": {
"relevance": {
"counts_by_value": {
"0": 9116,
"1": 2514,
"2": 1488
}
}
}
}
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | not relevant | 12K | 75.0% |
| 1 | relevant | 4.1K | 25.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 16347,
"fields": {
"relevance": {
"counts_by_value": {
"0": 12266,
"1": 4081
}
}
}
}
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2010.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2010')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 715 | 4.5% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 76.0% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.3K | 14.6% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 682 | 4.3% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 90 | 0.6% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2010.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 15845,
"fields": {
"relevance": {
"counts_by_value": {
"0": 12040,
"1": 2318,
"-2": 715,
"2": 682,
"3": 90
}
}
}
}
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 0 | 0.0% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 0 | 0.0% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.5K | 100.0% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 0 | 0.0% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 0 | 0.0% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 5522,
"fields": {
"relevance": {
"counts_by_value": {
"1": 5522
}
}
}
}
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2011.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2011')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 499 | 3.8% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 11K | 83.5% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.1K | 8.4% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 354 | 2.7% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 208 | 1.6% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2011.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 13081,
"fields": {
"relevance": {
"counts_by_value": {
"0": 10920,
"1": 1100,
"2": 354,
"-2": 499,
"3": 208
}
}
}
}
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.7K | 3.9% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 38K | 85.8% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 3.0K | 6.9% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 919 | 2.1% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 556 | 1.3% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 43889,
"fields": {
"relevance": {
"counts_by_value": {
"0": 37665,
"1": 3016,
"2": 919,
"-2": 1733,
"3": 556
}
}
}
}
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2012.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2012')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 561 | 5.6% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 7.2K | 71.6% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.4K | 13.8% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 300 | 3.0% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 17 | 0.2% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 580 | 5.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2012.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 10022,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 561,
"0": 7178,
"1": 1386,
"4": 580,
"2": 300,
"3": 17
}
}
}
}
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 2.2K | 5.7% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 31K | 78.7% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 3.5K | 9.0% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 887 | 2.3% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 47 | 0.1% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 1.7K | 4.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 38992,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 2237,
"0": 30669,
"1": 3494,
"4": 1658,
"2": 887,
"3": 47
}
}
}
}
}
Subset of ClueWeb09 with only German-language documents.
Language: de
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/de docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.de')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 49814309,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-de00"
}
}
}
}
Subset of ClueWeb09 with only English-language documents.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
}
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2009.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2009')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | not relevant | 17K | 70.9% |
| 1 | relevant | 4.8K | 20.5% |
| 2 | highly relevant | 2.0K | 8.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2009.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 23601,
"fields": {
"relevance": {
"counts_by_value": {
"0": 16743,
"1": 4832,
"2": 2026
}
}
}
}
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | not relevant | 21K | 76.8% |
| 1 | relevant | 6.5K | 23.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 27964,
"fields": {
"relevance": {
"counts_by_value": {
"0": 21465,
"1": 6499
}
}
}
}
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2010.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2010')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.4K | 5.6% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 19K | 73.7% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 4.0K | 15.9% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 1.1K | 4.3% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 138 | 0.5% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2010.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 25329,
"fields": {
"relevance": {
"counts_by_value": {
"0": 18665,
"1": 4018,
"-2": 1431,
"2": 1077,
"3": 138
}
}
}
}
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 0 | 0.0% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 0 | 0.0% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 9.0K | 100.0% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 0 | 0.0% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 0 | 0.0% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 9006,
"fields": {
"relevance": {
"counts_by_value": {
"1": 9006
}
}
}
}
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2011.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2011')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.0K | 5.3% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 15K | 78.5% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.0K | 10.5% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 711 | 3.7% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 408 | 2.1% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2011.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 19381,
"fields": {
"relevance": {
"counts_by_value": {
"0": 15205,
"2": 711,
"1": 2038,
"-2": 1019,
"3": 408
}
}
}
}
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 3.4K | 5.3% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 53K | 81.8% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.5K | 8.4% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 1.8K | 2.8% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 1.1K | 1.7% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 64868,
"fields": {
"relevance": {
"counts_by_value": {
"0": 53055,
"2": 1828,
"1": 5469,
"-2": 3435,
"3": 1081
}
}
}
}
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2012.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2012')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 858 | 5.3% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 72.7% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.2K | 13.8% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 405 | 2.5% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 52 | 0.3% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 858 | 5.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2012.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 16055,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 858,
"0": 11674,
"1": 2208,
"4": 858,
"2": 405,
"3": 52
}
}
}
}
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 3.4K | 5.4% |
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 50K | 79.6% |
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.6K | 8.9% |
| 2 | HRel: The content of this page provides substantial information on the topic. | 1.2K | 1.9% |
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 130 | 0.2% |
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 2.5K | 4.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 62394,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 3373,
"0": 49653,
"1": 5578,
"4": 2486,
"2": 1174,
"3": 130
}
}
}
}
}
Subset of ClueWeb09 with only Spanish-language documents.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/es docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.es')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 79333950,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-es"
}
}
}
}
Subset of ClueWeb09 with only French-language documents.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/fr docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.fr')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 50883172,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-fr"
}
}
}
}
Subset of ClueWeb09 with only Italian-language documents.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/it docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.it')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 27250729,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-it"
}
}
}
}
Subset of ClueWeb09 with only Japanese-language documents.
Language: ja
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ja docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ja')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 67337717,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-ja"
}
}
}
}
Subset of ClueWeb09 with only Korean-language documents.
Language: ko
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ko docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ko')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 18075141,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-ko000"
}
}
}
}
Subset of ClueWeb09 with only Portuguese-language documents.
Language: pt
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/pt docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.pt')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 37578858,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-pt"
}
}
}
}
TREC 2009 Million Query track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.trec-mq-2009.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.trec-mq-2009')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | not relevant | 26K | 74.1% |
| 1 | relevant | 5.9K | 17.0% |
| 2 | highly relevant | 3.1K | 9.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.trec-mq-2009.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }{
"docs": {
"count": 1040859705,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-"
}
}
},
"queries": {
"count": 40000
},
"qrels": {
"count": 34534,
"fields": {
"relevance": {
"counts_by_value": {
"0": 25586,
"1": 5856,
"2": 3092
}
}
}
}
}
Subset of ClueWeb09 with only Chinese-language documents.
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/zh docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.zh')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 177489357,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-zh"
}
}
}
}