This documentation is for
v0.5.8. See
here for documentation of the current latest version on pypi.
ir_datasets
: ClueWeb09
Index
- clueweb09
- clueweb09/ar
- clueweb09/catb
- clueweb09/catb/trec-web-2009
- clueweb09/catb/trec-web-2009/diversity
- clueweb09/catb/trec-web-2010
- clueweb09/catb/trec-web-2010/diversity
- clueweb09/catb/trec-web-2011
- clueweb09/catb/trec-web-2011/diversity
- clueweb09/catb/trec-web-2012
- clueweb09/catb/trec-web-2012/diversity
- clueweb09/de
- clueweb09/en
- clueweb09/en/trec-web-2009
- clueweb09/en/trec-web-2009/diversity
- clueweb09/en/trec-web-2010
- clueweb09/en/trec-web-2010/diversity
- clueweb09/en/trec-web-2011
- clueweb09/en/trec-web-2011/diversity
- clueweb09/en/trec-web-2012
- clueweb09/en/trec-web-2012/diversity
- clueweb09/es
- clueweb09/fr
- clueweb09/it
- clueweb09/ja
- clueweb09/ko
- clueweb09/pt
- clueweb09/trec-mq-2009
- clueweb09/zh
Data Access Information
To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the directories that look like the following:
- ClueWeb09_English_1
- ClueWeb09_English_2
- ...
- ClueWeb09_Arabic_1
- ...
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.
"clueweb09"
ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.
The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.
docsMetadata
1.0B docs
Language: multiple/other/unknown
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 1040859705,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-"
}
}
}
}
"clueweb09/ar"
Subset of ClueWeb09 with only Arabic-language documents.
docsMetadata
29M docs
Language: ar
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ar docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ar')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 29192662,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-ar000"
}
}
}
}
"clueweb09/catb"
Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.
docsMetadata
50M docs
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
}
}
"clueweb09/catb/trec-web-2009"
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2009.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2009')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
13K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- method: int
- iprob: float
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 9.1K | 69.5% |
1 | relevant | 2.5K | 19.2% |
2 | highly relevant | 1.5K | 11.3% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2009.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2009TrecWeb}
Bibtex:
@inproceedings{Clarke2009TrecWeb,
title={Overview of the TREC 2009 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff},
booktitle={TREC},
year={2009}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 13118,
"fields": {
"relevance": {
"counts_by_value": {
"0": 9116,
"1": 2514,
"2": 1488
}
}
}
}
}
"clueweb09/catb/trec-web-2009/diversity"
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
16K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 12K | 75.0% |
1 | relevant | 4.1K | 25.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2009TrecWeb}
Bibtex:
@inproceedings{Clarke2009TrecWeb,
title={Overview of the TREC 2009 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff},
booktitle={TREC},
year={2009}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 16347,
"fields": {
"relevance": {
"counts_by_value": {
"0": 12266,
"1": 4081
}
}
}
}
}
"clueweb09/catb/trec-web-2010"
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2010.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2010')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
16K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 715 | 4.5% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 76.0% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.3K | 14.6% |
2 | HRel: The content of this page provides substantial information on the topic. | 682 | 4.3% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 90 | 0.6% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2010.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2010TrecWeb}
Bibtex:
@inproceedings{Clarke2010TrecWeb,
title={Overview of the TREC 2010 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack},
booktitle={TREC},
year={2010}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 15845,
"fields": {
"relevance": {
"counts_by_value": {
"0": 12040,
"1": 2318,
"-2": 715,
"2": 682,
"3": 90
}
}
}
}
}
"clueweb09/catb/trec-web-2010/diversity"
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
5.5K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 0 | 0.0% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 0 | 0.0% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.5K | 100.0% |
2 | HRel: The content of this page provides substantial information on the topic. | 0 | 0.0% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 0 | 0.0% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2010TrecWeb}
Bibtex:
@inproceedings{Clarke2010TrecWeb,
title={Overview of the TREC 2010 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack},
booktitle={TREC},
year={2010}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 5522,
"fields": {
"relevance": {
"counts_by_value": {
"1": 5522
}
}
}
}
}
"clueweb09/catb/trec-web-2011"
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2011.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2011')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
13K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 499 | 3.8% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 11K | 83.5% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.1K | 8.4% |
2 | HRel: The content of this page provides substantial information on the topic. | 354 | 2.7% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 208 | 1.6% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2011.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2011TrecWeb}
Bibtex:
@inproceedings{Clarke2011TrecWeb,
title={Overview of the TREC 2011 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees},
booktitle={TREC},
year={2011}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 13081,
"fields": {
"relevance": {
"counts_by_value": {
"0": 10920,
"1": 1100,
"2": 354,
"-2": 499,
"3": 208
}
}
}
}
}
"clueweb09/catb/trec-web-2011/diversity"
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
44K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.7K | 3.9% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 38K | 85.8% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 3.0K | 6.9% |
2 | HRel: The content of this page provides substantial information on the topic. | 919 | 2.1% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 556 | 1.3% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2011TrecWeb}
Bibtex:
@inproceedings{Clarke2011TrecWeb,
title={Overview of the TREC 2011 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees},
booktitle={TREC},
year={2011}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 43889,
"fields": {
"relevance": {
"counts_by_value": {
"0": 37665,
"1": 3016,
"2": 919,
"-2": 1733,
"3": 556
}
}
}
}
}
"clueweb09/catb/trec-web-2012"
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2012.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2012')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
10K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 561 | 5.6% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 7.2K | 71.6% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.4K | 13.8% |
2 | HRel: The content of this page provides substantial information on the topic. | 300 | 3.0% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 17 | 0.2% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 580 | 5.8% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2012.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2012TrecWeb}
Bibtex:
@inproceedings{Clarke2012TrecWeb,
title={Overview of the TREC 2012 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees},
booktitle={TREC},
year={2012}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 10022,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 561,
"0": 7178,
"1": 1386,
"4": 580,
"2": 300,
"3": 17
}
}
}
}
}
"clueweb09/catb/trec-web-2012/diversity"
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
50M docs
Inherits docs from clueweb09/catb
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
39K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 2.2K | 5.7% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 31K | 78.7% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 3.5K | 9.0% |
2 | HRel: The content of this page provides substantial information on the topic. | 887 | 2.3% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 47 | 0.1% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 1.7K | 4.3% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2012TrecWeb}
Bibtex:
@inproceedings{Clarke2012TrecWeb,
title={Overview of the TREC 2012 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees},
booktitle={TREC},
year={2012}
}
{
"docs": {
"count": 50220423,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 38992,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 2237,
"0": 30669,
"1": 3494,
"4": 1658,
"2": 887,
"3": 47
}
}
}
}
}
"clueweb09/de"
Subset of ClueWeb09 with only German-language documents.
docsMetadata
50M docs
Language: de
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/de docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.de')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 49814309,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-de00"
}
}
}
}
"clueweb09/en"
Subset of ClueWeb09 with only English-language documents.
docsMetadata
504M docs
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
}
}
"clueweb09/en/trec-web-2009"
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2009.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2009')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
24K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- method: int
- iprob: float
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 17K | 70.9% |
1 | relevant | 4.8K | 20.5% |
2 | highly relevant | 2.0K | 8.6% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2009.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2009TrecWeb}
Bibtex:
@inproceedings{Clarke2009TrecWeb,
title={Overview of the TREC 2009 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff},
booktitle={TREC},
year={2009}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 23601,
"fields": {
"relevance": {
"counts_by_value": {
"0": 16743,
"1": 4832,
"2": 2026
}
}
}
}
}
"clueweb09/en/trec-web-2009/diversity"
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
28K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 21K | 76.8% |
1 | relevant | 6.5K | 23.2% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2009TrecWeb}
Bibtex:
@inproceedings{Clarke2009TrecWeb,
title={Overview of the TREC 2009 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff},
booktitle={TREC},
year={2009}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 27964,
"fields": {
"relevance": {
"counts_by_value": {
"0": 21465,
"1": 6499
}
}
}
}
}
"clueweb09/en/trec-web-2010"
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2010.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2010')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
25K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.4K | 5.6% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 19K | 73.7% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 4.0K | 15.9% |
2 | HRel: The content of this page provides substantial information on the topic. | 1.1K | 4.3% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 138 | 0.5% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2010.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2010TrecWeb}
Bibtex:
@inproceedings{Clarke2010TrecWeb,
title={Overview of the TREC 2010 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack},
booktitle={TREC},
year={2010}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 25329,
"fields": {
"relevance": {
"counts_by_value": {
"0": 18665,
"1": 4018,
"-2": 1431,
"2": 1077,
"3": 138
}
}
}
}
}
"clueweb09/en/trec-web-2010/diversity"
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
9.0K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 0 | 0.0% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 0 | 0.0% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 9.0K | 100.0% |
2 | HRel: The content of this page provides substantial information on the topic. | 0 | 0.0% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 0 | 0.0% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2010TrecWeb}
Bibtex:
@inproceedings{Clarke2010TrecWeb,
title={Overview of the TREC 2010 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack},
booktitle={TREC},
year={2010}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 9006,
"fields": {
"relevance": {
"counts_by_value": {
"1": 9006
}
}
}
}
}
"clueweb09/en/trec-web-2011"
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2011.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2011')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
19K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.0K | 5.3% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 15K | 78.5% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.0K | 10.5% |
2 | HRel: The content of this page provides substantial information on the topic. | 711 | 3.7% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 408 | 2.1% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2011.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2011TrecWeb}
Bibtex:
@inproceedings{Clarke2011TrecWeb,
title={Overview of the TREC 2011 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees},
booktitle={TREC},
year={2011}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 19381,
"fields": {
"relevance": {
"counts_by_value": {
"0": 15205,
"2": 711,
"1": 2038,
"-2": 1019,
"3": 408
}
}
}
}
}
"clueweb09/en/trec-web-2011/diversity"
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
65K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 3.4K | 5.3% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 53K | 81.8% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.5K | 8.4% |
2 | HRel: The content of this page provides substantial information on the topic. | 1.8K | 2.8% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 1.1K | 1.7% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2011TrecWeb}
Bibtex:
@inproceedings{Clarke2011TrecWeb,
title={Overview of the TREC 2011 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees},
booktitle={TREC},
year={2011}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 64868,
"fields": {
"relevance": {
"counts_by_value": {
"0": 53055,
"2": 1828,
"1": 5469,
"-2": 3435,
"3": 1081
}
}
}
}
}
"clueweb09/en/trec-web-2012"
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2012.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2012')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
16K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 858 | 5.3% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 72.7% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.2K | 13.8% |
2 | HRel: The content of this page provides substantial information on the topic. | 405 | 2.5% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 52 | 0.3% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 858 | 5.3% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2012.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2012TrecWeb}
Bibtex:
@inproceedings{Clarke2012TrecWeb,
title={Overview of the TREC 2012 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees},
booktitle={TREC},
year={2012}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 16055,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 858,
"0": 11674,
"1": 2208,
"4": 858,
"2": 405,
"3": 52
}
}
}
}
}
"clueweb09/en/trec-web-2012/diversity"
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
queriesdocsqrelsCitationMetadata
50 queries
Language: en
Query type:
TrecWebTrackQuery: (namedtuple)
- query_id: str
- query: str
- description: str
- type: str
- subtopics: Tuple[
TrecSubtopic: (namedtuple)
- number: str
- text: str
- type: str
, ...]
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity queries
[query_id] [query] [description] [type] [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
504M docs
Inherits docs from clueweb09/en
Language: en
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
62K qrels
Query relevance judgment type:
TrecSubQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- subtopic_id: str
Relevance levels
Rel. | Definition | Count | % |
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 3.4K | 5.4% |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 50K | 79.6% |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.6K | 8.9% |
2 | HRel: The content of this page provides substantial information on the topic. | 1.2K | 1.9% |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 130 | 0.2% |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 2.5K | 4.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity qrels --format tsv
[query_id] [doc_id] [relevance] [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Clarke2012TrecWeb}
Bibtex:
@inproceedings{Clarke2012TrecWeb,
title={Overview of the TREC 2012 Web Track},
author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees},
booktitle={TREC},
year={2012}
}
{
"docs": {
"count": 503903810,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-en"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 62394,
"fields": {
"relevance": {
"counts_by_value": {
"-2": 3373,
"0": 49653,
"1": 5578,
"4": 2486,
"2": 1174,
"3": 130
}
}
}
}
}
"clueweb09/es"
Subset of ClueWeb09 with only Spanish-language documents.
docsMetadata
79M docs
Language: es
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/es docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.es')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 79333950,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-es"
}
}
}
}
"clueweb09/fr"
Subset of ClueWeb09 with only French-language documents.
docsMetadata
51M docs
Language: fr
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/fr docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.fr')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 50883172,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-fr"
}
}
}
}
"clueweb09/it"
Subset of ClueWeb09 with only Italian-language documents.
docsMetadata
27M docs
Language: it
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/it docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.it')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 27250729,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-it"
}
}
}
}
"clueweb09/ja"
Subset of ClueWeb09 with only Japanese-language documents.
docsMetadata
67M docs
Language: ja
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ja docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ja')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 67337717,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-ja"
}
}
}
}
"clueweb09/ko"
Subset of ClueWeb09 with only Korean-language documents.
docsMetadata
18M docs
Language: ko
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ko docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ko')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 18075141,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-ko000"
}
}
}
}
"clueweb09/pt"
Subset of ClueWeb09 with only Portuguese-language documents.
docsMetadata
38M docs
Language: pt
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/pt docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.pt')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 37578858,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-pt"
}
}
}
}
"clueweb09/trec-mq-2009"
TREC 2009 Million Query track.
queriesdocsqrelsCitationMetadata
40K queries
Language: en
Query type:
GenericQuery: (namedtuple)
- query_id: str
- text: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.trec-mq-2009.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
1.0B docs
Inherits docs from clueweb09
Language: multiple/other/unknown
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.trec-mq-2009')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
35K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- method: int
- iprob: float
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 26K | 74.1% |
1 | relevant | 5.9K | 17.0% |
2 | highly relevant | 3.1K | 9.0% |
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv
[query_id] [doc_id] [relevance] [method] [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.trec-mq-2009.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
ir_datasets.bib:
\cite{Carterette2009MQ}
Bibtex:
@inproceedings{Carterette2009MQ,
title={Million Query Track 2009 Overview},
author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas},
booktitle={TREC},
year={2009}
}
{
"docs": {
"count": 1040859705,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-"
}
}
},
"queries": {
"count": 40000
},
"qrels": {
"count": 34534,
"fields": {
"relevance": {
"counts_by_value": {
"0": 25586,
"1": 5856,
"2": 3092
}
}
}
}
}
"clueweb09/zh"
Subset of ClueWeb09 with only Chinese-language documents.
docsMetadata
177M docs
Language: zh
Document type:
WarcDoc: (namedtuple)
- doc_id: str
- url: str
- date: str
- http_headers: bytes
- body: bytes
- body_content_type: str
Examples:
Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/zh docs
[doc_id] [url] [date] [http_headers] [body] [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.zh')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
"docs": {
"count": 177489357,
"fields": {
"doc_id": {
"max_len": 25,
"common_prefix": "clueweb09-zh"
}
}
}
}