← home
Github: datasets/clueweb09.py

ir_datasets: ClueWeb09

Index
  1. clueweb09
  2. clueweb09/ar
  3. clueweb09/catb
  4. clueweb09/catb/trec-web-2009
  5. clueweb09/catb/trec-web-2010
  6. clueweb09/catb/trec-web-2011
  7. clueweb09/catb/trec-web-2012
  8. clueweb09/de
  9. clueweb09/en
  10. clueweb09/en/trec-web-2009
  11. clueweb09/en/trec-web-2010
  12. clueweb09/en/trec-web-2011
  13. clueweb09/en/trec-web-2012
  14. clueweb09/es
  15. clueweb09/fr
  16. clueweb09/it
  17. clueweb09/ja
  18. clueweb09/ko
  19. clueweb09/pt
  20. clueweb09/trec-mq-2009
  21. clueweb09/zh

Data Access Information

To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.


"clueweb09"

ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

docs
1.0B docs

Language: multiple/other/unknown

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/ar"

Subset of ClueWeb09 with only Arabic-language documents.

docs
29M docs

Language: ar

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/ar docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/catb"

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

docs
50M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/catb/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2009 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
13K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.DefinitionCount%
0not relevant9.1K69.5%
1relevant2.5K19.2%
2highly relevant1.5K11.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }
Metadata

"clueweb09/catb/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2010 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2010 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
16K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk715 4.5%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.12K76.0%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.2.3K14.6%
2HRel: The content of this page provides substantial information on the topic.682 4.3%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.90 0.6%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.0 0.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }
Metadata

"clueweb09/catb/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2011 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2011 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
13K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk499 3.8%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.11K83.5%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.1.1K8.4%
2HRel: The content of this page provides substantial information on the topic.354 2.7%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.208 1.6%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.0 0.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }
Metadata

"clueweb09/catb/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2012 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2012 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
10K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk561 5.6%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.7.2K71.6%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.1.4K13.8%
2HRel: The content of this page provides substantial information on the topic.300 3.0%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.17 0.2%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.580 5.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }
Metadata

"clueweb09/de"

Subset of ClueWeb09 with only German-language documents.

docs
50M docs

Language: de

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/de docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/en"

Subset of ClueWeb09 with only English-language documents.

docs
504M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/en/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2009 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
504M docs

Inherits docs from clueweb09/en

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
24K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.DefinitionCount%
0not relevant17K70.9%
1relevant4.8K20.5%
2highly relevant2.0K8.6%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }
Metadata

"clueweb09/en/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2010 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
504M docs

Inherits docs from clueweb09/en

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2010 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
25K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk1.4K5.6%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.19K73.7%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.4.0K15.9%
2HRel: The content of this page provides substantial information on the topic.1.1K4.3%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.138 0.5%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.0 0.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }
Metadata

"clueweb09/en/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2011 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
504M docs

Inherits docs from clueweb09/en

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2011 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
19K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk1.0K5.3%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.15K78.5%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.2.0K10.5%
2HRel: The content of this page provides substantial information on the topic.711 3.7%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.408 2.1%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.0 0.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }
Metadata

"clueweb09/en/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2012 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
504M docs

Inherits docs from clueweb09/en

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2012 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
16K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk858 5.3%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.12K72.7%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.2.2K13.8%
2HRel: The content of this page provides substantial information on the topic.405 2.5%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.52 0.3%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.858 5.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }
Metadata

"clueweb09/es"

Subset of ClueWeb09 with only Spanish-language documents.

docs
79M docs

Language: es

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/es docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/fr"

Subset of ClueWeb09 with only French-language documents.

docs
51M docs

Language: fr

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/fr docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/it"

Subset of ClueWeb09 with only Italian-language documents.

docs
27M docs

Language: it

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/it docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/ja"

Subset of ClueWeb09 with only Japanese-language documents.

docs
67M docs

Language: ja

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/ja docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/ko"

Subset of ClueWeb09 with only Korean-language documents.

docs
18M docs

Language: ko

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/ko docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/pt"

Subset of ClueWeb09 with only Portuguese-language documents.

docs
38M docs

Language: pt

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/pt docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb09/trec-mq-2009"

TREC 2009 Million Query track.

queries
40K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/trec-mq-2009 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
1.0B docs

Inherits docs from clueweb09

Language: multiple/other/unknown

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/trec-mq-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
35K qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.DefinitionCount%
0not relevant26K74.1%
1relevant5.9K17.0%
2highly relevant3.1K9.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Carterette2009MQ}

Bibtex:

@inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }
Metadata

"clueweb09/zh"

Subset of ClueWeb09 with only Chinese-language documents.

docs
177M docs

Language: zh

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/zh docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata