← home
Github: datasets/clueweb09.py

ir_datasets: ClueWeb09

Index
  1. clueweb09
  2. clueweb09/ar
  3. clueweb09/catb
  4. clueweb09/catb/trec-web-2009
  5. clueweb09/catb/trec-web-2010
  6. clueweb09/catb/trec-web-2011
  7. clueweb09/catb/trec-web-2012
  8. clueweb09/de
  9. clueweb09/en
  10. clueweb09/en/trec-web-2009
  11. clueweb09/en/trec-web-2010
  12. clueweb09/en/trec-web-2011
  13. clueweb09/en/trec-web-2012
  14. clueweb09/es
  15. clueweb09/fr
  16. clueweb09/it
  17. clueweb09/ja
  18. clueweb09/ko
  19. clueweb09/pt
  20. clueweb09/trec-mq-2009
  21. clueweb09/zh

Data Access Information

To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.


"clueweb09"

ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

docs

Language: multiple/other/unknown

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/ar"

Subset of ClueWeb09 with only Arabic-language documents.

docs

Language: ar

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/ar docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/catb"

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/catb/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2009 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/catb

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

"clueweb09/catb/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2010 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/catb

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2010 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

"clueweb09/catb/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2011 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/catb

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2011 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

"clueweb09/catb/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2012 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/catb

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2012 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

"clueweb09/de"

Subset of ClueWeb09 with only German-language documents.

docs

Language: de

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/de docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/en"

Subset of ClueWeb09 with only English-language documents.

docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/en/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2009 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

"clueweb09/en/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2010 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2010 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

"clueweb09/en/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2011 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2011 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

"clueweb09/en/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2012 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: en

Note: Uses docs from clueweb09/en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2012 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

"clueweb09/es"

Subset of ClueWeb09 with only Spanish-language documents.

docs

Language: es

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/es docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/fr"

Subset of ClueWeb09 with only French-language documents.

docs

Language: fr

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/fr docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/it"

Subset of ClueWeb09 with only Italian-language documents.

docs

Language: it

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/it docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/ja"

Subset of ClueWeb09 with only Japanese-language documents.

docs

Language: ja

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/ja docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/ko"

Subset of ClueWeb09 with only Korean-language documents.

docs

Language: ko

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/ko docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/pt"

Subset of ClueWeb09 with only Portuguese-language documents.

docs

Language: pt

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/pt docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier


"clueweb09/trec-mq-2009"

TREC 2009 Million Query track.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/trec-mq-2009 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

Language: multiple/other/unknown

Note: Uses docs from clueweb09

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/trec-mq-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Carterette2009MQ}

Bibtex:

@inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }

"clueweb09/zh"

Subset of ClueWeb09 with only Chinese-language documents.

docs

Language: zh

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb09/zh docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier