ir_datasets : ClueWeb09

import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/ar"`

Subset of ClueWeb09 with only Arabic-language documents.

docs

Language: ar

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/ar docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/catb"`

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

docs

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/catb/trec-web-2009"`

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecPrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
method: int
iprob: float

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

`"clueweb09/catb/trec-web-2010"`

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2	HRel: The content of this page provides substantial information on the topic.
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

`"clueweb09/catb/trec-web-2011"`

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2	HRel: The content of this page provides substantial information on the topic.
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

`"clueweb09/catb/trec-web-2012"`

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2	HRel: The content of this page provides substantial information on the topic.
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

`"clueweb09/de"`

Subset of ClueWeb09 with only German-language documents.

docs

Language: de

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/de docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/en"`

Subset of ClueWeb09 with only English-language documents.

docs

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/en/trec-web-2009"`

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecPrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
method: int
iprob: float

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

`"clueweb09/en/trec-web-2010"`

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2	HRel: The content of this page provides substantial information on the topic.
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

`"clueweb09/en/trec-web-2011"`

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2	HRel: The content of this page provides substantial information on the topic.
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

`"clueweb09/en/trec-web-2012"`

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2	HRel: The content of this page provides substantial information on the topic.
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

`"clueweb09/es"`

Subset of ClueWeb09 with only Spanish-language documents.

docs

Language: es

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/es docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/fr"`

Subset of ClueWeb09 with only French-language documents.

docs

Language: fr

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/fr docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/it"`

Subset of ClueWeb09 with only Italian-language documents.

docs

Language: it

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/it docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/ja"`

Subset of ClueWeb09 with only Japanese-language documents.

docs

Language: ja

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/ja docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/ko"`

Subset of ClueWeb09 with only Korean-language documents.

docs

Language: ko

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/ko docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/pt"`

Subset of ClueWeb09 with only Portuguese-language documents.

docs

Language: pt

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/pt docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

`"clueweb09/trec-mq-2009"`

TREC 2009 Million Query track.

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/trec-mq-2009 queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

Inherits docs from clueweb09

Language: multiple/other/unknown

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/trec-mq-2009 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecPrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
method: int
iprob: float

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Carterette2009MQ}

Bibtex:

@inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }

`"clueweb09/zh"`

Subset of ClueWeb09 with only Chinese-language documents.

docs

Language: zh

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/zh docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.