ir_datasets : ClueWeb09

import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 1040859705,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-"
      }
    }
  }
}

`"clueweb09/ar"`

Subset of ClueWeb09 with only Arabic-language documents.

docs

29M docs

Language: ar

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/ar docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 29192662,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-ar000"
      }
    }
  }
}

`"clueweb09/catb"`

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

docs

50M docs

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  }
}

`"clueweb09/catb/trec-web-2009"`

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

13K qrels

Query relevance judgment type:

TrecPrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
method: int
iprob: float

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`9.1K`	69.5%
1	relevant	`2.5K`	19.2%
2	highly relevant	`1.5K`	11.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 13118,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 9116,
          "1": 2514,
          "2": 1488
        }
      }
    }
  }
}

`"clueweb09/catb/trec-web-2009/diversity"`

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

16K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`12K`	75.0%
1	relevant	`4.1K`	25.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2009/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 16347,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 12266,
          "1": 4081
        }
      }
    }
  }
}

`"clueweb09/catb/trec-web-2010"`

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

16K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`715`	4.5%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`12K`	76.0%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`2.3K`	14.6%
2	HRel: The content of this page provides substantial information on the topic.	`682`	4.3%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`90`	0.6%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 15845,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 12040,
          "1": 2318,
          "-2": 715,
          "2": 682,
          "3": 90
        }
      }
    }
  }
}

`"clueweb09/catb/trec-web-2010/diversity"`

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

5.5K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`0`	0.0%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`0`	0.0%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`5.5K`	100.0%
2	HRel: The content of this page provides substantial information on the topic.	`0`	0.0%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`0`	0.0%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2010/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 5522,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 5522
        }
      }
    }
  }
}

`"clueweb09/catb/trec-web-2011"`

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

13K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`499`	3.8%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`11K`	83.5%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`1.1K`	8.4%
2	HRel: The content of this page provides substantial information on the topic.	`354`	2.7%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`208`	1.6%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 13081,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 10920,
          "1": 1100,
          "2": 354,
          "-2": 499,
          "3": 208
        }
      }
    }
  }
}

`"clueweb09/catb/trec-web-2011/diversity"`

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

44K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`1.7K`	3.9%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`38K`	85.8%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`3.0K`	6.9%
2	HRel: The content of this page provides substantial information on the topic.	`919`	2.1%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`556`	1.3%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2011/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 43889,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 37665,
          "1": 3016,
          "2": 919,
          "-2": 1733,
          "3": 556
        }
      }
    }
  }
}

`"clueweb09/catb/trec-web-2012"`

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

10K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`561`	5.6%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`7.2K`	71.6%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`1.4K`	13.8%
2	HRel: The content of this page provides substantial information on the topic.	`300`	3.0%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`17`	0.2%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`580`	5.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 10022,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 561,
          "0": 7178,
          "1": 1386,
          "4": 580,
          "2": 300,
          "3": 17
        }
      }
    }
  }
}

`"clueweb09/catb/trec-web-2012/diversity"`

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

50M docs

Inherits docs from clueweb09/catb

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

39K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`2.2K`	5.7%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`31K`	78.7%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`3.5K`	9.0%
2	HRel: The content of this page provides substantial information on the topic.	`887`	2.3%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`47`	0.1%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`1.7K`	4.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/catb/trec-web-2012/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 38992,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 2237,
          "0": 30669,
          "1": 3494,
          "4": 1658,
          "2": 887,
          "3": 47
        }
      }
    }
  }
}

`"clueweb09/de"`

Subset of ClueWeb09 with only German-language documents.

docs

50M docs

Language: de

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/de docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 49814309,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-de00"
      }
    }
  }
}

`"clueweb09/en"`

Subset of ClueWeb09 with only English-language documents.

docs

504M docs

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  }
}

`"clueweb09/en/trec-web-2009"`

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

24K qrels

Query relevance judgment type:

TrecPrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
method: int
iprob: float

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`17K`	70.9%
1	relevant	`4.8K`	20.5%
2	highly relevant	`2.0K`	8.6%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 23601,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 16743,
          "1": 4832,
          "2": 2026
        }
      }
    }
  }
}

`"clueweb09/en/trec-web-2009/diversity"`

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

28K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`21K`	76.8%
1	relevant	`6.5K`	23.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2009/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2009TrecWeb}

Bibtex:

@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 27964,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 21465,
          "1": 6499
        }
      }
    }
  }
}

`"clueweb09/en/trec-web-2010"`

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

25K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`1.4K`	5.6%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`19K`	73.7%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`4.0K`	15.9%
2	HRel: The content of this page provides substantial information on the topic.	`1.1K`	4.3%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`138`	0.5%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 25329,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 18665,
          "1": 4018,
          "-2": 1431,
          "2": 1077,
          "3": 138
        }
      }
    }
  }
}

`"clueweb09/en/trec-web-2010/diversity"`

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

9.0K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`0`	0.0%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`0`	0.0%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`9.0K`	100.0%
2	HRel: The content of this page provides substantial information on the topic.	`0`	0.0%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`0`	0.0%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2010/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2010TrecWeb}

Bibtex:

@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 9006,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 9006
        }
      }
    }
  }
}

`"clueweb09/en/trec-web-2011"`

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

19K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`1.0K`	5.3%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`15K`	78.5%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`2.0K`	10.5%
2	HRel: The content of this page provides substantial information on the topic.	`711`	3.7%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`408`	2.1%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 19381,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 15205,
          "2": 711,
          "1": 2038,
          "-2": 1019,
          "3": 408
        }
      }
    }
  }
}

`"clueweb09/en/trec-web-2011/diversity"`

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

65K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`3.4K`	5.3%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`53K`	81.8%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`5.5K`	8.4%
2	HRel: The content of this page provides substantial information on the topic.	`1.8K`	2.8%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`1.1K`	1.7%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2011/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2011TrecWeb}

Bibtex:

@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 64868,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 53055,
          "2": 1828,
          "1": 5469,
          "-2": 3435,
          "3": 1081
        }
      }
    }
  }
}

`"clueweb09/en/trec-web-2012"`

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012 queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

16K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`858`	5.3%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`12K`	72.7%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`2.2K`	13.8%
2	HRel: The content of this page provides substantial information on the topic.	`405`	2.5%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`52`	0.3%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`858`	5.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 16055,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 858,
          "0": 11674,
          "1": 2208,
          "4": 858,
          "2": 405,
          "3": 52
        }
      }
    }
  }
}

`"clueweb09/en/trec-web-2012/diversity"`

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

50 queries

Language: en

Query type:

TrecWebTrackQuery: (namedtuple)

query_id: str
query: str
description: str
type: str
subtopics: Tuple[
TrecSubtopic: (namedtuple)
1. number: str
2. text: str
3. type: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012/diversity queries



[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

504M docs

Inherits docs from clueweb09/en

Language: en

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012/diversity docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

62K qrels

Query relevance judgment type:

TrecSubQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
subtopic_id: str

Relevance levels

Rel.	Definition	Count	%
-2	Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk	`3.4K`	5.4%
0	Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.	`50K`	79.6%
1	Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.	`5.6K`	8.9%
2	HRel: The content of this page provides substantial information on the topic.	`1.2K`	1.9%
3	Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.	`130`	0.2%
4	Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.	`2.5K`	4.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/en/trec-web-2012/diversity qrels --format tsv



[query_id]    [doc_id]    [relevance]    [subtopic_id]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Clarke2012TrecWeb}

Bibtex:

@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 62394,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 3373,
          "0": 49653,
          "1": 5578,
          "4": 2486,
          "2": 1174,
          "3": 130
        }
      }
    }
  }
}

`"clueweb09/es"`

Subset of ClueWeb09 with only Spanish-language documents.

docs

79M docs

Language: es

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/es docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 79333950,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-es"
      }
    }
  }
}

`"clueweb09/fr"`

Subset of ClueWeb09 with only French-language documents.

docs

51M docs

Language: fr

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/fr docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 50883172,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-fr"
      }
    }
  }
}

`"clueweb09/it"`

Subset of ClueWeb09 with only Italian-language documents.

docs

27M docs

Language: it

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/it docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 27250729,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-it"
      }
    }
  }
}

`"clueweb09/ja"`

Subset of ClueWeb09 with only Japanese-language documents.

docs

67M docs

Language: ja

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/ja docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 67337717,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-ja"
      }
    }
  }
}

`"clueweb09/ko"`

Subset of ClueWeb09 with only Korean-language documents.

docs

18M docs

Language: ko

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/ko docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 18075141,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-ko000"
      }
    }
  }
}

`"clueweb09/pt"`

Subset of ClueWeb09 with only Portuguese-language documents.

docs

38M docs

Language: pt

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/pt docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

{
  "docs": {
    "count": 37578858,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-pt"
      }
    }
  }
}

`"clueweb09/trec-mq-2009"`

TREC 2009 Million Query track.

40K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/trec-mq-2009 queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

1.0B docs

Inherits docs from clueweb09

Language: multiple/other/unknown

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/trec-mq-2009 docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

35K qrels

Query relevance judgment type:

TrecPrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
method: int
iprob: float

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`26K`	74.1%
1	relevant	`5.9K`	17.0%
2	highly relevant	`3.1K`	9.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Carterette2009MQ}

Bibtex:

@inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }

{
  "docs": {
    "count": 1040859705,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-"
      }
    }
  },
  "queries": {
    "count": 40000
  },
  "qrels": {
    "count": 34534,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 25586,
          "1": 5856,
          "2": 3092
        }
      }
    }
  }
}

`"clueweb09/zh"`

Subset of ClueWeb09 with only Chinese-language documents.

docs

177M docs

Language: zh

Document type:

WarcDoc: (namedtuple)

doc_id: str
url: str
date: str
http_headers: bytes
body: bytes
body_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export clueweb09/zh docs



[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier