← home
Github: datasets/clueweb09.py

ir_datasets: ClueWeb09

Index
  1. clueweb09
  2. clueweb09/ar
  3. clueweb09/catb
  4. clueweb09/catb/trec-web-2009
  5. clueweb09/catb/trec-web-2009/diversity
  6. clueweb09/catb/trec-web-2010
  7. clueweb09/catb/trec-web-2010/diversity
  8. clueweb09/catb/trec-web-2011
  9. clueweb09/catb/trec-web-2011/diversity
  10. clueweb09/catb/trec-web-2012
  11. clueweb09/catb/trec-web-2012/diversity
  12. clueweb09/de
  13. clueweb09/en
  14. clueweb09/en/trec-web-2009
  15. clueweb09/en/trec-web-2009/diversity
  16. clueweb09/en/trec-web-2010
  17. clueweb09/en/trec-web-2010/diversity
  18. clueweb09/en/trec-web-2011
  19. clueweb09/en/trec-web-2011/diversity
  20. clueweb09/en/trec-web-2012
  21. clueweb09/en/trec-web-2012/diversity
  22. clueweb09/es
  23. clueweb09/fr
  24. clueweb09/it
  25. clueweb09/ja
  26. clueweb09/ko
  27. clueweb09/pt
  28. clueweb09/trec-mq-2009
  29. clueweb09/zh

Data Access Information

To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.


"clueweb09"

ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

docsMetadata
1.0B docs

Language: multiple/other/unknown

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/ar"

Subset of ClueWeb09 with only Arabic-language documents.

docsMetadata
29M docs

Language: ar

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/catb"

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

docsMetadata
50M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2009/diversity"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2010/diversity"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2011/diversity"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/catb/trec-web-2012/diversity"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/de"

Subset of ClueWeb09 with only German-language documents.

docsMetadata
50M docs

Language: de

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/en"

Subset of ClueWeb09 with only English-language documents.

docsMetadata
504M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2009/diversity"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2010/diversity"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2011/diversity"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/en/trec-web-2012/diversity"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb09/es"

Subset of ClueWeb09 with only Spanish-language documents.

docsMetadata
79M docs

Language: es

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/fr"

Subset of ClueWeb09 with only French-language documents.

docsMetadata
51M docs

Language: fr

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/it"

Subset of ClueWeb09 with only Italian-language documents.

docsMetadata
27M docs

Language: it

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/ja"

Subset of ClueWeb09 with only Japanese-language documents.

docsMetadata
67M docs

Language: ja

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/ko"

Subset of ClueWeb09 with only Korean-language documents.

docsMetadata
18M docs

Language: ko

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/pt"

Subset of ClueWeb09 with only Portuguese-language documents.

docsMetadata
38M docs

Language: pt

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb09/trec-mq-2009"

TREC 2009 Million Query track.

queriesdocsqrelsCitationMetadata
40K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb09/zh"

Subset of ClueWeb09 with only Chinese-language documents.

docsMetadata
177M docs

Language: zh

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.