← home
Github: datasets/clueweb09.py

ir_datasets: ClueWeb09

Index
  1. clueweb09
  2. clueweb09/ar
  3. clueweb09/catb
  4. clueweb09/catb/trec-web-2009
  5. clueweb09/catb/trec-web-2010
  6. clueweb09/catb/trec-web-2011
  7. clueweb09/catb/trec-web-2012
  8. clueweb09/de
  9. clueweb09/en
  10. clueweb09/en/trec-web-2009
  11. clueweb09/en/trec-web-2010
  12. clueweb09/en/trec-web-2011
  13. clueweb09/en/trec-web-2012
  14. clueweb09/es
  15. clueweb09/fr
  16. clueweb09/it
  17. clueweb09/ja
  18. clueweb09/ko
  19. clueweb09/pt
  20. clueweb09/trec-mq-2009
  21. clueweb09/zh

"clueweb09"

ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

docs

Language: multiple/other/unknown

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/ar"

Subset of ClueWeb09 with only Arabic-language documents.

docs

Language: ar

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/ar')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/catb"

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/catb/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2009')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2009')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2009')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
Citation
bibtex: @inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

"clueweb09/catb/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2010')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2010')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2010')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

"clueweb09/catb/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2011')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2011')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2011')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

"clueweb09/catb/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2012')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2012')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/catb/trec-web-2012')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

"clueweb09/de"

Subset of ClueWeb09 with only German-language documents.

docs

Language: de

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/de')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/en"

Subset of ClueWeb09 with only English-language documents.

docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/en/trec-web-2009"

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2009')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2009')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2009')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
Citation
bibtex: @inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }

"clueweb09/en/trec-web-2010"

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2010')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2010')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2010')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }

"clueweb09/en/trec-web-2011"

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2011')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2011')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2011')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }

"clueweb09/en/trec-web-2012"

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2012')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2012')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
2HRel: The content of this page provides substantial information on the topic.
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/en/trec-web-2012')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }

"clueweb09/es"

Subset of ClueWeb09 with only Spanish-language documents.

docs

Language: es

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/es')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/fr"

Subset of ClueWeb09 with only French-language documents.

docs

Language: fr

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/fr')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/it"

Subset of ClueWeb09 with only Italian-language documents.

docs

Language: it

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/it')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/ja"

Subset of ClueWeb09 with only Japanese-language documents.

docs

Language: ja

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/ja')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/ko"

Subset of ClueWeb09 with only Korean-language documents.

docs

Language: ko

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/ko')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/pt"

Subset of ClueWeb09 with only Portuguese-language documents.

docs

Language: pt

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/pt')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

"clueweb09/trec-mq-2009"

TREC 2009 Million Query track.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/trec-mq-2009')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: multiple/other/unknown

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/trec-mq-2009')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
qrels
Query relevance judgment type:
TrecPrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. method: int
  5. iprob: float

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/trec-mq-2009')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
Citation
bibtex: @inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }

"clueweb09/zh"

Subset of ClueWeb09 with only Chinese-language documents.

docs

Language: zh

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('clueweb09/zh')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>