← home
Github: datasets/clueweb12.py

ir_datasets: ClueWeb12

Index
  1. clueweb12
  2. clueweb12/b13
  3. clueweb12/b13/clef-ehealth
  4. clueweb12/b13/clef-ehealth/cs
  5. clueweb12/b13/clef-ehealth/de
  6. clueweb12/b13/clef-ehealth/fr
  7. clueweb12/b13/clef-ehealth/hu
  8. clueweb12/b13/clef-ehealth/pl
  9. clueweb12/b13/clef-ehealth/sv
  10. clueweb12/b13/ntcir-www-1
  11. clueweb12/b13/ntcir-www-2
  12. clueweb12/b13/ntcir-www-3
  13. clueweb12/b13/trec-misinfo-2019
  14. clueweb12/touche-2020-task-2
  15. clueweb12/touche-2021-task-2
  16. clueweb12/trec-web-2013
  17. clueweb12/trec-web-2014

Data Access Information

To use this dataset, you need a copy of ClueWeb 2012, provided by CMU.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb12/corpus.


"clueweb12"

ClueWeb 2012 web document collection. Contains 733M web pages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

docs
733M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb12/b13"

Official subset of the ClueWeb12 datasets with 52M web pages.

docs
52M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb12/b13/clef-ehealth"

The CLEF eHealth 2016-17 IR dataset. Contains consumer health queries and judgments containing trustworthiness and understandability scores, in addition to the normal relevance assessments.

This dataset contains the combined 2016 and 2017 relevance judgments, since the same queries were used in the two year. The assessment year can be distinguished using iteration (2016 is iteration 0, 2017 is iteration 1).

queries
300 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
269K qrels
Query relevance judgment type:
EhealthQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. trustworthiness: int
  5. understandability: int
  6. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not relevant231K85.7%
1Somewhat relevant23K8.4%
2Highly relevant16K5.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth qrels --format tsv
[query_id]    [doc_id]    [relevance]    [trustworthiness]    [understandability]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Zuccon2016ClefEhealth,Palotti2017ClefEhealth}

Bibtex:

@inproceedings{Zuccon2016ClefEhealth, title={The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval}, author={Guido Zuccon and Joao Palotti and Lorraine Goeuriot and Liadh Kelly and Mihai Lupu and Pavel Pecina and Henning M{\"u}ller and Julie Budaher and Anthony Deacon}, booktitle={CLEF}, year={2016} } @inproceedings{Palotti2017ClefEhealth, title={CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab - Evaluating Retrieval Methods for Consumer Health Search}, author={Joao Palotti and Guido Zuccon and Jimmy and Pavel Pecina and Mihai Lupu and Lorraine Goeuriot and Liadh Kelly and Allan Hanbury}, booktitle={CLEF}, year={2017} }
Metadata

"clueweb12/b13/clef-ehealth/cs"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Czech. See clueweb12/b13/clef-ehealth for more details.

queries
300 queries

Language: cs

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/cs")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/cs queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/cs")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/cs docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
269K qrels
Query relevance judgment type:
EhealthQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. trustworthiness: int
  5. understandability: int
  6. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not relevant231K85.7%
1Somewhat relevant23K8.4%
2Highly relevant16K5.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/cs")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/cs qrels --format tsv
[query_id]    [doc_id]    [relevance]    [trustworthiness]    [understandability]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Zuccon2016ClefEhealth,Palotti2017ClefEhealth}

Bibtex:

@inproceedings{Zuccon2016ClefEhealth, title={The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval}, author={Guido Zuccon and Joao Palotti and Lorraine Goeuriot and Liadh Kelly and Mihai Lupu and Pavel Pecina and Henning M{\"u}ller and Julie Budaher and Anthony Deacon}, booktitle={CLEF}, year={2016} } @inproceedings{Palotti2017ClefEhealth, title={CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab - Evaluating Retrieval Methods for Consumer Health Search}, author={Joao Palotti and Guido Zuccon and Jimmy and Pavel Pecina and Mihai Lupu and Lorraine Goeuriot and Liadh Kelly and Allan Hanbury}, booktitle={CLEF}, year={2017} }
Metadata

"clueweb12/b13/clef-ehealth/de"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to German. See clueweb12/b13/clef-ehealth for more details.

queries
300 queries

Language: de

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/de")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/de queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/de")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/de docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
269K qrels
Query relevance judgment type:
EhealthQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. trustworthiness: int
  5. understandability: int
  6. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not relevant231K85.7%
1Somewhat relevant23K8.4%
2Highly relevant16K5.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/de")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/de qrels --format tsv
[query_id]    [doc_id]    [relevance]    [trustworthiness]    [understandability]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Zuccon2016ClefEhealth,Palotti2017ClefEhealth}

Bibtex:

@inproceedings{Zuccon2016ClefEhealth, title={The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval}, author={Guido Zuccon and Joao Palotti and Lorraine Goeuriot and Liadh Kelly and Mihai Lupu and Pavel Pecina and Henning M{\"u}ller and Julie Budaher and Anthony Deacon}, booktitle={CLEF}, year={2016} } @inproceedings{Palotti2017ClefEhealth, title={CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab - Evaluating Retrieval Methods for Consumer Health Search}, author={Joao Palotti and Guido Zuccon and Jimmy and Pavel Pecina and Mihai Lupu and Lorraine Goeuriot and Liadh Kelly and Allan Hanbury}, booktitle={CLEF}, year={2017} }
Metadata

"clueweb12/b13/clef-ehealth/fr"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to French. See clueweb12/b13/clef-ehealth for more details.

queries
300 queries

Language: fr

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/fr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/fr queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/fr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/fr docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
269K qrels
Query relevance judgment type:
EhealthQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. trustworthiness: int
  5. understandability: int
  6. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not relevant231K85.7%
1Somewhat relevant23K8.4%
2Highly relevant16K5.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/fr")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/fr qrels --format tsv
[query_id]    [doc_id]    [relevance]    [trustworthiness]    [understandability]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Zuccon2016ClefEhealth,Palotti2017ClefEhealth}

Bibtex:

@inproceedings{Zuccon2016ClefEhealth, title={The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval}, author={Guido Zuccon and Joao Palotti and Lorraine Goeuriot and Liadh Kelly and Mihai Lupu and Pavel Pecina and Henning M{\"u}ller and Julie Budaher and Anthony Deacon}, booktitle={CLEF}, year={2016} } @inproceedings{Palotti2017ClefEhealth, title={CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab - Evaluating Retrieval Methods for Consumer Health Search}, author={Joao Palotti and Guido Zuccon and Jimmy and Pavel Pecina and Mihai Lupu and Lorraine Goeuriot and Liadh Kelly and Allan Hanbury}, booktitle={CLEF}, year={2017} }
Metadata

"clueweb12/b13/clef-ehealth/hu"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Hungarian. See clueweb12/b13/clef-ehealth for more details.

queries
300 queries

Language: hu

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/hu")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/hu queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/hu")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/hu docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
269K qrels
Query relevance judgment type:
EhealthQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. trustworthiness: int
  5. understandability: int
  6. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not relevant231K85.7%
1Somewhat relevant23K8.4%
2Highly relevant16K5.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/hu")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/hu qrels --format tsv
[query_id]    [doc_id]    [relevance]    [trustworthiness]    [understandability]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Zuccon2016ClefEhealth,Palotti2017ClefEhealth}

Bibtex:

@inproceedings{Zuccon2016ClefEhealth, title={The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval}, author={Guido Zuccon and Joao Palotti and Lorraine Goeuriot and Liadh Kelly and Mihai Lupu and Pavel Pecina and Henning M{\"u}ller and Julie Budaher and Anthony Deacon}, booktitle={CLEF}, year={2016} } @inproceedings{Palotti2017ClefEhealth, title={CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab - Evaluating Retrieval Methods for Consumer Health Search}, author={Joao Palotti and Guido Zuccon and Jimmy and Pavel Pecina and Mihai Lupu and Lorraine Goeuriot and Liadh Kelly and Allan Hanbury}, booktitle={CLEF}, year={2017} }
Metadata

"clueweb12/b13/clef-ehealth/pl"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Polish. See clueweb12/b13/clef-ehealth for more details.

queries
300 queries

Language: pl

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/pl")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/pl queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/pl")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/pl docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
269K qrels
Query relevance judgment type:
EhealthQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. trustworthiness: int
  5. understandability: int
  6. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not relevant231K85.7%
1Somewhat relevant23K8.4%
2Highly relevant16K5.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/pl")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/pl qrels --format tsv
[query_id]    [doc_id]    [relevance]    [trustworthiness]    [understandability]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Zuccon2016ClefEhealth,Palotti2017ClefEhealth}

Bibtex:

@inproceedings{Zuccon2016ClefEhealth, title={The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval}, author={Guido Zuccon and Joao Palotti and Lorraine Goeuriot and Liadh Kelly and Mihai Lupu and Pavel Pecina and Henning M{\"u}ller and Julie Budaher and Anthony Deacon}, booktitle={CLEF}, year={2016} } @inproceedings{Palotti2017ClefEhealth, title={CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab - Evaluating Retrieval Methods for Consumer Health Search}, author={Joao Palotti and Guido Zuccon and Jimmy and Pavel Pecina and Mihai Lupu and Lorraine Goeuriot and Liadh Kelly and Allan Hanbury}, booktitle={CLEF}, year={2017} }
Metadata

"clueweb12/b13/clef-ehealth/sv"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Swedish. See clueweb12/b13/clef-ehealth for more details.

queries
300 queries

Language: sv

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/sv")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/sv queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/sv")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/sv docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
269K qrels
Query relevance judgment type:
EhealthQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. trustworthiness: int
  5. understandability: int
  6. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not relevant231K85.7%
1Somewhat relevant23K8.4%
2Highly relevant16K5.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/sv")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/clef-ehealth/sv qrels --format tsv
[query_id]    [doc_id]    [relevance]    [trustworthiness]    [understandability]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Zuccon2016ClefEhealth,Palotti2017ClefEhealth}

Bibtex:

@inproceedings{Zuccon2016ClefEhealth, title={The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval}, author={Guido Zuccon and Joao Palotti and Lorraine Goeuriot and Liadh Kelly and Mihai Lupu and Pavel Pecina and Henning M{\"u}ller and Julie Budaher and Anthony Deacon}, booktitle={CLEF}, year={2016} } @inproceedings{Palotti2017ClefEhealth, title={CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab - Evaluating Retrieval Methods for Consumer Health Search}, author={Joao Palotti and Guido Zuccon and Jimmy and Pavel Pecina and Mihai Lupu and Lorraine Goeuriot and Liadh Kelly and Allan Hanbury}, booktitle={CLEF}, year={2017} }
Metadata

"clueweb12/b13/ntcir-www-1"

The NTCIR-13 We Want Web (WWW) 1 ad-hoc ranking benchmark. Contains 100 queries with deep relevance judgments (avg 255 per query). Judgments aggregated from two assessors. Note that the qrels contain additional judgments from the NTCIR-14 CENTRE track.

queries
100 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-1 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-1 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
25K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Two annotators rated as non-relevant9.8K38.6%
1One annotator rated as relevant, one as non-relevant5.2K20.4%
2Two annotators rated as relevant, OR one rates as highly relevant and one as non-relevant4.7K18.5%
3One annotator rated as highly relevant, one as relevant4.1K16.0%
4Two annotators rated as highly relevant1.7K6.6%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-1 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Luo2017Www1}

Bibtex:

@inproceedings{Luo2017Www1, title={Overview of the NTCIR-13 We Want Web Task}, author={Cheng Luo and Tetsuya Sakai and Yiqun Liu and Zhicheng Dou and Chenyan Xiong and Jingfang Xu}, booktitle={NTCIR}, year={2017} }
Metadata

"clueweb12/b13/ntcir-www-2"

The NTCIR-14 We Want Web (WWW) 2 ad-hoc ranking benchmark. Contains 80 queries with deep relevance judgments (avg 345 per query). Judgments aggregated from two assessors.

queries
80 queries

Language: en

Query type:
NtcirQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-2 queries
[query_id]    [title]    [description]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-2 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
28K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Two annotators rated as non-relevant13K48.2%
1One annotator rated as relevant, one as non-relevant6.5K23.4%
2Two annotators rated as relevant, OR one rates as highly relevant and one as non-relevant4.7K16.9%
3One annotator rated as highly relevant, one as relevant2.3K8.4%
4Two annotators rated as highly relevant857 3.1%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-2 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Mao2018OWww2}

Bibtex:

@inproceedings{Mao2018OWww2, title={Overview of the NTCIR-14 We Want Web Task}, author={Jiaxin Mao and Tetsuya Sakai and Cheng Luo and Peng Xiao and Yiqun Liu and Zhicheng Dou}, booktitle={NTCIR}, year={2018} }
Metadata

"clueweb12/b13/ntcir-www-3"

The NTCIR-15 We Want Web (WWW) 3 ad-hoc ranking benchmark. Contains 160 queries with deep relevance judgments (to be released). 80 of the queries are from clueweb12/b13/ntcir-www-2.

queries
160 queries

Language: en

Query type:
NtcirQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-3 queries
[query_id]    [title]    [description]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/ntcir-www-3 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Metadata

"clueweb12/b13/trec-misinfo-2019"

The TREC Medical Misinformation 2019 dataset.

queries
51 queries

Language: en

Query type:
MisinfoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. cochranedoi: str
  4. description: str
  5. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/trec-misinfo-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, cochranedoi, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/trec-misinfo-2019 queries
[query_id]    [title]    [cochranedoi]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
52M docs

Inherits docs from clueweb12/b13

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/trec-misinfo-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/trec-misinfo-2019 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
23K qrels
Query relevance judgment type:
MisinfoQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. effectiveness: int
  5. redibility: int

Relevance levels

Rel.DefinitionCount%
0Not relevant19K81.8%
1Relevant3.1K13.7%
2Highly relevant1.0K4.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/trec-misinfo-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, effectiveness, redibility>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/b13/trec-misinfo-2019 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [effectiveness]    [redibility]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Abualsaud2019TrecDecision}

Bibtex:

@inproceedings{Abualsaud2019TrecDecision, title={Overview of the TREC 2019 Decision Track}, author={Mustafa Abualsaud and Christina Lioma and Maria Maistro and Mark D. Smucker and Guido Zuccon}, booktitle={TREC}, year={2019} }
Metadata

"clueweb12/touche-2020-task-2"

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance.

queries
50 queries

Language: en

Query type:
ToucheQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2020-task-2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/touche-2020-task-2 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
733M docs

Inherits docs from clueweb12

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2020-task-2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/touche-2020-task-2 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
1.8K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant961 53.9%
1relevant448 25.1%
2highly relevant374 21.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2020-task-2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/touche-2020-task-2 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Bondarenko2020Touche,Braunstain2016Support,Rafalak2014Credibility}

Bibtex:

@inproceedings{Bondarenko2020Touche, address = {Berlin Heidelberg New York}, author = {Alexander Bondarenko and Maik Fr{\"o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 11th International Conference of the CLEF Association (CLEF 2020)}, doi = {10.1007/978-3-030-58219-7\_26}, editor = {Avi Arampatzis and Evangelos Kanoulas and Theodora Tsikrika and Stefanos Vrochidis and Hideo Joho and Christina Lioma and Carsten Eickhoff and Aur{\'e}lie N{\'e}v{\'e}ol and Linda Cappellato and Nicola Ferro}, month = sep, pages = {384-395}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Thessaloniki, Greece}, title = {{Overview of Touch{\'e} 2020: Argument Retrieval}}, url = {https://link.springer.com/chapter/10.1007/978-3-030-58219-7_26}, volume = 12260, year = 2020, } @inproceedings{Braunstain2016Support, author = {Liora Braunstain and Oren Kurland and David Carmel and Idan Szpektor and Anna Shtok}, editor = {Nicola Ferro and Fabio Crestani and Marie{-}Francine Moens and Josiane Mothe and Fabrizio Silvestri and Giorgio Maria Di Nunzio and Claudia Hauff and Gianmaria Silvello}, title = {Supporting Human Answers for Advice-Seeking Questions in {CQA} Sites}, booktitle = {Advances in Information Retrieval - 38th European Conference on {IR} Research, {ECIR} 2016, Padua, Italy, March 20-23, 2016. Proceedings}, series = {Lecture Notes in Computer Science}, volume = {9626}, pages = {129--141}, publisher = {Springer}, year = {2016}, url = {https://doi.org/10.1007/978-3-319-30671-1\_10}, doi = {10.1007/978-3-319-30671-1\_10}, timestamp = {Sun, 25 Oct 2020 22:33:09 +0100}, biburl = {https://dblp.org/rec/conf/ecir/BraunstainKCSS16.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } @inproceedings{Rafalak2014Credibility, author = {Rafalak, Maria and Abramczuk, Katarzyna and Wierzbicki, Adam}, title = {Incredible: Is (Almost) All Web Content Trustworthy? Analysis of Psychological Factors Related to Website Credibility Evaluation}, year = {2014}, isbn = {9781450327459}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/2567948.2578997}, doi = {10.1145/2567948.2578997}, booktitle = {Proceedings of the 23rd International Conference on World Wide Web}, pages = {1117–1122}, numpages = {6}, keywords = {bias, risk-taking, online behavior, trust, credibility}, location = {Seoul, Korea}, series = {WWW '14 Companion} }
Metadata

"clueweb12/touche-2021-task-2"

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

queries
50 queries

Language: en

Query type:
ToucheQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2021-task-2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/touche-2021-task-2 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
733M docs

Inherits docs from clueweb12

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2021-task-2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/touche-2021-task-2 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
2.1K qrels
Query relevance judgment type:
ToucheQualityQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. quality: int
  5. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant1.4K69.1%
1relevant377 18.2%
2highly relevant264 12.7%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2021-task-2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, quality, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/touche-2021-task-2 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [quality]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Bondarenko2021Touche}

Bibtex:

@inproceedings{Bondarenko2021Touche, address = {Berlin Heidelberg New York}, author = {Alexander Bondarenko and Lukas Gienapp and Maik Fr{\"o}be and Meriem Beloucif and Yamen Ajjour and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021)}, doi = {10.1007/978-3-030-85251-1\_28}, editor = {{K. Sel{\c{c}}uk} Candan and Bogdan Ionescu and Lorraine Goeuriot and Henning M{\"u}ller and Alexis Joly and Maria Maistro and Florina Piroi and Guglielmo Faggioli and Nicola Ferro}, month = sep, pages = {450-467}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Bucharest, Romania}, title = {{Overview of Touch{\'e} 2021: Argument Retrieval}}, url = {https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28}, volume = 12880, year = 2021, }
Metadata

"clueweb12/trec-web-2013"

The TREC Web Track 2013 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2013")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/trec-web-2013 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
733M docs

Inherits docs from clueweb12

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2013")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/trec-web-2013 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
14K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk234 1.6%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.10K69.7%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.3.0K21.0%
2HRel: The content of this page provides substantial information on the topic.920 6.4%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.179 1.2%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.7 0.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2013")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/trec-web-2013 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{CollinsThompson2013TrecWeb}

Bibtex:

@inproceedings{CollinsThompson2013TrecWeb, title={TREC 2013 Web Track Overview}, author={Kevyn Collins-Thompson and Paul Bennett and Fernando Diaz and Charles L. A. Clarke and Ellen M. Voorhees}, booktitle={TREC}, year={2013} }
Metadata

"clueweb12/trec-web-2014"

The TREC Web Track 2014 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queries
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2014")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/trec-web-2014 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
733M docs

Inherits docs from clueweb12

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2014")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/trec-web-2014 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
14K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
-2Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk556 3.9%
0Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.8.2K56.9%
1Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.3.8K26.2%
2HRel: The content of this page provides substantial information on the topic.1.6K11.2%
3Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.230 1.6%
4Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.33 0.2%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2014")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export clueweb12/trec-web-2014 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{CollinsThompson2014TrecWeb}

Bibtex:

@inproceedings{CollinsThompson2014TrecWeb, title={TREC 2014 Web Track Overview}, author={Kevyn Collins-Thompson and Craig Macdonald and Paul Bennett and Fernando Diaz and Ellen M. Voorhees}, booktitle={TREC}, year={2014} }
Metadata