← home
Github: datasets/clueweb12.py

ir_datasets: ClueWeb12

Index
  1. clueweb12
  2. clueweb12/b13
  3. clueweb12/b13/clef-ehealth
  4. clueweb12/b13/clef-ehealth/cs
  5. clueweb12/b13/clef-ehealth/de
  6. clueweb12/b13/clef-ehealth/fr
  7. clueweb12/b13/clef-ehealth/hu
  8. clueweb12/b13/clef-ehealth/pl
  9. clueweb12/b13/clef-ehealth/sv
  10. clueweb12/b13/ntcir-www-1
  11. clueweb12/b13/ntcir-www-2
  12. clueweb12/b13/ntcir-www-3
  13. clueweb12/b13/trec-misinfo-2019
  14. clueweb12/touche-2020-task-2
  15. clueweb12/touche-2021-task-2
  16. clueweb12/trec-web-2013
  17. clueweb12/trec-web-2014

Data Access Information

To use this dataset, you need a copy of ClueWeb 2012, provided by CMU.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb12/corpus.


"clueweb12"

ClueWeb 2012 web document collection. Contains 733M web pages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

docsMetadata
733M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb12/b13"

Official subset of the ClueWeb12 datasets with 52M web pages.

docsMetadata
52M docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth"

The CLEF eHealth 2016-17 IR dataset. Contains consumer health queries and judgments containing trustworthiness and understandability scores, in addition to the normal relevance assessments.

This dataset contains the combined 2016 and 2017 relevance judgments, since the same queries were used in the two year. The assessment year can be distinguished using iteration (2016 is iteration 0, 2017 is iteration 1).

queriesdocsqrelsCitationMetadata
300 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/cs"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Czech. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitationMetadata
300 queries

Language: cs

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/cs")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/de"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to German. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitationMetadata
300 queries

Language: de

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/de")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/fr"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to French. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitationMetadata
300 queries

Language: fr

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/fr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/hu"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Hungarian. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitationMetadata
300 queries

Language: hu

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/hu")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/pl"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Polish. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitationMetadata
300 queries

Language: pl

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/pl")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/sv"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Swedish. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitationMetadata
300 queries

Language: sv

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/sv")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/ntcir-www-1"

The NTCIR-13 We Want Web (WWW) 1 ad-hoc ranking benchmark. Contains 100 queries with deep relevance judgments (avg 255 per query). Judgments aggregated from two assessors. Note that the qrels contain additional judgments from the NTCIR-14 CENTRE track.

queriesdocsqrelsCitationMetadata
100 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/ntcir-www-2"

The NTCIR-14 We Want Web (WWW) 2 ad-hoc ranking benchmark. Contains 80 queries with deep relevance judgments (avg 345 per query). Judgments aggregated from two assessors.

queriesdocsqrelsCitationMetadata
80 queries

Language: en

Query type:
NtcirQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.


"clueweb12/b13/ntcir-www-3"

The NTCIR-15 We Want Web (WWW) 3 ad-hoc ranking benchmark. Contains 160 queries with deep relevance judgments (to be released). 80 of the queries are from clueweb12/b13/ntcir-www-2.

queriesdocsMetadata
160 queries

Language: en

Query type:
NtcirQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.


"clueweb12/b13/trec-misinfo-2019"

The TREC Medical Misinformation 2019 dataset.

queriesdocsqrelsCitationMetadata
51 queries

Language: en

Query type:
MisinfoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. cochranedoi: str
  4. description: str
  5. narrative: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/trec-misinfo-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, cochranedoi, description, narrative>

You can find more details about the Python API here.


"clueweb12/touche-2020-task-2"

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
ToucheQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2020-task-2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"clueweb12/touche-2021-task-2"

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
ToucheQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/touche-2021-task-2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"clueweb12/trec-web-2013"

The TREC Web Track 2013 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2013")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb12/trec-web-2014"

The TREC Web Track 2014 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2014")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.