← home
Github: datasets/clueweb12.py

ir_datasets: ClueWeb12

Index
  1. clueweb12
  2. clueweb12/b13
  3. clueweb12/b13/clef-ehealth
  4. clueweb12/b13/clef-ehealth/cs
  5. clueweb12/b13/clef-ehealth/de
  6. clueweb12/b13/clef-ehealth/fr
  7. clueweb12/b13/clef-ehealth/hu
  8. clueweb12/b13/clef-ehealth/pl
  9. clueweb12/b13/clef-ehealth/sv
  10. clueweb12/b13/ntcir-www-1
  11. clueweb12/b13/ntcir-www-2
  12. clueweb12/b13/ntcir-www-3
  13. clueweb12/b13/trec-misinfo-2019
  14. clueweb12/trec-web-2013
  15. clueweb12/trec-web-2014

Data Access Information

To use this dataset, you need a copy of ClueWeb 2012, provided by CMU.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb12/corpus.


"clueweb12"

ClueWeb 2012 web document collection. Contains 733M web pages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb12/b13"

Official subset of the ClueWeb12 datasets with 52M web pages.

docs

Language: en

Document type:
WarcDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. date: str
  4. http_headers: bytes
  5. body: bytes
  6. body_content_type: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth"

The CLEF eHealth 2016-17 IR dataset. Contains consumer health queries and judgments containing trustworthiness and understandability scores, in addition to the normal relevance assessments.

This dataset contains the combined 2016 and 2017 relevance judgments, since the same queries were used in the two year. The assessment year can be distinguished using iteration (2016 is iteration 0, 2017 is iteration 1).

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/cs"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Czech. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitation

Language: cs

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/cs")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/de"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to German. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitation

Language: de

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/de")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/fr"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to French. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitation

Language: fr

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/fr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/hu"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Hungarian. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitation

Language: hu

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/hu")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/pl"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Polish. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitation

Language: pl

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/pl")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/clef-ehealth/sv"

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Swedish. See clueweb12/b13/clef-ehealth for more details.

queriesdocsqrelsCitation

Language: sv

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/clef-ehealth/sv")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/ntcir-www-1"

The NTCIR-13 We Want Web (WWW) 1 ad-hoc ranking benchmark. Contains 100 queries with deep relevance judgments (avg 255 per query). Judgments aggregated from two assessors. Note that the qrels contain additional judgments from the NTCIR-14 CENTRE track.

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"clueweb12/b13/ntcir-www-2"

The NTCIR-14 We Want Web (WWW) 2 ad-hoc ranking benchmark. Contains 80 queries with deep relevance judgments (avg 345 per query). Judgments aggregated from two assessors.

queriesdocsqrelsCitation

Language: en

Query type:
NtcirQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.


"clueweb12/b13/ntcir-www-3"

The NTCIR-15 We Want Web (WWW) 3 ad-hoc ranking benchmark. Contains 160 queries with deep relevance judgments (to be released). 80 of the queries are from clueweb12/b13/ntcir-www-2.

queriesdocs

Language: en

Query type:
NtcirQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/ntcir-www-3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>

You can find more details about the Python API here.


"clueweb12/b13/trec-misinfo-2019"

The TREC Medical Misinformation 2019 dataset.

queriesdocsqrelsCitation

Language: en

Query type:
MisinfoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. cochranedoi: str
  4. description: str
  5. narrative: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/b13/trec-misinfo-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, cochranedoi, description, narrative>

You can find more details about the Python API here.


"clueweb12/trec-web-2013"

The TREC Web Track 2013 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitation

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2013")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.


"clueweb12/trec-web-2014"

The TREC Web Track 2014 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

queriesdocsqrelsCitation

Language: en

Query type:
TrecWebTrackQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. description: str
  4. type: str
  5. subtopics: Tuple[
    TrecSubtopic: (namedtuple)
    1. number: str
    2. text: str
    3. type: str
    , ...]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("clueweb12/trec-web-2014")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>

You can find more details about the Python API here.