ir_datasets
: ClueWeb12ClueWeb 2012 web document collection. Contains 733M web pages.
The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Official subset of the ClueWeb12 datasets with 52M web pages.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
The CLEF eHealth 2016-17 IR dataset. Contains consumer health queries and judgments containing trustworthiness and understandability scores, in addition to the normal relevance assessments.
This dataset contains the combined 2016 and 2017 relevance judgments, since the same queries were used in the two year. The assessment year can be distinguished using iteration (2016 is iteration 0, 2017 is iteration 1).
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Somewhat relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>
The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Czech. See clueweb12/b13/clef-ehealth for more details.
Language: cs
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/cs')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/cs')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Somewhat relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/cs')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>
The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to German. See clueweb12/b13/clef-ehealth for more details.
Language: de
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/de')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/de')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Somewhat relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/de')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>
The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to French. See clueweb12/b13/clef-ehealth for more details.
Language: fr
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/fr')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/fr')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Somewhat relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/fr')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>
The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Hungarian. See clueweb12/b13/clef-ehealth for more details.
Language: hu
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/hu')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/hu')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Somewhat relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/hu')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>
The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Polish. See clueweb12/b13/clef-ehealth for more details.
Language: pl
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/pl')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/pl')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Somewhat relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/pl')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>
The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Swedish. See clueweb12/b13/clef-ehealth for more details.
Language: sv
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/sv')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/sv')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Somewhat relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/clef-ehealth/sv')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, trustworthiness, understandability, iteration>
The NTCIR-13 We Want Web (WWW) 1 ad-hoc ranking benchmark. Contains 100 queries with deep relevance judgments (avg 255 per query). Judgments aggregated from two assessors. Note that the qrels contain additional judgments from the NTCIR-14 CENTRE track.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-1')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-1')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Two annotators rated as non-relevant |
1 | One annotator rated as relevant, one as non-relevant |
2 | Two annotators rated as relevant, OR one rates as highly relevant and one as non-relevant |
3 | One annotator rated as highly relevant, one as relevant |
4 | Two annotators rated as highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-1')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
The NTCIR-14 We Want Web (WWW) 2 ad-hoc ranking benchmark. Contains 80 queries with deep relevance judgments (avg 345 per query). Judgments aggregated from two assessors.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-2')
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-2')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Two annotators rated as non-relevant |
1 | One annotator rated as relevant, one as non-relevant |
2 | Two annotators rated as relevant, OR one rates as highly relevant and one as non-relevant |
3 | One annotator rated as highly relevant, one as relevant |
4 | Two annotators rated as highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-2')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
The NTCIR-15 We Want Web (WWW) 3 ad-hoc ranking benchmark. Contains 160 queries with deep relevance judgments (to be released). 80 of the queries are from clueweb12/b13/ntcir-www-2.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-3')
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/ntcir-www-3')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
The TREC Medical Misinformation 2019 dataset.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/trec-misinfo-2019')
for query in dataset.queries_iter():
query # namedtuple<query_id, title, cochranedoi, description, narrative>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/trec-misinfo-2019')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
0 | Not relevant |
1 | Relevant |
2 | Highly relevant |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/b13/trec-misinfo-2019')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, effectiveness, redibility>
The TREC Web Track 2013 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/trec-web-2013')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/trec-web-2013')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/trec-web-2013')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
The TREC Web Track 2014 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/trec-web-2014')
for query in dataset.queries_iter():
query # namedtuple<query_id, query, description, type, subtopics>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/trec-web-2014')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
Relevance levels
Rel. | Definition |
---|---|
-2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk |
0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. |
1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. |
2 | HRel: The content of this page provides substantial information on the topic. |
3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. |
4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. |
Example
import ir_datasets
dataset = ir_datasets.load('clueweb12/trec-web-2014')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>