ir_datasets
: HC4 (HLTCOE CLIR Common-Crawl Collection)To access the docuemnts of this dataset, you will need to download the documents from Common Crawl. The script for downloading and validating the documents are in HLTCOE/HC4. Please use the following command to download the documents:
git clone https://github.com/hltcoe/HC4
cd HC4
pip install -r requirements.txt
python download_documents.py --storage ~/.ir_datasets/hc4/ \
--zho ./resources/hc4/zho/ids.jsonl.gz \
--fas ./resources/hc4/fas/ids.jsonl.gz \
--rus ./resources/hc4/rus/ids.*.jsonl.gz \
--jobs {number of process}
After download, please also post-process the downloaded file to verify all and only specified documents are downloaded, and modify the ordering of the collection to match the original specified ordering in the id files.
for lang in zho fas rus; do
python fix_document_order.py --hc4_file ~/.ir_datasets/hc4/$lang/hc4_docs.jsonl \
--id_file ./resources/hc4/$lang/ids*.jsonl.gz \
--check_hash
done
You can also store the documents in other directory and create a soft link for ~/.ir_datasets/hc4/.
HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments.
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }The Persian collection contains English queries and Persian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.
Language: fa
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/fa docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 486486, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } } }
Development split of hc4/fa.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/fa/dev queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/fa
Language: fa
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/fa/dev docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 456 | 80.7% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 46 | 8.1% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 63 | 11.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/fa/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 486486, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 10 }, "qrels": { "count": 565, "fields": { "relevance": { "counts_by_value": { "0": 456, "3": 63, "1": 46 } } } } }
Test split of hc4/fa.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/fa/test queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/fa
Language: fa
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/fa/test docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 2.1K | 83.3% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 215 | 8.5% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 206 | 8.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/fa/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 486486, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 2522, "fields": { "relevance": { "counts_by_value": { "0": 2101, "1": 215, "3": 206 } } } } }
Train split of hc4/fa.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/fa/train queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/fa
Language: fa
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/fa/train docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 67 | 59.8% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 23 | 20.5% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 22 | 19.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/fa/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/fa/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 486486, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 8 }, "qrels": { "count": 112, "fields": { "relevance": { "counts_by_value": { "1": 23, "3": 22, "0": 67 } } } } }
The Russian collection contains English queries and Russian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.
Language: ru
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/ru docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 4721064, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } } }
Development split of hc4/ru.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/ru/dev queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/ru
Language: ru
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/ru/dev docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 186 | 70.2% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 67 | 25.3% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 12 | 4.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/ru/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 4721064, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 4 }, "qrels": { "count": 265, "fields": { "relevance": { "counts_by_value": { "0": 186, "1": 67, "3": 12 } } } } }
Test split of hc4/ru.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/ru/test queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/ru
Language: ru
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/ru/test docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 2.3K | 77.3% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 411 | 13.8% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 262 | 8.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/ru/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 4721064, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 2970, "fields": { "relevance": { "counts_by_value": { "0": 2297, "1": 411, "3": 262 } } } } }
Train split of hc4/ru.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/ru/train queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/ru
Language: ru
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/ru/train docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 38 | 41.3% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 31 | 33.7% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 23 | 25.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/ru/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/ru/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 4721064, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 7 }, "qrels": { "count": 92, "fields": { "relevance": { "counts_by_value": { "1": 31, "3": 23, "0": 38 } } } } }
The Chinese collection contains English queries and Chinese documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/zh docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 646305, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } } }
Development split of hc4/zh.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/zh/dev queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/zh
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/zh/dev docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 374 | 80.3% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 30 | 6.4% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 62 | 13.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/zh/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 646305, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 10 }, "qrels": { "count": 466, "fields": { "relevance": { "counts_by_value": { "0": 374, "3": 62, "1": 30 } } } } }
Test split of hc4/zh.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/zh/test queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/zh
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/zh/test docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 2.3K | 82.8% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 192 | 7.0% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 282 | 10.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/zh/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 646305, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 2751, "fields": { "relevance": { "counts_by_value": { "0": 2277, "3": 282, "1": 192 } } } } }
Train split of hc4/zh.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>
You can find more details about the Python API here.
ir_datasets export hc4/zh/train queries
[query_id] [title] [description] [ht_title] [ht_description] [mt_title] [mt_description] [narrative_by_relevance] [report] [report_url] [report_date] [translation_lang]
...
You can find more details about the CLI here.
No example available for PyTerrier
Inherits docs from hc4/zh
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, time, cc_file>
You can find more details about the Python API here.
ir_datasets export hc4/zh/train docs
[doc_id] [title] [text] [url] [time] [cc_file]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not-valuable. Information in the document might be included in a report footnote, or omitted entirely. | 173 | 50.7% |
1 | Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report. | 140 | 41.1% |
3 | Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic. | 28 | 8.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("hc4/zh/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export hc4/zh/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }{ "docs": { "count": 646305, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 23 }, "qrels": { "count": 341, "fields": { "relevance": { "counts_by_value": { "0": 173, "1": 140, "3": 28 } } } } }