← home
Github: datasets/hc4.py

ir_datasets: HC4 (HLTCOE CLIR Common-Crawl Collection)

Index
  1. hc4
  2. hc4/fa
  3. hc4/fa/dev
  4. hc4/fa/test
  5. hc4/fa/train
  6. hc4/ru
  7. hc4/ru/dev
  8. hc4/ru/test
  9. hc4/ru/train
  10. hc4/zh
  11. hc4/zh/dev
  12. hc4/zh/test
  13. hc4/zh/train

Data Access Information

To access the docuemnts of this dataset, you will need to download the documents from Common Crawl. The script for downloading and validating the documents are in HLTCOE/HC4. Please use the following command to download the documents:

git clone https://github.com/hltcoe/HC4
cd HC4
pip install -r requirements.txt
python download_documents.py --storage ~/.ir_datasets/hc4/ \
--zho ./resources/hc4/zho/ids.jsonl.gz \
--fas ./resources/hc4/fas/ids.jsonl.gz \
--rus ./resources/hc4/rus/ids.*.jsonl.gz \
--jobs {number of process}

After download, please also post-process the downloaded file to verify all and only specified documents are downloaded, and modify the ordering of the collection to match the original specified ordering in the id files.

for lang in zho fas rus; do
  python fix_document_order.py --hc4_file ~/.ir_datasets/hc4/$lang/hc4_docs.jsonl \
   --id_file ./resources/hc4/$lang/ids*.jsonl.gz \
   --check_hash
done

You can also store the documents in other directory and create a soft link for ~/.ir_datasets/hc4/.


"hc4"

HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments.

  • Documents: Web pages from Common Crawl in Chinese, Persian, and Russian.
  • Queries: English TREC-style title/description queries. Narrative field contains an example passage for each relevance level. Human and machine translation of the titles and descriptions in the target language (i.e., document language) are provided in the query object. (Titles and descriptions are machine-translated into all three target languages even in the laguages that they are not assessed to facillate CLIR other than English-to-X pairs, e.g., Persian-to-Chinese. Please refer to the original dataset repository for these additional resources.)
  • Report: Each query comes with an English report that is designed to be written by professional searchers prior to the search.
  • Qrels: Documents are judged in three levels of relevance. Please refer to the dataset paper for the full definition of the levels.
  • Repository
  • Dataset Paper
Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

"hc4/fa"

The Persian collection contains English queries and Persian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

docsCitationMetadata
486K docs

Language: fa

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/fa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.


"hc4/fa/dev"

Development split of hc4/fa.

queriesdocsqrelsCitationMetadata
10 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/fa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/fa/test"

Test split of hc4/fa.

queriesdocsqrelsCitationMetadata
50 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/fa/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/fa/train"

Train split of hc4/fa.

queriesdocsqrelsCitationMetadata
8 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/fa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/ru"

The Russian collection contains English queries and Russian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

docsCitationMetadata
4.7M docs

Language: ru

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/ru")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.


"hc4/ru/dev"

Development split of hc4/ru.

queriesdocsqrelsCitationMetadata
4 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/ru/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/ru/test"

Test split of hc4/ru.

queriesdocsqrelsCitationMetadata
50 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/ru/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/ru/train"

Train split of hc4/ru.

queriesdocsqrelsCitationMetadata
7 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/ru/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/zh"

The Chinese collection contains English queries and Chinese documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

docsCitationMetadata
646K docs

Language: zh

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.


"hc4/zh/dev"

Development split of hc4/zh.

queriesdocsqrelsCitationMetadata
10 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/zh/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/zh/test"

Test split of hc4/zh.

queriesdocsqrelsCitationMetadata
50 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/zh/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"hc4/zh/train"

Train split of hc4/zh.

queriesdocsqrelsCitationMetadata
23 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("hc4/zh/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.