HC4 (HLTCOE CLIR Common-Crawl Collection)

`"hc4"`

HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments.

Documents: Web pages from Common Crawl in Chinese, Persian, and Russian.
Queries: English TREC-style title/description queries. Narrative field contains an example passage for each relevance level. Human and machine translation of the titles and descriptions in the target language (i.e., document language) are provided in the query object. (Titles and descriptions are machine-translated into all three target languages even in the laguages that they are not assessed to facillate CLIR other than English-to-X pairs, e.g., Persian-to-Chinese. Please refer to the original dataset repository for these additional resources.)
Report: Each query comes with an English report that is designed to be written by professional searchers prior to the search.
Qrels: Documents are judged in three levels of relevance. Please refer to the dataset paper for the full definition of the levels.
Repository
Dataset Paper

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

`"hc4/fa"`

The Persian collection contains English queries and Persian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

docs

486K docs

Language: fa

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 486486,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"hc4/fa/dev"`

Development split of hc4/fa.

queries

10 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/dev queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

486K docs

Inherits docs from hc4/fa

Language: fa

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/dev docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

565 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`456`	80.7%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`46`	8.1%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`63`	11.2%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 486486,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 565,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 456,
          "3": 63,
          "1": 46
        }
      }
    }
  }
}

`"hc4/fa/test"`

Test split of hc4/fa.

queries

50 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/test queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

486K docs

Inherits docs from hc4/fa

Language: fa

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/test docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

2.5K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`2.1K`	83.3%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`215`	8.5%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`206`	8.2%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 486486,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 2522,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2101,
          "1": 215,
          "3": 206
        }
      }
    }
  }
}

`"hc4/fa/train"`

Train split of hc4/fa.

queries

8 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/train queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

486K docs

Inherits docs from hc4/fa

Language: fa

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/train docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

112 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`67`	59.8%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`23`	20.5%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`22`	19.6%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/fa/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/fa/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 486486,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 8
  },
  "qrels": {
    "count": 112,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 23,
          "3": 22,
          "0": 67
        }
      }
    }
  }
}

`"hc4/ru"`

The Russian collection contains English queries and Russian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

docs

4.7M docs

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 4721064,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"hc4/ru/dev"`

Development split of hc4/ru.

queries

4 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/dev queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

4.7M docs

Inherits docs from hc4/ru

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/dev docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

265 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`186`	70.2%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`67`	25.3%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`12`	4.5%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 4721064,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 4
  },
  "qrels": {
    "count": 265,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 186,
          "1": 67,
          "3": 12
        }
      }
    }
  }
}

`"hc4/ru/test"`

Test split of hc4/ru.

queries

50 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/test queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

4.7M docs

Inherits docs from hc4/ru

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/test docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

3.0K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`2.3K`	77.3%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`411`	13.8%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`262`	8.8%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 4721064,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 2970,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2297,
          "1": 411,
          "3": 262
        }
      }
    }
  }
}

`"hc4/ru/train"`

Train split of hc4/ru.

queries

7 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/train queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

4.7M docs

Inherits docs from hc4/ru

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/train docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

92 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`38`	41.3%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`31`	33.7%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`23`	25.0%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/ru/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/ru/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 4721064,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7
  },
  "qrels": {
    "count": 92,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 31,
          "3": 23,
          "0": 38
        }
      }
    }
  }
}

`"hc4/zh"`

The Chinese collection contains English queries and Chinese documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

docs

646K docs

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 646305,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"hc4/zh/dev"`

Development split of hc4/zh.

queries

10 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/dev queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

646K docs

Inherits docs from hc4/zh

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/dev docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

466 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`374`	80.3%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`30`	6.4%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`62`	13.3%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 646305,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10
  },
  "qrels": {
    "count": 466,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 374,
          "3": 62,
          "1": 30
        }
      }
    }
  }
}

`"hc4/zh/test"`

Test split of hc4/zh.

queries

50 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/test queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

646K docs

Inherits docs from hc4/zh

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/test docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

2.8K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`2.3K`	82.8%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`192`	7.0%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`282`	10.3%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 646305,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 2751,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2277,
          "3": 282,
          "1": 192
        }
      }
    }
  }
}

`"hc4/zh/train"`

Train split of hc4/zh.

queries

23 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/train queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

646K docs

Inherits docs from hc4/zh

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/train docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

341 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`173`	50.7%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`140`	41.1%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`28`	8.2%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("hc4/zh/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export hc4/zh/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

Metadata

{
  "docs": {
    "count": 646305,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 23
  },
  "qrels": {
    "count": 341,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 173,
          "1": 140,
          "3": 28
        }
      }
    }
  }
}

`ir_datasets`: HC4 (HLTCOE CLIR Common-Crawl Collection)

Data Access Information

`"hc4"`

`"hc4/fa"`

`"hc4/fa/dev"`

`"hc4/fa/test"`

`"hc4/fa/train"`

`"hc4/ru"`

`"hc4/ru/dev"`

`"hc4/ru/test"`

`"hc4/ru/train"`

`"hc4/zh"`

`"hc4/zh/dev"`

`"hc4/zh/test"`

`"hc4/zh/train"`

ir_datasets: HC4 (HLTCOE CLIR Common-Crawl Collection)

Data Access Information

"hc4"

"hc4/fa"

"hc4/fa/dev"

"hc4/fa/test"

"hc4/fa/train"

"hc4/ru"

"hc4/ru/dev"

"hc4/ru/test"

"hc4/ru/train"

"hc4/zh"

"hc4/zh/dev"

"hc4/zh/test"

"hc4/zh/train"

`ir_datasets`: HC4 (HLTCOE CLIR Common-Crawl Collection)

`"hc4"`

`"hc4/fa"`

`"hc4/fa/dev"`

`"hc4/fa/test"`

`"hc4/fa/train"`

`"hc4/ru"`

`"hc4/ru/dev"`

`"hc4/ru/test"`

`"hc4/ru/train"`

`"hc4/zh"`

`"hc4/zh/dev"`

`"hc4/zh/test"`

`"hc4/zh/train"`