← home
Github: datasets/neuclir.py

ir_datasets: NeuCLIR Corpus

Index
  1. neuclir
  2. neuclir/1
  3. neuclir/1/fa
  4. neuclir/1/fa/hc4-filtered
  5. neuclir/1/ru
  6. neuclir/1/ru/hc4-filtered
  7. neuclir/1/zh
  8. neuclir/1/zh/hc4-filtered

Data Access Information

To access the docuemnts of this dataset, you will need to download the documents from Common Crawl. The script for downloading and validating the documents are in NeuCLIR/download-collection . Please use the following command to download the documents:

git clone https://github.com/NeuCLIR/download-collection
cd download-collection
pip install -r requirements.txt
python download_documents.py --storage ~/.ir_datasets/neuclir/1 \
--zho ./resource/zho/ids.jsonl.gz \
--fas ./resource/fas/ids.jsonl.gz \
--rus ./resource/rus/ids.*.jsonl.gz \
--jobs {number of process}

After download, please also post-process the downloaded file to verify all and only specified documents are downloaded, and modify the ordering of the collection to match the original specified ordering in the id files.

for lang in zho fas rus; do
  python fix_document_order.py --raw_download_file ~/.ir_datasets/neuclir/1/$lang/docs.jsonl \
   --id_file ./resource/$lang/ids*.jsonl.gz \
   --check_hash
done

You can also store the documents in other directory and create a soft link for ~/.ir_datasets/neuclir/22/.


"neuclir"

This is the dataset created for TREC 2022 NeuCLIR Track. Topics will be developed and released by June 2022 by NIST. Relevance judgements will be available after the evaluation (around November).

The collection designed to be similar to [HC4] and a large portion of documents from HC4 are ported to this collection. Users can conduct experiemnts on this collection with queries and qrels in HC4 for development.

  • Documents: Web pages from Common Crawl in Chinese, Persian, and Russian.
  • Queries: (To be released) English TREC-style title/description queries. Narrative field contains an example passage for each relevance level. Human and machine translation of the titles and descriptions in the target language (i.e., document language) are provided in the query object.
  • Qrels: (To be released) Documents are judged in three levels of relevance. Please refer to the dataset paper for the full definition of the levels.
  • See also: hc4
  • NeuCLIR Track Website
  • Collection Repository

"neuclir/1"

Version 1 of the NeuCLIR corpus.


"neuclir/1/fa"

The Persian collection contains English queries (to be released) and Persian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

docsMetadata
2.2M docs

Language: fa

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.


"neuclir/1/fa/hc4-filtered"

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

queriesdocsqrelsCitationMetadata
60 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"neuclir/1/ru"

The Russian collection contains English queries (to be released) and Russian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

docsMetadata
4.6M docs

Language: ru

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.


"neuclir/1/ru/hc4-filtered"

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

queriesdocsqrelsCitationMetadata
54 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.


"neuclir/1/zh"

The Chinese collection contains English queries (to be released) and Chinese documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

docsMetadata
3.2M docs

Language: zh

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.


"neuclir/1/zh/hc4-filtered"

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

queriesdocsqrelsCitationMetadata
60 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.