← home
Github: datasets/clirmatrix.py

ir_datasets: CLIRMatrix

Index
  1. clirmatrix

"clirmatrix"

CLIRMatrix contains is massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval.

With 139 languages, there are 19,182 total language pairs. This is too many to list individually in the catalog, so patterns are instead used to match the dataset.

"clirmatrix/{lang}" (e.g., "clirmatrix/en"):

The document corpus for the given language. Documents are provided as GenericDocs.

"clirmatrix/{doc_lang}/{bi139-base|bi139-full}/{query_lang}/{train|dev|test1|test2}" (e.g., "clirmatrix/en/bi139-full/de/train"):

Documents are provided as GenericDocs, queries are provided as GenericQuerys, and qrels are provided as TrecQrels.

Supported languages are: af, als, am, an, ar, arz, ast, az, azb, ba, bar, be, bg, bn, bpy, br, bs, bug, ca, cdo, ce, ceb, ckb, cs, cv, cy, da, de, diq, el, eml, en, eo, es, et, eu, fa, fi, fo, fr, fy, ga, gd, gl, gu, he, hi, hr, hsb, ht, hu, hy, ia, id, ilo, io, is, it, ja, jv, ka, kk, kn, ko, ku, ky, la, lb, li, lmo, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, my, mzn, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pl, pms, pnb, ps, pt, qu, ro, ru, sa, sah, scn, sco, sd, sh, si, simple, sk, sl, sq, sr, su, sv, sw, szl, ta, te, tg, th, tl, tr, tt, uk, ur, uz, vec, vi, vo, wa, war, wuu, xmf, yi, yo, zh

"clirmatrix/{doc_lang}/multi8/{query_lang}/{train|dev|test1|test2}" (e.g., "clirmatrix/en/multi8/de/train"):

Documents are provided as GenericDocs, queries are provided as GenericQuerys, and qrels are provided as TrecQrels. Supported languages are: ar, de, en, es, fr, ja, ru, zh

Citation

ir_datasets.bib:

\cite{Sun2020Clirmatrix}

Bibtex:

@inproceedings{Sun2020Clirmatrix, title = "{CLIRM}atrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval", author = "Sun, Shuo and Duh, Kevin", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.340", doi = "10.18653/v1/2020.emnlp-main.340", pages = "4160--4170" }