ir_datasets
: CLIRMatrixCLIRMatrix contains is massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval.
With 139 languages, there are 19,182 total language pairs. This is too many to list individually in the catalog, so patterns are instead used to match the dataset.
"clirmatrix/{lang}" (e.g., "clirmatrix/en"):
The document corpus for the given language. Documents are provided as GenericDocs.
"clirmatrix/{doc_lang}/{bi139-base|bi139-full}/{query_lang}/{train|dev|test1|test2}" (e.g., "clirmatrix/en/bi139-full/de/train"):
Documents are provided as GenericDocs, queries are provided as GenericQuerys, and qrels are provided as TrecQrels.
Supported languages are: af, als, am, an, ar, arz, ast, az, azb, ba, bar, be, bg, bn, bpy, br, bs, bug, ca, cdo, ce, ceb, ckb, cs, cv, cy, da, de, diq, el, eml, en, eo, es, et, eu, fa, fi, fo, fr, fy, ga, gd, gl, gu, he, hi, hr, hsb, ht, hu, hy, ia, id, ilo, io, is, it, ja, jv, ka, kk, kn, ko, ku, ky, la, lb, li, lmo, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, my, mzn, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pl, pms, pnb, ps, pt, qu, ro, ru, sa, sah, scn, sco, sd, sh, si, simple, sk, sl, sq, sr, su, sv, sw, szl, ta, te, tg, th, tl, tr, tt, uk, ur, uz, vec, vi, vo, wa, war, wuu, xmf, yi, yo, zh
"clirmatrix/{doc_lang}/multi8/{query_lang}/{train|dev|test1|test2}" (e.g., "clirmatrix/en/multi8/de/train"):
Documents are provided as GenericDocs, queries are provided as GenericQuerys, and qrels are provided as TrecQrels. Supported languages are: ar, de, en, es, fr, ja, ru, zh