← home
Github: datasets/trec_mandarin.py

ir_datasets: TREC Mandarin

Index
  1. trec-mandarin
  2. trec-mandarin/trec5
  3. trec-mandarin/trec6

Data Access Information

To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2000T52.

Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.

The source file is: LDC2000T52.tgz.

ir_datasets expects this file to be copied/linked as ~/.ir_datasets/trec-mandarin/corpus.tgz.


"trec-mandarin"

A collection of news articles in Mandarin in Simplified Chinese, used for multi-lingual evaluation in TREC 5 and TREC 6.

Document collection from LDC2000T52.

docsCitationMetadata
165K docs

Language: zh

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.


"trec-mandarin/trec5"

Mandarin Chinese benchmark from TREC 5.

queriesdocsqrelsCitationMetadata
28 queries

Language: multiple/other/unknown

Query type:
TrecMandarinQuery: (namedtuple)
  1. query_id: str
  2. title_en: str
  3. title_zh: str
  4. description_en: str
  5. description_zh: str
  6. narrative_en: str
  7. narrative_zh: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title_en, title_zh, description_en, description_zh, narrative_en, narrative_zh>

You can find more details about the Python API here.


"trec-mandarin/trec6"

Mandarin Chinese benchmark from TREC 6.

queriesdocsqrelsCitationMetadata
26 queries

Language: multiple/other/unknown

Query type:
TrecMandarinQuery: (namedtuple)
  1. query_id: str
  2. title_en: str
  3. title_zh: str
  4. description_en: str
  5. description_zh: str
  6. narrative_en: str
  7. narrative_zh: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec6")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title_en, title_zh, description_en, description_zh, narrative_en, narrative_zh>

You can find more details about the Python API here.