← home
Github: datasets/trec_mandarin.py

ir_datasets: TREC Mandarin

Index
  1. trec-mandarin
  2. trec-mandarin/trec5
  3. trec-mandarin/trec6

"trec-mandarin"

A collection of news articles in Mandarin in Simplified Chinese, used for multi-lingual evaluation in TREC 5 and TREC 6.

Document collection from LDC2000T52.

docs

Language: zh

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Example

import ir_datasets
dataset = ir_datasets.load('trec-mandarin')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>
Citation
bibtex: @misc{LDC2000T52, title={TREC Mandarin LDC2000T52}, author={Rogers, Willie}, year={2000}, url={https://catalog.ldc.upenn.edu/LDC2000T52}, publisher={Linguistic Data Consortium} }

"trec-mandarin/trec5"

Mandarin Chinese benchmark from TREC 5.

queries

Language: multiple/other/unknown

Query type:
TrecMandarinQuery: (namedtuple)
  1. query_id: str
  2. title_en: str
  3. title_zh: str
  4. description_en: str
  5. description_zh: str
  6. narrative_en: str
  7. narrative_zh: str

Example

import ir_datasets
dataset = ir_datasets.load('trec-mandarin/trec5')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title_en, title_zh, description_en, description_zh, narrative_en, narrative_zh>
docs

Language: zh

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Example

import ir_datasets
dataset = ir_datasets.load('trec-mandarin/trec5')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0not relevant
1relevant

Example

import ir_datasets
dataset = ir_datasets.load('trec-mandarin/trec5')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{harman1997chinese, title={Spanish and Chinese Document Retrieval in TREC-5}, author={Alan Smeaton and Ross Wilkinson}, booktitle={TREC}, year={1996} }

"trec-mandarin/trec6"

Mandarin Chinese benchmark from TREC 6.

queries

Language: multiple/other/unknown

Query type:
TrecMandarinQuery: (namedtuple)
  1. query_id: str
  2. title_en: str
  3. title_zh: str
  4. description_en: str
  5. description_zh: str
  6. narrative_en: str
  7. narrative_zh: str

Example

import ir_datasets
dataset = ir_datasets.load('trec-mandarin/trec6')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title_en, title_zh, description_en, description_zh, narrative_en, narrative_zh>
docs

Language: zh

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Example

import ir_datasets
dataset = ir_datasets.load('trec-mandarin/trec6')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0not relevant
1relevant

Example

import ir_datasets
dataset = ir_datasets.load('trec-mandarin/trec6')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{wilkinson1998chinese, title={Chinese Document Retrieval at TREC-6}, author={Ross Wilkinson}, booktitle={TREC}, year={1997} }