This documentation is for
v0.5.0. See
here for documentation of the current latest version on pypi.
ir_datasets
: TREC Mandarin
Index
- trec-mandarin
- trec-mandarin/trec5
- trec-mandarin/trec6
Data Access Information
To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2000T52.
Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.
The source file is: LDC2000T52.tgz.
ir_datasets expects this file to be copied/linked as ~/.ir_datasets/trec-mandarin/corpus.tgz.
"trec-mandarin"
A collection of news articles in Mandarin in Simplified Chinese, used for multi-lingual evaluation in TREC 5 and TREC 6.
Document collection from LDC2000T52.
docsCitationMetadata
165K docs
Language: zh
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-mandarin docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
ir_datasets.bib:
\cite{Rogers2000Mandarin}
Bibtex:
@misc{Rogers2000Mandarin,
title={TREC Mandarin LDC2000T52},
author={Rogers, Willie},
year={2000},
url={https://catalog.ldc.upenn.edu/LDC2000T52},
publisher={Linguistic Data Consortium}
}
{
"docs": {
"count": 164789,
"fields": {
"doc_id": {
"max_len": 22,
"common_prefix": ""
}
}
}
}
"trec-mandarin/trec5"
Mandarin Chinese benchmark from TREC 5.
queriesdocsqrelsCitationMetadata
28 queries
Language: multiple/other/unknown
Query type:
TrecMandarinQuery: (namedtuple)
- query_id: str
- title_en: str
- title_zh: str
- description_en: str
- description_zh: str
- narrative_en: str
- narrative_zh: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec5")
for query in dataset.queries_iter():
query # namedtuple<query_id, title_en, title_zh, description_en, description_zh, narrative_en, narrative_zh>
You can find more details about the Python API here.
ir_datasets export trec-mandarin/trec5 queries
[query_id] [title_en] [title_zh] [description_en] [description_zh] [narrative_en] [narrative_zh]
...
You can find more details about the CLI here.
No example available for PyTerrier
165K docs
Inherits docs from trec-mandarin
Language: zh
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-mandarin/trec5 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
16K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 13K | 86.0% |
1 | relevant | 2.2K | 14.0% |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-mandarin/trec5 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
ir_datasets.bib:
\cite{Harman1997Chinese,Rogers2000Mandarin}
Bibtex:
@inproceedings{Harman1997Chinese,
title={Spanish and Chinese Document Retrieval in TREC-5},
author={Alan Smeaton and Ross Wilkinson},
booktitle={TREC},
year={1996}
}
@misc{Rogers2000Mandarin,
title={TREC Mandarin LDC2000T52},
author={Rogers, Willie},
year={2000},
url={https://catalog.ldc.upenn.edu/LDC2000T52},
publisher={Linguistic Data Consortium}
}
{
"docs": {
"count": 164789,
"fields": {
"doc_id": {
"max_len": 22,
"common_prefix": ""
}
}
},
"queries": {
"count": 28
},
"qrels": {
"count": 15588,
"fields": {
"relevance": {
"counts_by_value": {
"0": 13406,
"1": 2182
}
}
}
}
}
"trec-mandarin/trec6"
Mandarin Chinese benchmark from TREC 6.
queriesdocsqrelsCitationMetadata
26 queries
Language: multiple/other/unknown
Query type:
TrecMandarinQuery: (namedtuple)
- query_id: str
- title_en: str
- title_zh: str
- description_en: str
- description_zh: str
- narrative_en: str
- narrative_zh: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec6")
for query in dataset.queries_iter():
query # namedtuple<query_id, title_en, title_zh, description_en, description_zh, narrative_en, narrative_zh>
You can find more details about the Python API here.
ir_datasets export trec-mandarin/trec6 queries
[query_id] [title_en] [title_zh] [description_en] [description_zh] [narrative_en] [narrative_zh]
...
You can find more details about the CLI here.
No example available for PyTerrier
165K docs
Inherits docs from trec-mandarin
Language: zh
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec6")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-mandarin/trec6 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
9.2K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition | Count | % |
0 | not relevant | 6.3K | 68.0% |
1 | relevant | 3.0K | 32.0% |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-mandarin/trec6")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-mandarin/trec6 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
ir_datasets.bib:
\cite{Wilkinson1998Chinese,Rogers2000Mandarin}
Bibtex:
@inproceedings{Wilkinson1998Chinese,
title={Chinese Document Retrieval at TREC-6},
author={Ross Wilkinson},
booktitle={TREC},
year={1997}
}
@misc{Rogers2000Mandarin,
title={TREC Mandarin LDC2000T52},
author={Rogers, Willie},
year={2000},
url={https://catalog.ldc.upenn.edu/LDC2000T52},
publisher={Linguistic Data Consortium}
}
{
"docs": {
"count": 164789,
"fields": {
"doc_id": {
"max_len": 22,
"common_prefix": ""
}
}
},
"queries": {
"count": 26
},
"qrels": {
"count": 9236,
"fields": {
"relevance": {
"counts_by_value": {
"1": 2958,
"0": 6278
}
}
}
}
}