This documentation is for
v0.4.1. See
here for documentation of the current latest version on pypi.
ir_datasets
: TREC Arabic
Index
- trec-arabic
- trec-arabic/ar2001
- trec-arabic/ar2002
Data Access Information
To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2001T55.
Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.
The source file is: arabic_newswire_a_LDC2001T55.tgz.
ir_datasets expects this file to be copied/linked as ~/.ir_datasets/trec-arabic/corpus.tgz.
"trec-arabic"
A collection of news articles in Arabic, used for multi-lingual evaluation in TREC 2001 and TREC 2002.
Document collection from LDC2001T55.
docsCitation
Language: ar
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-arabic")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-arabic docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
ir_datasets.bib:
\cite{Graff2001Arabic}
Bibtex:
@misc{Graff2001Arabic,
title={Arabic Newswire Part 1 LDC2001T55},
author={Graff, David, and Walker, Kevin},
year={2001},
url={https://catalog.ldc.upenn.edu/LDC2001T55},
publisher={Linguistic Data Consortium}
}
"trec-arabic/ar2001"
Arabic benchmark from TREC 2001.
queriesdocsqrelsCitation
Language: ar
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-arabic/ar2001")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-arabic/ar2001 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: ar
Note: Uses docs from trec-arabic
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-arabic/ar2001")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-arabic/ar2001 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-arabic/ar2001")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-arabic/ar2001 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
ir_datasets.bib:
\cite{Gey2001Arabic,Graff2001Arabic}
Bibtex:
@inproceedings{Gey2001Arabic,
title={The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic using English, French or Arabic Queries},
author={Fredric Gey and Douglas Oard},
booktitle={TREC},
year={2001}
}
@misc{Graff2001Arabic,
title={Arabic Newswire Part 1 LDC2001T55},
author={Graff, David, and Walker, Kevin},
year={2001},
url={https://catalog.ldc.upenn.edu/LDC2001T55},
publisher={Linguistic Data Consortium}
}
"trec-arabic/ar2002"
Arabic benchmark from TREC 2002.
queriesdocsqrelsCitation
Language: ar
Query type:
TrecQuery: (namedtuple)
- query_id: str
- title: str
- description: str
- narrative: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-arabic/ar2002")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export trec-arabic/ar2002 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
No example available for PyTerrier
Language: ar
Note: Uses docs from trec-arabic
Document type:
TrecDoc: (namedtuple)
- doc_id: str
- text: str
- marked_up_doc: str
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-arabic/ar2002")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-arabic/ar2002 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
Query relevance judgment type:
TrecQrel: (namedtuple)
- query_id: str
- doc_id: str
- relevance: int
- iteration: str
Relevance levels
Rel. | Definition |
0 | not relevant |
1 | relevant |
Examples:
Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-arabic/ar2002")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-arabic/ar2002 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
ir_datasets.bib:
\cite{Gey2002Arabic,Graff2001Arabic}
Bibtex:
@inproceedings{Gey2002Arabic,
title={The TREC-2002 Arabic/English CLIR Track},
author={Fredric Gey and Douglas Oard},
booktitle={TREC},
year={2002}
}
@misc{Graff2001Arabic,
title={Arabic Newswire Part 1 LDC2001T55},
author={Graff, David, and Walker, Kevin},
year={2001},
url={https://catalog.ldc.upenn.edu/LDC2001T55},
publisher={Linguistic Data Consortium}
}