ir_datasets
: TREC SpanishTo use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2000T51.
Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.
The source file is: LDC2000T51.tgz.
ir_datasets expects this file to be copied/linked as ~/.ir_datasets/trec-spanish/corpus.tgz.
A collection of news articles in Spanish, used for multi-lingual evaluation in TREC 3 and TREC 4.
Document collection from LDC2000T51.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-spanish")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-spanish docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-spanish')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@misc{Rogers2000Spanish, title={TREC Spanish LDC2000T51}, author={Rogers, Willie}, year={2000}, url={https://catalog.ldc.upenn.edu/LDC2000T51}, publisher={Linguistic Data Consortium} }{ "docs": { "count": 120605, "fields": { "doc_id": { "max_len": 13, "common_prefix": "" } } } }
Spanish benchmark from TREC 3.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec3")
for query in dataset.queries_iter():
query # namedtuple<query_id, title_es, title_en, description_es, description_en, narrative_es, narrative_en>
You can find more details about the Python API here.
ir_datasets export trec-spanish/trec3 queries
[query_id] [title_es] [title_en] [description_es] [description_en] [narrative_es] [narrative_en]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-spanish.trec3.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from trec-spanish
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-spanish/trec3 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-spanish.trec3')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 14K | 74.9% |
1 | relevant | 4.8K | 25.1% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-spanish/trec3 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-spanish.trec3.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Harman1994Trec3, title={Overview of the Third Text REtrieval Conference (TREC-3)}, author={Donna Harman}, booktitle={TREC}, year={1994} } @misc{Rogers2000Spanish, title={TREC Spanish LDC2000T51}, author={Rogers, Willie}, year={2000}, url={https://catalog.ldc.upenn.edu/LDC2000T51}, publisher={Linguistic Data Consortium} }{ "docs": { "count": 120605, "fields": { "doc_id": { "max_len": 13, "common_prefix": "" } } }, "queries": { "count": 25 }, "qrels": { "count": 19005, "fields": { "relevance": { "counts_by_value": { "1": 4766, "0": 14239 } } } } }
Spanish benchmark from TREC 4.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec4")
for query in dataset.queries_iter():
query # namedtuple<query_id, description_es1, description_en1, description_es2, description_en2>
You can find more details about the Python API here.
ir_datasets export trec-spanish/trec4 queries
[query_id] [description_es1] [description_en1] [description_es2] [description_en2]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-spanish.trec4.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from trec-spanish
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, marked_up_doc>
You can find more details about the Python API here.
ir_datasets export trec-spanish/trec4 docs
[doc_id] [text] [marked_up_doc]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-spanish.trec4')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 11K | 83.2% |
1 | relevant | 2.2K | 16.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-spanish/trec4 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-spanish.trec4.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Harman1995Trec4, title={Overview of the Fourth Text REtrieval Conference (TREC-4)}, author={Donna Harman}, booktitle={TREC}, year={1995} } @misc{Rogers2000Spanish, title={TREC Spanish LDC2000T51}, author={Rogers, Willie}, year={2000}, url={https://catalog.ldc.upenn.edu/LDC2000T51}, publisher={Linguistic Data Consortium} }{ "docs": { "count": 120605, "fields": { "doc_id": { "max_len": 13, "common_prefix": "" } } }, "queries": { "count": 25 }, "qrels": { "count": 13109, "fields": { "relevance": { "counts_by_value": { "1": 2202, "0": 10907 } } } } }