`ir_datasets`: TREC Spanish

Index

trec-spanish
trec-spanish/trec3
trec-spanish/trec4

Data Access Information

To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2000T51.

Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.

The source file is: LDC2000T51.tgz.

ir_datasets expects this file to be copied/linked as ~/.ir_datasets/trec-spanish/corpus.tgz.

`"trec-spanish"`

A collection of news articles in Spanish, used for multi-lingual evaluation in TREC 3 and TREC 4.

Document collection from LDC2000T51.

docs

121K docs

Language: es

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-spanish")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-spanish docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rogers2000Spanish}

Bibtex:

@misc{Rogers2000Spanish, title={TREC Spanish LDC2000T51}, author={Rogers, Willie}, year={2000}, url={https://catalog.ldc.upenn.edu/LDC2000T51}, publisher={Linguistic Data Consortium} }

Metadata

{
  "docs": {
    "count": 120605,
    "fields": {
      "doc_id": {
        "max_len": 13,
        "common_prefix": ""
      }
    }
  }
}

`"trec-spanish/trec3"`

Spanish benchmark from TREC 3.

Task Overview Paper

queries

25 queries

Language: multiple/other/unknown

Query type:

TrecSpanish3Query: (namedtuple)

query_id: str
title_es: str
title_en: str
description_es: str
description_en: str
narrative_es: str
narrative_en: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title_es, title_en, description_es, description_en, narrative_es, narrative_en>

You can find more details about the Python API here.

CLI

ir_datasets export trec-spanish/trec3 queries



[query_id]    [title_es]    [title_en]    [description_es]    [description_en]    [narrative_es]    [narrative_en]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

121K docs

Inherits docs from trec-spanish

Language: es

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-spanish/trec3 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

19K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`14K`	74.9%
1	relevant	`4.8K`	25.1%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-spanish/trec3 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Harman1994Trec3,Rogers2000Spanish}

Bibtex:

@inproceedings{Harman1994Trec3, title={Overview of the Third Text REtrieval Conference (TREC-3)}, author={Donna Harman}, booktitle={TREC}, year={1994} } @misc{Rogers2000Spanish, title={TREC Spanish LDC2000T51}, author={Rogers, Willie}, year={2000}, url={https://catalog.ldc.upenn.edu/LDC2000T51}, publisher={Linguistic Data Consortium} }

Metadata

{
  "docs": {
    "count": 120605,
    "fields": {
      "doc_id": {
        "max_len": 13,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 25
  },
  "qrels": {
    "count": 19005,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4766,
          "0": 14239
        }
      }
    }
  }
}

`"trec-spanish/trec4"`

Spanish benchmark from TREC 4.

Task Overview Paper

queries

25 queries

Language: multiple/other/unknown

Query type:

TrecSpanish4Query: (namedtuple)

query_id: str
description_es1: str
description_en1: str
description_es2: str
description_en2: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, description_es1, description_en1, description_es2, description_en2>

You can find more details about the Python API here.

CLI

ir_datasets export trec-spanish/trec4 queries



[query_id]    [description_es1]    [description_en1]    [description_es2]    [description_en2]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

121K docs

Inherits docs from trec-spanish

Language: es

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export trec-spanish/trec4 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

13K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`11K`	83.2%
1	relevant	`2.2K`	16.8%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("trec-spanish/trec4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-spanish/trec4 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Harman1995Trec4,Rogers2000Spanish}

Bibtex:

@inproceedings{Harman1995Trec4, title={Overview of the Fourth Text REtrieval Conference (TREC-4)}, author={Donna Harman}, booktitle={TREC}, year={1995} } @misc{Rogers2000Spanish, title={TREC Spanish LDC2000T51}, author={Rogers, Willie}, year={2000}, url={https://catalog.ldc.upenn.edu/LDC2000T51}, publisher={Linguistic Data Consortium} }

Metadata

{
  "docs": {
    "count": 120605,
    "fields": {
      "doc_id": {
        "max_len": 13,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 25
  },
  "qrels": {
    "count": 13109,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 2202,
          "0": 10907
        }
      }
    }
  }
}

ir_datasets: TREC Spanish

Data Access Information

"trec-spanish"

"trec-spanish/trec3"

"trec-spanish/trec4"

`ir_datasets`: TREC Spanish

`"trec-spanish"`

`"trec-spanish/trec3"`

`"trec-spanish/trec4"`