← home
Github: datasets/aquaint.py

ir_datasets: AQUAINT

Index
  1. aquaint
  2. aquaint/trec-robust-2005

Data Access Information

To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2002T31.

Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.

The source file is: aquaint_comp_LDC2002T31.tgz.

ir_datasets expects this file to be copied/linked in ~/.ir_datasets/aquaint/.


"aquaint"

A document collection of about 1M English newswire text. Sources are the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service.

docsCitationMetadata
1.0M docs

Language: en

Document type:
TrecDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. marked_up_doc: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("aquaint")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.


"aquaint/trec-robust-2005"

The TREC Robust 2005 dataset. Contains a subset of 50 "hard" queries from trec-robust04.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("aquaint/trec-robust-2005")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.