`ir_datasets`: AQUAINT

Index

aquaint
aquaint/trec-robust-2005

Data Access Information

To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2002T31.

Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.

The source file is: aquaint_comp_LDC2002T31.tgz.

ir_datasets expects this file to be copied/linked in ~/.ir_datasets/aquaint/.

`"aquaint"`

A document collection of about 1M English newswire text. Sources are the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service.

Dataset details

docs

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aquaint")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export aquaint docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:aquaint')
# Index aquaint
indexer = pt.IterDictIndexer('./indices/aquaint')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Graff2002Aquaint}

Bibtex:

@misc{Graff2002Aquaint, title={The AQUAINT Corpus of English News Text}, author={David Graff}, year={2002}, url={https://catalog.ldc.upenn.edu/LDC2002T31}, publisher={Linguistic Data Consortium} }

`"aquaint/trec-robust-2005"`

The TREC Robust 2005 dataset. Contains a subset of 50 "hard" queries from trec-robust04.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Shared task site
Task overview paper
See also: trec-robust04

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aquaint/trec-robust-2005")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export aquaint/trec-robust-2005 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:aquaint/trec-robust-2005')
index_ref = pt.IndexRef.of('./indices/aquaint') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Inherits docs from aquaint

Language: en

Document type:

TrecDoc: (namedtuple)

doc_id: str
text: str
marked_up_doc: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aquaint/trec-robust-2005")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, marked_up_doc>

You can find more details about the Python API here.

CLI

ir_datasets export aquaint/trec-robust-2005 docs



[doc_id]    [text]    [marked_up_doc]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:aquaint/trec-robust-2005')
# Index aquaint
indexer = pt.IterDictIndexer('./indices/aquaint')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'marked_up_doc'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aquaint/trec-robust-2005")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export aquaint/trec-robust-2005 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:aquaint/trec-robust-2005')
index_ref = pt.IndexRef.of('./indices/aquaint') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Voorhees2005Robust,Graff2002Aquaint}

Bibtex:

@inproceedings{Voorhees2005Robust, title={Overview of the TREC 2005 Robust Retrieval Track}, author={Ellen M. Voorhees}, booktitle={TREC}, year={2005} } @misc{Graff2002Aquaint, title={The AQUAINT Corpus of English News Text}, author={David Graff}, year={2002}, url={https://catalog.ldc.upenn.edu/LDC2002T31}, publisher={Linguistic Data Consortium} }

ir_datasets: AQUAINT

Data Access Information

"aquaint"

"aquaint/trec-robust-2005"

`ir_datasets`: AQUAINT

`"aquaint"`

`"aquaint/trec-robust-2005"`