← home
Github: datasets/aol_ia.py

ir_datasets: AOL-IA (Internet Archive)

Index
  1. aol-ia

Data Access Information

To use the documents of this dataset, you will need to run the download script in aolia-tools. To run the script, use the following commands:

git clone https://github.com/terrierteam/aolia-tools
cd aolia-tools
pip install -r requirements.txt
python downloader.py

It takes around 2 days to download all documents.


"aol-ia"

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

queriesdocsqrelsqlogsCitationMetadata
10.0M queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.