`ir_datasets`: AOL-IA (Internet Archive)

Index

aol-ia

Data Access Information

To use the documents of this dataset, you will need to run the download script in aolia-tools. To run the script, use the following commands:

 git clone https://github.com/terrierteam/aolia-tools
 cd aolia-tools
 pip install -r requirements.txt
 python downloader.py

It takes around 2 days to download all documents.

`"aol-ia"`

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

queries

10.0M queries

Language: multiple/other/unknown

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aol-ia")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export aol-ia queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

1.5M docs

Language: multiple/other/unknown

Document type:

AolIaDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
ia_url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aol-ia")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, ia_url>

You can find more details about the Python API here.

CLI

ir_datasets export aol-ia docs



[doc_id]    [title]    [text]    [url]    [ia_url]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

19M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	clicked	`19M`	100.0%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aol-ia")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export aol-ia qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qlogs

36M qlogs

Query Log type:

AolQlog: (namedtuple)

user_id: str
query_id: str
query: str
query_orig: str
time: datetime
items: Tuple[
LogItem: (namedtuple)
1. doc_id: str
2. rank: int
3. clicked: bool
, ...]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("aol-ia")
for qlog in dataset.qlogs_iter():
    qlog # namedtuple<user_id, query_id, query, query_orig, time, items>

You can find more details about the Python API here.

CLI

No example available for CLI

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Pass2006Picture,MacAvaney2022Reproducing}

Bibtex:

@inproceedings{Pass2006Picture, title={A picture of search}, author={Pass, Greg and Chowdhury, Abdur and Torgeson, Cayley}, booktitle={InfoScale}, year={2006} } @inproceedings{MacAvaney2022Reproducing, author={MacAvaney, Sean and Macdonald, Craig and Ounis, Iadh}, title={Reproducing Personalised Session Search over the AOL Query Log}, booktitle={ECIR}, year={2022} }

Metadata

{
  "docs": {
    "count": 1525586,
    "fields": {
      "doc_id": {
        "max_len": 12,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 9966939
  },
  "qrels": {
    "count": 19442629,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 19442629
        }
      }
    }
  },
  "qlogs": {
    "count": 36389567
  }
}

ir_datasets: AOL-IA (Internet Archive)

Data Access Information

"aol-ia"

`ir_datasets`: AOL-IA (Internet Archive)

`"aol-ia"`