← home
Github: datasets/aol_ia.py

ir_datasets: AOL-IA (Internet Archive)

Index
  1. aol-ia

Data Access Information

To use the documents of this dataset, you will need to run the download script in aolia-tools. To run the script, use the following commands:

git clone https://github.com/terrierteam/aolia-tools
cd aolia-tools
pip install -r requirements.txt
python downloader.py

It takes around 2 days to download all documents.


"aol-ia"

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

queries
10.0M queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export aol-ia queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs
1.5M docs

Language: multiple/other/unknown

Document type:
AolIaDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. ia_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, ia_url>

You can find more details about the Python API here.

CLI
ir_datasets export aol-ia docs
[doc_id]    [title]    [text]    [url]    [ia_url]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
19M qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1clicked19M100.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export aol-ia qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qlogs
36M qlogs
Query Log type:
AolQlog: (namedtuple)
  1. user_id: str
  2. query_id: str
  3. query: str
  4. query_orig: str
  5. time: datetime
  6. items: Tuple[
    LogItem: (namedtuple)
    1. doc_id: str
    2. rank: int
    3. clicked: bool
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for qlog in dataset.qlogs_iter():
    qlog # namedtuple<user_id, query_id, query, query_orig, time, items>

You can find more details about the Python API here.

CLI

No example available for CLI

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Pass2006Picture,MacAvaney2022Reproducing}

Bibtex:

@inproceedings{Pass2006Picture, title={A picture of search}, author={Pass, Greg and Chowdhury, Abdur and Torgeson, Cayley}, booktitle={InfoScale}, year={2006} } @inproceedings{MacAvaney2022Reproducing, author={MacAvaney, Sean and Macdonald, Craig and Ounis, Iadh}, title={Reproducing Personalised Session Search over the AOL Query Log}, booktitle={ECIR}, year={2022} }
Metadata