ir_datasets
: AOL-IA (Internet Archive)To use the documents of this dataset, you will need to run the download script in aolia-tools. To run the script, use the following commands:
git clone https://github.com/terrierteam/aolia-tools
cd aolia-tools
pip install -r requirements.txt
python downloader.py
It takes around 2 days to download all documents.
This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.
The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export aol-ia queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.aol-ia.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, url, ia_url>
You can find more details about the Python API here.
ir_datasets export aol-ia docs
[doc_id] [title] [text] [url] [ia_url]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.aol-ia')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | clicked | 19M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export aol-ia qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.aol-ia.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("aol-ia")
for qlog in dataset.qlogs_iter():
qlog # namedtuple<user_id, query_id, query, query_orig, time, items>
You can find more details about the Python API here.
No example available for CLI
No example available for PyTerrier
No example available for XPM-IR
Bibtex:
@inproceedings{Pass2006Picture, title={A picture of search}, author={Pass, Greg and Chowdhury, Abdur and Torgeson, Cayley}, booktitle={InfoScale}, year={2006} } @inproceedings{MacAvaney2022Reproducing, author={MacAvaney, Sean and Macdonald, Craig and Ounis, Iadh}, title={Reproducing Personalised Session Search over the AOL Query Log}, booktitle={ECIR}, year={2022} }{ "docs": { "count": 1525586, "fields": { "doc_id": { "max_len": 12, "common_prefix": "" } } }, "queries": { "count": 9966939 }, "qrels": { "count": 19442629, "fields": { "relevance": { "counts_by_value": { "1": 19442629 } } } }, "qlogs": { "count": 36389567 } }