← home
Github: allenai/ir_datasets

ir_datasets: Experimaestro

Experimaestro is a experiment manager framework. The experimaestro-ir package provides support for IR data and experiments.

To get started with experimaestro-ir, see this guide. You will need to run:

pip install experimaestro-ir

Basic Usage

Datasets are references using irds.{dotted-irds}, where {dotted-irds} is the ir-datasets dataset ID with the slashes replaced with dots. For example, to load the antique/test dataset in Experimaestro, use:

from datamaestro import prepare_dataset
dataset = prepare_dataset("irds.antique.train")

Experimaestro's Dataset objects have a different API and use a different naming convention than ir_datasets, but they provide similar functionality. The naming is as follows:


To get the first 20 documents:

for _, document in zip(range(20), dataset.documents.iter()):
    print(document.docid, document.text)  # (qid, text) tuple

To get the topics:

for topic in dataset.topics.iter():
    print(topic.qid, topic.text, topic.metadata)  # (qid, text, metadata) tuple

Running an experiment

from experimaestro import experiment
from datamaestro import prepare_dataset
from xpmir.measures import AP, nDCG
from xpmir.interfaces.anserini import IndexCollection, AnseriniRetriever
from xpmir.rankers.standard import BM25
from xpmir.evaluation import Evaluate
import os
import logging

dataset = prepare_dataset("irds.antique.train")

with experiment("workdir", "evaluate-bm25", port=12345) as xp:
    # Build the index
    xp.setenv("JAVA_HOME", os.environ["JAVA_HOME"])
    index = IndexCollection(documents=dataset.documents).submit()

    bm25_retriever = AnseriniRetriever(k=1500, index=index, model=BM25())
    bm25_eval = Evaluate(dataset=dataset, retriever=bm25_retriever, measures=[
        AP, nDCG@10

print("BM25 results")