ir_datasets
: ExperimaestroExperimaestro is a experiment manager framework. The experimaestro-ir package provides support for IR data and experiments.
To get started with experimaestro-ir, see this guide. You will need to run:
pip install experimaestro-ir
Datasets are references using irds.{dotted-irds}, where {dotted-irds} is the ir-datasets dataset ID with the slashes replaced with dots. For example, to load the antique/test dataset in Experimaestro, use:
from datamaestro import prepare_dataset
dataset = prepare_dataset("irds.antique.train")
Experimaestro's Dataset objects have a different API and use a different naming convention than ir_datasets, but they provide similar functionality. The naming is as follows:
Experimaestro's... | Uses... |
---|---|
dataset.documents.iter() | dataset.docs_iter() |
dataset.topics.iter() | dataset.queries_iter() |
dataset.assessments.iter() | dataset.qrels_iter() |
To get the first 20 documents:
for _, document in zip(range(20), dataset.documents.iter()):
print(document.docid, document.text) # (qid, text) tuple
To get the topics:
for topic in dataset.topics.iter():
print(topic.qid, topic.text, topic.metadata) # (qid, text, metadata) tuple
from experimaestro import experiment
from datamaestro import prepare_dataset
from xpmir.measures import AP, nDCG
from xpmir.interfaces.anserini import IndexCollection, AnseriniRetriever
from xpmir.rankers.standard import BM25
from xpmir.evaluation import Evaluate
import os
import logging
logging.basicConfig(level=logging.INFO)
dataset = prepare_dataset("irds.antique.train")
with experiment("workdir", "evaluate-bm25", port=12345) as xp:
# Build the index
xp.setenv("JAVA_HOME", os.environ["JAVA_HOME"])
index = IndexCollection(documents=dataset.documents).submit()
bm25_retriever = AnseriniRetriever(k=1500, index=index, model=BM25())
bm25_eval = Evaluate(dataset=dataset, retriever=bm25_retriever, measures=[
AP, nDCG@10
]).submit()
print("BM25 results")
print(bm25_eval.results.read_text())