← home
Github: allenai/ir_datasets

PyTerrier & ir_datasets

PyTerrier is a Python interface to the Terrier search engine that enables the creation of flexible retrieval pipelines.

To get started with PyTerrier, see this guide.

The documentation for each dataset includes PyTerrier examples for indexing, retrieval, and experimentation. Click on the PyTerrier tab in the documentation to see these examples.

Basic Usage

The PyTerrier library includes its own dataset API, which will use the ir_datasets implementation under the hood if the dataset ID is prefixed with irds:. For example, to load the antique/test dataset in PyTerrier, run:

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/test') # 'irds:<ir-datasets-id>'

PyTerrier Dataset objects have a different API and use a different naming convention than ir_datasets, but they provide similar functionality. When wrapping an ir_dataset, it automatically maps to a PyTerrier-compatible format:

PyTerrier's...Uses...Notes
get_corpus_iter()docs_iter()
get_corpus_lang()docs_lang()
get_topics(variant=None)queries_iter()When multiple are available (e.g., title, description, narrative), variant selects which query field to use
get_topics_lang()queries_lang()
get_qrels()qrels_iter()
info_url()Provides URL of correspnding ir_datasets documentation page

Indexing a Dataset

The pt.IterDictIndexer class can index an ir_datasets doc collection. When calling index, be sure to set the correct fields you want to include in the index. These can be found on the dataset's documentation page, and suggestions for each dataset are given in the PyTerrier samples.

dataset = pt.get_dataset('irds:antique') # use a pyterrier dataset object here
indexer = pt.IterDictIndexer('./indices/antique')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

More information about indexing in PyTerrier can be found here.

Performing Retrieval

PyTerrier provides a variety of retrieval functions with a common API. The pipelines often start with retrieval over an inverted index using a scoring function like BM25.

dataset = pt.get_dataset('irds:antique/test') # use a pyterrier dataset object here
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
bm25 = pt.BatchRetrieve(index_ref, wmodel='BM25')
bm25(dataset.get_topics())

Some datasets have multiple query formats (e.g., title, description, narrative). To select which one to use, specify the variant:

dataset = pt.get_dataset('irds:trec-robust04')
bm25(dataset.get_topics('description'))

More information about retrieval and ranking in PyTerrier can be found here.

Running an Experiment

PyTerrier also provides an interface for conducting IR experiments.

from pyterrier.measures import *
dataset = pt.get_dataset('irds:antique/test') # use a pyterrier dataset object here
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
bm25 = pt.BatchRetrieve(index_ref, wmodel='BM25')
pt.Experiment(
    [bm25],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

More information about experiments in PyTerrier can be found here.

Document Text

Some re-ranking models, such as those based on BERT, make use of the document text. PyTerrier can use ir_datasets's fast document lookups (via docs_store) for this when passing an ir_datasets-backed object into pt.text.get_text when building a retrieval pipeline:

dataset = pt.get_dataset('irds:antique/test') # use a pyterrier dataset object here
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipe = (pt.BatchRetrieve(index_ref, wmodel="DPH")
     >> pt.text.get_text(dataset, "text")
     >> pt.text.scorer(wmodel="DPH"))

Note that the second argument specifies the document field to use.

More information about working with document text in PyTerrier can be found here.

Further Information