← home
Github: allenai/ir_datasets

ir_datasets: Design

This document summarises the design decisions and reasoning of ir_datasets.

Why Python?

Python was a natural choice for building ir_datasets. Python is the language of choice for much of the recent work in Information Retrieval (particularly work that uses deep neural networks). Python also already has a rich ecosystem of software packages, giving ir_datasets easy access to a variety of development and runtime tools. Finally, Python has a realatiely simple syntax and code structure, simplifying the develoment and maintenance of the package.

The decision to build this library with Python isn't without its downsides, however. Processing some data formats are slow in Python and require building specialised Python extensions to handle them efficiently (e.g., pyautocorpus was built to take advantage of a fast c-based library for extracting the text from Wikipedia markup.)

Why NamedTuples?

NamedTuples are used as the core data objects in ir_datasets. These objects are fast, lightweight, immutable, and reasonably self-documenting. Further, since it's part of the Python Standard Library, it does not introduce any dependencies.

NamedTuples are also very flexible, and can easily be convered into a variety of formats:

import ir_datasets
dataset = ir_datasets.load('antique')
doc = dataset.docs.lookup('1424320_8')

# To a tuple:
tuple(doc)
# ('1424320_8', 'As others pointed out, there are[SNIP]cash.. . Regards.')

# To a dict:
doc._asdict()
# {'doc_id': '1424320_8', 'text': 'As others pointed out, there are[SNIP]cash.. . Regards.'}

# To a Pandas Dataframe:
import pandas as pd
pd.DataFrame(dataset.docs)
#            doc_id                                               text
# 0       2020338_0  A small group of politicians believed strongly...
# 1       2020338_1             Because there is a lot of oil in Iraq.
# ...           ...                                                ...
# 403664  1424320_8  As others pointed out, there are investor lend...
# 403665  1424320_9  You can finance up to 100% of the property val...

Before going with NamedTuples, we also considered several other approachs for data objects:

Why not dataclass? The recently-added Python dataclasses provide much of the same functionality as NamedTuple. However, they come with some drawbacks They are generally slower than NamedTuples (though starting in Python 3.10, slots=True can help with some operations). NamedTuples allow for easier transformation into other common types, including dicts and tuples. This can be particularly helpful for reducing storage overhead (e.g., the docstore for msmarco-passage is 23% smaller because the contents are encoded as NamedTuples, rather than dataclasses). Since they were only first introduced in python 3.7, they are incompatible with the popular 3.6 release.

Why not dict? Dictionary objects are popular choices easy, ad hoc data records, since they can be easily created, extended, and serialised into formats like JSON. However, they do not provide a clear and fixed definition of the available fields, so a separate definition structure would need to be provided for information like the type of each field. They are also larger in memory and operations such as looking up items by field are also slower than NamedTuple.

Why not protobuf? Protocol Buffers are a serialisation format allowing data to be shared across systems and platforms. The Python binding provides much of the same functionality as NamedTuples, with the added benefit of a well-defined serialisation format. However, they introduce a relatively large dependency and would add non-trivial complexity to the build process (e.g., protobufs need to be compiled to generate the corresponding Python code). They also do not convert as easily to other formats (e.g., Pandas treats them as object types by default, rather than expanding a column out for each field). Per our benchmarking, they are also slightly slower to read from disk, compared to our approach with NamedTuples.

Why LZ4?

The LZ4 compression algorithm is used throughout ir_datasets when data needs to be stored. Most notably, data in most docstore objects—used for looking up documents by ID, iterating quickly, or slicing document iterators—compress document object contents using LZ4 compression. We found this compression format preferable to others for a variety of reasons. Most importantly, it provides by far the fastest decompression, while providing a reasonable compression rate. It also allows for fine-grained control over the output frames, which allows for fast seeking to specific offsets. This is helpful for providing functionality like fast document lookups. As a bonus, the software package is lighter-weight than other competing compression libraries (e.g., zstandard, snappy).

In some cases, we cannot avoid using other compression techniques. For instance, the ClueWeb datasets would be too large and take too long to re-compress as LZ4. To provide fast lookup and fancy slicing functionality for these datasets, we built the zlib-sate library which allows for more flexible seeking to particular points in gzip files. By distributing files that provide periodic checkpoints, efficient corpus processing is provided while managing stoage costs.