← home
Github: allenai/ir_datasets

ir_datasets: Beta Python API

This is an experimental version of the Python API, and may be buggy and subject to change in future versions. See here for the official python API. For now, both versions of the python API live side-by-side.

Dataset objects

Datasets can be obtained through ir_datasets.load("dataset-id") or constructed with ir_datasets.create_dataset(...). Dataset objects provide the following methods:

dataset.has_[docs|queries|qrels|scoreddocs|docpairs|qlogs]() -> bool

Returns True if this dataset provides the corresponding entity type (e.g., dataset.has_docs() will provide dataset.docs).

iter(dataset.docs) -> iter[namedtuple]

Returns an iterator of namedtuples, where each item is a document in the collection.

len(dataset.docs) -> int

Returns the number of documents in the collection.

dataset.docs[start:stop:skip] -> iter[namedtuple]

Returns an iterator of namedtuples by index, specified by the slice given.

# First 10 documents
dataset.docs[:10]
# Last 10 documents
dataset.docs[-10:]
# Every 2 documents
dataset.docs[::2]
# Every 2 documents, starting with the first document
dataset.docs[1::2]
# The first half of the collection
dataset.docs[:1/2]
# The middle third of collection
dataset.docs[1/3:2/3]

Note that the fancy slicing mechanics are faster and more sophisticated than itertools.islice; documents are not processed if they are skipped.

dataset.docs.type -> type

Returns the NamedTuple type that the iter(dataset.docs) returns. The available fields and type information can be found with _fields and __annotations__:

dataset.docs.type._fields
('doc_id', 'title', 'doi', 'date', 'abstract')
dataset.docs.type.__annotations__
{{
  'doc_id': str,
  'title': str,
  'doi': str,
  'date': str,
  'abstract': str
}}

dataset.docs.lookup(doc_ids) -> Dict[str, namedtuple]

Returns a dictionary mapping all doc_ids found in the collection to their contents.

dataset.docs.lookup_iter(doc_ids) -> Iterable[namedtuple]

Returns an iterable of all docs associated with the specified doc_ids found in the collection.

dataset.docs.lang -> str

Returns the two-character ISO 639-1 language code (e.g., "en" for English) of the documents in this collection. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown.

dataset.docs.metadata -> dict

Returns available metadata about the documents from this dataset (e.g., count).

iter(dataset.queries) -> iter[namedtuple]

Returns an iterator over namedtuples representing queries in the dataset.

len(dataset.queries) -> int

Returns the number of queries in the dataset.

dataset.queries.type -> type

Returns the type of the namedtuple returned by iter(dataset.queries), including _fields and __annotations__.

dataset.queries.lang -> str

Returns the two-character ISO 639-1 language code (e.g., "en" for English) of the queries. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown. Note that some datasets include translations as different query fields.

dataset.queries.lookup(query_ids) -> Dict[str, namedtuple]

Returns a dictionary mapping all query_ids found in the dataset to their contents.

dataset.queries.lookup_iter(query_ids) -> Iterable[namedtuple]

Returns an iterable of all docs associated with the specified query_ids found in the dataset.

dataset.queries.metadata -> dict

Returns available metadata about the queries from this dataset (e.g., count).

iter(dataset.qrels) -> iter[namedtuple]

Returns an iterator over namedtuples representing query relevance assessments in the dataset.

len(dataset.qrels) -> int

Returns the numer of qrels in the dataset.

dataset.qrels.type -> type

Returns the type of the namedtuple returned by iter(dataset.qrels), including _fields and __annotations__.

dataset.qrels.defs -> dict[int, str]

Returns a mapping between relevance levels and a textual description of what the level represents. (E.g., 0 represting not relevant, 1 representing possibly relevant, 2 representing definitely relevant.)

dataset.qrels.asdict() -> dict[str, dict[str, int]]

Returns a dict of dicts representing all qrels for this collection. Note that this will load all qrels into memory. The outer dict key is the query_id and the inner key is the doc_id. This is useful in tools such as pytrec_eval.

dataset.qrels.metadata -> dict

Returns available metadata about the qrels from this dataset (e.g., count).

iter(dataset.scoreddocs) -> iter[namedtuple]

Returns an iterator over namedtuples representing scored docs (e.g., initial rankings for re-ranking tasks) in the dataset.

len(dataset.scoreddocs) -> int

Returns the number of scoreddocs in the collection.

dataset.scoreddocs.type -> type

Returns the type of the namedtuple returned by iter(dataset.scoreddocs), including _fields and __annotations__.

dataset.scoreddocs.metadata -> dict

Returns available metadata about the scoreddocs from this dataset (e.g., count).

iter(dataset.docpairs) -> iter[namedtuple]

Returns an iterator over namedtuples representing doc pairs (e.g., training pairs) in the dataset.

len(dataset.docpairs) -> int

Returns the number of docpairs in the collection.

dataset.docpairs.type -> type

Returns the type of the namedtuple returned by iter(datset.docpairs), including _fields and __annotations__.

dataset.docpairs.metadata -> dict

Returns available metadata about the docpairs from this dataset (e.g., count).

iter(dataset.qlogs) -> iter[namedtuple]

Returns an iterator over namedtuples representing query log records in the dataset.

len(dataset.qlogs) -> int

Returns the number of qlogs in the collection.

dataset.qlogs.type -> type

Returns the type of the namedtuple returned by iter(dataset.qlogs), including _fields and __annotations__.

dataset.qlogs.metadata -> dict

Returns available metadata about the qlogs from this dataset (e.g., count).