ir_datasets: Beta Python API
Datasets can be obtained through ir_datasets.load("dataset-id")
or constructed with ir_datasets.create_dataset(...). Dataset objects provide the
following methods:
dataset.has_[docs|queries|qrels|scoreddocs|docpairs|qlogs]() -> boolReturns True if this dataset provides the corresponding entity type
(e.g., dataset.has_docs() will provide dataset.docs).
iter(dataset.docs) -> iter[namedtuple]Returns an iterator of namedtuples, where each item is a document in the collection.
len(dataset.docs) -> intReturns the number of documents in the collection.
dataset.docs[start:stop:skip] -> iter[namedtuple]Returns an iterator of namedtuples by index, specified by the slice given.
# First 10 documents
dataset.docs[:10]
# Last 10 documents
dataset.docs[-10:]
# Every 2 documents
dataset.docs[::2]
# Every 2 documents, starting with the first document
dataset.docs[1::2]
# The first half of the collection
dataset.docs[:1/2]
# The middle third of collection
dataset.docs[1/3:2/3]
Note that the fancy slicing mechanics are faster and more sophisticated than
itertools.islice; documents are not processed if they are skipped.
dataset.docs.type -> type
Returns the NamedTuple type that the iter(dataset.docs) returns.
The available fields and type information can be found with _fields and __annotations__:
dataset.docs.type._fields
('doc_id', 'title', 'doi', 'date', 'abstract')
dataset.docs.type.__annotations__
{{
'doc_id': str,
'title': str,
'doi': str,
'date': str,
'abstract': str
}}
dataset.docs.lookup(doc_ids) -> Dict[str, namedtuple]Returns a dictionary mapping all doc_ids found in the collection to their contents.
dataset.docs.lookup_iter(doc_ids) -> Iterable[namedtuple]Returns an iterable of all docs associated with the specified doc_ids found in the collection.
dataset.docs.lang -> strReturns the two-character ISO 639-1 language code (e.g., "en" for English) of the documents in this collection. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown.
dataset.docs.metadata -> dictReturns available metadata about the documents from this dataset (e.g., count).
iter(dataset.queries) -> iter[namedtuple]Returns an iterator over namedtuples representing queries in the dataset.
len(dataset.queries) -> intReturns the number of queries in the dataset.
dataset.queries.type -> type
Returns the type of the namedtuple returned by iter(dataset.queries),
including _fields and __annotations__.
dataset.queries.lang -> strReturns the two-character ISO 639-1 language code (e.g., "en" for English) of the queries. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown. Note that some datasets include translations as different query fields.
dataset.queries.lookup(query_ids) -> Dict[str, namedtuple]Returns a dictionary mapping all query_ids found in the dataset to their contents.
dataset.queries.lookup_iter(query_ids) -> Iterable[namedtuple]Returns an iterable of all docs associated with the specified query_ids found in the dataset.
dataset.queries.metadata -> dictReturns available metadata about the queries from this dataset (e.g., count).
iter(dataset.qrels) -> iter[namedtuple]Returns an iterator over namedtuples representing query relevance assessments in the dataset.
len(dataset.qrels) -> intReturns the numer of qrels in the dataset.
dataset.qrels.type -> type
Returns the type of the namedtuple returned by iter(dataset.qrels),
including _fields and __annotations__.
dataset.qrels.defs -> dict[int, str]Returns a mapping between relevance levels and a textual description of what the level represents. (E.g., 0 represting not relevant, 1 representing possibly relevant, 2 representing definitely relevant.)
dataset.qrels.asdict() -> dict[str, dict[str, int]]
Returns a dict of dicts representing all qrels for this collection. Note
that this will load all qrels into memory. The outer dict key is the
query_id and the inner key is the doc_id.
This is useful in tools such as pytrec_eval.
dataset.qrels.metadata -> dictReturns available metadata about the qrels from this dataset (e.g., count).
iter(dataset.scoreddocs) -> iter[namedtuple]Returns an iterator over namedtuples representing scored docs (e.g., initial rankings for re-ranking tasks) in the dataset.
len(dataset.scoreddocs) -> intReturns the number of scoreddocs in the collection.
dataset.scoreddocs.type -> type
Returns the type of the namedtuple returned by iter(dataset.scoreddocs),
including _fields and __annotations__.
dataset.scoreddocs.metadata -> dictReturns available metadata about the scoreddocs from this dataset (e.g., count).
iter(dataset.docpairs) -> iter[namedtuple]Returns an iterator over namedtuples representing doc pairs (e.g., training pairs) in the dataset.
len(dataset.docpairs) -> intReturns the number of docpairs in the collection.
dataset.docpairs.type -> type
Returns the type of the namedtuple returned by iter(datset.docpairs),
including _fields and __annotations__.
dataset.docpairs.metadata -> dictReturns available metadata about the docpairs from this dataset (e.g., count).
iter(dataset.qlogs) -> iter[namedtuple]Returns an iterator over namedtuples representing query log records in the dataset.
len(dataset.qlogs) -> intReturns the number of qlogs in the collection.
dataset.qlogs.type -> type
Returns the type of the namedtuple returned by iter(dataset.qlogs),
including _fields and __annotations__.
dataset.qlogs.metadata -> dictReturns available metadata about the qlogs from this dataset (e.g., count).