ir_datasets
: Beta Python API
Datasets can be obtained through ir_datasets.load("dataset-id")
or constructed with ir_datasets.create_dataset(...)
. Dataset objects provide the
following methods:
dataset.has_[docs|queries|qrels|scoreddocs|docpairs|qlogs]() -> bool
Returns True
if this dataset provides the corresponding entity type
(e.g., dataset.has_docs()
will provide dataset.docs
).
iter(dataset.docs) -> iter[namedtuple]
Returns an iterator of namedtuple
s, where each item is a document in the collection.
len(dataset.docs) -> int
Returns the number of documents in the collection.
dataset.docs[start:stop:skip] -> iter[namedtuple]
Returns an iterator of namedtuple
s by index, specified by the slice given.
# First 10 documents
dataset.docs[:10]
# Last 10 documents
dataset.docs[-10:]
# Every 2 documents
dataset.docs[::2]
# Every 2 documents, starting with the first document
dataset.docs[1::2]
# The first half of the collection
dataset.docs[:1/2]
# The middle third of collection
dataset.docs[1/3:2/3]
Note that the fancy slicing mechanics are faster and more sophisticated than
itertools.islice
; documents are not processed if they are skipped.
dataset.docs.type -> type
Returns the NamedTuple
type that the iter(dataset.docs)
returns.
The available fields and type information can be found with _fields
and __annotations__
:
dataset.docs.type._fields
('doc_id', 'title', 'doi', 'date', 'abstract')
dataset.docs.type.__annotations__
{{
'doc_id': str,
'title': str,
'doi': str,
'date': str,
'abstract': str
}}
dataset.docs.lookup(doc_ids) -> Dict[str, namedtuple]
Returns a dictionary mapping all doc_ids found in the collection to their contents.
dataset.docs.lookup_iter(doc_ids) -> Iterable[namedtuple]
Returns an iterable of all docs associated with the specified doc_ids found in the collection.
dataset.docs.lang -> str
Returns the two-character ISO 639-1 language code (e.g., "en" for English) of the documents in this collection. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown.
dataset.docs.metadata -> dict
Returns available metadata about the documents from this dataset (e.g., count).
iter(dataset.queries) -> iter[namedtuple]
Returns an iterator over namedtuples representing queries in the dataset.
len(dataset.queries) -> int
Returns the number of queries in the dataset.
dataset.queries.type -> type
Returns the type of the namedtuple returned by iter(dataset.queries)
,
including _fields
and __annotations__
.
dataset.queries.lang -> str
Returns the two-character ISO 639-1 language code (e.g., "en" for English) of the queries. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown. Note that some datasets include translations as different query fields.
dataset.queries.lookup(query_ids) -> Dict[str, namedtuple]
Returns a dictionary mapping all query_ids found in the dataset to their contents.
dataset.queries.lookup_iter(query_ids) -> Iterable[namedtuple]
Returns an iterable of all docs associated with the specified query_ids found in the dataset.
dataset.queries.metadata -> dict
Returns available metadata about the queries from this dataset (e.g., count).
iter(dataset.qrels) -> iter[namedtuple]
Returns an iterator over namedtuples representing query relevance assessments in the dataset.
len(dataset.qrels) -> int
Returns the numer of qrels in the dataset.
dataset.qrels.type -> type
Returns the type of the namedtuple returned by iter(dataset.qrels)
,
including _fields
and __annotations__
.
dataset.qrels.defs -> dict[int, str]
Returns a mapping between relevance levels and a textual description of what the level represents. (E.g., 0 represting not relevant, 1 representing possibly relevant, 2 representing definitely relevant.)
dataset.qrels.asdict() -> dict[str, dict[str, int]]
Returns a dict of dicts representing all qrels for this collection. Note
that this will load all qrels into memory. The outer dict key is the
query_id
and the inner key is the doc_id
.
This is useful in tools such as pytrec_eval.
dataset.qrels.metadata -> dict
Returns available metadata about the qrels from this dataset (e.g., count).
iter(dataset.scoreddocs) -> iter[namedtuple]
Returns an iterator over namedtuples representing scored docs (e.g., initial rankings for re-ranking tasks) in the dataset.
len(dataset.scoreddocs) -> int
Returns the number of scoreddocs in the collection.
dataset.scoreddocs.type -> type
Returns the type of the namedtuple returned by iter(dataset.scoreddocs)
,
including _fields
and __annotations__
.
dataset.scoreddocs.metadata -> dict
Returns available metadata about the scoreddocs from this dataset (e.g., count).
iter(dataset.docpairs) -> iter[namedtuple]
Returns an iterator over namedtuples representing doc pairs (e.g., training pairs) in the dataset.
len(dataset.docpairs) -> int
Returns the number of docpairs in the collection.
dataset.docpairs.type -> type
Returns the type of the namedtuple returned by iter(datset.docpairs)
,
including _fields
and __annotations__
.
dataset.docpairs.metadata -> dict
Returns available metadata about the docpairs from this dataset (e.g., count).
iter(dataset.qlogs) -> iter[namedtuple]
Returns an iterator over namedtuples representing query log records in the dataset.
len(dataset.qlogs) -> int
Returns the number of qlogs in the collection.
dataset.qlogs.type -> type
Returns the type of the namedtuple returned by iter(dataset.qlogs)
,
including _fields
and __annotations__
.
dataset.qlogs.metadata -> dict
Returns available metadata about the qlogs from this dataset (e.g., count).