ir_datasets
: Beta Python API
Datasets can be obtained through ir_datasets.load("dataset-id")
or constructed with ir_datasets.create_dataset(...)
. Dataset objects provide the
following methods:
dataset.has_docs() -> bool
Returns True
if this dataset supports dataset.docs_*
methods.
dataset.has_queries() -> bool
Returns True
if this dataset supports dataset.queries_*
methods.
dataset.has_qrels() -> bool
Returns True
if this dataset supports dataset.qrels_*
methods.
dataset.has_scoreddocs() -> bool
Returns True
if this dataset supports dataset.scoreddocs_*
methods.
dataset.has_docpairs() -> bool
Returns True
if this dataset supports dataset.docpairs_*
methods.
iter(dataset.docs) -> iter[namedtuple]
Returns an iterator of namedtuple
s, where each item is a document in the collection.
len(dataset.docs) -> int
Returns the number of documents in the collection.
dataset.docs[start:stop:skip] -> iter[namedtuple]
Returns an iterator of namedtuple
s by index, specified by the slice given.
# First 10 documents
dataset.docs[:10]
# Last 10 documents
dataset.docs[-10:]
# Every 2 documents
dataset.docs[::2]
# Every 2 documents, starting with the first document
dataset.docs[1::2]
# The first half of the collection
dataset.docs[:1/2]
# The middle third of collection
dataset.docs[1/3:2/3]
Note that the fancy slicing mechanics are faster and more sophisticated than
itertools.islice
; documents are not processed if they are skipped.
dataset.docs.type -> type
Returns the NamedTuple
type that the docs_iter
returns.
The available fields and type information can be found with _fields
and __annotations__
:
dataset.docs.type._fields
('doc_id', 'title', 'doi', 'date', 'abstract')
dataset.docs.type.__annotations__
{
'doc_id': str,
'title': str,
'doi': str,
'date': str,
'abstract': str
}
dataset.docs.lookup(doc_ids) -> Dict[str, namedtuple]
Returns a dictionary mapping all doc_ids found in the collection to their contents.
dataset.docs.lookup_iter(doc_ids) -> Iterable[namedtuple]
Returns an iterable of all docs associated with the specified doc_ids found in the collection.
dataset.docs.lang -> str
Returns the two-character ISO 639-1 language code (e.g., "en" for English) of the documents in this collection. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown.
iter(dataset.queries) -> iter[namedtuple]
Returns an iterator over namedtuples representing queries in the dataset.
len(dataset.queries) -> int
Returns the number of queries in the dataset.
dataset.queries.type -> type
Returns the type of the namedtuple returned by iter(queries)
,
including _fields
and __annotations__
.
dataset.queries.lang -> str
Returns the two-character ISO 639-1 language code (e.g., "en" for English) of the queries. Returns None if there are multiple languages, a language not represented by an ISO 639-1 code, or the language is otherwise unknown. Note that some datasets include translations as different query fields.
dataset.queries.lookup(query_ids) -> Dict[str, namedtuple]
Returns a dictionary mapping all query_ids found in the dataset to their contents.
dataset.queries.lookup_iter(query_ids) -> Iterable[namedtuple]
Returns an iterable of all docs associated with the specified query_ids found in the dataset.
iter(dataset.qrels) -> iter[namedtuple]
Returns an iterator over namedtuples representing query relevance assessments in the dataset.
len(dataset.qrels) -> int
Returns the numer of qrels in the dataset.
dataset.qrels.type -> type
Returns the type of the namedtuple returned by qrels_iter
,
including _fields
and __annotations__
.
dataset.qrels.defs -> dict[int, str]
Returns a mapping between relevance levels and a textual description of what the level represents. (E.g., 0 represting not relevant, 1 representing possibly relevant, 2 representing definitely relevant.)
dataset.qrels.asdict() -> dict[str, dict[str, int]]
Returns a dict of dicts representing all qrels for this collection. Note
that this will load all qrels into memory. The outer dict key is the
query_id
and the inner key is the doc_id
.
This is useful in tools such as pytrec_eval.
iter(dataset.scoreddocs) -> iter[namedtuple]
Returns an iterator over namedtuples representing scored docs (e.g., initial rankings for re-ranking tasks) in the dataset.
dataset.scoreddocs.type -> type
Returns the type of the namedtuple returned by scoreddocs_iter
,
including _fields
and __annotations__
.
iter(dataset.docpairs) -> iter[namedtuple]
Returns an iterator over namedtuples representing doc pairs (e.g., training pairs) in the dataset.
dataset.docpairs.type -> type
Returns the type of the namedtuple returned by docpairs_iter
,
including _fields
and __annotations__
.