← home
Github: datasets/cranfield.py

ir_datasets: Cranfield

Index
  1. cranfield

"cranfield"

A small corpus of 1,400 scientific abstracts.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cranfield")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export cranfield queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cranfield')
index_ref = pt.IndexRef.of('./indices/cranfield') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
CranfieldDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. author: str
  5. bib: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cranfield")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, author, bib>

You can find more details about the Python API here.

CLI
ir_datasets export cranfield docs
[doc_id]    [title]    [text]    [author]    [bib]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cranfield')
# Index cranfield
indexer = pt.IterDictIndexer('./indices/cranfield')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'author', 'bib'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
-1References of no interest.
1References of minimum interest, for example, those that have been included from an historical viewpoint.
2References which were useful, either as general background to the work or as suggesting methods of tackling certain aspects of the work.
3References of a high degree of relevance, the lack of which either would have made the research impracticable or would have resulted in a considerable amount of extra work.
4References which are a complete answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cranfield")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cranfield qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cranfield')
index_ref = pt.IndexRef.of('./indices/cranfield') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.