ir_datasets : PubMed Central (TREC CDS)

import ir_datasets
dataset = ir_datasets.load("pmc/v1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v1 docs



[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1')
# Index pmc/v1
indexer = pt.IterDictIndexer('./indices/pmc_v1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.pmc.v1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

{
  "docs": {
    "count": 733111,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"pmc/v1/trec-cds-2014"`

The TREC Clinical Decision Support (CDS) track from 2014.

queries

30 queries

Language: en

Query type:

TrecCdsQuery: (namedtuple)

query_id: str
type: str
description: str
summary: str

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2014")
for query in dataset.queries_iter():
    query # namedtuple<query_id, type, description, summary>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v1/trec-cds-2014 queries



[query_id]    [type]    [description]    [summary]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2014')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('type'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.pmc.v1.trec-cds-2014.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

733K docs

Inherits docs from pmc/v1

Language: en

Document type:

PmcDoc: (namedtuple)

doc_id: str
journal: str
title: str
abstract: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2014")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v1/trec-cds-2014 docs



[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2014')
# Index pmc/v1
indexer = pt.IterDictIndexer('./indices/pmc_v1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.pmc.v1.trec-cds-2014')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

38K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`35K`	91.2%
1	possibly relevant	`1.7K`	4.4%
2	definitely relevant	`1.7K`	4.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2014")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v1/trec-cds-2014 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2014')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('type'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.pmc.v1.trec-cds-2014.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Simpson2014TrecCds}

Bibtex:

@inproceedings{Simpson2014TrecCds, title={Overview of the TREC 2014 Clinical Decision Support Track}, author={Matthew S. Simpson and Ellen M. Voorhees and William Hersh}, booktitle={TREC}, year={2014} }

{
  "docs": {
    "count": 733111,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 30
  },
  "qrels": {
    "count": 37949,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 34593,
          "2": 1683,
          "1": 1673
        }
      }
    }
  }
}

`"pmc/v1/trec-cds-2015"`

The TREC Clinical Decision Support (CDS) track from 2015.

queries

30 queries

Language: en

Query type:

TrecCdsQuery: (namedtuple)

query_id: str
type: str
description: str
summary: str

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2015")
for query in dataset.queries_iter():
    query # namedtuple<query_id, type, description, summary>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v1/trec-cds-2015 queries



[query_id]    [type]    [description]    [summary]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2015')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('type'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.pmc.v1.trec-cds-2015.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

733K docs

Inherits docs from pmc/v1

Language: en

Document type:

PmcDoc: (namedtuple)

doc_id: str
journal: str
title: str
abstract: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2015")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v1/trec-cds-2015 docs



[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2015')
# Index pmc/v1
indexer = pt.IterDictIndexer('./indices/pmc_v1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.pmc.v1.trec-cds-2015')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

38K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`33K`	86.8%
1	possibly relevant	`3.0K`	7.9%
2	definitely relevant	`2.0K`	5.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2015")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v1/trec-cds-2015 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2015')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('type'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.pmc.v1.trec-cds-2015.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Roberts2015TrecCds}

Bibtex:

@inproceedings{Roberts2015TrecCds, title={Overview of the TREC 2015 Clinical Decision Support Track}, author={Kirk Roberts and Matthew S. Simpson and Ellen Voorhees and William R. Hersh}, booktitle={TREC}, year={2015} }

{
  "docs": {
    "count": 733111,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 30
  },
  "qrels": {
    "count": 37807,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 2979,
          "0": 32817,
          "2": 2011
        }
      }
    }
  }
}

`"pmc/v2"`

Subset of PMC articles used for the TREC 2016 task (v2). Inclues titles, abstracts, full text. Collected from the open access segment on March 28, 2016.

Information on documents

docs

1.3M docs

Language: en

Document type:

PmcDoc: (namedtuple)

doc_id: str
journal: str
title: str
abstract: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v2 docs



[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v2')
# Index pmc/v2
indexer = pt.IterDictIndexer('./indices/pmc_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.pmc.v2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

{
  "docs": {
    "count": 1255260,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"pmc/v2/trec-cds-2016"`

The TREC Clinical Decision Support (CDS) track from 2016.

queries

30 queries

Language: en

Query type:

TrecCds2016Query: (namedtuple)

query_id: str
type: str
note: str
description: str
summary: str

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v2/trec-cds-2016")
for query in dataset.queries_iter():
    query # namedtuple<query_id, type, note, description, summary>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v2/trec-cds-2016 queries



[query_id]    [type]    [note]    [description]    [summary]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v2/trec-cds-2016')
index_ref = pt.IndexRef.of('./indices/pmc_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('type'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.pmc.v2.trec-cds-2016.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

1.3M docs

Inherits docs from pmc/v2

Language: en

Document type:

PmcDoc: (namedtuple)

doc_id: str
journal: str
title: str
abstract: str
body: str

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v2/trec-cds-2016")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v2/trec-cds-2016 docs



[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v2/trec-cds-2016')
# Index pmc/v2
indexer = pt.IterDictIndexer('./indices/pmc_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.pmc.v2.trec-cds-2016')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

38K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`32K`	85.5%
1	possibly relevant	`3.4K`	9.0%
2	definitely relevant	`2.0K`	5.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("pmc/v2/trec-cds-2016")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export pmc/v2/trec-cds-2016 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:pmc/v2/trec-cds-2016')
index_ref = pt.IndexRef.of('./indices/pmc_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('type'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.pmc.v2.trec-cds-2016.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Roberts2016TrecCds}

Bibtex:

@inproceedings{Roberts2016TrecCds, title={Overview of the TREC 2016 Clinical Decision Support Track}, author={Kirk Roberts and Dina Demner-Fushman and Ellen M. Voorhees and William R. Hersh}, booktitle={TREC}, year={2016} }