← home
Github: datasets/pmc.py

ir_datasets: PubMed Central (TREC CDS)

Index
  1. pmc
  2. pmc/v1
  3. pmc/v1/trec-cds-2014
  4. pmc/v1/trec-cds-2015
  5. pmc/v2
  6. pmc/v2/trec-cds-2016

"pmc"

Bio-medical articles from PubMed Central. Right now, only includes subsets used for the TREC Clinical Decision Support (CDS) 2014-16 tasks.


"pmc/v1"

Subset of PMC articles used for the TREC 2014 and 2015 tasks (v1). Inclues titles, abstracts, full text. Collected from the open access segment on January 21, 2014.

docs
733K docs

Language: en

Document type:
PmcDoc: (namedtuple)
  1. doc_id: str
  2. journal: str
  3. title: str
  4. abstract: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v1 docs
[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1')
# Index pmc/v1
indexer = pt.IterDictIndexer('./indices/pmc_v1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

Metadata

"pmc/v1/trec-cds-2014"

The TREC Clinical Decision Support (CDS) track from 2014.

queries
30 queries

Language: en

Query type:
TrecCdsQuery: (namedtuple)
  1. query_id: str
  2. type: str
  3. description: str
  4. summary: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2014")
for query in dataset.queries_iter():
    query # namedtuple<query_id, type, description, summary>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v1/trec-cds-2014 queries
[query_id]    [type]    [description]    [summary]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2014')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('type'))

You can find more details about PyTerrier retrieval here.

docs
733K docs

Inherits docs from pmc/v1

Language: en

Document type:
PmcDoc: (namedtuple)
  1. doc_id: str
  2. journal: str
  3. title: str
  4. abstract: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2014")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v1/trec-cds-2014 docs
[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2014')
# Index pmc/v1
indexer = pt.IterDictIndexer('./indices/pmc_v1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

qrels
38K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant35K91.2%
1possibly relevant1.7K4.4%
2definitely relevant1.7K4.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2014")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v1/trec-cds-2014 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2014')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('type'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Simpson2014TrecCds}

Bibtex:

@inproceedings{Simpson2014TrecCds, title={Overview of the TREC 2014 Clinical Decision Support Track}, author={Matthew S. Simpson and Ellen M. Voorhees and William Hersh}, booktitle={TREC}, year={2014} }
Metadata

"pmc/v1/trec-cds-2015"

The TREC Clinical Decision Support (CDS) track from 2015.

queries
30 queries

Language: en

Query type:
TrecCdsQuery: (namedtuple)
  1. query_id: str
  2. type: str
  3. description: str
  4. summary: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2015")
for query in dataset.queries_iter():
    query # namedtuple<query_id, type, description, summary>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v1/trec-cds-2015 queries
[query_id]    [type]    [description]    [summary]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2015')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('type'))

You can find more details about PyTerrier retrieval here.

docs
733K docs

Inherits docs from pmc/v1

Language: en

Document type:
PmcDoc: (namedtuple)
  1. doc_id: str
  2. journal: str
  3. title: str
  4. abstract: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2015")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v1/trec-cds-2015 docs
[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2015')
# Index pmc/v1
indexer = pt.IterDictIndexer('./indices/pmc_v1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

qrels
38K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant33K86.8%
1possibly relevant3.0K7.9%
2definitely relevant2.0K5.3%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v1/trec-cds-2015")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v1/trec-cds-2015 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:pmc/v1/trec-cds-2015')
index_ref = pt.IndexRef.of('./indices/pmc_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('type'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Roberts2015TrecCds}

Bibtex:

@inproceedings{Roberts2015TrecCds, title={Overview of the TREC 2015 Clinical Decision Support Track}, author={Kirk Roberts and Matthew S. Simpson and Ellen Voorhees and William R. Hersh}, booktitle={TREC}, year={2015} }
Metadata

"pmc/v2"

Subset of PMC articles used for the TREC 2016 task (v2). Inclues titles, abstracts, full text. Collected from the open access segment on March 28, 2016.

docs
1.3M docs

Language: en

Document type:
PmcDoc: (namedtuple)
  1. doc_id: str
  2. journal: str
  3. title: str
  4. abstract: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v2 docs
[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v2')
# Index pmc/v2
indexer = pt.IterDictIndexer('./indices/pmc_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

Metadata

"pmc/v2/trec-cds-2016"

The TREC Clinical Decision Support (CDS) track from 2016.

queries
30 queries

Language: en

Query type:
TrecCds2016Query: (namedtuple)
  1. query_id: str
  2. type: str
  3. note: str
  4. description: str
  5. summary: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v2/trec-cds-2016")
for query in dataset.queries_iter():
    query # namedtuple<query_id, type, note, description, summary>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v2/trec-cds-2016 queries
[query_id]    [type]    [note]    [description]    [summary]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v2/trec-cds-2016')
index_ref = pt.IndexRef.of('./indices/pmc_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('type'))

You can find more details about PyTerrier retrieval here.

docs
1.3M docs

Inherits docs from pmc/v2

Language: en

Document type:
PmcDoc: (namedtuple)
  1. doc_id: str
  2. journal: str
  3. title: str
  4. abstract: str
  5. body: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v2/trec-cds-2016")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, abstract, body>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v2/trec-cds-2016 docs
[doc_id]    [journal]    [title]    [abstract]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:pmc/v2/trec-cds-2016')
# Index pmc/v2
indexer = pt.IterDictIndexer('./indices/pmc_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['journal', 'title', 'abstract', 'body'])

You can find more details about PyTerrier indexing here.

qrels
38K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant32K85.5%
1possibly relevant3.4K9.0%
2definitely relevant2.0K5.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("pmc/v2/trec-cds-2016")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export pmc/v2/trec-cds-2016 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:pmc/v2/trec-cds-2016')
index_ref = pt.IndexRef.of('./indices/pmc_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('type'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Roberts2016TrecCds}

Bibtex:

@inproceedings{Roberts2016TrecCds, title={Overview of the TREC 2016 Clinical Decision Support Track}, author={Kirk Roberts and Dina Demner-Fushman and Ellen M. Voorhees and William R. Hersh}, booktitle={TREC}, year={2016} }
Metadata