← home
Github: datasets/cord19.py

ir_datasets: CORD-19

Index
  1. cord19
  2. cord19/fulltext
  3. cord19/fulltext/trec-covid
  4. cord19/trec-covid
  5. cord19/trec-covid/round1
  6. cord19/trec-covid/round2
  7. cord19/trec-covid/round3
  8. cord19/trec-covid/round4
  9. cord19/trec-covid/round5

"cord19"

Collection of scientific articles related to COVID-19.

Uses the 2020-07-16 version of the dataset, corresponding to the "complete" collection used for TREC COVID.

Note that this version of the document collection only provides article meta-data. To get the full text, use cord19/fulltext.

docs

Language: en

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export cord19 docs
[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

Citation
bibtex: @article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

"cord19/fulltext"

Version of cord19 dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

docs

Language: en

Document type:
Cord19FullTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str
  6. body: Tuple[
    Cord19FullTextSection: (namedtuple)
    1. title: str
    2. text: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/fulltext docs
[doc_id]    [title]    [doi]    [date]    [abstract]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext')
# Index cord19/fulltext
indexer = pt.IterDictIndexer('./indices/cord19_fulltext')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

Citation
bibtex: @article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

"cord19/fulltext/trec-covid"

Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/fulltext/trec-covid queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19_fulltext') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from cord19/fulltext

Document type:
Cord19FullTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str
  6. body: Tuple[
    Cord19FullTextSection: (namedtuple)
    1. title: str
    2. text: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/fulltext/trec-covid docs
[doc_id]    [title]    [doi]    [date]    [abstract]    [body]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
# Index cord19/fulltext
indexer = pt.IterDictIndexer('./indices/cord19_fulltext')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/fulltext/trec-covid qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19_fulltext') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

"cord19/trec-covid"

The Complete TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from cord19

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid docs
[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

"cord19/trec-covid/round1"

Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round1 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round1 docs
[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
# Index cord19/trec-covid/round1
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round1 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

"cord19/trec-covid/round2"

Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round2 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round2 docs
[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
# Index cord19/trec-covid/round2
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round2 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

"cord19/trec-covid/round3"

Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round3 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round3') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round3 docs
[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
# Index cord19/trec-covid/round3
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round3')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round3 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round3') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

"cord19/trec-covid/round4"

Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round4 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round4') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round4 docs
[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
# Index cord19/trec-covid/round4
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round4')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round4 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round4') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

"cord19/trec-covid/round5"

Round 5 of the TREC COVID task. Includes 50 queries related to COVID-19. This uses the "2020-07-16" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round5 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from cord19

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round5 docs
[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export cord19/trec-covid/round5 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }