← home
Github: datasets/cord19.py

ir_datasets: CORD-19

Index
  1. cord19
  2. cord19/fulltext
  3. cord19/fulltext/trec-covid
  4. cord19/trec-covid

"cord19"

Collection of scientific articles related to COVID-19.

Uses the 2020-07-16 version of the dataset, corresponding to the "complete" collection used for TREC COVID.

Note that this version of the document collection only provides article meta-data. To get the full text, use cord19/fulltext.

docs

Language: en

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('cord19')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>
Citation
bibtex: @article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

"cord19/fulltext"

Version of cord19 dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

docs

Language: en

Document type:
Cord19FullTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str
  6. body: Tuple[
    Cord19FullTextSection: (namedtuple)
    1. title: str
    2. text: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('cord19/fulltext')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>
Citation
bibtex: @article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

"cord19/fulltext/trec-covid"

Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Example

import ir_datasets
dataset = ir_datasets.load('cord19/fulltext/trec-covid')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
docs

Language: en

Document type:
Cord19FullTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str
  6. body: Tuple[
    Cord19FullTextSection: (namedtuple)
    1. title: str
    2. text: str
    , ...]

Example

import ir_datasets
dataset = ir_datasets.load('cord19/fulltext/trec-covid')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Example

import ir_datasets
dataset = ir_datasets.load('cord19/fulltext/trec-covid')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

"cord19/trec-covid"

The TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Example

import ir_datasets
dataset = ir_datasets.load('cord19/trec-covid')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
docs

Language: en

Document type:
Cord19Doc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. doi: str
  4. date: str
  5. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('cord19/trec-covid')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Not Relevant: everything else.
1Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Example

import ir_datasets
dataset = ir_datasets.load('cord19/trec-covid')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }