CORD-19 - ir_datasets

`"cord19"`

Collection of scientific articles related to COVID-19.

Uses the 2020-07-16 version of the dataset, corresponding to the "complete" collection used for TREC COVID.

Note that this version of the document collection only provides article meta-data. To get the full text, use cord19/fulltext.

Document collection site

docs

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Example


import ir_datasets
dataset = ir_datasets.load('cord19')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

Citation

bibtex: @article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

`"cord19/fulltext"`

Version of cord19 dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

docs

Language: en

Document type:

Cord19FullTextDoc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str
body: Tuple[
Cord19FullTextSection: (namedtuple)
1. title: str
2. text: str
, ...]

Example


import ir_datasets
dataset = ir_datasets.load('cord19/fulltext')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>

Citation

bibtex: @article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

`"cord19/fulltext/trec-covid"`

Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Example


import ir_datasets
dataset = ir_datasets.load('cord19/fulltext/trec-covid')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

docs

Language: en

Document type:

Cord19FullTextDoc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str
body: Tuple[
Cord19FullTextSection: (namedtuple)
1. title: str
2. text: str
, ...]

Example


import ir_datasets
dataset = ir_datasets.load('cord19/fulltext/trec-covid')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	Not Relevant: everything else.
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Example


import ir_datasets
dataset = ir_datasets.load('cord19/fulltext/trec-covid')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

Citation

bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

`"cord19/trec-covid"`

The TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.

queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Example


import ir_datasets
dataset = ir_datasets.load('cord19/trec-covid')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

docs

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Example


import ir_datasets
dataset = ir_datasets.load('cord19/trec-covid')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	Not Relevant: everything else.
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.

Example


import ir_datasets
dataset = ir_datasets.load('cord19/trec-covid')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

Citation

bibtex: @article{Voorhees2020TRECCOVIDCA, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} }

ir_datasets: CORD-19

"cord19"

"cord19/fulltext"

"cord19/fulltext/trec-covid"

"cord19/trec-covid"

`ir_datasets`: CORD-19

`"cord19"`

`"cord19/fulltext"`

`"cord19/fulltext/trec-covid"`

`"cord19/trec-covid"`