ir_datasets : CORD-19

import ir_datasets
dataset = ir_datasets.load("cord19")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI

ir_datasets export cord19 docs



[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

\cite{Wang2020Cord19}

Bibtex:

@article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 192509,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  }
}

`"cord19/fulltext"`

Version of cord19 dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

docs

193K docs

Language: en

Document type:

Cord19FullTextDoc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str
body: Tuple[
Cord19FullTextSection: (namedtuple)
1. title: str
2. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/fulltext")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/fulltext docs



[doc_id]    [title]    [doi]    [date]    [abstract]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext')
# Index cord19/fulltext
indexer = pt.IterDictIndexer('./indices/cord19_fulltext')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

\cite{Wang2020Cord19}

Bibtex:

@article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 192509,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  }
}

`"cord19/fulltext/trec-covid"`

Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/fulltext/trec-covid queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19_fulltext') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

193K docs

Inherits docs from cord19/fulltext

Language: en

Document type:

Cord19FullTextDoc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str
body: Tuple[
Cord19FullTextSection: (namedtuple)
1. title: str
2. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract, body>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/fulltext/trec-covid docs



[doc_id]    [title]    [doi]    [date]    [abstract]    [body]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
# Index cord19/fulltext
indexer = pt.IterDictIndexer('./indices/cord19_fulltext')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels

69K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant: everything else.	`43K`	61.5%
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.	`11K`	15.9%
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.	`16K`	22.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/fulltext/trec-covid qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19_fulltext') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Voorhees2020TrecCovid,Wang2020Cord19}

Bibtex:

@article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 192509,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 69318,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 15609,
          "1": 11055,
          "0": 42652,
          "-1": 2
        }
      }
    }
  }
}

`"cord19/trec-covid"`

The Complete TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

193K docs

Inherits docs from cord19

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid docs



[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels

69K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant: everything else.	`43K`	61.5%
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.	`11K`	15.9%
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.	`16K`	22.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Voorhees2020TrecCovid,Wang2020Cord19}

Bibtex:

@article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 192509,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 69318,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 15609,
          "1": 11055,
          "0": 42652,
          "-1": 2
        }
      }
    }
  }
}

`"cord19/trec-covid/round1"`

Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.

30 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round1 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

51K docs

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round1 docs



[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
# Index cord19/trec-covid/round1
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels

8.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant: everything else.	`6.3K`	72.9%
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.	`1.1K`	12.8%
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.	`1.2K`	14.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round1 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Voorhees2020TrecCovid,Wang2020Cord19}

Bibtex:

@article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 51078,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 30
  },
  "qrels": {
    "count": 8691,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1237,
          "1": 1115,
          "0": 6339
        }
      }
    }
  }
}

`"cord19/trec-covid/round2"`

Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

35 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round2 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

60K docs

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round2 docs



[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
# Index cord19/trec-covid/round2
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels

12K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant: everything else.	`9.0K`	75.1%
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.	`1.4K`	11.7%
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.	`1.6K`	13.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Voorhees2020TrecCovid,Wang2020Cord19}

Bibtex:

@article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 59887,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 35
  },
  "qrels": {
    "count": 12037,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 9035,
          "1": 1410,
          "2": 1592
        }
      }
    }
  }
}

`"cord19/trec-covid/round3"`

Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

40 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round3 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round3') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

128K docs

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round3 docs



[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
# Index cord19/trec-covid/round3
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round3')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels

13K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant: everything else.	`8.0K`	63.0%
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.	`2.1K`	16.4%
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.	`2.6K`	20.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round3 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round3') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Voorhees2020TrecCovid,Wang2020Cord19}

Bibtex:

@article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 128492,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 40
  },
  "qrels": {
    "count": 12713,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 2089,
          "0": 8015,
          "2": 2609
        }
      }
    }
  }
}

`"cord19/trec-covid/round4"`

Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

45 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round4 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round4') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

158K docs

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round4 docs



[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
# Index cord19/trec-covid/round4
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round4')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels

13K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant: everything else.	`7.4K`	56.1%
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.	`2.3K`	17.2%
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.	`3.5K`	26.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round4 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round4') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Voorhees2020TrecCovid,Wang2020Cord19}

Bibtex:

@article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }

{
  "docs": {
    "count": 158274,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 45
  },
  "qrels": {
    "count": 13262,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 2279,
          "0": 7438,
          "2": 3545
        }
      }
    }
  }
}

`"cord19/trec-covid/round5"`

Round 5 of the TREC COVID task. Includes 50 queries related to COVID-19. This uses the "2020-07-16" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

50 queries

Inherits queries from cord19/trec-covid

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round5 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

193K docs

Inherits docs from cord19

Language: en

Document type:

Cord19Doc: (namedtuple)

doc_id: str
title: str
doi: str
date: str
abstract: str

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, doi, date, abstract>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round5 docs



[doc_id]    [title]    [doi]    [date]    [abstract]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels

23K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant: everything else.	`12K`	52.9%
1	Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.	`4.2K`	18.3%
2	Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.	`6.7K`	28.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export cord19/trec-covid/round5 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Voorhees2020TrecCovid,Wang2020Cord19}

Bibtex:

@article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }