ir_datasets
: CORD-19Collection of scientific articles related to COVID-19.
Uses the 2020-07-16 version of the dataset, corresponding to the "complete" collection used for TREC COVID.
Note that this version of the document collection only provides article meta-data. To get the full text, use cord19/fulltext.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract>
You can find more details about the Python API here.
ir_datasets export cord19 docs
[doc_id] [title] [doi] [date] [abstract]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Version of cord19 dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract, body>
You can find more details about the Python API here.
ir_datasets export cord19/fulltext docs
[doc_id] [title] [doi] [date] [abstract] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext')
# Index cord19/fulltext
indexer = pt.IterDictIndexer('./indices/cord19_fulltext')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.
Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export cord19/fulltext/trec-covid queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19_fulltext') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from cord19/fulltext
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract, body>
You can find more details about the Python API here.
ir_datasets export cord19/fulltext/trec-covid docs
[doc_id] [title] [doi] [date] [abstract] [body]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
# Index cord19/fulltext
indexer = pt.IterDictIndexer('./indices/cord19_fulltext')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant: everything else. |
1 | Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer. |
2 | Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question. |
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/fulltext/trec-covid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export cord19/fulltext/trec-covid qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/fulltext/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19_fulltext') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
The Complete TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from cord19
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid docs
[doc_id] [title] [doi] [date] [abstract]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant: everything else. |
1 | Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer. |
2 | Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question. |
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round1 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round1 docs
[doc_id] [title] [doi] [date] [abstract]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
# Index cord19/trec-covid/round1
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round1')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant: everything else. |
1 | Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer. |
2 | Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question. |
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round1 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round1')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.
Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round2 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round2 docs
[doc_id] [title] [doi] [date] [abstract]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
# Index cord19/trec-covid/round2
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant: everything else. |
1 | Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer. |
2 | Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question. |
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round2')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.
Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round3 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round3') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round3 docs
[doc_id] [title] [doi] [date] [abstract]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
# Index cord19/trec-covid/round3
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round3')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant: everything else. |
1 | Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer. |
2 | Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question. |
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round3 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round3')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round3') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.
Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round4 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round4') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round4 docs
[doc_id] [title] [doi] [date] [abstract]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
# Index cord19/trec-covid/round4
indexer = pt.IterDictIndexer('./indices/cord19_trec-covid_round4')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant: everything else. |
1 | Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer. |
2 | Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question. |
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round4 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round4')
index_ref = pt.IndexRef.of('./indices/cord19_trec-covid_round4') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Round 5 of the TREC COVID task. Includes 50 queries related to COVID-19. This uses the "2020-07-16" version of the collection.
Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round5 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from cord19
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, doi, date, abstract>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round5 docs
[doc_id] [title] [doi] [date] [abstract]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
# Index cord19
indexer = pt.IterDictIndexer('./indices/cord19')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'doi', 'date', 'abstract'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not Relevant: everything else. |
1 | Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer. |
2 | Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question. |
Examples:
import ir_datasets
dataset = ir_datasets.load("cord19/trec-covid/round5")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export cord19/trec-covid/round5 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:cord19/trec-covid/round5')
index_ref = pt.IndexRef.of('./indices/cord19') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.