Github: allenai/ir_datasets

ir_datasets: Catalog

ir_datasets provides a common interface to many IR ranking datasets.

Getting Started

Install with pip:

pip install ir_datasets==0.4.3

Guides:

Dataset Index

✅: Data available as automatic download

⚠️: Data available from a third party

⬆️: Data inherited from a parent dataset (highlights which one on hover)

Dataset docs queries qrels scoreddocs docpairs
antique
antique/test⬆️
antique/test/non-offensive⬆️
antique/train⬆️
antique/train/split200-train⬆️
antique/train/split200-valid⬆️
aquaint⚠️
aquaint/trec-robust-2005⬆️
beir
beir/arguana
beir/climate-fever
beir/cqadupstack/android
beir/cqadupstack/english
beir/cqadupstack/gaming
beir/cqadupstack/gis
beir/cqadupstack/mathematica
beir/cqadupstack/physics
beir/cqadupstack/programmers
beir/cqadupstack/stats
beir/cqadupstack/tex
beir/cqadupstack/unix
beir/cqadupstack/webmasters
beir/cqadupstack/wordpress
beir/dbpedia-entity
beir/dbpedia-entity/dev⬆️
beir/dbpedia-entity/test⬆️
beir/fever
beir/fever/dev⬆️
beir/fever/test⬆️
beir/fever/train⬆️
beir/fiqa
beir/fiqa/dev⬆️
beir/fiqa/test⬆️
beir/fiqa/train⬆️
beir/hotpotqa
beir/hotpotqa/dev⬆️
beir/hotpotqa/test⬆️
beir/hotpotqa/train⬆️
beir/msmarco
beir/msmarco/dev⬆️
beir/msmarco/test⬆️
beir/msmarco/train⬆️
beir/nfcorpus
beir/nfcorpus/dev⬆️
beir/nfcorpus/test⬆️
beir/nfcorpus/train⬆️
beir/nq
beir/quora
beir/quora/dev⬆️
beir/quora/test⬆️
beir/scidocs
beir/scifact
beir/scifact/test⬆️
beir/scifact/train⬆️
beir/trec-covid
beir/webis-touche2020
c4
c4/en-noclean-tr
c4/en-noclean-tr/trec-misinfo-2021⬆️
car
car/v1.5
car/v1.5/test200⬆️
car/v1.5/train/fold0⬆️
car/v1.5/train/fold1⬆️
car/v1.5/train/fold2⬆️
car/v1.5/train/fold3⬆️
car/v1.5/train/fold4⬆️
car/v1.5/trec-y1⬆️
car/v1.5/trec-y1/auto⬆️⬆️
car/v1.5/trec-y1/manual⬆️⬆️
clinicaltrials
clinicaltrials/2017
clinicaltrials/2017/trec-pm-2017⬆️
clinicaltrials/2017/trec-pm-2018⬆️
clinicaltrials/2019
clinicaltrials/2019/trec-pm-2019⬆️
clinicaltrials/2021
clinicaltrials/2021/trec-ct-2021⬆️
clirmatrix
clueweb09⚠️
clueweb09/ar⚠️
clueweb09/catb⚠️
clueweb09/catb/trec-web-2009⬆️
clueweb09/catb/trec-web-2010⬆️
clueweb09/catb/trec-web-2011⬆️
clueweb09/catb/trec-web-2012⬆️
clueweb09/de⚠️
clueweb09/en⚠️
clueweb09/en/trec-web-2009⬆️
clueweb09/en/trec-web-2010⬆️
clueweb09/en/trec-web-2011⬆️
clueweb09/en/trec-web-2012⬆️
clueweb09/es⚠️
clueweb09/fr⚠️
clueweb09/it⚠️
clueweb09/ja⚠️
clueweb09/ko⚠️
clueweb09/pt⚠️
clueweb09/trec-mq-2009⬆️
clueweb09/zh⚠️
clueweb12⚠️
clueweb12/b13⚠️
clueweb12/b13/clef-ehealth⬆️
clueweb12/b13/clef-ehealth/cs⬆️
clueweb12/b13/clef-ehealth/de⬆️
clueweb12/b13/clef-ehealth/fr⬆️
clueweb12/b13/clef-ehealth/hu⬆️
clueweb12/b13/clef-ehealth/pl⬆️
clueweb12/b13/clef-ehealth/sv⬆️
clueweb12/b13/ntcir-www-1⬆️
clueweb12/b13/ntcir-www-2⬆️
clueweb12/b13/ntcir-www-3⬆️
clueweb12/b13/trec-misinfo-2019⬆️
clueweb12/trec-web-2013⬆️
clueweb12/trec-web-2014⬆️
codesearchnet
codesearchnet/challenge
codesearchnet/test
codesearchnet/train
codesearchnet/valid
cord19
cord19/fulltext
cord19/fulltext/trec-covid⬆️
cord19/trec-covid⬆️
cord19/trec-covid/round1
cord19/trec-covid/round2
cord19/trec-covid/round3
cord19/trec-covid/round4
cord19/trec-covid/round5⬆️⬆️
cranfield
dpr-w100
dpr-w100/natural-questions/dev⬆️
dpr-w100/natural-questions/train⬆️
dpr-w100/trivia-qa/dev⬆️
dpr-w100/trivia-qa/train⬆️
gov⚠️
gov/trec-web-2002⬆️
gov/trec-web-2002/named-page⬆️
gov/trec-web-2003⬆️
gov/trec-web-2003/named-page⬆️
gov/trec-web-2004⬆️
gov2⚠️
gov2/trec-mq-2007⬆️
gov2/trec-mq-2008⬆️
gov2/trec-tb-2004⬆️
gov2/trec-tb-2005⬆️
gov2/trec-tb-2005/efficiency⬆️
gov2/trec-tb-2005/named-page⬆️
gov2/trec-tb-2006⬆️
gov2/trec-tb-2006/efficiency⬆️
gov2/trec-tb-2006/efficiency/10k⬆️
gov2/trec-tb-2006/efficiency/stream1⬆️
gov2/trec-tb-2006/efficiency/stream2⬆️
gov2/trec-tb-2006/efficiency/stream3⬆️
gov2/trec-tb-2006/efficiency/stream4⬆️
gov2/trec-tb-2006/named-page⬆️
highwire
highwire/trec-genomics-2006⬆️
highwire/trec-genomics-2007⬆️
medline
medline/2004
medline/2004/trec-genomics-2004⬆️
medline/2004/trec-genomics-2005⬆️
medline/2017
medline/2017/trec-pm-2017⬆️
medline/2017/trec-pm-2018⬆️
mmarco
mmarco/de
mmarco/de/dev⬆️
mmarco/de/train⬆️
mmarco/es
mmarco/es/dev⬆️
mmarco/es/train⬆️
mmarco/fr
mmarco/fr/dev⬆️
mmarco/fr/train⬆️
mmarco/id
mmarco/id/dev⬆️
mmarco/id/train⬆️
mmarco/it
mmarco/it/dev⬆️
mmarco/it/train⬆️
mmarco/pt
mmarco/pt/dev⬆️
mmarco/pt/train⬆️
mmarco/ru
mmarco/ru/dev⬆️
mmarco/ru/train⬆️
mmarco/zh
mmarco/zh/dev⬆️
mmarco/zh/train⬆️
mr-tydi
mr-tydi/ar
mr-tydi/ar/dev⬆️
mr-tydi/ar/test⬆️
mr-tydi/ar/train⬆️
mr-tydi/bn
mr-tydi/bn/dev⬆️
mr-tydi/bn/test⬆️
mr-tydi/bn/train⬆️
mr-tydi/en
mr-tydi/en/dev⬆️
mr-tydi/en/test⬆️
mr-tydi/en/train⬆️
mr-tydi/fi
mr-tydi/fi/dev⬆️
mr-tydi/fi/test⬆️
mr-tydi/fi/train⬆️
mr-tydi/id
mr-tydi/id/dev⬆️
mr-tydi/id/test⬆️
mr-tydi/id/train⬆️
mr-tydi/ja
mr-tydi/ja/dev⬆️
mr-tydi/ja/test⬆️
mr-tydi/ja/train⬆️
mr-tydi/ko
mr-tydi/ko/dev⬆️
mr-tydi/ko/test⬆️
mr-tydi/ko/train⬆️
mr-tydi/ru
mr-tydi/ru/dev⬆️
mr-tydi/ru/test⬆️
mr-tydi/ru/train⬆️
mr-tydi/sw
mr-tydi/sw/dev⬆️
mr-tydi/sw/test⬆️
mr-tydi/sw/train⬆️
mr-tydi/te
mr-tydi/te/dev⬆️
mr-tydi/te/test⬆️
mr-tydi/te/train⬆️
mr-tydi/th
mr-tydi/th/dev⬆️
mr-tydi/th/test⬆️
mr-tydi/th/train⬆️
msmarco-document
msmarco-document/dev⬆️
msmarco-document/eval⬆️
msmarco-document/orcas⬆️
msmarco-document/train⬆️
msmarco-document/trec-dl-2019⬆️
msmarco-document/trec-dl-2019/judged⬆️⬆️
msmarco-document/trec-dl-2020⬆️
msmarco-document/trec-dl-2020/judged⬆️⬆️
msmarco-document/trec-dl-hard⬆️
msmarco-document/trec-dl-hard/fold1⬆️
msmarco-document/trec-dl-hard/fold2⬆️
msmarco-document/trec-dl-hard/fold3⬆️
msmarco-document/trec-dl-hard/fold4⬆️
msmarco-document/trec-dl-hard/fold5⬆️
msmarco-document-v2
msmarco-document-v2/dev1⬆️
msmarco-document-v2/dev2⬆️
msmarco-document-v2/train⬆️
msmarco-document-v2/trec-dl-2019⬆️
msmarco-document-v2/trec-dl-2019/judged⬆️⬆️
msmarco-document-v2/trec-dl-2020⬆️
msmarco-document-v2/trec-dl-2020/judged⬆️⬆️
msmarco-document-v2/trec-dl-2021⬆️
msmarco-passage
msmarco-passage/dev⬆️
msmarco-passage/dev/judged⬆️⬆️
msmarco-passage/dev/small⬆️
msmarco-passage/eval⬆️
msmarco-passage/eval/small⬆️
msmarco-passage/train⬆️
msmarco-passage/train/judged⬆️⬆️⬆️
msmarco-passage/train/medical⬆️
msmarco-passage/train/split200-train⬆️
msmarco-passage/train/split200-valid⬆️
msmarco-passage/trec-dl-2019⬆️
msmarco-passage/trec-dl-2019/judged⬆️⬆️
msmarco-passage/trec-dl-2020⬆️
msmarco-passage/trec-dl-2020/judged⬆️⬆️
msmarco-passage/trec-dl-hard⬆️
msmarco-passage/trec-dl-hard/fold1⬆️
msmarco-passage/trec-dl-hard/fold2⬆️
msmarco-passage/trec-dl-hard/fold3⬆️
msmarco-passage/trec-dl-hard/fold4⬆️
msmarco-passage/trec-dl-hard/fold5⬆️
msmarco-passage-v2
msmarco-passage-v2/dev1⬆️
msmarco-passage-v2/dev2⬆️
msmarco-passage-v2/train⬆️
msmarco-passage-v2/trec-dl-2021⬆️
msmarco-qna
msmarco-qna/dev⬆️
msmarco-qna/eval⬆️
msmarco-qna/train⬆️
natural-questions
natural-questions/dev⬆️
natural-questions/train⬆️
nfcorpus
nfcorpus/dev⬆️
nfcorpus/dev/nontopic⬆️
nfcorpus/dev/video⬆️
nfcorpus/test⬆️
nfcorpus/test/nontopic⬆️
nfcorpus/test/video⬆️
nfcorpus/train⬆️
nfcorpus/train/nontopic⬆️
nfcorpus/train/video⬆️
nyt⚠️
nyt/trec-core-2017⬆️
nyt/wksup⬆️⚠️⚠️
nyt/wksup/train⬆️⚠️⚠️
nyt/wksup/valid⬆️⚠️⚠️
pmc
pmc/v1
pmc/v1/trec-cds-2014⬆️
pmc/v1/trec-cds-2015⬆️
pmc/v2
pmc/v2/trec-cds-2016⬆️
trec-arabic⚠️
trec-arabic/ar2001⬆️
trec-arabic/ar2002⬆️
trec-fair-2021
trec-fair-2021/eval⬆️
trec-fair-2021/train⬆️
trec-mandarin⚠️
trec-mandarin/trec5⬆️
trec-mandarin/trec6⬆️
trec-robust04⚠️
trec-robust04/fold1⬆️
trec-robust04/fold2⬆️
trec-robust04/fold3⬆️
trec-robust04/fold4⬆️
trec-robust04/fold5⬆️
trec-spanish⚠️
trec-spanish/trec3⬆️
trec-spanish/trec4⬆️
tripclick⚠️
tripclick/test⬆️⚠️⚠️
tripclick/test/head⬆️⚠️⚠️
tripclick/test/tail⬆️⚠️⚠️
tripclick/test/torso⬆️⚠️⚠️
tripclick/train⬆️⚠️⚠️⚠️
tripclick/train/head⬆️⚠️⚠️
tripclick/train/head/dctr⬆️⬆️⚠️
tripclick/train/tail⬆️⚠️⚠️
tripclick/train/torso⬆️⚠️⚠️
tripclick/val⬆️⚠️⚠️⚠️
tripclick/val/head⬆️⚠️⚠️⚠️
tripclick/val/head/dctr⬆️⬆️⚠️⬆️
tripclick/val/tail⬆️⚠️⚠️⚠️
tripclick/val/torso⬆️⚠️⚠️⚠️
tweets2013-ia
tweets2013-ia/trec-mb-2013⬆️
tweets2013-ia/trec-mb-2014⬆️
vaswani
wapo
wapo/v2⚠️
wapo/v2/trec-core-2018⬆️
wapo/v2/trec-news-2018⬆️
wapo/v2/trec-news-2019⬆️
wapo/v3/trec-news-2020
wikir
wikir/en1k
wikir/en1k/test⬆️
wikir/en1k/training⬆️
wikir/en1k/validation⬆️
wikir/en59k
wikir/en59k/test⬆️
wikir/en59k/training⬆️
wikir/en59k/validation⬆️
wikir/en78k
wikir/en78k/test⬆️
wikir/en78k/training⬆️
wikir/en78k/validation⬆️
wikir/ens78k
wikir/ens78k/test⬆️
wikir/ens78k/training⬆️
wikir/ens78k/validation⬆️
wikir/es13k
wikir/es13k/test⬆️
wikir/es13k/training⬆️
wikir/es13k/validation⬆️
wikir/fr14k
wikir/fr14k/test⬆️
wikir/fr14k/training⬆️
wikir/fr14k/validation⬆️
wikir/it16k
wikir/it16k/test⬆️
wikir/it16k/training⬆️
wikir/it16k/validation⬆️

Other Versions

Citation

When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset can be found on each dataset's documenation page.

If you use this tool, please cite our SIGIR resource paper:

@inproceedings{macavaney:sigir2021-irds, author = {MacAvaney, Sean and Yates, Andrew and Feldman, Sergey and Downey, Doug and Cohan, Arman and Goharian, Nazli}, title = {Simplified Data Wrangling with ir_datasets}, year = {2021}, booktitle = {SIGIR} }