Github: allenai/ir_datasets

ir_datasets: Catalog

ir_datasets provides a common interface to many IR ranking datasets.

Getting Started

Install with pip:

pip install ir_datasets==0.4.0

Guides:

Dataset Index

✅: Data available as automatic download

⚠️: Data available from a third party

Dataset docs queries qrels scoreddocs docpairs
antique
antique/test
antique/test/non-offensive
antique/train
antique/train/split200-train
antique/train/split200-valid
aquaint⚠️
aquaint/trec-robust-2005⚠️
beir
beir/arguana
beir/climate-fever
beir/cqadupstack/android
beir/cqadupstack/english
beir/cqadupstack/gaming
beir/cqadupstack/gis
beir/cqadupstack/mathematica
beir/cqadupstack/physics
beir/cqadupstack/programmers
beir/cqadupstack/stats
beir/cqadupstack/tex
beir/cqadupstack/unix
beir/cqadupstack/webmasters
beir/cqadupstack/wordpress
beir/dbpedia-entity
beir/dbpedia-entity/dev
beir/dbpedia-entity/test
beir/fever
beir/fever/dev
beir/fever/test
beir/fever/train
beir/fiqa
beir/fiqa/dev
beir/fiqa/test
beir/fiqa/train
beir/hotpotqa
beir/hotpotqa/dev
beir/hotpotqa/test
beir/hotpotqa/train
beir/msmarco
beir/msmarco/dev
beir/msmarco/test
beir/msmarco/train
beir/nfcorpus
beir/nfcorpus/dev
beir/nfcorpus/test
beir/nfcorpus/train
beir/nq
beir/quora
beir/quora/dev
beir/quora/test
beir/scidocs
beir/scifact
beir/scifact/test
beir/scifact/train
beir/trec-covid
beir/webis-touche2020
car
car/v1.5
car/v1.5/test200
car/v1.5/train/fold0
car/v1.5/train/fold1
car/v1.5/train/fold2
car/v1.5/train/fold3
car/v1.5/train/fold4
car/v1.5/trec-y1
car/v1.5/trec-y1/auto
car/v1.5/trec-y1/manual
clinicaltrials
clinicaltrials/2017
clinicaltrials/2017/trec-pm-2017
clinicaltrials/2017/trec-pm-2018
clinicaltrials/2019
clinicaltrials/2019/trec-pm-2019
clinicaltrials/2021
clirmatrix
clueweb09⚠️
clueweb09/ar⚠️
clueweb09/catb⚠️
clueweb09/catb/trec-web-2009⚠️
clueweb09/catb/trec-web-2010⚠️
clueweb09/catb/trec-web-2011⚠️
clueweb09/catb/trec-web-2012⚠️
clueweb09/de⚠️
clueweb09/en⚠️
clueweb09/en/trec-web-2009⚠️
clueweb09/en/trec-web-2010⚠️
clueweb09/en/trec-web-2011⚠️
clueweb09/en/trec-web-2012⚠️
clueweb09/es⚠️
clueweb09/fr⚠️
clueweb09/it⚠️
clueweb09/ja⚠️
clueweb09/ko⚠️
clueweb09/pt⚠️
clueweb09/trec-mq-2009⚠️
clueweb09/zh⚠️
clueweb12⚠️
clueweb12/b13⚠️
clueweb12/b13/clef-ehealth⚠️
clueweb12/b13/clef-ehealth/cs⚠️
clueweb12/b13/clef-ehealth/de⚠️
clueweb12/b13/clef-ehealth/fr⚠️
clueweb12/b13/clef-ehealth/hu⚠️
clueweb12/b13/clef-ehealth/pl⚠️
clueweb12/b13/clef-ehealth/sv⚠️
clueweb12/b13/ntcir-www-1⚠️
clueweb12/b13/ntcir-www-2⚠️
clueweb12/b13/ntcir-www-3⚠️
clueweb12/b13/trec-misinfo-2019⚠️
clueweb12/trec-web-2013⚠️
clueweb12/trec-web-2014⚠️
codesearchnet
codesearchnet/challenge
codesearchnet/test
codesearchnet/train
codesearchnet/valid
cord19
cord19/fulltext
cord19/fulltext/trec-covid
cord19/trec-covid
cord19/trec-covid/round1
cord19/trec-covid/round2
cord19/trec-covid/round3
cord19/trec-covid/round4
cord19/trec-covid/round5
cranfield
dpr-w100
dpr-w100/natural-questions/dev
dpr-w100/natural-questions/train
dpr-w100/trivia-qa/dev
dpr-w100/trivia-qa/train
gov⚠️
gov/trec-web-2002⚠️
gov/trec-web-2002/named-page⚠️
gov/trec-web-2003⚠️
gov/trec-web-2003/named-page⚠️
gov/trec-web-2004⚠️
gov2⚠️
gov2/trec-mq-2007⚠️
gov2/trec-mq-2008⚠️
gov2/trec-tb-2004⚠️
gov2/trec-tb-2005⚠️
gov2/trec-tb-2005/efficiency⚠️
gov2/trec-tb-2005/named-page⚠️
gov2/trec-tb-2006⚠️
gov2/trec-tb-2006/efficiency⚠️
gov2/trec-tb-2006/efficiency/10k⚠️
gov2/trec-tb-2006/efficiency/stream1⚠️
gov2/trec-tb-2006/efficiency/stream2⚠️
gov2/trec-tb-2006/efficiency/stream3⚠️
gov2/trec-tb-2006/efficiency/stream4⚠️
gov2/trec-tb-2006/named-page⚠️
highwire
highwire/trec-genomics-2006
highwire/trec-genomics-2007
medline
medline/2004
medline/2004/trec-genomics-2004
medline/2004/trec-genomics-2005
medline/2017
medline/2017/trec-pm-2017
medline/2017/trec-pm-2018
msmarco-document
msmarco-document/dev
msmarco-document/eval
msmarco-document/orcas
msmarco-document/train
msmarco-document/trec-dl-2019
msmarco-document/trec-dl-2019/judged
msmarco-document/trec-dl-2020
msmarco-document/trec-dl-2020/judged
msmarco-document/trec-dl-hard
msmarco-document/trec-dl-hard/fold1
msmarco-document/trec-dl-hard/fold2
msmarco-document/trec-dl-hard/fold3
msmarco-document/trec-dl-hard/fold4
msmarco-document/trec-dl-hard/fold5
msmarco-passage
msmarco-passage/dev
msmarco-passage/dev/judged
msmarco-passage/dev/small
msmarco-passage/eval
msmarco-passage/eval/small
msmarco-passage/train
msmarco-passage/train/judged
msmarco-passage/train/medical
msmarco-passage/train/split200-train
msmarco-passage/train/split200-valid
msmarco-passage/trec-dl-2019
msmarco-passage/trec-dl-2019/judged
msmarco-passage/trec-dl-2020
msmarco-passage/trec-dl-2020/judged
msmarco-passage/trec-dl-hard
msmarco-passage/trec-dl-hard/fold1
msmarco-passage/trec-dl-hard/fold2
msmarco-passage/trec-dl-hard/fold3
msmarco-passage/trec-dl-hard/fold4
msmarco-passage/trec-dl-hard/fold5
msmarco-qna
msmarco-qna/dev
msmarco-qna/eval
msmarco-qna/train
natural-questions
natural-questions/dev
natural-questions/train
nfcorpus
nfcorpus/dev
nfcorpus/dev/nontopic
nfcorpus/dev/video
nfcorpus/test
nfcorpus/test/nontopic
nfcorpus/test/video
nfcorpus/train
nfcorpus/train/nontopic
nfcorpus/train/video
nyt⚠️
nyt/trec-core-2017⚠️
nyt/wksup⚠️⚠️⚠️
nyt/wksup/train⚠️⚠️⚠️
nyt/wksup/valid⚠️⚠️⚠️
pmc
pmc/v1
pmc/v1/trec-cds-2014
pmc/v1/trec-cds-2015
pmc/v2
pmc/v2/trec-cds-2016
trec-arabic⚠️
trec-arabic/ar2001⚠️
trec-arabic/ar2002⚠️
trec-mandarin⚠️
trec-mandarin/trec5⚠️
trec-mandarin/trec6⚠️
trec-robust04⚠️
trec-robust04/fold1⚠️
trec-robust04/fold2⚠️
trec-robust04/fold3⚠️
trec-robust04/fold4⚠️
trec-robust04/fold5⚠️
trec-spanish⚠️
trec-spanish/trec3⚠️
trec-spanish/trec4⚠️
tripclick⚠️
tripclick/test⚠️⚠️⚠️
tripclick/test/head⚠️⚠️⚠️
tripclick/test/tail⚠️⚠️⚠️
tripclick/test/torso⚠️⚠️⚠️
tripclick/train⚠️⚠️⚠️⚠️
tripclick/train/head⚠️⚠️⚠️
tripclick/train/head/dctr⚠️⚠️⚠️
tripclick/train/tail⚠️⚠️⚠️
tripclick/train/torso⚠️⚠️⚠️
tripclick/val⚠️⚠️⚠️⚠️
tripclick/val/head⚠️⚠️⚠️⚠️
tripclick/val/head/dctr⚠️⚠️⚠️⚠️
tripclick/val/tail⚠️⚠️⚠️⚠️
tripclick/val/torso⚠️⚠️⚠️⚠️
tweets2013-ia
tweets2013-ia/trec-mb-2013
tweets2013-ia/trec-mb-2014
vaswani
wapo
wapo/v2⚠️
wapo/v2/trec-core-2018⚠️
wapo/v2/trec-news-2018⚠️
wapo/v2/trec-news-2019⚠️
wapo/v3/trec-news-2020
wikir
wikir/en1k
wikir/en1k/test
wikir/en1k/training
wikir/en1k/validation
wikir/en59k
wikir/en59k/test
wikir/en59k/training
wikir/en59k/validation
wikir/es13k
wikir/es13k/test
wikir/es13k/training
wikir/es13k/validation
wikir/fr14k
wikir/fr14k/test
wikir/fr14k/training
wikir/fr14k/validation
wikir/it16k
wikir/it16k/test
wikir/it16k/training
wikir/it16k/validation

Other Versions

Citation

When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset can be found on each dataset's documenation page, or in the python interface via dataset.documentation()['bibtex'] (when available).

If you use this tool, please cite our SIGIR resource paper:

@inproceedings{macavaney:sigir2021-irds, author = {MacAvaney, Sean and Yates, Andrew and Feldman, Sergey and Downey, Doug and Cohan, Arman and Goharian, Nazli}, title = {Simplified Data Wrangling with ir_datasets}, year = {2021}, booktitle = {SIGIR} }