← home
Github: datasets/beir.py

ir_datasets: Beir (benchmark suite)

Index
  1. beir
  2. beir/arguana
  3. beir/climate-fever
  4. beir/cqadupstack/android
  5. beir/cqadupstack/english
  6. beir/cqadupstack/gaming
  7. beir/cqadupstack/gis
  8. beir/cqadupstack/mathematica
  9. beir/cqadupstack/physics
  10. beir/cqadupstack/programmers
  11. beir/cqadupstack/stats
  12. beir/cqadupstack/tex
  13. beir/cqadupstack/unix
  14. beir/cqadupstack/webmasters
  15. beir/cqadupstack/wordpress
  16. beir/dbpedia-entity
  17. beir/dbpedia-entity/dev
  18. beir/dbpedia-entity/test
  19. beir/fever
  20. beir/fever/dev
  21. beir/fever/test
  22. beir/fever/train
  23. beir/fiqa
  24. beir/fiqa/dev
  25. beir/fiqa/test
  26. beir/fiqa/train
  27. beir/hotpotqa
  28. beir/hotpotqa/dev
  29. beir/hotpotqa/test
  30. beir/hotpotqa/train
  31. beir/msmarco
  32. beir/msmarco/dev
  33. beir/msmarco/test
  34. beir/msmarco/train
  35. beir/nfcorpus
  36. beir/nfcorpus/dev
  37. beir/nfcorpus/test
  38. beir/nfcorpus/train
  39. beir/nq
  40. beir/quora
  41. beir/quora/dev
  42. beir/quora/test
  43. beir/scidocs
  44. beir/scifact
  45. beir/scifact/test
  46. beir/scifact/train
  47. beir/trec-covid
  48. beir/webis-touche2020

"beir"

Beir is a suite of benchmarks to test zero-shot transfer.

Citation

ir_datasets.bib:

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/arguana"

A version of the ArguAna Counterargs dataset, for argument retrieval.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/arguana queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
index_ref = pt.IndexRef.of('./indices/beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/arguana docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
# Index beir/arguana
indexer = pt.IterDictIndexer('./indices/beir_arguana')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/arguana qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
index_ref = pt.IndexRef.of('./indices/beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Wachsmuth2018Arguana,Thakur2021Beir}

Bibtex:

@inproceedings{Wachsmuth2018Arguana, author = "Wachsmuth, Henning and Syed, Shahbaz and Stein, Benno", title = "Retrieval of the Best Counterargument without Prior Topic Knowledge", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", location = "Melbourne, Australia", pages = "241--251", url = "http://aclweb.org/anthology/P18-1023" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/climate-fever"

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/climate-fever queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/climate-fever docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
# Index beir/climate-fever
indexer = pt.IterDictIndexer('./indices/beir_climate-fever')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/climate-fever qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Diggelmann2020CLIMATEFEVERAD,Thakur2021Beir}

Bibtex:

@article{Diggelmann2020CLIMATEFEVERAD, title={CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims}, author={T. Diggelmann and Jordan L. Boyd-Graber and Jannis Bulian and Massimiliano Ciaramita and Markus Leippold}, journal={ArXiv}, year={2020}, volume={abs/2012.00614} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/android"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/android queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_android') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/android docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
# Index beir/cqadupstack/android
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_android')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/android qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_android') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/english"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/english queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_english') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/english docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
# Index beir/cqadupstack/english
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_english')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/english qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_english') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/gaming"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/gaming queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gaming') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/gaming docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
# Index beir/cqadupstack/gaming
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_gaming')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/gaming qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gaming') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/gis"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/gis queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gis') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/gis docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
# Index beir/cqadupstack/gis
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_gis')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/gis qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gis') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/mathematica"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/mathematica queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_mathematica') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/mathematica docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
# Index beir/cqadupstack/mathematica
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_mathematica')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/mathematica qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_mathematica') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/physics"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/physics queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_physics') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/physics docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
# Index beir/cqadupstack/physics
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_physics')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/physics qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_physics') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/programmers"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/programmers queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_programmers') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/programmers docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
# Index beir/cqadupstack/programmers
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_programmers')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/programmers qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_programmers') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/stats"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/stats queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_stats') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/stats docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
# Index beir/cqadupstack/stats
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_stats')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/stats qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_stats') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/tex"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/tex queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_tex') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/tex docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
# Index beir/cqadupstack/tex
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_tex')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/tex qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_tex') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/unix"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/unix queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_unix') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/unix docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
# Index beir/cqadupstack/unix
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_unix')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/unix qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_unix') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/webmasters"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/webmasters queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_webmasters') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/webmasters docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
# Index beir/cqadupstack/webmasters
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_webmasters')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/webmasters qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_webmasters') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/cqadupstack/wordpress"

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/wordpress queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_wordpress') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/wordpress docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
# Index beir/cqadupstack/wordpress
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_wordpress')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/cqadupstack/wordpress qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_wordpress') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/dbpedia-entity"

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Hasibi2017DBpediaEntityVA,Thakur2021Beir}

Bibtex:

@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/dbpedia-entity/dev"

A random sample of 67 queries from the official test set, used as a dev set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity/dev queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/dbpedia-entity

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity/dev docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hasibi2017DBpediaEntityVA,Thakur2021Beir}

Bibtex:

@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/dbpedia-entity/test"

A the official test set, without 67 queries used as a dev set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/dbpedia-entity

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/dbpedia-entity/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Hasibi2017DBpediaEntityVA,Thakur2021Beir}

Bibtex:

@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fever"

A version of the FEVER dataset for fact verification. Includes queries from the /train /dev and /test subsets.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fever/dev"

The official dev set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/dev queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/fever

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/dev docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fever/test"

The official test set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/fever

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fever/train"

The official train set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/train queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/fever

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/train docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fever/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fiqa"

A version of the FIQA-2018 dataset (financial opinion question answering). Queries include those in the /train /dev and /test subsets.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fiqa/dev"

Random sample of 500 queries from the official dataset.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/dev queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/fiqa

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/dev docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fiqa/test"

Random sample of 648 queries from the official dataset.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/fiqa

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/fiqa/train"

Official dataset without the 1148 queries sampled for /dev and /test.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/train queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/fiqa

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/train docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/fiqa/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/hotpotqa"

A version of the Hotpot QA dataset for multi-hop question answering. Queries include all those in /train /dev and /test.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/hotpotqa/dev"

Random selection of the 5447 queries from /train.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/dev queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/hotpotqa

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/dev docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/hotpotqa/test"

Official dev set from HotpotQA, here used as a test set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/hotpotqa

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/hotpotqa/train"

Official train set, without the random selection of the 5447 queries used for /dev.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/train queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/hotpotqa

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/train docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/hotpotqa/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/msmarco"

A version of the MS MARCO passage ranking dataset. Includes queries from the /train, /dev, and /test sub-datasets.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/msmarco/dev"

A version of the MS MARCO passage ranking dev set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/dev queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/msmarco

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/dev docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/msmarco/test"

A version of the TREC Deep Learning 2019 set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/msmarco

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Craswell2019TrecDl,Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/msmarco/train"

A version of the MS MARCO passage ranking train set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/train queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/msmarco

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/train docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/msmarco/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/nfcorpus"

A version of the NF Corpus (Nutrition Facts). Queries use the "title" variant of the query, which here are often natural language questions. Queries include all those from /train /dev and /test.

Data pre-processing may be different than what is done in nfcorpus.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/nfcorpus/dev"

Combined dev set of NFCorpus.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/dev queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/nfcorpus

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/dev docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/nfcorpus/test"

Combined test set of NFCorpus.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/nfcorpus

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/nfcorpus/train"

Combined train set of NFCorpus.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/train queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/nfcorpus

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/train docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nfcorpus/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/nq"

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nq")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nq queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
index_ref = pt.IndexRef.of('./indices/beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nq")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nq docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
# Index beir/nq
indexer = pt.IterDictIndexer('./indices/beir_nq')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/nq")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/nq qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
index_ref = pt.IndexRef.of('./indices/beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Kwiatkowski2019Nq,Thakur2021Beir}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/quora"

A version of the Quora duplicate question detection dataset (QQP). Includes queries from /dev and /test sets.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/quora/dev"

A 5,000 question subset of the original dataset, without overlaps in the other subsets.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora/dev queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/quora

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora/dev docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/quora/test"

A 10,000 question subset of the original dataset, without overlaps in the other subsets.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/quora

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/quora/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/scidocs"

A version of the SciDocs dataset, used for citation retrieval.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scidocs queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
index_ref = pt.IndexRef.of('./indices/beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scidocs docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
# Index beir/scidocs
indexer = pt.IterDictIndexer('./indices/beir_scidocs')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scidocs qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
index_ref = pt.IndexRef.of('./indices/beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Cohan2020Scidocs,Thakur2021Beir}

Bibtex:

@inproceedings{Cohan2020Scidocs, title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers", author = "Cohan, Arman and Feldman, Sergey and Beltagy, Iz and Downey, Doug and Weld, Daniel", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.207", doi = "10.18653/v1/2020.acl-main.207", pages = "2270--2282" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/scifact"

A version of the SciFact dataset, for fact verification. Queries include those form the /train and /test sets.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Wadden2020Scifact,Thakur2021Beir}

Bibtex:

@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/scifact/test"

The official dev set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact/test queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/scifact

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact/test docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Wadden2020Scifact,Thakur2021Beir}

Bibtex:

@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/scifact/train"

The official train set.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact/train queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from beir/scifact

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact/train docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/scifact/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Wadden2020Scifact,Thakur2021Beir}

Bibtex:

@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/trec-covid"

A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.

Data pre-processing may be different than what is done in cord19/trec-covid.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/trec-covid queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
index_ref = pt.IndexRef.of('./indices/beir_trec-covid') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/trec-covid docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
# Index beir/trec-covid
indexer = pt.IterDictIndexer('./indices/beir_trec-covid')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/trec-covid qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
index_ref = pt.IndexRef.of('./indices/beir_trec-covid') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Wang2020Cord19,Voorhees2020TrecCovid,Thakur2021Beir}

Bibtex:

@article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} } @article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

"beir/webis-touche2020"

A version of the Touchè-2020 dataset, for argument retrieval.

Negative relevance judgments from the original dataset are replaced with 0.

queries

Language: en

Query type:
BeirQuery: (namedtuple)
  1. query_id: str
  2. text: str
  3. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/webis-touche2020 queries
[query_id]    [text]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Document type:
BeirDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str
  4. metadata: Dict[str,str]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, metadata>

You can find more details about the Python API here.

CLI
ir_datasets export beir/webis-touche2020 docs
[doc_id]    [text]    [title]    [metadata]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
# Index beir/webis-touche2020
indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export beir/webis-touche2020 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Bondarenko2020Tuche,Thakur2021Beir}

Bibtex:

@inproceedings{Bondarenko2020Tuche, title={Overview of Touch{\'e} 2020: Argument Retrieval}, author={Alexander Bondarenko and Maik Fr{\"o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Christian Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle={CLEF}, year={2020} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }